rsync is both an interesting program and an interesting idea. rsync the program does a lot of interesting and useful things but what has always been most interesting to me is the rsync algorithm - its ability to update two versions of the same file over a network, one being local and the other being remote while efficiently sending little more than just the data that has changed. rsync the program uses this ability to efficiently copy or sync file changes between networked computers. Many others have built upon this idea to build general backup tools and services.

What has not been so great about rsync is that as a UNIX program it has never really worked well on Windows. Sure there are 'ports' but by not being a native Windows application, many things were ignored or left by the wayside like file ACLs, permissions, NTFS alternative data streams, etc.… You can get it to work and some have done a pretty good job of fixing it up to make it more Windows-friendly but at its core it just doesn't have access to the information or APIs needed to truly integrate into the Windows ecosystem.

rsync itself is pretty old – it predates much of the Internet as we know it, dating back to 1996. The ideas and algorithms have been tweaked over the years but are stable at this point. Newer ideas, such as Multiround rsync have been proposed to better deal with bandwidth-constrained devices such as mobile devices but these are not currently used in rsync.

Years ago I remember reading about a new feature built into Windows Server 2003 R2 called Remote Differential Compression (RDC). At the time it was a technology only used as part of Windows DFS and not really applicable to anything else. Recently while working on a side project I started to wonder how hard it would be to port the rsync algorithm for use in my project. While researching this I came across RDC again and realized that it has since been built into every version of Windows since Vista as a low-level COM API that anyone can use. This RDC API is data source agnostic meaning that it can be used for files, databases, or any other form of raw data that needs to be efficiently transferred. Since Vista's release I suspect that RDC or the ideas behind it are at the core of Windows features like block-level Offline Files updating and BranchCache.

On the surface RDC seems a lot like rsync but it has been researched and developed by Microsoft in significant new ways. Mostly notably it is a recursive algorithm. What this means is, instead of exchanging one set of block hashes or signatures, RDC recursively processes the block signatures to generate smaller sets of signatures to send over the wire. It's like if RDC took the set of signatures for a file and wrote them to disk as a new file and then understanding that both the local and remote signatures are probably similar (since the original files should be similar), used the RDC algorithm to recursively generate a new set of signatures based on the previous set of signatures, each time cutting the size of block signatures to a fraction of what they first were. Now instead of sending the original large set of signatures, it send this reduced set and then uses the RDC algorithm to build the larger set. In the end this means that even less data needs to be transferred over the wire.

Another novel idea in RDC is the idea what you can use more than just the two versions of the same file to generate signatures. In fact, you can likely use parts of other files to further reduce the data that is transferred over the network. The idea comes from something like this: you may have made multiple copies of the same files or even cut/paste similar data between files. Some files may even have blocks that are mostly the same despite being totally unrelated. Since transferring block signatures uses less bandwidth than sending the actual data, why not take advantage of this fact to reduce the transferred data even more. The problem is how do you tell which parts of other local files may match parts of the remote files you want to transfer. You can't reasonably compare every file to every other file as that would be too long of a process. The answer to this problem is solved by pre-analyzing "traits" of each file and storing them so that they can be quickly compared as candidates. This "trait" data is extremely small, being around 96-bits per file and can easily be stored in the file's metadata, in a database or even in memory.

It is no surprise to me that RDC remains unknown to most. Even though it has been a Windows feature with a public API for years now the MSDN documentation on how to use RDC is severely lacking – mostly notably on how RDC recursively generates signatures and how to transfer, reconstruct, and use them. In my extensive searching I have only come up with two examples of how to use the RDC APIs (which are COM APIs, btw). Both of these are complex with one being completely broken, despite being published as an official MSDN sample. The only truly working sample I could find is the one that ships as part of the Windows SDK, which you'll have to install to get to it. You'll find that sample in the Samples\winbase\rdc folder of your Windows SDK installation (at least in SDK v7.1 – I have not checked the Windows 8 SDK version).

As I mentioned, these sample are complex – the working C++ sample uses DCOM so that you can use a local and remote machine. This means customs interfaces and levels of abstraction that make it hard to get to the core of what is really going on, not to mention it being all cross-process. The second, broken sample is built for .NET and is written in C# and uses COM interop and a custom Web Service so that you can test against a local and remote machine. Both of these samples btw, can be run on the same machine which makes it a bit easier to follow but you're still dealing with cross-process debugging. The .NET one is a little easier to follow though as it uses less custom abstraction to hide the complexities of working with RDC. However it is broken.

It was only by understanding the SDK's C++ RDC sample that I could understand how the .NET sample was broken. Remember how I said that RDC recursively generates signatures to cut down on the amount of bandwidth needed? Well, as it turns out the .NET sample was generating these signatures both locally and remotely just fine. What it was not doing was transferring these signatures correctly. As best I can tell it was assuming that there was only one level of signatures to be transferred – in fact it would work if there was only one level which is typically what happens for very small files (in an unrelated bug it would simply crash with files over 64kb which leaves me to believe it was never tested with larger files). However RDC by default decides how many levels of recursion should take place and when it does it means that multiple levels of signatures need to be transferred (but by using the RDC algorithm so the overall bandwidth stays low). Handling these multiple levels of transfers was what was broken. I was able to fix that and test that it works no matter how many levels of signatures where being used or the size of the file. In the process I added more logging, fixed other bugs, and cleaned a few things up as well. Here is that newer, fixed up sample RDC .NET project. It is by no means ready for production (or even useful as an rsync alternative) but it does show how to use RDC with .NET.

RDC.NET sample on Bitbucket.org: https://bitbucket.org/davidjade/rdc.net 

Note: If you're not a Mercurial user and you just want the source as a .zip file, just go to the Downloads | Branches section of the project and select your archive format.

While complex to use RDC works really well. In my sample data I used two bitmap files of the same photograph, one of which had been altered by writing the word "changed" in a 72pt font across it. Since these files were saved as uncompressed bitmaps (as compared to jpegs) they were mostly the same. The total file size was 2.4mb but by using RDC to transfer the changed version from a remote location, the total RDC bandwidth needed as only 217kb making for a 92% reduction in bandwidth needed to update the file. The RDC overhead is likely too great to use for smaller files but for larger files it can really pay off, especially if you are bandwidth constrained.

RDC was developed several years ago – 2006 being the year that the research paper on RDC was published with the samples dating to shortly after that time as well. I would encourage you to check out that research paper if you really want to understand how RDC functions. It is a great example of how Microsoft Research really is producing novel and significant advances in computing. It is also the missing documentation that is needed to best make use of the minimal MSDN documentation on the RDC COM API.

One last note, as sometimes unfortunately happens in the world of Windows, misinformation has spread among Windows users (the type that think every service they don't use directly should be turned off) that has led users to think that RDC should be uninstalled on Windows so that it somehow fixes slow network file transfers. Besides this making totally no sense whatsoever and besides the fact the RDC isn't used for anything in a client Windows OS, it may be something you have to contend with if you build an app that uses RDC. The myth is so widespread that Microsoft had to write about it.


Comments are closed

Flux and Mutability

The mutable notebook of David Jade