rsync is both an interesting program and an interesting idea. rsync the program does a lot of interesting and useful things but what has always been most interesting to me is the rsync algorithm - its ability to update two versions of the same file over a network, one being local and the other being remote while efficiently sending little more than just the data that has changed. rsync the program uses this ability to efficiently copy or sync file changes between networked computers. Many others have built upon this idea to build general backup tools and services.

What has not been so great about rsync is that as a UNIX program it has never really worked well on Windows. Sure there are 'ports' but by not being a native Windows application, many things were ignored or left by the wayside like file ACLs, permissions, NTFS alternative data streams, etc.… You can get it to work and some have done a pretty good job of fixing it up to make it more Windows-friendly but at its core it just doesn't have access to the information or APIs needed to truly integrate into the Windows ecosystem.

rsync itself is pretty old – it predates much of the Internet as we know it, dating back to 1996. The ideas and algorithms have been tweaked over the years but are stable at this point. Newer ideas, such as Multiround rsync have been proposed to better deal with bandwidth-constrained devices such as mobile devices but these are not currently used in rsync.

Years ago I remember reading about a new feature built into Windows Server 2003 R2 called Remote Differential Compression (RDC). At the time it was a technology only used as part of Windows DFS and not really applicable to anything else. Recently while working on a side project I started to wonder how hard it would be to port the rsync algorithm for use in my project. While researching this I came across RDC again and realized that it has since been built into every version of Windows since Vista as a low-level COM API that anyone can use. This RDC API is data source agnostic meaning that it can be used for files, databases, or any other form of raw data that needs to be efficiently transferred. Since Vista's release I suspect that RDC or the ideas behind it are at the core of Windows features like block-level Offline Files updating and BranchCache.

On the surface RDC seems a lot like rsync but it has been researched and developed by Microsoft in significant new ways. Mostly notably it is a recursive algorithm. What this means is, instead of exchanging one set of block hashes or signatures, RDC recursively processes the block signatures to generate smaller sets of signatures to send over the wire. It's like if RDC took the set of signatures for a file and wrote them to disk as a new file and then understanding that both the local and remote signatures are probably similar (since the original files should be similar), used the RDC algorithm to recursively generate a new set of signatures based on the previous set of signatures, each time cutting the size of block signatures to a fraction of what they first were. Now instead of sending the original large set of signatures, it send this reduced set and then uses the RDC algorithm to build the larger set. In the end this means that even less data needs to be transferred over the wire.

Another novel idea in RDC is the idea what you can use more than just the two versions of the same file to generate signatures. In fact, you can likely use parts of other files to further reduce the data that is transferred over the network. The idea comes from something like this: you may have made multiple copies of the same files or even cut/paste similar data between files. Some files may even have blocks that are mostly the same despite being totally unrelated. Since transferring block signatures uses less bandwidth than sending the actual data, why not take advantage of this fact to reduce the transferred data even more. The problem is how do you tell which parts of other local files may match parts of the remote files you want to transfer. You can't reasonably compare every file to every other file as that would be too long of a process. The answer to this problem is solved by pre-analyzing "traits" of each file and storing them so that they can be quickly compared as candidates. This "trait" data is extremely small, being around 96-bits per file and can easily be stored in the file's metadata, in a database or even in memory.

It is no surprise to me that RDC remains unknown to most. Even though it has been a Windows feature with a public API for years now the MSDN documentation on how to use RDC is severely lacking – mostly notably on how RDC recursively generates signatures and how to transfer, reconstruct, and use them. In my extensive searching I have only come up with two examples of how to use the RDC APIs (which are COM APIs, btw). Both of these are complex with one being completely broken, despite being published as an official MSDN sample. The only truly working sample I could find is the one that ships as part of the Windows SDK, which you'll have to install to get to it. You'll find that sample in the Samples\winbase\rdc folder of your Windows SDK installation (at least in SDK v7.1 – I have not checked the Windows 8 SDK version).

As I mentioned, these sample are complex – the working C++ sample uses DCOM so that you can use a local and remote machine. This means customs interfaces and levels of abstraction that make it hard to get to the core of what is really going on, not to mention it being all cross-process. The second, broken sample is built for .NET and is written in C# and uses COM interop and a custom Web Service so that you can test against a local and remote machine. Both of these samples btw, can be run on the same machine which makes it a bit easier to follow but you're still dealing with cross-process debugging. The .NET one is a little easier to follow though as it uses less custom abstraction to hide the complexities of working with RDC. However it is broken.

It was only by understanding the SDK's C++ RDC sample that I could understand how the .NET sample was broken. Remember how I said that RDC recursively generates signatures to cut down on the amount of bandwidth needed? Well, as it turns out the .NET sample was generating these signatures both locally and remotely just fine. What it was not doing was transferring these signatures correctly. As best I can tell it was assuming that there was only one level of signatures to be transferred – in fact it would work if there was only one level which is typically what happens for very small files (in an unrelated bug it would simply crash with files over 64kb which leaves me to believe it was never tested with larger files). However RDC by default decides how many levels of recursion should take place and when it does it means that multiple levels of signatures need to be transferred (but by using the RDC algorithm so the overall bandwidth stays low). Handling these multiple levels of transfers was what was broken. I was able to fix that and test that it works no matter how many levels of signatures where being used or the size of the file. In the process I added more logging, fixed other bugs, and cleaned a few things up as well. Here is that newer, fixed up sample RDC .NET project. It is by no means ready for production (or even useful as an rsync alternative) but it does show how to use RDC with .NET.

RDC.NET sample on Bitbucket.org: https://bitbucket.org/davidjade/rdc.net 

Note: If you're not a Mercurial user and you just want the source as a .zip file, just go to the Downloads | Branches section of the project and select your archive format.

While complex to use RDC works really well. In my sample data I used two bitmap files of the same photograph, one of which had been altered by writing the word "changed" in a 72pt font across it. Since these files were saved as uncompressed bitmaps (as compared to jpegs) they were mostly the same. The total file size was 2.4mb but by using RDC to transfer the changed version from a remote location, the total RDC bandwidth needed as only 217kb making for a 92% reduction in bandwidth needed to update the file. The RDC overhead is likely too great to use for smaller files but for larger files it can really pay off, especially if you are bandwidth constrained.

RDC was developed several years ago – 2006 being the year that the research paper on RDC was published with the samples dating to shortly after that time as well. I would encourage you to check out that research paper if you really want to understand how RDC functions. It is a great example of how Microsoft Research really is producing novel and significant advances in computing. It is also the missing documentation that is needed to best make use of the minimal MSDN documentation on the RDC COM API.

One last note, as sometimes unfortunately happens in the world of Windows, misinformation has spread among Windows users (the type that think every service they don't use directly should be turned off) that has led users to think that RDC should be uninstalled on Windows so that it somehow fixes slow network file transfers. Besides this making totally no sense whatsoever and besides the fact the RDC isn't used for anything in a client Windows OS, it may be something you have to contend with if you build an app that uses RDC. The myth is so widespread that Microsoft had to write about it.


This is the fourth and final in a series of articles about a new backup process I have implemented for my home network. In the previous article I covered the tools I used to create this new backup process and the issues I faced along the way. In this article I'll give an overview of the backup process and highlight some of the implementation details of creating this new backup process.

The view from 50,000 feet

At the core of the new backup process, things are pretty straightforward. The 50,000 foot view is something like this, with just 5 main steps:

  1. Create a VSS snapshot of the drive to be backed up.
  2. Compare the previously backup files to the current files and get the list of files that are new, newer, the same, and missing (i.e. deleted).
  3. Build the directory structure for the new backup set.
  4. For each file that is the same, create hard links in the new directory structure that link to the previous backup's files.
  5. Copy over all the new and newer files.

As they say though, the devil's in the details. But first, let me get this out of the way.

In the following discussion, I will be leaving a lot of little details out to keep things simpler. I'm not going to cover things like creating log files, creating system event log entries, email notifications, etc... These are all important things to have when creating a set-and-forget backup system but suffice it to say, the scripts that I have created produce detailed logs and have rich error handling along with notifications for when things go wrong.

Flying lower - servers, vaults, and sets

On my server, I have a dedicated backup file network share where each machine that participates in the backup scheme has a directory reserved for it. Within each machine directory there is a machine configuration XML file and one or more backup vaults . Each backup vault is simply a directory and a backup configuration file that specifies the set of files to back up into that vault.

Within each vault is one or more backup sets where each set is a complete snapshot of the vault's data. These backup sets are again just directories which are named using the date that the set was created. Since each backup set in a vault is a complete snapshot of the data being backed up there will be many duplicate and unchanged files between sets. To minimize the storage necessary to preserve these complete snapshots, files that are unchanged between backup sets are hard linked together. This results in exactly one copy of the unchanged file's data being shared between all backup sets in a vault. This also means that each new backup set only takes as much physical storage space as is necessary to hold the files that have changed since the last back up yet it still preserves the complete snapshot view for each backup set.

This multi-level approach to defining backups allows me to specify multiple backups vaults for each machine, tuning each for the type of data and/or the frequency of change. In most cases, each machine has one backup vault that contains Window's "Documents and Settings" folders (i.e. all user's data). On my server however I also have other vaults for things like the IIS web server's files and for other shared network folders. Having multiple vaults also allows me to separate out backups based on the frequency of change since a new backup set is only created in a vault if the backup process determines that files have actually changed and need to be backed up.

Having all of the configuration on the backup server also allow me to centrally manage each of the client machines and their backups. To set up a new client, I just create the directories on the server, set up a few configuration files, and then install and schedule the backup scripts on the client. Once that is done I can adjust the backup process for any client just by updating its configuration files on the server. I can even pull down new versions of the backup scripts and components to the client as needed.

Step #1 - Freezing the file system

Since there are several processing steps that access the files to be backed up, the first thing I needed to do was to take a snapshot of the file system so that I had a frozen-in-time view of the current files. To do this I use my custom VSS COM component along with the machine's XML configuration file. This simple XML configuration file, which is trivial to read in PowerShell, specifies which disk volumes to snapshot and what DOS device names (i.e. drive letters) to expose the snapshots as. Special care had to be taken though to ensure that any errors in the backup process would properly release the VSS snapshot. I accomplished this by setting up a global PowerShell trap handler that cleans up the VSS COM object as necessary.

Note: For more details on working with PowerShell trap handlers and exceptions, here  is an excellent PDF article on the subject: PowerShell Debugging and Error Handling

Once I have a frozen view of the file system, I can then process each backup vault defined to determine what if anything needs to be backed up. The process that follows is repeated for each vault that is defined for the machine. It is only after all vaults have been processed that the VSS snapshots are released.

Note: I have recently read another article about using VSHADOW and accessing the shadow copy on Windows XP by using a utility named DOSDEV. This technique effectively replaces the need for my custom VSS COM object. Here are the details: How to assign drive letters to VSS shadow copies... on Windows XP !

Step #2 - Processing a backup vault and classifying the files

Each vault has a configuration file that specifies the directories to back up along with the files and directories to ignore (e.g. "Temp" directories, temporary files, etc...). Since robocopy has sophisticated options for file selection, I use robocopy job files for this.

Robocopy job files are just a way to save robocopy command line arguments into a file for later use. The robocopy documentation has more details on how to create these but they are basically just plain text files that contain command line arguments. Robocopy is also flexible in that you can mix job files with commands specified on the  command line. This ability to mix and match options allows me to separate out selection information from other command line options used during various parts of the backup process.

Now that I have a way to save the selections of the files to be backed up, the next step is to determine the lists of files that are new, newer, the same, or have been deleted. Luckily robocopy will do this by using the listing-only command line switch along with the verbose switch. Using these two switches together causes robocopy to produce a very complete log file for what would happen if the listing-only command had not been specified. For each file and directory it outputs the classification of the file (i.e. New, Newer, Changed, Extra, Same) along with the full path. So to get the list of files and their classification I simply invoke robocopy using the source data (as exposed via the VSS snapshot) while specifying the last backup set as the target. This give me one giant log file that lists every file that is new, newer, the same, or has been deleted since the last backup. This potentially makes for one very large log file depending on the number of files that are being backed up but it also gives me all the information I need to both determine whether a backup is necessary and if so, which files have not change and can therefore be hard links in the new backup set.

To make this log easier to parse, robocopy places the classification information along with the path of the object at a fixed column in the log file, which is documented in the robocopy documentation. This makes it pretty easy to use a tool like PowerShell to parse the robocopy log file to determine which files and directories are the same, new, changed, and or have been deleted. There is only one problem that I have encountered with using robocopy to produce the file lists. It seems that robocopy's log files are always written using standard ASCII characters.

Note: Windows Vista ships with a new version of robocopy that fixes this limitation but that new version will only work on Windows Vista unfortunately.

So if you have files that have UNICODE characters in their name (i.e. ©, ®, ½, etc...) then robocopy will substitute plain ASCII characters for those symbols in the output. The side effect of this is that if you have any files with UNICODE characters, the path returned from robocopy cannot be used to directly access the file. This does cause a few hiccups when creating hard links as the files will appear to be missing. The worst case however is that hard link creation fails and then the file gets re-copied to the new backup set. Not an ideal situation but one which does leave the new backup set's integrity intact albeit at the expense of backup storage space. A simple way to reduce or eliminate these soft-errors is to simply rename the files to contain only ASCII characters. I've done this for the handful of files where this was an issue when it was clear that changing the file name would have no ill effects.

Once I have the list of files and their classification I can then determine whether a backup is necessary (i.e. are there any New or Newer files). If no files have changed then I simply end the processing for this vault. No new backup set is created in this case since nothing has changed. If I'm backing up something like Window's "Documents and Settings" folder then this is highly unlikely but for some sets of infrequently changed data this might be more common.

If a backup is necessary, I now also have my list of unchanged files that are to be hard linked between the last backup set and the new one. Step #1 of the backup process is now complete.

Step #3 - Building a new set

The next step is to build the directory structure for the new backup set. Again robocopy will do this for me but it's not completely obvious how to do this from reading the documentation.

One of robocopy's commands tells it which attributes of the object to copy. By default it will copy the data, attributes, and timestamp. You can also specify to copy security descriptors, owner info, and auditing info. The trick is you can also tell it copy everything except the data in which case it will just build the directory structure along with all the NTFS security descriptors, owner info, and auditing info.

Again I use my robocopy job file with a different set of command line options to copy just the directory structure for the new backup set. This time however I first create a new backup set directory, using the current date as its name (i.e. yyyy-mm-dd, etc...) and use that as the target directory for robocopy.

Once robocopy has built the new directory structure, step #2 of the backup process is now complete.

Step #4 - Hard linking the unchanged files

Next comes the creation of the hard links for all of the files which are unchanged since the last backup set. From my parsed robocopy log file, I iterate over all of the files that where classified as the "same" and I create a hard link in the new backup set, at the proper location, to the last backup set's copy of the file. For each hard link I need to create in my PowerShell script, I call the hard link helper function that I wrote in C#. This is typically a very fast procedure but the shear number of unchanged files in some cases can make this one of bigger steps in the backup process.

As I mentioned earlier, there is the case of failure when creating hard links due the to limitation of robocopy's log file format and file names with UNICODE characters. There is also another case were hard links may fail, when the account that is running the backup process does not have enough permission to access the underlying file. In my experience this is pretty rare but can happen for certain system files that only the system account has access to. The worst case scenario however, is that the hard link creation will fail and the copy stage of the process will re-copy the file. Because of the possibility of these hard link failures I allow for a certain number of hard links to fail before I abort the entire backup process.

After this stage in the process the new backup set will contain all of the directories and all of the files that are the same since the last backup. This also means that any files that have been deleted since the last backup have been effectively dropped from the new backup set as well. They haven't been physically deleted though; they just have not been carried forward to the new backup set via hard linking.

Once these hard links are created, step #3 of the backup process in now complete.

Step #4 - Backing up the new data

The next and final step is to simply allow robocopy to copy over any new and changed files into the new backup set. Since we are working with a VSS snapshot of the data, there is no possibility that files have changed since we classified them and thus no danger that any hard link to a previous backed up file will get overwritten by the copy process. Since the only files that currently exist in the new backup set are the files that are unchanged since the last backup, by default robocopy will only copy over files that are either new or newer.

During the copy process however, there can be issues with file access permissions if the account that the backup process is running under does not have sufficient privileges to copy the files. This will sometime be the case for certain system files and other user's private files. To get around these issues I use the robocopy option to copy files in a special backup mode. This special backup mode allows robocopy to copy files that it might not otherwise have access to for the purposes of backing them up. When using this option though you have to ensure that the account that this backup process is running under is a member of the Backup Operators group on all of the machines that are participating in the backup.

Once this file copy step is completed, the backup process is complete for the vault being processed and the next vault can then be processed. This cycle continues until all of the vaults for the machine have been processed.

Step #5 - Tidying up

After all of the vaults have been processed for a machine it is then safe to release the VSS snapshots. After that has been completed, the backup process is complete for that machine and my script exits.

There is always room for improvements

 Rsync, which was the main inspiration for my new backup process, was designed as a client-server tool that can greatly reduce the time required to copy data between machines by using a differential file copy process. This special copy process examines each file so that only the parts that are change get sent over the wire. When I started out it was my hope to patch up Rsync to be more Windows-aware but in the end, it was far less work to just recreate the parts that I needed the most especially given that wire transfer speed within my network was not an issue. So the differential file copy process got dropped in favor of leveraging robocopy. Someday however I would like to add that back but that would require moving to a client-server architecture.

Another thing I would like to add someday is the detection of moved files. This is a typical scenario for me, especially when working with digital camera photos. I copy new photos to an import folder on my main computer where I spend some time working and sorting through them. They are then usually moved to their own folder somewhere in my photo collection. If a backup were to happen in the middle of this workflow, the files would appear to have been added, deleted, and added again (in the new location). From robocopy's viewpoint, these moved files are treated as groups of deleted and new files. One could imagine an extension of the backup process where these deleted and new files where matched together and recognized as a moved file. Then a hard link from the old location to the new location could be created and the file would not have to be re-copied. To do this with absolute certainty though, the files should be binary compared first to ensure that they really are the same files. Again without a client-server process, the comparison would be moving the same amount of data over the wire as the copy does so for now, this feature will have to wait.

Wrapping up

I've been using this system for about two months now on a variety of Windows machines from servers to laptops. So far it is working very well and has even prevent disaster at least once. Shortly after getting this system up and running I had a hard drive fail. Luckily I had robust backups of all of the important files.

It has also given me the security to try new things like moving all of my photos and video to a RAID-0 stripped drive, which offers no redundancy if a drive fails but is blindingly fast. Knowing that I have reliable and automatic backups let me sleep at night secure in the knowledge that if one of these RAID-0 drives fails the most I will lost is that day's work.

I should also note that this new system is not the end point for my backup strategy. While my backups are stored on my server in a RAID-5 disk array, I also try to regularly back up the backup data to tape which I move off site. I am just now experimenting with this system, but what I do is to create NTFS junction points of all of the the latest backup sets into a folder which I then point NTBACKUP at. Each time I want to refresh the tape backup, I remove the previous junction points and set them up again using the latest backup data. By doing this I can even create incremental backups to tape since to anything looking at the junctioned folders it just appears that the files have been updated (i.e. the ready for archive bit is set).

Down the road I hope to someday utilize an Internet backup service instead of tape, perhaps built upon Amazon's Simple Storage Service (Amazon S3) but for right now DSL upload speeds are too slow for the volume of data I have.

That about wraps it for what I had planned to discuss in this series of articles. I know that this final article is lacking in concrete examples of how this was actually done, but it is a rather large amount of PowerShell code spread out over seven files and totaling nearly 1200 lines of script code together with several thousand lines of C++ and C# source code. It is my hope that this series of articles will inspire others to check out the technologies I used to create this backup process and perhaps take the ideas further.

In the meantime, you can always leave me a comment if there's some part of the process where you would like to see more details and I will see if I can write up new articles on those aspects of the process.

Thanks for reading.


This is the third in a series of articles about a new backup process I have implemented for my home network. In the previous article I covered a mirror backup process that maintains a storage-efficient backup history. In this article I'll cover the tools I used and the issues I had to overcome while using them.

Common tools and a not so common use of them

Once I had decided to create a backup system that creates space-conserving mirror backups by leveraging NTFS hard links, I set out to make a simple prototype. It occurred to me that I already had a very good tool for copying data around, a free tool called robocopy from the Windows sources kit. Robocopy is a very powerful file copying tool that can be configured in a multitude of ways, including the ability to copy files in backup mode, a special mode of file access that can be used to bypass file security for the purposes of backing up files. It is also faster and more reliable than the file copy tools that come with Windows and has a very good set of options to control which files to copy. However robocopy know nothing about the process of creating hard links to previous versions of files. This step I would have to do myself.

In searching for information on how to create hard links, it wasn't long before I ran across references to the fsutil tool that is included in Windows XP and Windows Server 2003. Using this tool you can creating NTFS hard links from the command line.

Together with robocopy and a bit of creative CMD scripting, I was able to throw together a prototype that could create mirror backups while hard linking to the files that had not changed since the previous backup just like rsync did. I started by duplicating the directory structure of the old backup by using robocopy to copy just the directories. Next I used fsutil to hard linked copies of the previous backup files into the new directories. I did this by traversing the old backup directories and using fsutil to create hard links to each of the older files. Then I used robocopy to generate a list of the files that had changed since the last backup, including files that were no longer present. From that listing I then deleted those files from the newly created mirror backup. Finally, I used robocopy to copy over just the newer files into the new mirror backup. While it wasn't the most efficient method, it worked pretty well but it had one important limitation: fsutil only works on local disks. It was also a pretty hacky bit of CMD script since I had to do string manipulation to create the hard links. I had considered re-writing the whole process in C# but then something else popped on my radar.

PowerShell, isn't that some sort of new gasoline?

It was about this time that Microsoft released RC2 of PowerShell (which as just recently gone RTM). PowerShell is Microsoft's new administrative scripting language for the future. Besides be a very good replacement for command shell scripting and VBScript, it is also the new foundation of the management tools for the next version of Microsoft Exchange. It is an amazingly powerful scripting language, easily learned, easily extended, and is easily the more important tool I have learned in a long time.

PowerShell is different from other scripting languages because it is based on the concept of pipelining objects. Many scripting languages, including the native Windows shell, support pipeling text data from command to command. PowerShell is different however in that it pipelines complete .NET objects instead of just textual data. As full .NET objects, each object in the pipeline has state, properties, and methods. They can be passed as parameters to functions, extended dynamically, coerced into other types, and placed back into the pipeline. Functions in PowerShell can also be treated as objects allowing you to do some types of functional programming tasks that are not easily done in other .NET languages. It is a very powerful idea and my brief description doesn't even scratch the surface of the power that lies within PowerShell. It is all still very new to me but already I am finding many uses for it.

Tip: Here's a PowerShell gotcha to keep in mind. Every expression in PowerShell that produces output places that output in the pipeline. This can lead to pretty weird debugging issue if you aren't careful. I had more than one case where a function was returning more than I wanted because I was calling a command that placed things in the pipeline without realizing it. There are two ways to avoid this however. One is to assign the output of commands to a variable and the other is to redirect the output to $null (i.e. do-something > $null).

PowerShell's object pipeline nature along with the rich set of built-in commands knows as cmdlets, makes for a perfect system for doing administrative computer tasks. There are cmdlets for accessing PowerShell providers such as the file system and the registry, accessing WMI object, COM objects, and the full 2.0 .NET framework. I've seen examples of everything from a simple file parsing scripts to a simple but complete HTTP server written in PowerShell in just a few lines of code. To me it appeared to be the perfect language for scripting a new backup process. However PowerShell does not offer support for creating NTFS hard links either. For this I would need to extend PowerShell.

Extending PowerShell through custom C# objects and P/Invoke

Starting with Windows XP there is a new API for creating hard links, CreateHardlink. In previous versions of Windows, creating hard links was somewhat of a black art. You had to use the complex and sparsely documented Win32 Backup API's. It could be done and there are examples of how to do it out there, but it was not for the faint of heart. The CreateHardlink API however solves that, making it almost trivial to create hard links on NTFS. Furthermore, unlike fsutil the CreateHardlink API fully supports creating hard links on remote network NTFS drives. PowerShell cannot easily call native API's on its own though. To do that, you need to extend PowerShell with a bit on .NET code.

PowerShell is very easy to extend. You can write complete cmdlets', objects that fully plug into the PowerShell pipeline framework or you can just create simple .NET objects that can be created and invoked thanks to PowerShell's ability to access the .NET framework.

Using C# and a bit of P/Invoke it was almost trivial to solve the problem of not being able to create hard links in PowerShell (and .NET) by writing a simple object that called the Win32 CreateHardlink API. Once that was done, I could easily create my new .NET object in PowerShell and use it to create all the hard links that I wanted. Now I could create a more complete backup script from the ground up using PowerShell.

If you'd like to access the CreateHardlink API in PowerShell or .NET, here is a C# code snippet to help you. Simply create a new class in a .DLL and add this method. I added this method as a static member since it does not require any state from the class. This also makes it very easy to call from PowerShell.

[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)]
internal static extern int CreateHardLink(string lpFileName, 
    string lpExistingFileName, IntPtr lpSecurityAttributes);

static public void CreateHardlink(string strTarget, string strSource)
{
    if(CreateHardLink(strTarget, strSource, IntPtr.Zero) == 0)
    {
        throw new System.ComponentModel.Win32Exception(Marshal.GetLastWin32Error());
    }
}

To call this code from PowerShell, you simply load the .NET assembly and then call that static method on your class. Note that this will throw an exception if it fails so make sure you have a PowerShell trap handler somewhere in your script.

# load the custom .NET assembly
[System.Reflection.Assembly]::LoadFrom('YourLibrary.dll')

# create a hard link
[YourLibraryName.YourClass]::CreateHardlink($Target, $Source) > $null

Whoops, that file is in use

There was still one more issue to tackle before I could write a robust backup system, accessing files that are in use. Starting with Windows XP Microsoft introduced a new system for accessing files that are currently in use on Windows systems, the Volume Shadow Copy Service (VSS for short, but not to be confused with Microsoft's VSS source control system).

One of the ideas behind VSS is that when requested, the OS will make a read-only copy of the drive, a snapshot frozen in time, available to a backup program. Other programs can continue to change the original disk files but this shadow copy, or snapshot will remain frozen and completely accessible to the program that created it. Furthermore when a backup program requests that a shadow copy is to be created, the OS can coordinate with shadow copy providers to ensure that the data on the disk is in a consistent state before the shadow copy is created. This further ensures that the files that the backup program has access are in a consistent enough state on the disk to be backed up. This is especially useful for files that are either always open or always changing like the system registry, user profiles, Exchange, or SQL databases. Once the backup program is finished with this temporary read-only shadow copy, it then releases it and it disappears from the system. By using the VSS system backup programs can gain access to every file on the drive even if they are exclusively in use by other programs. For me it was essential to use VSS with any backup process I implemented.

There were a few tough problems though. On windows XP these VSS snapshots are very temporary in that they only exist for as long as you hold a reference to them via COM. Once released, they auto-delete themselves. And unlike VSS on Windows Server 2003 they cannot be exposed as a drive letter for easy access. You have to access them via the native NT kernel's method of addressing NT namespace objects, the GLOBALROOT namespace. On XP when you ask the VSS service to create a snapshot, what you get is a NT GLOBALROOT path that looks like this:
\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy1. Unfortunately this is something that not even the native Windows command shell fully understands and if you try and access it from PowerShell or .NET you'll get an exception telling you that you really shouldn't be accessing internal NT paths in .NET. To solve this I would need another bit of custom code to extend PowerShell.

VSHADOW.EXE and exposing a snapshot as a drive letter

VSHADOW is a sample tool that is part of the VSS SDK. It is a command line interface to the VSS API. By using this tool you can create and release VSS snapshots at will. It even has a way around the COM auto-destruction of snapshots on Windows XP by allowing you to call an external program once the snapshot has been created so that you can access the snapshot while VSHADOW is still keeping it alive. It will even create a set of environment variables for you to let you know the names of the GLOBALROOT shadow copies that it has created. This still didn't solve my problem of not being able to access them via PowerShell though (or robocopy for that matter) but having this source code was a good start.

All physical devices in Windows like hard drives exist in the GLOBALROOT namespace. It is only through device name mapping that we can access them via their friendly DOS names like C:, D:, etc.... Normally the OS creates these device mapping automatically at start up or whenever a new device has been connected. VSS snapshots however don't automatically get recognized and mapped. Mapping a friendly name to a VSS snapshot has to be done by directly using the win32 DefineDeviceCreate API. By using this API you can create and remove DOS device mappings to VSS snapshots on the fly even on Windows XP. But since VSS snapshots are temporary you have to manage them carefully or the system could become unstable.

Creating a VSS snapshot and mapping it to DOS device names is well beyond what I wanted to try to do in C#. Luckily for me, the VSHADOW C++ source code was written in a very reusable manor and I could easily reuse it by wrapping a COM object around it.

The not so nice COM interop experience with .NET 2.0

Creating a snapshot is not the simplest of procedures. You have to query for the list of VSS writers, map them against the target volume, determine which ones to include in the process, and finally request that the snapshot be created. You have to hold on to the VSS COM interface to keep the snapshot alive on XP for the duration of its use. When you are done, you have to release it in a controlled manor or the VSS system can completely degrade and require a system restart to recover from in most cases. It is also not the fastest process either, something that would come back to bite me later. However the VSHADOW source which is written in C++, was written in such a way that it made it very easy to turn into a COM object using ATL. It was as simple as creating a new ATL COM object project in Visual Studio and including the core VSHADOW sources file into the project. Once I had it building as a COM object it didn't take me long to put a .NET friendly interface on this new COM object that exposed methods to create and destroy VSS snapshots as well as map them to DOS device names.

PowerShell has native support for create and calling COM objects that is even easier than in other .NET languages. There is no need to create .NET interop classes, you just dynamically creating the COM object and use it much like you would in VBScript. Once I created my new VSS COM object it was trivial to create VSS snapshots on the fly and map them to DOS device names using PowerShell. With my new VSS COM object I now had complete access to VSS snapshots from any tool that could access a standard drive. It has some limitations but for this backup process it works very well.

Releasing the VSS snapshot in PowerShell however was another story. There is no clean way that I can find to force a created COM object to be released in PowerShell. You have to wait for the .NET garbage collector to do its thing which is usually not until the PowerShell process is exiting. My new COM object had its clean-up code in the COM object's Release method so that when it was released it would clean up the VSS state in the proper way, ensuring that the system remained stable. Unfortunately for me relying on a COM object's Release method to work during the .NET shutdown process proved to be one huge headache.

After many, many hours of debugging and not really believing what I was seeing I finally had to accept what was going on. From what I was seeing and from the research I have done it is my understanding that Finalizers in .NET, which are called when an object is being destroyed and which are also responsible for calling a COM object's Release method in PowerShell, are not guaranteed to complete when a process shuts down. Usually this is not a problem as the process is going away anyway. It is a problem however when you have native resources to release.

What I was seeing and not believing for literally hours and hours was that in the middle of my COM object's Release method the PowerShell process would just exit normally. No exceptions, no faults, nothing - just poof it's gone. And every time that it did this it would leave the VSS system in such a state that the machine had to be restarted because I was never given the chance to properly execute the VSS clean-up code which can be a lengthy process. It seems that the PowerShell shutdown process was timing out my clean-up code. It was a complete mess and still one that I cannot believe is acceptable but apparently to the folks who created .NET it is (you can read about it here in way more detail than anyone should have to know. Just search for "timeout" and "watchdog" on that page). The thought that external native code can have the plugged pulled just blows me away.

The fix was rather simple once I realized that I cannot count on my COM object's Release method to always complete. I had to move all critical clean-up code and put it in a public method that my PowerShell script would always call. Luckily PowerShell has pretty decent error handling and it wasn't too hard to ensure that I always called the clean-up method on my COM object before PowerShell normally terminates. I'm still not thrilled about this though. I would have preferred that my COM object be allowed to clean up after itself as necessary.

The moral of this story is that you are responsible for all complex clean-up even when calling native code. Don't depend on the .NET framework to always play nice.

Now that I had this behind me I had all the pieces that I needed: a robust file copy tool, a powerful scripting language, the ability to create hard links, and full access to Volume Shadow Copy snapshots.

 

In part four I'll cover the process overview and implementation details of creating the intelligent mirror backup process that I choose to be the foundation of my new backup strategy.


This is the second in a series of articles about a new backup process I have implemented for my home network. In the first article I covered some background information and why I choose a non-traditional backup process for my network. In this article I'll give an overview of mirror backups, their problems, and the ways in which they can be improved. I'll also talk about rsync and why for me, it was not the right tool for creating backups on Windows.

A more intelligent mirror backups process

Mirror backups have several advantages over the traditional backup process. Like a full backup they are a complete snapshot in time of the data being backed up but since they are typically stored on randomly accessible media (i.e. a disk or online), accessing any part of the backup set is easily accomplished. However, also like a full backup you usually have to copy each file to create the mirror even if that file has not changed since the last backup. This makes storing backup history both a time consuming process and an inefficient use of storage space. You can of course just copy the changed files into an existing mirror backup but then you lose any file change history from the last backup set.

While researching backup solutions I came across a UNIX utility called rsync. Rsync is a tool that was designed to efficiently create file set mirrors. It has a large and flexible set of features including its own network protocol that can efficiently transfer just the differences between versions of a file when creating a new mirror. The most appealing feature to me however is its ability to conserve storage space while still creating complete mirror backups. It does this by leveraging a feature called hard links which are found on modern files systems (including the Windows NTFS file system).

Hard links are a method by which the operating system separates the storage of the file's actual data from the file's directory entry. In other words, a file in a directory is really just a named pointer to the file's physical data which is stored somewhere else on the hard disk. Every file stored on a hard disk is stored in this way. Most of the time, there is a one-to-one mapping with each physical file's data having only one named directory entry or file name. However, some file systems allow you to create additional named references to the file's physical data. These reference can exist in the same directory or in any other directory on the file system. Furthermore, they can have the same or a completely different file name. When this is done they share all aspects of the actual data, including the file's attributes such as security settings, creation time, and the last modified time. If you change the file's data or attributes via one reference then all the other references will reflect those changes immediately. You can even delete the original reference to the data and the others will continue to live on. It is only after all references that point to the physical data have been deleted that the actual file data is deleted.

What makes rsync really special is that it can leverage this very powerful file system feature when creating mirror backups. It does this be examining the files being backed up, comparing each file against the copy in the previous mirror backup. Then for each file that it finds to be identical, instead of copying that unchanged file into the new mirror backup it simply creates a new hard link to the file as it exists in the previous mirror backup. This is much, much faster than re-copying the file itself but it essentially does the same thing. It makes the unchanged version of the file available in the newly created mirror backup. Furthermore it is also very space efficient since the file's data is still only stored once physically but it is accessible from every mirror backup that it is hard linked into.

By leveraging hard links, it also allows for another interesting feature. You can physically delete older mirror backups without affecting any of the new mirror backups that may be referencing unchanged data with these backups. Files that have other references will stay alive in those newer mirrors while older files that only have a reference in the mirror being deleted will be destroyed. This makes managing the amount of backup history to keep very easy; you can just delete older mirrors when you no longer want the history they contain.

Together these two features effectively allow you to break the traditional full/incremental backup cycle forever while keeping the best parts of each. You can have several full mirror backups available to preserve file history with no physical duplication of data for files that remain unchanged while still running a process that only copies newer files like an incremental backup. It is the best of both full backups and incremental backups combined in one logical backup process. You can read more about using rsync to create hard linking mirror backups here.

If rsync is for UNIX, what's this got to do with backing up Windows?

Rsync is a very UNIX-y tool and I am pretty firmly in the Windows camp. However you can still run rsync on Windows using the cygwin system. Cygwin is a software translation layer that was developed to allow certain types of UNIX programs to work on Windows systems. It has been around for a long time and is very functional and reliable. I used cygwin quite successfully to play with rsync on both Windows XP and Windows Server 2003. However for backing up data from Windows' NTFS file system there are some issues.

Windows' NTFS file system supports a wide array for features including security attributes, NTFS alternative data streams, sparse files, encrypted file, and much more. Cygwin and thus rsync know nothing about these additional file attributes since the translation layer was designed to make Windows seem like just another UNIX system. What this means is, that rsync can only copy the aspects of the files that it knows about, i.e. the file's regular data. So if you use rsync to make mirror backups on Windows you will essentially lose all of this other and sometimes important file and directory information (you also lose this information when simply copying files to a CD/DVD as well). This information however is usually preserved when doing traditional backups or when copying files between NTFS locations including across Windows networks. What was needed was a way for rsync to preserve this information on Windows.

I debated for quite some time on whether or not to dive in and start hacking away at the rsync source code to try to teach it more about the Windows NTFS file system and this additional file data. I even went so far as to patch cygwin to teach it about Windows' GLOBALROOT device paths, a feature that was essential in order to use Windows' Volume Shadow Copy Service (VSS) with rsync (this patch is now part of the cygwin CVS source, Btw) . However in the end I decided it would probably be much more effort to update rsync than it would be to write my own backup process. Rsync has a lot of features that I do not need and even though my new backup process was likely to be missing some of the features that I really liked about rsync, my new process would still contain the core method by which rsync creates space-conserving mirror backups.

This decision to start fresh on my own backup tool coincided with the time that I had started to play with another tool, Microsoft's recently released PowerShell scripting language. It wasn't long before I realized the PowerShell was a perfect fit for this type of problem.

 

In part three I'll cover the tools I used to implemented an intelligent mirror backup process in Windows using robocopy, Microsoft's PowerShell scripting language, C#, the Volume Shadow Copy Service, C++ and a little bit of COM interop.


This is the first in a series of articles about a new backup process I have implemented for my home network. In this first article I'll cover background information and why I choose a non-traditional backup process. In future articles, I'll cover my implementation of this system for backing up  on Windows.

Note: Currently the custom backup system described here is not publicly available. That is something that I am looking at doing in the future but as it exists now is it not in a state that would be usable by the general public.

The nature of traditional backup processes

I’ve been thinking off and on for quite some time about setting up a new backup system for my home network. I started down this path for a better backup solution because I was not really happy with my existing NT Backup scripts. At first I thought I would continue to leverage NT Backup with a better set of scripts to handle backup rotations, but the more I thought about it the more I became convinced that this was not the way to build a backup system for the future.

Traditional backup programs are still largely built on concepts from the days when everyone backed up their data to cheap backup media (i.e. discs or tape). They are typically designed to backup all information to a set of backup media. Even though most backup programs now allow you to back up data to external hard disks or network locations, most still create monolithic backup files. Getting to the data once it's on the backup media usually involves a restore process that moves the individual file data back to a hard disk.

Nowadays however hard disks are cheap and online storage is practically getting cheaper by the minute. Backing up using a process that creates large, monolithic backup files just doesn't make as much sense anymore. Hard drives are very good as storing individual files and online backups services are most efficient when they can upload small incremental files changes rather than larger monolithic backup files.

After this realization I started to look for possible alternatives to the monolithic backup process. Most of what I found offered little more than what NT Backup and some clever scripts had to offer. There were some notable exceptions and the solution I ultimately settled on is largely inspired by one of these exceptions, a tool named rsync. However once I made the decision to move away from NT Backup and a traditional, monolithic backup system things got a lot more complicated.

Along the way I've learned new technologies such as Microsoft's new PowerShell scripting language, how to work with the Windows Volume Shadow Copy Service (VSS), digging into Window's GLOBALROOT device namespace, and insights into the pitfalls of .NET/CLR and COM interop. I even dived pretty deep into cygwin development at one point along the way.

A more complicated setup requires a more complicated backup plan

I don’t have what one would call a typical home network. On my home network I have several laptops, workstations, a media computer, and a server. My server is a Windows Server 2003 acting as a domain controller and I use domain accounts for all computer logins. I also use several advanced features only available to computers participating in an AD domain environment like user account folder redirections and domain based group policy settings for management. This server also runs Microsoft Exchange server which is set up to automatically download POP email for all users and make it available from multiple email clients. I even have a few SQL Servers running here and there although so far this is not data that I have been too concerned with (mostly development projects with test data).

This configuration has allowed me to set up an environment where sharing resources is a cornerstone to the way we use our computers. Exchange allows access to our email from anywhere with rich calendar and address book support, whether simultaneously from multiple computers, over the web remotely (via the OWA client), or in offline mode on our computers. Folder redirection and offline files allow us to share and sync data effortlessly with shared desktops, documents, favorites, or any other resource on the network including shared application. I can freely move from computer to computer, from inside the network to outside and still have access to my email, data, and other network resources anywhere I go whether online or offline. If it sounds like a complicated set up that's is because it is, but from a management point of view once it was set up I spend very little time keeping it running. The downside to this configuration is that my data backup scenario is a bit more complicated that just copying files to a CD/DVD.

My primary goal of course is to not lose a user's data; the documents, photos, email, and other things that we all create. My secondary goal however is to preserve the state of the entire system. By state of the entire system I mean all the settings and configurations for each user account, on each computer, as well as the OS configuration of each computer. As I said before it is a complicated system and if something goes wrong I want to avoid having to rebuild systems from scratch if at all possible. My goal is to get back up and running as quickly as possible.

Running a domain controller also complicates backup strategies. Since the login accounts are all Active Directory domain accounts , if the state of my server is lost then so are the user accounts. Email is stored centrally in the Exchange database, which is tied to these AD accounts and requires a special backup process as well. All together this necessitates something more than copying data files to a disc or an external hard disk.

My overall goal in creating a backup system is to both data preservation and having as little downtime as possible when something does goes wrong (and it will at some point). But I also want this to be as automatic as possible, something I can just set and largely forget. And I want it to be resilient too. These are bold goals for such a complicated setup and a one-man IT shop. Yet my previous solution of scripting NT Backup, while simplistic, met some of these goals but failed in many of these as well.

Why my previous traditional backup solution wasn't good enough

My previous solution provided basic data and system state protection. I used simple scripts to control NT Backup to back up both the user data and the Windows' system state nightly for each computer. I also used NT Backup to backup up my Exchange server's data as well. I backed everything up to a secondary disk on my network. However I only ever kept the previous backup set so I had a history of exactly two backup cycles. By doing so however, I had at least three copies of everything stored in two separate places.

My data volume however presented issues right from the start. I have a lot of data on one workstation in particular, mostly very large photo files. The data set size for that machine alone is currently 40GB if I don't count DV video, which effectively rules out running nightly full backups and keeping a lot of backup history. Incremental backups offer one solution to this problem but I am not a fan of creating long chains of incremental backup sets as it makes the restore process more complicated and time consuming to get data back. They also have other drawbacks in that if one of the incremental backups sets fails, the chain is broken and you could lose the only backup copy of a particular file. As a compromise I eventually settled on a simple rotation scheme of full backups and differential backups. This made for a restore process that was at most two steps, a full backup restore followed by a single differential backup restore while also ensuring that I had some file change history preserved.

While my old scheme provided basic data protection I felt that it was lacking in several ways. First off was that it didn’t keep very much history and I wanted to keep more. It’s easy to not notice that something has gone wrong for quite a while with that volume of data. If a single file has been changed or lost chances are that I will know immediately as I probably caused it myself. But it’s not always that simple; files can get corrupted when hard drives have minor and unnoticeable failures. Other actions can also have consequences that aren’t always immediately apparent too. Any good backup system should have a reasonable amount of history.

Secondly, a lot of redundant data was flying around each night. Differential backups are not very efficient. I had scheduled my backups to run in a staggered sequence so that they were not competing for the server’s bandwidth all at once. Still, so much duplicate data was getting copied each night that the process took around 3 hours and sometimes a lot longer if a full backup of any machine was triggered. I also know that my data volume will continue to grow. I already have data (DV video and other media files) that I don’t currently back up that I should. With my old process I couldn’t grow the data volume too much before backups would have started to take all night long or longer. I needed to stop copying duplicate data as much as possible.

Lastly, I had no solution for offsite backups nor did I have a solution for archiving data that doesn’t change much. I have DVD burners and I even have a 70GB DAT tape drive but they were not integrated into my backup process. By using NT Backup on each machine I was left with a collection of large monolithic backup files that would not fit on backup media without file splitting nor could they be sent to an Internet backup service in a reasonable amount of time .

Mirror mirror

From examining my current situation and researching possible new solutions, it was clear to me that there are now better ways to backup data than the traditional methods. What I really wanted was a mirror backup. However since a mirror backup is a complete backup of everything frozen at a point in time, a simple file copy scheme is a very inefficient use of storage space when keeping history. They do however make it very easy to upload incremental changes to offline storage as you can easily detect and upload just the changed files from the last backup set. What is needed is an intelligent mirror backup that conserves space when storing history. One way to do this is by eliminating the physical storage of duplicate files. The solution I ultimately settled on does this by leveraging features of the NTFS file system to both eliminate duplicate file storage and yet still make it possible to browse a complete mirror backup with the Windows Explorer.

 

In part two I'll cover the intelligent mirror method I chose to be the foundation of my new backup strategy, rsync, and why it didn't work for me.


Flux and Mutability

The mutable notebook of David Jade