This is the second in a series of articles about a new backup process I have implemented for my home network. In the first article I covered some background information and why I choose a non-traditional backup process for my network. In this article I'll give an overview of mirror backups, their problems, and the ways in which they can be improved. I'll also talk about rsync and why for me, it was not the right tool for creating backups on Windows.

A more intelligent mirror backups process

Mirror backups have several advantages over the traditional backup process. Like a full backup they are a complete snapshot in time of the data being backed up but since they are typically stored on randomly accessible media (i.e. a disk or online), accessing any part of the backup set is easily accomplished. However, also like a full backup you usually have to copy each file to create the mirror even if that file has not changed since the last backup. This makes storing backup history both a time consuming process and an inefficient use of storage space. You can of course just copy the changed files into an existing mirror backup but then you lose any file change history from the last backup set.

While researching backup solutions I came across a UNIX utility called rsync. Rsync is a tool that was designed to efficiently create file set mirrors. It has a large and flexible set of features including its own network protocol that can efficiently transfer just the differences between versions of a file when creating a new mirror. The most appealing feature to me however is its ability to conserve storage space while still creating complete mirror backups. It does this by leveraging a feature called hard links which are found on modern files systems (including the Windows NTFS file system).

Hard links are a method by which the operating system separates the storage of the file's actual data from the file's directory entry. In other words, a file in a directory is really just a named pointer to the file's physical data which is stored somewhere else on the hard disk. Every file stored on a hard disk is stored in this way. Most of the time, there is a one-to-one mapping with each physical file's data having only one named directory entry or file name. However, some file systems allow you to create additional named references to the file's physical data. These reference can exist in the same directory or in any other directory on the file system. Furthermore, they can have the same or a completely different file name. When this is done they share all aspects of the actual data, including the file's attributes such as security settings, creation time, and the last modified time. If you change the file's data or attributes via one reference then all the other references will reflect those changes immediately. You can even delete the original reference to the data and the others will continue to live on. It is only after all references that point to the physical data have been deleted that the actual file data is deleted.

What makes rsync really special is that it can leverage this very powerful file system feature when creating mirror backups. It does this be examining the files being backed up, comparing each file against the copy in the previous mirror backup. Then for each file that it finds to be identical, instead of copying that unchanged file into the new mirror backup it simply creates a new hard link to the file as it exists in the previous mirror backup. This is much, much faster than re-copying the file itself but it essentially does the same thing. It makes the unchanged version of the file available in the newly created mirror backup. Furthermore it is also very space efficient since the file's data is still only stored once physically but it is accessible from every mirror backup that it is hard linked into.

By leveraging hard links, it also allows for another interesting feature. You can physically delete older mirror backups without affecting any of the new mirror backups that may be referencing unchanged data with these backups. Files that have other references will stay alive in those newer mirrors while older files that only have a reference in the mirror being deleted will be destroyed. This makes managing the amount of backup history to keep very easy; you can just delete older mirrors when you no longer want the history they contain.

Together these two features effectively allow you to break the traditional full/incremental backup cycle forever while keeping the best parts of each. You can have several full mirror backups available to preserve file history with no physical duplication of data for files that remain unchanged while still running a process that only copies newer files like an incremental backup. It is the best of both full backups and incremental backups combined in one logical backup process. You can read more about using rsync to create hard linking mirror backups here.

If rsync is for UNIX, what's this got to do with backing up Windows?

Rsync is a very UNIX-y tool and I am pretty firmly in the Windows camp. However you can still run rsync on Windows using the cygwin system. Cygwin is a software translation layer that was developed to allow certain types of UNIX programs to work on Windows systems. It has been around for a long time and is very functional and reliable. I used cygwin quite successfully to play with rsync on both Windows XP and Windows Server 2003. However for backing up data from Windows' NTFS file system there are some issues.

Windows' NTFS file system supports a wide array for features including security attributes, NTFS alternative data streams, sparse files, encrypted file, and much more. Cygwin and thus rsync know nothing about these additional file attributes since the translation layer was designed to make Windows seem like just another UNIX system. What this means is, that rsync can only copy the aspects of the files that it knows about, i.e. the file's regular data. So if you use rsync to make mirror backups on Windows you will essentially lose all of this other and sometimes important file and directory information (you also lose this information when simply copying files to a CD/DVD as well). This information however is usually preserved when doing traditional backups or when copying files between NTFS locations including across Windows networks. What was needed was a way for rsync to preserve this information on Windows.

I debated for quite some time on whether or not to dive in and start hacking away at the rsync source code to try to teach it more about the Windows NTFS file system and this additional file data. I even went so far as to patch cygwin to teach it about Windows' GLOBALROOT device paths, a feature that was essential in order to use Windows' Volume Shadow Copy Service (VSS) with rsync (this patch is now part of the cygwin CVS source, Btw) . However in the end I decided it would probably be much more effort to update rsync than it would be to write my own backup process. Rsync has a lot of features that I do not need and even though my new backup process was likely to be missing some of the features that I really liked about rsync, my new process would still contain the core method by which rsync creates space-conserving mirror backups.

This decision to start fresh on my own backup tool coincided with the time that I had started to play with another tool, Microsoft's recently released PowerShell scripting language. It wasn't long before I realized the PowerShell was a perfect fit for this type of problem.

 

In part three I'll cover the tools I used to implemented an intelligent mirror backup process in Windows using robocopy, Microsoft's PowerShell scripting language, C#, the Volume Shadow Copy Service, C++ and a little bit of COM interop.


Comments are closed

Flux and Mutability

The mutable notebook of David Jade