This is the second in a series of articles about a new backup process I have implemented for my home network. In the first article I covered some background information and why I choose a non-traditional backup process for my network. In this article I'll give an overview of mirror backups, their problems, and the ways in which they can be improved. I'll also talk about rsync and why for me, it was not the right tool for creating backups on Windows.

A more intelligent mirror backups process

Mirror backups have several advantages over the traditional backup process. Like a full backup they are a complete snapshot in time of the data being backed up but since they are typically stored on randomly accessible media (i.e. a disk or online), accessing any part of the backup set is easily accomplished. However, also like a full backup you usually have to copy each file to create the mirror even if that file has not changed since the last backup. This makes storing backup history both a time consuming process and an inefficient use of storage space. You can of course just copy the changed files into an existing mirror backup but then you lose any file change history from the last backup set.

While researching backup solutions I came across a UNIX utility called rsync. Rsync is a tool that was designed to efficiently create file set mirrors. It has a large and flexible set of features including its own network protocol that can efficiently transfer just the differences between versions of a file when creating a new mirror. The most appealing feature to me however is its ability to conserve storage space while still creating complete mirror backups. It does this by leveraging a feature called hard links which are found on modern files systems (including the Windows NTFS file system).

Hard links are a method by which the operating system separates the storage of the file's actual data from the file's directory entry. In other words, a file in a directory is really just a named pointer to the file's physical data which is stored somewhere else on the hard disk. Every file stored on a hard disk is stored in this way. Most of the time, there is a one-to-one mapping with each physical file's data having only one named directory entry or file name. However, some file systems allow you to create additional named references to the file's physical data. These reference can exist in the same directory or in any other directory on the file system. Furthermore, they can have the same or a completely different file name. When this is done they share all aspects of the actual data, including the file's attributes such as security settings, creation time, and the last modified time. If you change the file's data or attributes via one reference then all the other references will reflect those changes immediately. You can even delete the original reference to the data and the others will continue to live on. It is only after all references that point to the physical data have been deleted that the actual file data is deleted.

What makes rsync really special is that it can leverage this very powerful file system feature when creating mirror backups. It does this be examining the files being backed up, comparing each file against the copy in the previous mirror backup. Then for each file that it finds to be identical, instead of copying that unchanged file into the new mirror backup it simply creates a new hard link to the file as it exists in the previous mirror backup. This is much, much faster than re-copying the file itself but it essentially does the same thing. It makes the unchanged version of the file available in the newly created mirror backup. Furthermore it is also very space efficient since the file's data is still only stored once physically but it is accessible from every mirror backup that it is hard linked into.

By leveraging hard links, it also allows for another interesting feature. You can physically delete older mirror backups without affecting any of the new mirror backups that may be referencing unchanged data with these backups. Files that have other references will stay alive in those newer mirrors while older files that only have a reference in the mirror being deleted will be destroyed. This makes managing the amount of backup history to keep very easy; you can just delete older mirrors when you no longer want the history they contain.

Together these two features effectively allow you to break the traditional full/incremental backup cycle forever while keeping the best parts of each. You can have several full mirror backups available to preserve file history with no physical duplication of data for files that remain unchanged while still running a process that only copies newer files like an incremental backup. It is the best of both full backups and incremental backups combined in one logical backup process. You can read more about using rsync to create hard linking mirror backups here.

If rsync is for UNIX, what's this got to do with backing up Windows?

Rsync is a very UNIX-y tool and I am pretty firmly in the Windows camp. However you can still run rsync on Windows using the cygwin system. Cygwin is a software translation layer that was developed to allow certain types of UNIX programs to work on Windows systems. It has been around for a long time and is very functional and reliable. I used cygwin quite successfully to play with rsync on both Windows XP and Windows Server 2003. However for backing up data from Windows' NTFS file system there are some issues.

Windows' NTFS file system supports a wide array for features including security attributes, NTFS alternative data streams, sparse files, encrypted file, and much more. Cygwin and thus rsync know nothing about these additional file attributes since the translation layer was designed to make Windows seem like just another UNIX system. What this means is, that rsync can only copy the aspects of the files that it knows about, i.e. the file's regular data. So if you use rsync to make mirror backups on Windows you will essentially lose all of this other and sometimes important file and directory information (you also lose this information when simply copying files to a CD/DVD as well). This information however is usually preserved when doing traditional backups or when copying files between NTFS locations including across Windows networks. What was needed was a way for rsync to preserve this information on Windows.

I debated for quite some time on whether or not to dive in and start hacking away at the rsync source code to try to teach it more about the Windows NTFS file system and this additional file data. I even went so far as to patch cygwin to teach it about Windows' GLOBALROOT device paths, a feature that was essential in order to use Windows' Volume Shadow Copy Service (VSS) with rsync (this patch is now part of the cygwin CVS source, Btw) . However in the end I decided it would probably be much more effort to update rsync than it would be to write my own backup process. Rsync has a lot of features that I do not need and even though my new backup process was likely to be missing some of the features that I really liked about rsync, my new process would still contain the core method by which rsync creates space-conserving mirror backups.

This decision to start fresh on my own backup tool coincided with the time that I had started to play with another tool, Microsoft's recently released PowerShell scripting language. It wasn't long before I realized the PowerShell was a perfect fit for this type of problem.

 

In part three I'll cover the tools I used to implemented an intelligent mirror backup process in Windows using robocopy, Microsoft's PowerShell scripting language, C#, the Volume Shadow Copy Service, C++ and a little bit of COM interop.


This is the first in a series of articles about a new backup process I have implemented for my home network. In this first article I'll cover background information and why I choose a non-traditional backup process. In future articles, I'll cover my implementation of this system for backing up  on Windows.

Note: Currently the custom backup system described here is not publicly available. That is something that I am looking at doing in the future but as it exists now is it not in a state that would be usable by the general public.

The nature of traditional backup processes

I’ve been thinking off and on for quite some time about setting up a new backup system for my home network. I started down this path for a better backup solution because I was not really happy with my existing NT Backup scripts. At first I thought I would continue to leverage NT Backup with a better set of scripts to handle backup rotations, but the more I thought about it the more I became convinced that this was not the way to build a backup system for the future.

Traditional backup programs are still largely built on concepts from the days when everyone backed up their data to cheap backup media (i.e. discs or tape). They are typically designed to backup all information to a set of backup media. Even though most backup programs now allow you to back up data to external hard disks or network locations, most still create monolithic backup files. Getting to the data once it's on the backup media usually involves a restore process that moves the individual file data back to a hard disk.

Nowadays however hard disks are cheap and online storage is practically getting cheaper by the minute. Backing up using a process that creates large, monolithic backup files just doesn't make as much sense anymore. Hard drives are very good as storing individual files and online backups services are most efficient when they can upload small incremental files changes rather than larger monolithic backup files.

After this realization I started to look for possible alternatives to the monolithic backup process. Most of what I found offered little more than what NT Backup and some clever scripts had to offer. There were some notable exceptions and the solution I ultimately settled on is largely inspired by one of these exceptions, a tool named rsync. However once I made the decision to move away from NT Backup and a traditional, monolithic backup system things got a lot more complicated.

Along the way I've learned new technologies such as Microsoft's new PowerShell scripting language, how to work with the Windows Volume Shadow Copy Service (VSS), digging into Window's GLOBALROOT device namespace, and insights into the pitfalls of .NET/CLR and COM interop. I even dived pretty deep into cygwin development at one point along the way.

A more complicated setup requires a more complicated backup plan

I don’t have what one would call a typical home network. On my home network I have several laptops, workstations, a media computer, and a server. My server is a Windows Server 2003 acting as a domain controller and I use domain accounts for all computer logins. I also use several advanced features only available to computers participating in an AD domain environment like user account folder redirections and domain based group policy settings for management. This server also runs Microsoft Exchange server which is set up to automatically download POP email for all users and make it available from multiple email clients. I even have a few SQL Servers running here and there although so far this is not data that I have been too concerned with (mostly development projects with test data).

This configuration has allowed me to set up an environment where sharing resources is a cornerstone to the way we use our computers. Exchange allows access to our email from anywhere with rich calendar and address book support, whether simultaneously from multiple computers, over the web remotely (via the OWA client), or in offline mode on our computers. Folder redirection and offline files allow us to share and sync data effortlessly with shared desktops, documents, favorites, or any other resource on the network including shared application. I can freely move from computer to computer, from inside the network to outside and still have access to my email, data, and other network resources anywhere I go whether online or offline. If it sounds like a complicated set up that's is because it is, but from a management point of view once it was set up I spend very little time keeping it running. The downside to this configuration is that my data backup scenario is a bit more complicated that just copying files to a CD/DVD.

My primary goal of course is to not lose a user's data; the documents, photos, email, and other things that we all create. My secondary goal however is to preserve the state of the entire system. By state of the entire system I mean all the settings and configurations for each user account, on each computer, as well as the OS configuration of each computer. As I said before it is a complicated system and if something goes wrong I want to avoid having to rebuild systems from scratch if at all possible. My goal is to get back up and running as quickly as possible.

Running a domain controller also complicates backup strategies. Since the login accounts are all Active Directory domain accounts , if the state of my server is lost then so are the user accounts. Email is stored centrally in the Exchange database, which is tied to these AD accounts and requires a special backup process as well. All together this necessitates something more than copying data files to a disc or an external hard disk.

My overall goal in creating a backup system is to both data preservation and having as little downtime as possible when something does goes wrong (and it will at some point). But I also want this to be as automatic as possible, something I can just set and largely forget. And I want it to be resilient too. These are bold goals for such a complicated setup and a one-man IT shop. Yet my previous solution of scripting NT Backup, while simplistic, met some of these goals but failed in many of these as well.

Why my previous traditional backup solution wasn't good enough

My previous solution provided basic data and system state protection. I used simple scripts to control NT Backup to back up both the user data and the Windows' system state nightly for each computer. I also used NT Backup to backup up my Exchange server's data as well. I backed everything up to a secondary disk on my network. However I only ever kept the previous backup set so I had a history of exactly two backup cycles. By doing so however, I had at least three copies of everything stored in two separate places.

My data volume however presented issues right from the start. I have a lot of data on one workstation in particular, mostly very large photo files. The data set size for that machine alone is currently 40GB if I don't count DV video, which effectively rules out running nightly full backups and keeping a lot of backup history. Incremental backups offer one solution to this problem but I am not a fan of creating long chains of incremental backup sets as it makes the restore process more complicated and time consuming to get data back. They also have other drawbacks in that if one of the incremental backups sets fails, the chain is broken and you could lose the only backup copy of a particular file. As a compromise I eventually settled on a simple rotation scheme of full backups and differential backups. This made for a restore process that was at most two steps, a full backup restore followed by a single differential backup restore while also ensuring that I had some file change history preserved.

While my old scheme provided basic data protection I felt that it was lacking in several ways. First off was that it didn’t keep very much history and I wanted to keep more. It’s easy to not notice that something has gone wrong for quite a while with that volume of data. If a single file has been changed or lost chances are that I will know immediately as I probably caused it myself. But it’s not always that simple; files can get corrupted when hard drives have minor and unnoticeable failures. Other actions can also have consequences that aren’t always immediately apparent too. Any good backup system should have a reasonable amount of history.

Secondly, a lot of redundant data was flying around each night. Differential backups are not very efficient. I had scheduled my backups to run in a staggered sequence so that they were not competing for the server’s bandwidth all at once. Still, so much duplicate data was getting copied each night that the process took around 3 hours and sometimes a lot longer if a full backup of any machine was triggered. I also know that my data volume will continue to grow. I already have data (DV video and other media files) that I don’t currently back up that I should. With my old process I couldn’t grow the data volume too much before backups would have started to take all night long or longer. I needed to stop copying duplicate data as much as possible.

Lastly, I had no solution for offsite backups nor did I have a solution for archiving data that doesn’t change much. I have DVD burners and I even have a 70GB DAT tape drive but they were not integrated into my backup process. By using NT Backup on each machine I was left with a collection of large monolithic backup files that would not fit on backup media without file splitting nor could they be sent to an Internet backup service in a reasonable amount of time .

Mirror mirror

From examining my current situation and researching possible new solutions, it was clear to me that there are now better ways to backup data than the traditional methods. What I really wanted was a mirror backup. However since a mirror backup is a complete backup of everything frozen at a point in time, a simple file copy scheme is a very inefficient use of storage space when keeping history. They do however make it very easy to upload incremental changes to offline storage as you can easily detect and upload just the changed files from the last backup set. What is needed is an intelligent mirror backup that conserves space when storing history. One way to do this is by eliminating the physical storage of duplicate files. The solution I ultimately settled on does this by leveraging features of the NTFS file system to both eliminate duplicate file storage and yet still make it possible to browse a complete mirror backup with the Windows Explorer.

 

In part two I'll cover the intelligent mirror method I chose to be the foundation of my new backup strategy, rsync, and why it didn't work for me.


Here’s a tip for tracking important email responses that you’re waiting for. Many email clients these days support flagging or labeling items as well as rules for processing items. I use Outlook as my email manager which has a robust set of rules and in the 2003 version, something called “search folders” which are saved searched that you can quickly access. Using Outlook I can set up folders to automatically sort all of my emails with a certain flag into one place for quick access. In my email system I use flags for managing the flow of “Next Actions” as defined in GTD (Getting Things Done by David Allen). Right now I have a fairly simple set of flags that I use, “Action”, “Deferred”, and “Waiting For”. Anything else that is not flagged is by default considered reference information (or completed actions if I have checked them off).

The “Action” and “Deferred” flags are pretty self-explanatory; it means that there is an action (or possible future action) that I need to perform; usually something that takes more time than a quick response. “Waiting For” is also fairly self-explanatory but its use is sometimes a little more complicated in practice.

In most cases I use it to simple flag something that I am waiting for, such as flagging an order shipment email or flagging something important that someone has told me they would follow up on. Where it gets more complicated is when I need to track a request that I am sending to someone. In those cases I use a special rule to flag a copy of the item as “Waiting For”. In Outlook I have a rule set up that checks for emails that I have Cc’s to myself. When Outlook’s rule engine finds those items it marks them as read and flags them as “Waiting For”. Now whenever I want to track a request to someone, I just Cc myself on the request and a reminder will automatically be generated and filled in my “Waiting For” folder.


I’ve recently discovered a name (and a tool) for something that most of us do but that I didn’t know had a name. It’s called Mind Mapping and if you’ve written on a whiteboard chances are you’ve already done it. Mind Mapping in its simplest form is just writing down your thoughts in a visual way, i.e. text connected with lines and more text. The real discovery for me however was not that it had a name but rather that there are tools out that facilitate creating mind maps on a computer. The product I found is called MindManager from MindJet and from what I can tell they seem to be the market leader and for good reason. They have a well designed and functional product equally on par with anything in Microsoft Office, which it can fully integrate with.

While MindManager can be used with any PC the real beauty of this program is that it is probably the best TabletPC program out there, far better than anything Microsoft has ever produced. When installed on a TabletPC it has a special Pen mode that takes full advantage of all the features of a TabletPC including gestures. Creating mind maps this way is very fast and intuitive. You can insert text, drawings, make connections, etc… all with simple strokes of the TabletPC’s pen.

So what does a MindManger mind map look like? Here’s a simple example (you can click on it for a larger view)


 

 

This is a very simple example of a mind map (there are many more in the MindJet.com online gallery) and in fact they can look just about anyway you like. There is an abundance of styles and images included that you can use to build mind maps. You can also link in live data from other sources (excel, outlook, databases, etc…). In addition to capturing text you can also insert hyperlinks to other maps or external sources.

Besides just capturing thoughts and ideas you can use mind maps for all sorts of things like taking notes during a meeting or creating process flow diagrams or even creating dashboard type diagrams that link to other sources of information.

After reading about it I was eager to try it but I waited until I had a good idea that I wanted to capture. I’ve always been unsatisfied with handwritten notes, both on paper and electronic especially when trying to capture complex thoughts. Usually I find that when I go back to them (if I can read them) I have a hard time recalling the thinking behind the notes. My notes just don’t seem as connected to my original thoughts or my original thought process. Some when my next big idea came along, I picked up Mind Manager and starting creating a mind map of it. I was amazed at how quickly I could capture thoughts and more importantly, relate them to one another. Rearranging my thoughts throughout the capture process was also easily done just by dragging things around. I found it far more effective at capturing and organizing my thoughts than paper. Running out of room for inserting new thoughts was just not an issue at all.

The real test though came days later after I had captured my original thought stream and went back to try and decipher it. I found that my mind map notes made more sense to me than the chicken scratch I normally write down. I could see the connections between my thoughts much better than with my old note taking style. The best thing however was how easy it was to extend the map with new thoughts. As I drilled down deeper into my original notes adding new layers was by far easier than any other system of note taking I have ever used. It made me an instant convert. I would highly recommend giving it a try.

MindManager is available for both the PC and Mac. You can download a free 21-day trial from MindJet.com. Be warned though that it is not cheap, retailing at $350 for the Pro version (I got my copy for $275 on Ebay though).


In addition to a Pocket PC I also use a Tablet PC. It’s a convertible model and I don’t always use it in tablet mode, but that is changing as I find more tools to fill the gaps in the user interface that seem missing to me.

As I mentioned when using a handwriting user interface I’m a fan of something called gestures. The Tablet PC has these too but it is up to each application to map them to commands. So far there are few applications that do so. There are a great many things that gestures could be used for like selection or Cut, Copy, and Paste that would make the Tablet PC feel much more natural. It’s a shame that Microsoft never implemented a system-wide gesture user interface into the Tablet PC. However like Calligrapher that added this feature to the Pocket PC there is something called StrokeIt for the Tablet PC (actually any PC) that adds both system and application specific gestures.

StrokeIt is not made specifically for the Tablet PC but despite this, it works really well. As such it doesn’t use most of the Microsoft defined gestures but you can teach it any new gesture you want including the missing ones defined by Microsoft. With StrokeIt you can use gestures anywhere to invoke any set of keys or commands using a macro system for defining gesture actions. They’re not always the easiest thing to set up but they are quite powerful and you can set up gestures and actions to be global or per-application. It comes with a bunch of pre-defined actions for popular applications to get you started. It’s also very light-weight taking just 360k on my TabletPC (by contrast the Tablet PC recognition UI take about 34,000k).

Using StrokeIt my Tablet PC feels much more natural. When I need to invoke the spell checker, I can just draw a big checkmark like on my Pocket PC. I can also set up gestures and actions for Cut, Copy, and Paste or anything else I want. And best of all StrokeIt is completely free for personal use. If you’re using a Tablet PC I would highly recommend trying StrokeIt.


Flux and Mutability

The mutable notebook of David Jade