I use iTunes on Windows only because I have to. I don't really care for it as a music organizer and besides, I rarely listen to music on my computer. I have an iPod though and what I do like about iTunes is that I can just plug in my iPod and it will automatically sync everything without any intervention from me. I also like that iTunes allows me to subscribe to podcasts. It annoys me though that unless I have iTunes running it will not update my podcast subscriptions. Since I only use iTunes for syncing my iPod, iTunes is never running.

Luckily iTunes has an COM automation interface that you can use to force it to update your podcasts. With a bit of PowerShell scripting I am now able to automatically update all of my podcasts in the middle of the night when I don't have to be bothered with the fact that iTunes is running. I used PowerShell for this scripting task as again, it seems perfect for a task like this.

My script is a little more complicated than just starting up iTunes and then starting the podcast downloads. Since I am scheduling this process to run nightly I also wanted to be able to close iTunes once it was finished. Unfortunately the iTunes COM automation interface provides no way to tell whether or not iTunes is busy downloading new podcasts. However it seems that iTunes creates a pretty predictable temporary download folder structure, which it then removes when it has finished downloading all of the podcasts that it updates. I used this bit of knowledge in my script to detect when iTunes was busy downloading new podcasts. I just watch for those directories, waiting for iTunes to finish and then I shut down iTunes. So far this has worked pretty well although I'm sure it is not complete robust under all circumstances.

Here is the PowerShell script I developed. If you want to use it you'll have to tweak the directory path that is checked in the script to match your computer. I should also mention that on at least one of my machines iTunes creates the temporary 'downloads\podcast' directory in a different place under the iTunes music folder. I have not found a setting in iTunes that determines where the temporary folder gets created so you may have to poke around a little while iTunes is downloading podcast updates to figure out where it is on your machine if this does not work.

 

# update iTunes pod casts

function test-Downloading
{
    if(test-path 'C:\Documents and Settings\MusicBox\My Documents\My Music\iTunes\iTunes Music\Downloads\Podcasts')
    {
        return $True
    }
    return $False
}


$iTunes = new-Object -comobject iTunes.Application
if($iTunes -ne $null)
{
    'iTunes started' | out-Host

    # start iTunes podcasts update
    'updating podcasts' | out-Host
    $iTunes.UpdatePodcastFeeds()

    # set a time out time
    $TimeOut = (get-Date).AddMinutes(30)

    # wait a few minutes and check if it is downloading
    'waiting for downloads to start...' | out-Host
    start-sleep (30)

    # loop while iTunes seems to be downloading (presence of temporary download folder)
    'checking for download activity' | out-Host
    while(test-Downloading -and ((get-Date) -lt $TimeOut))
    {
        # give it some more time
        'download activity detected, waiting...' | out-Host
        start-sleep (60)
    }

    if((get-Date) -ge $TimeOut)
    {
        'downloading timed-out' | out-Host
    }

    # now quit iTunes after a little settling time
    start-Sleep (30)
    'quitting iTunes' | out-Host
    $iTunes.Quit()
    $iTunes = $null
}

This is the third in a series of articles about a new backup process I have implemented for my home network. In the previous article I covered a mirror backup process that maintains a storage-efficient backup history. In this article I'll cover the tools I used and the issues I had to overcome while using them.

Common tools and a not so common use of them

Once I had decided to create a backup system that creates space-conserving mirror backups by leveraging NTFS hard links, I set out to make a simple prototype. It occurred to me that I already had a very good tool for copying data around, a free tool called robocopy from the Windows sources kit. Robocopy is a very powerful file copying tool that can be configured in a multitude of ways, including the ability to copy files in backup mode, a special mode of file access that can be used to bypass file security for the purposes of backing up files. It is also faster and more reliable than the file copy tools that come with Windows and has a very good set of options to control which files to copy. However robocopy know nothing about the process of creating hard links to previous versions of files. This step I would have to do myself.

In searching for information on how to create hard links, it wasn't long before I ran across references to the fsutil tool that is included in Windows XP and Windows Server 2003. Using this tool you can creating NTFS hard links from the command line.

Together with robocopy and a bit of creative CMD scripting, I was able to throw together a prototype that could create mirror backups while hard linking to the files that had not changed since the previous backup just like rsync did. I started by duplicating the directory structure of the old backup by using robocopy to copy just the directories. Next I used fsutil to hard linked copies of the previous backup files into the new directories. I did this by traversing the old backup directories and using fsutil to create hard links to each of the older files. Then I used robocopy to generate a list of the files that had changed since the last backup, including files that were no longer present. From that listing I then deleted those files from the newly created mirror backup. Finally, I used robocopy to copy over just the newer files into the new mirror backup. While it wasn't the most efficient method, it worked pretty well but it had one important limitation: fsutil only works on local disks. It was also a pretty hacky bit of CMD script since I had to do string manipulation to create the hard links. I had considered re-writing the whole process in C# but then something else popped on my radar.

PowerShell, isn't that some sort of new gasoline?

It was about this time that Microsoft released RC2 of PowerShell (which as just recently gone RTM). PowerShell is Microsoft's new administrative scripting language for the future. Besides be a very good replacement for command shell scripting and VBScript, it is also the new foundation of the management tools for the next version of Microsoft Exchange. It is an amazingly powerful scripting language, easily learned, easily extended, and is easily the more important tool I have learned in a long time.

PowerShell is different from other scripting languages because it is based on the concept of pipelining objects. Many scripting languages, including the native Windows shell, support pipeling text data from command to command. PowerShell is different however in that it pipelines complete .NET objects instead of just textual data. As full .NET objects, each object in the pipeline has state, properties, and methods. They can be passed as parameters to functions, extended dynamically, coerced into other types, and placed back into the pipeline. Functions in PowerShell can also be treated as objects allowing you to do some types of functional programming tasks that are not easily done in other .NET languages. It is a very powerful idea and my brief description doesn't even scratch the surface of the power that lies within PowerShell. It is all still very new to me but already I am finding many uses for it.

Tip: Here's a PowerShell gotcha to keep in mind. Every expression in PowerShell that produces output places that output in the pipeline. This can lead to pretty weird debugging issue if you aren't careful. I had more than one case where a function was returning more than I wanted because I was calling a command that placed things in the pipeline without realizing it. There are two ways to avoid this however. One is to assign the output of commands to a variable and the other is to redirect the output to $null (i.e. do-something > $null).

PowerShell's object pipeline nature along with the rich set of built-in commands knows as cmdlets, makes for a perfect system for doing administrative computer tasks. There are cmdlets for accessing PowerShell providers such as the file system and the registry, accessing WMI object, COM objects, and the full 2.0 .NET framework. I've seen examples of everything from a simple file parsing scripts to a simple but complete HTTP server written in PowerShell in just a few lines of code. To me it appeared to be the perfect language for scripting a new backup process. However PowerShell does not offer support for creating NTFS hard links either. For this I would need to extend PowerShell.

Extending PowerShell through custom C# objects and P/Invoke

Starting with Windows XP there is a new API for creating hard links, CreateHardlink. In previous versions of Windows, creating hard links was somewhat of a black art. You had to use the complex and sparsely documented Win32 Backup API's. It could be done and there are examples of how to do it out there, but it was not for the faint of heart. The CreateHardlink API however solves that, making it almost trivial to create hard links on NTFS. Furthermore, unlike fsutil the CreateHardlink API fully supports creating hard links on remote network NTFS drives. PowerShell cannot easily call native API's on its own though. To do that, you need to extend PowerShell with a bit on .NET code.

PowerShell is very easy to extend. You can write complete cmdlets', objects that fully plug into the PowerShell pipeline framework or you can just create simple .NET objects that can be created and invoked thanks to PowerShell's ability to access the .NET framework.

Using C# and a bit of P/Invoke it was almost trivial to solve the problem of not being able to create hard links in PowerShell (and .NET) by writing a simple object that called the Win32 CreateHardlink API. Once that was done, I could easily create my new .NET object in PowerShell and use it to create all the hard links that I wanted. Now I could create a more complete backup script from the ground up using PowerShell.

If you'd like to access the CreateHardlink API in PowerShell or .NET, here is a C# code snippet to help you. Simply create a new class in a .DLL and add this method. I added this method as a static member since it does not require any state from the class. This also makes it very easy to call from PowerShell.

[DllImport("kernel32.dll", SetLastError = true, CharSet = CharSet.Auto)]
internal static extern int CreateHardLink(string lpFileName, 
    string lpExistingFileName, IntPtr lpSecurityAttributes);

static public void CreateHardlink(string strTarget, string strSource)
{
    if(CreateHardLink(strTarget, strSource, IntPtr.Zero) == 0)
    {
        throw new System.ComponentModel.Win32Exception(Marshal.GetLastWin32Error());
    }
}

To call this code from PowerShell, you simply load the .NET assembly and then call that static method on your class. Note that this will throw an exception if it fails so make sure you have a PowerShell trap handler somewhere in your script.

# load the custom .NET assembly
[System.Reflection.Assembly]::LoadFrom('YourLibrary.dll')

# create a hard link
[YourLibraryName.YourClass]::CreateHardlink($Target, $Source) > $null

Whoops, that file is in use

There was still one more issue to tackle before I could write a robust backup system, accessing files that are in use. Starting with Windows XP Microsoft introduced a new system for accessing files that are currently in use on Windows systems, the Volume Shadow Copy Service (VSS for short, but not to be confused with Microsoft's VSS source control system).

One of the ideas behind VSS is that when requested, the OS will make a read-only copy of the drive, a snapshot frozen in time, available to a backup program. Other programs can continue to change the original disk files but this shadow copy, or snapshot will remain frozen and completely accessible to the program that created it. Furthermore when a backup program requests that a shadow copy is to be created, the OS can coordinate with shadow copy providers to ensure that the data on the disk is in a consistent state before the shadow copy is created. This further ensures that the files that the backup program has access are in a consistent enough state on the disk to be backed up. This is especially useful for files that are either always open or always changing like the system registry, user profiles, Exchange, or SQL databases. Once the backup program is finished with this temporary read-only shadow copy, it then releases it and it disappears from the system. By using the VSS system backup programs can gain access to every file on the drive even if they are exclusively in use by other programs. For me it was essential to use VSS with any backup process I implemented.

There were a few tough problems though. On windows XP these VSS snapshots are very temporary in that they only exist for as long as you hold a reference to them via COM. Once released, they auto-delete themselves. And unlike VSS on Windows Server 2003 they cannot be exposed as a drive letter for easy access. You have to access them via the native NT kernel's method of addressing NT namespace objects, the GLOBALROOT namespace. On XP when you ask the VSS service to create a snapshot, what you get is a NT GLOBALROOT path that looks like this:
\\?\GLOBALROOT\Device\HarddiskVolumeShadowCopy1. Unfortunately this is something that not even the native Windows command shell fully understands and if you try and access it from PowerShell or .NET you'll get an exception telling you that you really shouldn't be accessing internal NT paths in .NET. To solve this I would need another bit of custom code to extend PowerShell.

VSHADOW.EXE and exposing a snapshot as a drive letter

VSHADOW is a sample tool that is part of the VSS SDK. It is a command line interface to the VSS API. By using this tool you can create and release VSS snapshots at will. It even has a way around the COM auto-destruction of snapshots on Windows XP by allowing you to call an external program once the snapshot has been created so that you can access the snapshot while VSHADOW is still keeping it alive. It will even create a set of environment variables for you to let you know the names of the GLOBALROOT shadow copies that it has created. This still didn't solve my problem of not being able to access them via PowerShell though (or robocopy for that matter) but having this source code was a good start.

All physical devices in Windows like hard drives exist in the GLOBALROOT namespace. It is only through device name mapping that we can access them via their friendly DOS names like C:, D:, etc.... Normally the OS creates these device mapping automatically at start up or whenever a new device has been connected. VSS snapshots however don't automatically get recognized and mapped. Mapping a friendly name to a VSS snapshot has to be done by directly using the win32 DefineDeviceCreate API. By using this API you can create and remove DOS device mappings to VSS snapshots on the fly even on Windows XP. But since VSS snapshots are temporary you have to manage them carefully or the system could become unstable.

Creating a VSS snapshot and mapping it to DOS device names is well beyond what I wanted to try to do in C#. Luckily for me, the VSHADOW C++ source code was written in a very reusable manor and I could easily reuse it by wrapping a COM object around it.

The not so nice COM interop experience with .NET 2.0

Creating a snapshot is not the simplest of procedures. You have to query for the list of VSS writers, map them against the target volume, determine which ones to include in the process, and finally request that the snapshot be created. You have to hold on to the VSS COM interface to keep the snapshot alive on XP for the duration of its use. When you are done, you have to release it in a controlled manor or the VSS system can completely degrade and require a system restart to recover from in most cases. It is also not the fastest process either, something that would come back to bite me later. However the VSHADOW source which is written in C++, was written in such a way that it made it very easy to turn into a COM object using ATL. It was as simple as creating a new ATL COM object project in Visual Studio and including the core VSHADOW sources file into the project. Once I had it building as a COM object it didn't take me long to put a .NET friendly interface on this new COM object that exposed methods to create and destroy VSS snapshots as well as map them to DOS device names.

PowerShell has native support for create and calling COM objects that is even easier than in other .NET languages. There is no need to create .NET interop classes, you just dynamically creating the COM object and use it much like you would in VBScript. Once I created my new VSS COM object it was trivial to create VSS snapshots on the fly and map them to DOS device names using PowerShell. With my new VSS COM object I now had complete access to VSS snapshots from any tool that could access a standard drive. It has some limitations but for this backup process it works very well.

Releasing the VSS snapshot in PowerShell however was another story. There is no clean way that I can find to force a created COM object to be released in PowerShell. You have to wait for the .NET garbage collector to do its thing which is usually not until the PowerShell process is exiting. My new COM object had its clean-up code in the COM object's Release method so that when it was released it would clean up the VSS state in the proper way, ensuring that the system remained stable. Unfortunately for me relying on a COM object's Release method to work during the .NET shutdown process proved to be one huge headache.

After many, many hours of debugging and not really believing what I was seeing I finally had to accept what was going on. From what I was seeing and from the research I have done it is my understanding that Finalizers in .NET, which are called when an object is being destroyed and which are also responsible for calling a COM object's Release method in PowerShell, are not guaranteed to complete when a process shuts down. Usually this is not a problem as the process is going away anyway. It is a problem however when you have native resources to release.

What I was seeing and not believing for literally hours and hours was that in the middle of my COM object's Release method the PowerShell process would just exit normally. No exceptions, no faults, nothing - just poof it's gone. And every time that it did this it would leave the VSS system in such a state that the machine had to be restarted because I was never given the chance to properly execute the VSS clean-up code which can be a lengthy process. It seems that the PowerShell shutdown process was timing out my clean-up code. It was a complete mess and still one that I cannot believe is acceptable but apparently to the folks who created .NET it is (you can read about it here in way more detail than anyone should have to know. Just search for "timeout" and "watchdog" on that page). The thought that external native code can have the plugged pulled just blows me away.

The fix was rather simple once I realized that I cannot count on my COM object's Release method to always complete. I had to move all critical clean-up code and put it in a public method that my PowerShell script would always call. Luckily PowerShell has pretty decent error handling and it wasn't too hard to ensure that I always called the clean-up method on my COM object before PowerShell normally terminates. I'm still not thrilled about this though. I would have preferred that my COM object be allowed to clean up after itself as necessary.

The moral of this story is that you are responsible for all complex clean-up even when calling native code. Don't depend on the .NET framework to always play nice.

Now that I had this behind me I had all the pieces that I needed: a robust file copy tool, a powerful scripting language, the ability to create hard links, and full access to Volume Shadow Copy snapshots.

 

In part four I'll cover the process overview and implementation details of creating the intelligent mirror backup process that I choose to be the foundation of my new backup strategy.


This is the second in a series of articles about a new backup process I have implemented for my home network. In the first article I covered some background information and why I choose a non-traditional backup process for my network. In this article I'll give an overview of mirror backups, their problems, and the ways in which they can be improved. I'll also talk about rsync and why for me, it was not the right tool for creating backups on Windows.

A more intelligent mirror backups process

Mirror backups have several advantages over the traditional backup process. Like a full backup they are a complete snapshot in time of the data being backed up but since they are typically stored on randomly accessible media (i.e. a disk or online), accessing any part of the backup set is easily accomplished. However, also like a full backup you usually have to copy each file to create the mirror even if that file has not changed since the last backup. This makes storing backup history both a time consuming process and an inefficient use of storage space. You can of course just copy the changed files into an existing mirror backup but then you lose any file change history from the last backup set.

While researching backup solutions I came across a UNIX utility called rsync. Rsync is a tool that was designed to efficiently create file set mirrors. It has a large and flexible set of features including its own network protocol that can efficiently transfer just the differences between versions of a file when creating a new mirror. The most appealing feature to me however is its ability to conserve storage space while still creating complete mirror backups. It does this by leveraging a feature called hard links which are found on modern files systems (including the Windows NTFS file system).

Hard links are a method by which the operating system separates the storage of the file's actual data from the file's directory entry. In other words, a file in a directory is really just a named pointer to the file's physical data which is stored somewhere else on the hard disk. Every file stored on a hard disk is stored in this way. Most of the time, there is a one-to-one mapping with each physical file's data having only one named directory entry or file name. However, some file systems allow you to create additional named references to the file's physical data. These reference can exist in the same directory or in any other directory on the file system. Furthermore, they can have the same or a completely different file name. When this is done they share all aspects of the actual data, including the file's attributes such as security settings, creation time, and the last modified time. If you change the file's data or attributes via one reference then all the other references will reflect those changes immediately. You can even delete the original reference to the data and the others will continue to live on. It is only after all references that point to the physical data have been deleted that the actual file data is deleted.

What makes rsync really special is that it can leverage this very powerful file system feature when creating mirror backups. It does this be examining the files being backed up, comparing each file against the copy in the previous mirror backup. Then for each file that it finds to be identical, instead of copying that unchanged file into the new mirror backup it simply creates a new hard link to the file as it exists in the previous mirror backup. This is much, much faster than re-copying the file itself but it essentially does the same thing. It makes the unchanged version of the file available in the newly created mirror backup. Furthermore it is also very space efficient since the file's data is still only stored once physically but it is accessible from every mirror backup that it is hard linked into.

By leveraging hard links, it also allows for another interesting feature. You can physically delete older mirror backups without affecting any of the new mirror backups that may be referencing unchanged data with these backups. Files that have other references will stay alive in those newer mirrors while older files that only have a reference in the mirror being deleted will be destroyed. This makes managing the amount of backup history to keep very easy; you can just delete older mirrors when you no longer want the history they contain.

Together these two features effectively allow you to break the traditional full/incremental backup cycle forever while keeping the best parts of each. You can have several full mirror backups available to preserve file history with no physical duplication of data for files that remain unchanged while still running a process that only copies newer files like an incremental backup. It is the best of both full backups and incremental backups combined in one logical backup process. You can read more about using rsync to create hard linking mirror backups here.

If rsync is for UNIX, what's this got to do with backing up Windows?

Rsync is a very UNIX-y tool and I am pretty firmly in the Windows camp. However you can still run rsync on Windows using the cygwin system. Cygwin is a software translation layer that was developed to allow certain types of UNIX programs to work on Windows systems. It has been around for a long time and is very functional and reliable. I used cygwin quite successfully to play with rsync on both Windows XP and Windows Server 2003. However for backing up data from Windows' NTFS file system there are some issues.

Windows' NTFS file system supports a wide array for features including security attributes, NTFS alternative data streams, sparse files, encrypted file, and much more. Cygwin and thus rsync know nothing about these additional file attributes since the translation layer was designed to make Windows seem like just another UNIX system. What this means is, that rsync can only copy the aspects of the files that it knows about, i.e. the file's regular data. So if you use rsync to make mirror backups on Windows you will essentially lose all of this other and sometimes important file and directory information (you also lose this information when simply copying files to a CD/DVD as well). This information however is usually preserved when doing traditional backups or when copying files between NTFS locations including across Windows networks. What was needed was a way for rsync to preserve this information on Windows.

I debated for quite some time on whether or not to dive in and start hacking away at the rsync source code to try to teach it more about the Windows NTFS file system and this additional file data. I even went so far as to patch cygwin to teach it about Windows' GLOBALROOT device paths, a feature that was essential in order to use Windows' Volume Shadow Copy Service (VSS) with rsync (this patch is now part of the cygwin CVS source, Btw) . However in the end I decided it would probably be much more effort to update rsync than it would be to write my own backup process. Rsync has a lot of features that I do not need and even though my new backup process was likely to be missing some of the features that I really liked about rsync, my new process would still contain the core method by which rsync creates space-conserving mirror backups.

This decision to start fresh on my own backup tool coincided with the time that I had started to play with another tool, Microsoft's recently released PowerShell scripting language. It wasn't long before I realized the PowerShell was a perfect fit for this type of problem.

 

In part three I'll cover the tools I used to implemented an intelligent mirror backup process in Windows using robocopy, Microsoft's PowerShell scripting language, C#, the Volume Shadow Copy Service, C++ and a little bit of COM interop.


This is the first in a series of articles about a new backup process I have implemented for my home network. In this first article I'll cover background information and why I choose a non-traditional backup process. In future articles, I'll cover my implementation of this system for backing up  on Windows.

Note: Currently the custom backup system described here is not publicly available. That is something that I am looking at doing in the future but as it exists now is it not in a state that would be usable by the general public.

The nature of traditional backup processes

I’ve been thinking off and on for quite some time about setting up a new backup system for my home network. I started down this path for a better backup solution because I was not really happy with my existing NT Backup scripts. At first I thought I would continue to leverage NT Backup with a better set of scripts to handle backup rotations, but the more I thought about it the more I became convinced that this was not the way to build a backup system for the future.

Traditional backup programs are still largely built on concepts from the days when everyone backed up their data to cheap backup media (i.e. discs or tape). They are typically designed to backup all information to a set of backup media. Even though most backup programs now allow you to back up data to external hard disks or network locations, most still create monolithic backup files. Getting to the data once it's on the backup media usually involves a restore process that moves the individual file data back to a hard disk.

Nowadays however hard disks are cheap and online storage is practically getting cheaper by the minute. Backing up using a process that creates large, monolithic backup files just doesn't make as much sense anymore. Hard drives are very good as storing individual files and online backups services are most efficient when they can upload small incremental files changes rather than larger monolithic backup files.

After this realization I started to look for possible alternatives to the monolithic backup process. Most of what I found offered little more than what NT Backup and some clever scripts had to offer. There were some notable exceptions and the solution I ultimately settled on is largely inspired by one of these exceptions, a tool named rsync. However once I made the decision to move away from NT Backup and a traditional, monolithic backup system things got a lot more complicated.

Along the way I've learned new technologies such as Microsoft's new PowerShell scripting language, how to work with the Windows Volume Shadow Copy Service (VSS), digging into Window's GLOBALROOT device namespace, and insights into the pitfalls of .NET/CLR and COM interop. I even dived pretty deep into cygwin development at one point along the way.

A more complicated setup requires a more complicated backup plan

I don’t have what one would call a typical home network. On my home network I have several laptops, workstations, a media computer, and a server. My server is a Windows Server 2003 acting as a domain controller and I use domain accounts for all computer logins. I also use several advanced features only available to computers participating in an AD domain environment like user account folder redirections and domain based group policy settings for management. This server also runs Microsoft Exchange server which is set up to automatically download POP email for all users and make it available from multiple email clients. I even have a few SQL Servers running here and there although so far this is not data that I have been too concerned with (mostly development projects with test data).

This configuration has allowed me to set up an environment where sharing resources is a cornerstone to the way we use our computers. Exchange allows access to our email from anywhere with rich calendar and address book support, whether simultaneously from multiple computers, over the web remotely (via the OWA client), or in offline mode on our computers. Folder redirection and offline files allow us to share and sync data effortlessly with shared desktops, documents, favorites, or any other resource on the network including shared application. I can freely move from computer to computer, from inside the network to outside and still have access to my email, data, and other network resources anywhere I go whether online or offline. If it sounds like a complicated set up that's is because it is, but from a management point of view once it was set up I spend very little time keeping it running. The downside to this configuration is that my data backup scenario is a bit more complicated that just copying files to a CD/DVD.

My primary goal of course is to not lose a user's data; the documents, photos, email, and other things that we all create. My secondary goal however is to preserve the state of the entire system. By state of the entire system I mean all the settings and configurations for each user account, on each computer, as well as the OS configuration of each computer. As I said before it is a complicated system and if something goes wrong I want to avoid having to rebuild systems from scratch if at all possible. My goal is to get back up and running as quickly as possible.

Running a domain controller also complicates backup strategies. Since the login accounts are all Active Directory domain accounts , if the state of my server is lost then so are the user accounts. Email is stored centrally in the Exchange database, which is tied to these AD accounts and requires a special backup process as well. All together this necessitates something more than copying data files to a disc or an external hard disk.

My overall goal in creating a backup system is to both data preservation and having as little downtime as possible when something does goes wrong (and it will at some point). But I also want this to be as automatic as possible, something I can just set and largely forget. And I want it to be resilient too. These are bold goals for such a complicated setup and a one-man IT shop. Yet my previous solution of scripting NT Backup, while simplistic, met some of these goals but failed in many of these as well.

Why my previous traditional backup solution wasn't good enough

My previous solution provided basic data and system state protection. I used simple scripts to control NT Backup to back up both the user data and the Windows' system state nightly for each computer. I also used NT Backup to backup up my Exchange server's data as well. I backed everything up to a secondary disk on my network. However I only ever kept the previous backup set so I had a history of exactly two backup cycles. By doing so however, I had at least three copies of everything stored in two separate places.

My data volume however presented issues right from the start. I have a lot of data on one workstation in particular, mostly very large photo files. The data set size for that machine alone is currently 40GB if I don't count DV video, which effectively rules out running nightly full backups and keeping a lot of backup history. Incremental backups offer one solution to this problem but I am not a fan of creating long chains of incremental backup sets as it makes the restore process more complicated and time consuming to get data back. They also have other drawbacks in that if one of the incremental backups sets fails, the chain is broken and you could lose the only backup copy of a particular file. As a compromise I eventually settled on a simple rotation scheme of full backups and differential backups. This made for a restore process that was at most two steps, a full backup restore followed by a single differential backup restore while also ensuring that I had some file change history preserved.

While my old scheme provided basic data protection I felt that it was lacking in several ways. First off was that it didn’t keep very much history and I wanted to keep more. It’s easy to not notice that something has gone wrong for quite a while with that volume of data. If a single file has been changed or lost chances are that I will know immediately as I probably caused it myself. But it’s not always that simple; files can get corrupted when hard drives have minor and unnoticeable failures. Other actions can also have consequences that aren’t always immediately apparent too. Any good backup system should have a reasonable amount of history.

Secondly, a lot of redundant data was flying around each night. Differential backups are not very efficient. I had scheduled my backups to run in a staggered sequence so that they were not competing for the server’s bandwidth all at once. Still, so much duplicate data was getting copied each night that the process took around 3 hours and sometimes a lot longer if a full backup of any machine was triggered. I also know that my data volume will continue to grow. I already have data (DV video and other media files) that I don’t currently back up that I should. With my old process I couldn’t grow the data volume too much before backups would have started to take all night long or longer. I needed to stop copying duplicate data as much as possible.

Lastly, I had no solution for offsite backups nor did I have a solution for archiving data that doesn’t change much. I have DVD burners and I even have a 70GB DAT tape drive but they were not integrated into my backup process. By using NT Backup on each machine I was left with a collection of large monolithic backup files that would not fit on backup media without file splitting nor could they be sent to an Internet backup service in a reasonable amount of time .

Mirror mirror

From examining my current situation and researching possible new solutions, it was clear to me that there are now better ways to backup data than the traditional methods. What I really wanted was a mirror backup. However since a mirror backup is a complete backup of everything frozen at a point in time, a simple file copy scheme is a very inefficient use of storage space when keeping history. They do however make it very easy to upload incremental changes to offline storage as you can easily detect and upload just the changed files from the last backup set. What is needed is an intelligent mirror backup that conserves space when storing history. One way to do this is by eliminating the physical storage of duplicate files. The solution I ultimately settled on does this by leveraging features of the NTFS file system to both eliminate duplicate file storage and yet still make it possible to browse a complete mirror backup with the Windows Explorer.

 

In part two I'll cover the intelligent mirror method I chose to be the foundation of my new backup strategy, rsync, and why it didn't work for me.


Flux and Mutability

The mutable notebook of David Jade