Update: Since installing this newer RE2 drive firmware, my RAID array has been working flawlessly every since. I have not had one single timeout or error since. It appears that this firmware completely solved my issues.

Here are the links on the Western Digital site:
WDxxxxYS firmware update information
WDxxxxYS firmware download

A hard drive in my server crashed last early December. It was only about 3 year old, but it was also out of warrantee too. Instead of just replacing it I decided to buy a new set of drives and build a RAID 5 array so that if in the future one drive crashes I will have some level of redundancy. After doing some research I choose to build a software RAID 5 array (yes, I know) because I wanted to be able to guarantee that I could move my RAID 5 array to any other Windows machine in case of hardware failure. I didn't want to worry about becoming dependant on a certain RAID controller with a certain revision, certain driver, etc... For the most part this has been a good decision.

In order to do this I also decided to switch to SATA drives as well, which meant I would need to get a PCI SATA controller since my server is a bit older and doesn't support SATA natively. I choose a basic Promise controller that had 4 SATA 3.0gb ports. I then installed the controller and driver and easily built a 1.5TB RAID 5 array. All was well.

Then on December 18th my server mysteriously dropped one of the drives from the RAID array. In my event logs I saw a whole slew of device timeout messages for the failed drive. When I looked in the disk manager sure enough, one of the disks was missing but because it was a RAID 5 array, no data was lost (yet). I suspected that the drive was toasted so I shut down the machine and was going to reboot to run diagnostics in preparation for sending the drive back for replacement. However once I rebooted the drive came back online without an errors. I ran the diagnostics and they said the drive was fine. Windows happily rebuilt the RAID array and all was fine, until January 18th.

On January 18th the same thing happened again, a drive was dropped from the RAID array after a whole slew of device timeout messages. I figured that it was the same drive, getting more flaky but then I noticed that it was a different drive this time. My next thought was that it must be a controller error. Perhaps the cheap Promise controller I bought was not that best decision. I ordered a Adaptec SATA PCI controller as a replacement and kept my fingers crossed that it would not crash again before it arrived.

Once the new controller arrived I felt a little vindicated in my decisions to go with software RAID. I simply swapped out controllers and rebooted and the RAID array came online without a hitch. Now I felt, everything was going to be ok. That was until, February 18th.

On February 18th the system dropped yet a different drive. The fact that it was happening almost exactly 4 weeks after that last two incidents was not lost of me. Could it have just been a strange coincidence? Whatever it was it was clear to me that it was not just a controller issue. But neither was it a single drive as each time it was a different drive that was crashing. Perhaps it was some weird configuration error. I rebuilt the array (which takes 14 hours) and started poking around the system for things that could cause this.

I found all sorts of suspicious things, which would all eventually turn out to be red-herrings. Things like the disks set for auto spin-down, my UPS mysteriously disconnecting for a few seconds which led to the server thinking it was running on batteries for a few moments, old bits of the Promise filter drivers still installed, etc... Each time I thought I found the cause until the array crashed again. However by now the array was crashing much more unpredictably and frequently (did I mentioned that it also almost always crashed when I was out of town?). I also started experiencing other strange issues on the server, such as the system clock jumping into the future whenever the RAID array crashed. At this point I resigned myself to believing that the old server hardware must be going south so I set out to build a new server.

Transferring everything to a new server (domain, configuration, services, Exchange, SQL, IIS, data, etc...) turnout out to be a LOT of work, more so because I also decided to build a new primary domain controller with all the important services in a virtual machine running on the new hardware (which is also a DC with little else running on it). It took me well over a week to plan things out and to transfer and set up all the domain services. The only worrisome part was when I attempted to transfer over my RAID array. The new server recognized it as an array but it kept telling me that not all of the drives where present and that I would lose data if I imported it. After much research (and backing things up) I determined that this was probably not going to be the case so I let it import the array, which it did instantly and perfectly. The RAID array was now transferred and functioning in the new server. Surely everything must be right now. My RAID array by this time had survived no less than 6 crashes without losing data and each time the failing drive appeared to be fine after a reboot.

Then on July 3rd while I was out of town, the new server dropped a drive from the RAID array again after a whole slew of device timeouts. At this point I was just going to send the drives back to Western Digital for replacement. Their must be something wrong with them I figured. As I prepared to request an RMA, I decided to download and run the diagnostics tools one more time. That is when I noticed that for the Western Digital RE2 drives there was a firmware update. When I read the description from their knowledgebase I almost fell out of my seat (emphasis mine):

WD hard drives have an internal routine that is periodically executed as part of the internal “Data Lifeguard” process that enhances the operational life expectancy. While the drive is running this routine, if the drive encounters an error, the drive’s internal host/device timer for this routine is NOT canceled causing the drive to be locked in this routine, never becoming accessible to the host computer/controller. This condition can only be reset by a Power Cycle. WD has resolved this issue by making a change to the firmware so when a disk error is encountered, the host/device timer is checked first and then the routine is canceled allowing the drive to be accessible to the host computer/controller. The interval rate for the error condition to occur is 1-4 weeks, and will only occur if the drive encounters a disk error when running this routine.

Could it be that I was suffering from this? It seems that this is a description of EXACTLY what I was experiencing every 1-4 weeks. I shutdown my server and flashed all the drives with the newer firmware. Again since I was running software RAID I could ignore the warnings about not updating drives that are part of a RAID array since to everything concerned, the are just a bunch of single drives. Note that this KB article seems to imply that my drives are in fact experiencing disk errors that are triggering this and perhaps they are and will still need replacing. So far no diagnostic tools shows that there are. Unfortunately for me, only time will tell. Hopefully it will only be 1-4 weeks though before I know.

Flux and Mutability

The mutable notebook of David Jade