About RAID Levels

This is about computer hardware.

RAID

RAID stands for Redundant Array of Inexpensive Disks. Originally, it was a system to make several physical hard disks appear as one big logical disk. There's plenty of information on the Web to be found if you want an in-depth explanation.

Anyway, RAID comes in various levels. If you know about these levels, I won't need to explain them to you. If you don't, you can look it up, but chances are you're not all that interested anyway. I'll be discussing RAID 0 and RAID 1 briefly, since those are the variations that are being discussed more and more lately.

So, what's the point of all this? Well, recently some stories have been appearing on the Web claiming that RAID 0 has little or no performance benefit for desktop systems. See here and here, for example. Since these people have done some actual testing, I'm not going to dispute these claims.

Basically, the articles claim that RAID 0 only makes sense if you have a lot of sustained disk transfers, like in server systems. I believe this to be true. However, in the past I had some write performance problems with capturing analog video, resulting in many dropped frames: the disk could nout not quite keep up with the data stream (640x480 pixels, 24-bit, at 30 frames/s makes 27.6 MB/s). Modern (fast) disks should be able to handle this under optimal circumstances, but it's pushing the limits. RAID 0 may have helped here.

However, what irks me about these articles is that they invariably state that RAID 0 arrays are "unreliable" or even "very unreliable", because of the fact that one disk failure causes total data loss. Usually, despite all the solid analysis with regard to the negligible performance increase, there is little or no explanation about the reduced reliability.

So, let's examine this.

MTBF

More alphabet soup. MTBF stands for Mean Time Between Failure. Simply put, this is the average operation time before your hard disk fails. This does not mean that it is guaranteed to work without failure for this amount of time. Your disk could fail within the first hour of operation. And it also does not mean that the disk will fail after this time has elapsed. It could function without a glitch for many times the MTBF.

Without getting into mathematics too much, a RAID 0 arrangement with two disks effectively cuts the MTBF in half. Fortunately, hard disks have become ridiculously reliable over the years. Take for example the Western Digital Raptor WD740. This is a recently released, top-of-the line IDE hard disk. Western Digital claims an MTBF of 1.2 million hours for this drive. Putting two of these puppies in RAID 0 cuts the MTBF to 600,000 hours. But let's live dangerously, and put 4 of them in a RAID 0 array (RAID 0 can handle 2, 3, 4, or more disks). This means that our MTBF is now 300,000 hours. That's bad. Or is it?

If we assume that we keep our disks operating 24 hours a day, 365 days a year, etc., then we find that an MTBF of 300,000 hours equals 12,500 days, or more than 34 years. As mentioned, this does not guarantee 34 years of faultless operation, but the chances are pretty slim that there will be a disk failure within the lifetime of the computer. The chances of other components failing are quite a bit higher. Take for example power supplies: typical MTBF ratings for high quality power supplies range from 50,000-80,000 hours. Pretty good (5.7 - 9.1 years), but quite a bit worse than our "unreliable" RAID array. And, apart from the cooling fan(s), they have no moving parts.

I've worked with many different computer systems over the course of many years. So far, I have had one power supply failure, one speaker system go south on me, and two CD drives giving up. Not a single hard drive failure.

Now, let's have a look at a two disk RAID 1 array. If we handle failures in the dumbest way possible (wait for the first disk to fail, then do nothing and wait for the second disk to fail), then we still double the MTBF of a single disk. In case of the WD740, we double the MTBF from 137 years to 274 years. That's kind of ridiculous for a desktop system. In reality, the effective MTBF would be much larger, since you would replace a disk as soon as it fails, going back to a mirrored configuration again. Of course, on average, you would have to wait 137 years for this to happen...

I'm silently making quite a few assumptions here. One of them is that the odds of one disk failing are independent of the odds of another disk failing. This is not necessarily true. For example, a disk failure could be caused by a faulty power supply. Chances are that if one of the disks is fried because of this, the other one will suffer the same fate. So, the MTBF situation could be less rosy than pictured here.

The Point

So, what's my point? Maybe RAID 0 does not belong in a desktop configuration, since the performance advantages are negligible under most circumstances. However, the disadvantages ("unreliable") are greatly exaggerated, in my humble opinion. And RAID 1 does not belong in a desktop configuration at all. The advantages ("reliable") are irrelevant, the disadvantages (lost disk capacity) are quite significant.

Additional Thoughts

So, do "reliable" RAID configurations like RAID 1 (and other levels, most prominently RAID 5) not make any sense? Of course they do. In mission-critical enterprise systems that need uptimes of 99.99% and higher, "redundant" RAID configurations are essential in order to achieve these uptime numbers. Enterprise systems often have huge arrays of disks, requiring a high level of redundancy, usually with "hot-swappable" disks. However, this is a completely different situation from desktop systems. Which brings me to...

Backups. If you have a desktop system and have critical data, then you should make regular backups. Relying on a RAID 1 array is a bad idea in this case (your system could fail in so many other ways). With current prices for CD-R(W) drives and media, there is no excuse for not making backups. Now, if only I could take my own advice...


Please send comments to webmaster@oldeloohuis.com.