About RAID Levels

About RAID Levels

This is about computer hardware.

RAID

RAID stands for Redundant Array of Inexpensive Disks. Originally, it was a system to make several physical hard disks appear as one big logical disk. There's plenty of information on the Web to be found if you want an in-depth explanation.

Anyway, RAID comes in various levels. If you know about these levels, I won't need to explain them to you. If you don't, you can look it up, but chances are you're not all that interested anyway. I'll be discussing RAID 0 and RAID 1 briefly, since those are the variations that are being discussed more and more lately.

RAID 0
Say you have two hard disks, each 80GB in size (they don't need to be the same size, but we'll assume they are, since that makes things so much simpler). Now, suppose you could "spread" your data over the two disks, so that you can have the disks work in parallel when you read and write your data. That would make your data access so much faster, theoretically twice as fast. Guess what? This mechanism exists. It is called striping, and is implemented by RAID 0 controllers. These controllers appear more and more as standard equipment on modern motherboards.
So, you can combine your two disks of 80GB, and effectively have a 160GB disk, which has faster data transfer as well. So, what's the downside? Well, if one of the two disks fails, all of the 160GB of data is lost. Gone.

RAID 1
This is a bit of the opposite of RAID 0. Basically, you take the same two 80GB disks, but now you use a mechanism called mirroring. Basically, the controller writes all data to both disks. So, if one fails, the data is still available on the other one, since both disks are exact copies of one another. The downside is obvious: you have wasted 80GB of disk space, since the mirrored disk appears as one 80GB disk. RAID 1 typically has no effect on performance.
But not quite. There is a possible effect, both on read and write performance. Since the controller has to write the data to both disks, it has to wait until both disks are finished writing. So, it has to wait until the slowest drive is done. So, write performance suffers (a little bit).
Read performance should be a little better, though. The controller can stop reading data when the fastest disk is finished.

To what extent RAID controllers actually employ these strategies is not quite clear to me, by the way.

So, what's the point of all this? Well, recently some stories have been appearing on the Web claiming that RAID 0 has little or no performance benefit for desktop systems. See here and here, for example. Since these people have done some actual testing, I'm not going to dispute these claims.

Basically, the articles claim that RAID 0 only makes sense if you have a lot of sustained disk transfers, like in server systems. I believe this to be true. However, in the past I had some write performance problems with capturing analog video, resulting in many dropped frames: the disk could nout not quite keep up with the data stream (640x480 pixels, 24-bit, at 30 frames/s makes 27.6 MB/s). Modern (fast) disks should be able to handle this under optimal circumstances, but it's pushing the limits. RAID 0 may have helped here.

However, what irks me about these articles is that they invariably state that RAID 0 arrays are "unreliable" or even "very unreliable", because of the fact that one disk failure causes total data loss. Usually, despite all the solid analysis with regard to the negligible performance increase, there is little or no explanation about the reduced reliability.

So, let's examine this.

MTBF

More alphabet soup. MTBF stands for Mean Time Between Failure. Simply put, this is the average operation time before your hard disk fails. This does not mean that it is guaranteed to work without failure for this amount of time. Your disk could fail within the first hour of operation. And it also does not mean that the disk will fail after this time has elapsed. It could function without a glitch for many times the MTBF.

Without getting into mathematics too much, a RAID 0 arrangement with two disks effectively cuts the MTBF in half. Fortunately, hard disks have become ridiculously reliable over the years. Take for example the Western Digital Raptor WD740. This is a recently released, top-of-the line IDE hard disk. Western Digital claims an MTBF of 1.2 million hours for this drive. Putting two of these puppies in RAID 0 cuts the MTBF to 600,000 hours. But let's live dangerously, and put 4 of them in a RAID 0 array (RAID 0 can handle 2, 3, 4, or more disks). This means that our MTBF is now 300,000 hours. That's bad. Or is it?

If we assume that we keep our disks operating 24 hours a day, 365 days a year, etc., then we find that an MTBF of 300,000 hours equals 12,500 days, or more than 34 years. As mentioned, this does not guarantee 34 years of faultless operation, but the chances are pretty slim that there will be a disk failure within the lifetime of the computer. The chances of other components failing are quite a bit higher. Take for example power supplies: typical MTBF ratings for high quality power supplies range from 50,000-80,000 hours. Pretty good (5.7 - 9.1 years), but quite a bit worse than our "unreliable" RAID array. And, apart from the cooling fan(s), they have no moving parts.

I've worked with many different computer systems over the course of many years. So far, I have had one power supply failure, one speaker system go south on me, and two CD drives giving up. Not a single hard drive failure.

Now, let's have a look at a two disk RAID 1 array. If we handle failures in the dumbest way possible (wait for the first disk to fail, then do nothing and wait for the second disk to fail), then we still double the MTBF of a single disk. In case of the WD740, we double the MTBF from 137 years to 274 years. That's kind of ridiculous for a desktop system. In reality, the effective MTBF would be much larger, since you would replace a disk as soon as it fails, going back to a mirrored configuration again. Of course, on average, you would have to wait 137 years for this to happen...

I'm silently making quite a few assumptions here. One of them is that the odds of one disk failing are independent of the odds of another disk failing. This is not necessarily true. For example, a disk failure could be caused by a faulty power supply. Chances are that if one of the disks is fried because of this, the other one will suffer the same fate. So, the MTBF situation could be less rosy than pictured here.

The Point

So, what's my point? Maybe RAID 0 does not belong in a desktop configuration, since the performance advantages are negligible under most circumstances. However, the disadvantages ("unreliable") are greatly exaggerated, in my humble opinion. And RAID 1 does not belong in a desktop configuration at all. The advantages ("reliable") are irrelevant, the disadvantages (lost disk capacity) are quite significant.

Additional Thoughts

So, do "reliable" RAID configurations like RAID 1 (and other levels, most prominently RAID 5) not make any sense? Of course they do. In mission-critical enterprise systems that need uptimes of 99.99% and higher, "redundant" RAID configurations are essential in order to achieve these uptime numbers. Enterprise systems often have huge arrays of disks, requiring a high level of redundancy, usually with "hot-swappable" disks. However, this is a completely different situation from desktop systems. Which brings me to...

Backups. If you have a desktop system and have critical data, then you should make regular backups. Relying on a RAID 1 array is a bad idea in this case (your system could fail in so many other ways). With current prices for CD-R(W) drives and media, there is no excuse for not making backups. Now, if only I could take my own advice...

Please send comments to webmaster@oldeloohuis.com.