So, recently I started looking to see if there was any nice hardware around to provide a solid enclosure for a Freenas based home made NAS storage system.  In looking into this, I ran across this page: Freenas Raid Overview.  What really caught my eye was the statement “CAUTION: RAID5 “died” back in 2009″ and a link to this article: Why RAID 5 stops working in 2009.   Worried that I had made a fatal error in my existing 12TB (6x2TB) RAID 5 setup, I read on and realized something wasn’t right.  And it got worse; a follow up article in 2013 Has RAID5 stopped working by the same author continued on in error.  “What’s the problem?” you might ask.  Well, it is a failure to understand fundamental math.

See, the author (and, to be fair, lots of people) makes a mistake when looking at probability of separate events added together.  They make the assumption that if you have six separate events each with a given probability of happening, and you put them all together, then as a whole you’ve increased your chance of that event happening.  That’s completely wrong.  Your overall probability is no greater than the individual probabilities.  Each individual event has no effect on the other events.  So since you have six 2TB disks with a max URE failure rate (probability of failure to read) of 1×10^14 you are still only looking at the failure of that 2TB disk, not of the 12TB of storage.  If you really want to try to account for combined events, you can take the chances of having two drives fail with URE at the same time.  This is done by multiplying the events together.  So 1/(1×10^14) times 1/(1×10^14) equals 1/(1×10^28) probability of failure, that is a URE of 1×10^28!  All failures probabilities are completely independent.  And it gets better from there:

1.  With the probability and statistics error stated above, you are only looking at the chance of failure for each individual disk, not the whole storage array.  So you have a 1×10^14 probability of a read failure for a 2TB disk during the recovery of any disk.  Yes, this technically gets worse as drive sizes increase, but you would need to read each individual, COMPLETELY FULL 2TB disk, in whole 6.25 times (for the needed 12.5TB of data) to hit this probability of failure point on that disk.  For a 4TB disk you have to read the entire full disk 3.125 times, so worse odds, but in most setups this still is unlikely to occur during a rebuild (unless you’ve just got bad luck).

2. That 1×10^14 is the MAX unrecoverable read error rate.  That means that you should get no more than that number of failures.  You are actually likely to get less than that number of failures, so can expect to be able to read more data than 12.5 TB before a failure. See, more good news!

3.  When RAID 5 is in recovery mode, you are not reading a full 2 TB of data off your full 2TB disks to rebuild your failed drive.  The parity information to recover the drive is only the total usable storage divided by the number of drives in the array.  For a 2TB x 6 array (12TB of raw storage) you get 10TB of usable storage.  That 10TB is divided by 6 to give you about 1.67 TB of data needed to be read off each individual 2 TB drive to recover the failed drive in the array. So, again, your odds get better.

Yes, the chance of failure does go up as drives get larger (assuming URE doesn’t improve), and, yes, you should ALWAYS have offsite (a different raid box) backup for anything you don’t want to risk losing (good disaster recovery strategy anyways).  But RAID 5 isn’t dead and is still an excellent choice for good performance, reliability, and cost.

And here is my real life example:  I made the mistake of purchasing Seagate “green” 2TB drives for my original 6x2TB NAS box.  These drives have a little bug, they report “failed” even when they haven’t really failed when they are used with some hardware raid solutions.  For 4 months after I installed these drives, I had a drive failure just about every three weeks and had to do a rebuild of 5TB of data (take failed drive out, format it blank, stick it back in, rebuild).  That’s about five RAID 5 rebuilds before I finally gathered the funds to replace all the drives with WD red NAS drives (no failures since).  Oh, and each time I swapped out a red drive for a green drive, another Raid 5 rebuild, so six more rebuilds for a total of eleven.  Guess what, I got lucky and there were no URE events during any of those rebuilds and no data was lost (yes I have off site backup as well).  Of course when I say luck, I mean my odds were pretty good I wouldn’t have a catastrophic failure as the other author claimed I would.  😉