Skip to content
Go back

RAID Is Not Backup: Rebuild Math

By SumGuy 9 min read
RAID Is Not Backup: Rebuild Math

Your RAID Array Just Lost a Drive. Congratulations, You’re Not Done Yet.

The drive health alert fires at 11 PM. One of your RAID 5 drives is dead. You’ve got redundancy, so you’re fine, right? You order a replacement on next-day delivery and go to bed.

Here’s what’s happening while you sleep: your array is running degraded, one drive death away from total data loss. And while you’re waiting for that replacement, the surviving drives are under extra stress. If any of them has a sector that’s been slowly going bad — and on drives that are a few years old, there’s a real chance — you’re going to find out about it during the rebuild.

This is the part of RAID that the enthusiast forums gloss over. RAID protects against a complete drive failure. It does nothing about silent corruption, simultaneous failures, or the specific window of vulnerability during a rebuild. Let’s talk about the math.

Unrecoverable Read Errors: The Rebuild Killer

Every hard drive has a spec called the Unrecoverable Read Error rate (URE). For consumer drives the spec sheet has historically said 1 error per 10^14 bits read. Enterprise drives usually spec at 10^15. This sounds like a lot until you do the arithmetic.

2026 update. Read your actual datasheet. A lot of the modern high-capacity helium drives — 18TB-22TB consumer SKUs — quietly spec at 10^13 instead of 10^14 (sometimes buried under “non-recoverable read errors per bit read” rather than the cleaner URE wording). Some manufacturers also dropped the number from the public spec entirely. The math below uses 10^14 as the historical baseline, but if your drives are 18TB+, assume the real-world failure curve is steeper.

A 4TB drive contains roughly 3.2 × 10^13 bits. Rebuilding a RAID 5 array after a failure requires reading every bit from every surviving drive. For a 3-drive RAID 5 with 4TB drives, that’s approximately 2 × 3.2 × 10^13 = 6.4 × 10^13 bits read to rebuild one 4TB drive’s worth of data.

Consumer URE rate: 1 error per 10^14 bits.
Bits read during rebuild: ~6.4 × 10^13.
Probability of hitting at least one URE: ~47%.

Nearly a coin flip. During that rebuild, you have roughly a 47% chance of hitting an unrecoverable read error on one of the surviving drives. When mdadm hits a URE during a RAID 5 rebuild, it can’t reconstruct the missing data — the parity calculation breaks down. Depending on your setup, this either corrupts that portion of data silently or aborts the rebuild entirely.

On larger drives the math gets worse. A 3-drive RAID 5 with 8TB drives reads ~1.28 × 10^14 bits during rebuild — statistically expected to hit more than one URE on consumer drives.

This is not a reason to panic. It is a reason to:

  1. Use RAID 6 instead of RAID 5 when drives are 4TB or larger (see RAID 6 vs RAID 10)
  2. Keep your arrays monitored so you know about degraded state immediately
  3. Replace failed drives fast — the longer you run degraded, the larger your exposure window

How Long Does a Rebuild Actually Take?

RAID 5 rebuild speed depends on drive speed and array activity. A rough benchmark on spinning drives doing a sequential rebuild on an idle array:

2026 reality check. Those are best-case numbers for an idle array. Add 2-3× for any production workload — competing IO, scrubbing, or just normal NAS service stretches rebuilds significantly. An 18TB rebuild on a busy 8-bay array routinely runs past 48 hours. This is exactly why RAID 6 / Z2 / Z3 exist as minimum-viable parity for modern drive sizes.

During that entire window, your array is degraded. If a second drive fails — for any reason — you lose everything. No recovery. No “but I had RAID.” Just gone.

Terminal window
# Check rebuild progress and estimated time remaining
watch cat /proc/mdstat
md5 : active raid5 sdd[3] sdc[1] sdb[0]
8388608 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [UU_]
[===========>.........] recovery = 58.3% (2457600/4194304) finish=12.3min speed=39247K/sec

[UU_] means one device is missing. The rebuild progress and finish estimate are right there. That finish=12.3min is optimistic — production arrays with larger drives and active workloads take much longer.

Software RAID vs Hardware HBA: What You Actually Need

Home lab RAID conversations inevitably hit the “should I use a hardware RAID card?” question. Short answer for most home labs: no, use software RAID with a plain HBA.

Software RAID (mdadm/Linux MD):

Cheap “RAID cards” (IT mode HBA + fake RAID):

True hardware RAID controllers (Areca, LSI in IR mode, Dell PERC):

For a 4–8 bay home NAS: grab an LSI 9211-8i (or equivalent) flashed to IT mode for about $30 on eBay, run mdadm, sleep soundly.

Setting Up Monitoring That Actually Tells You Things

The most important part of running RAID long-term is knowing about failures immediately — not when you go looking for a file and find it gone. mdadm has built-in monitoring that sends email alerts. Set it up.

Terminal window
# Edit mdadm config to add your email
echo 'MAILADDR your@email.com' >> /etc/mdadm/mdadm.conf
# Start the mdadm monitor daemon (checks every 30 minutes)
mdadm --monitor --daemonise --mail=your@email.com --delay=1800 /dev/md0 /dev/md5
# Or add to systemd: enable mdmonitor service
systemctl enable mdmonitor
systemctl start mdmonitor

The --delay=1800 means check every 1800 seconds (30 minutes). For more aggressive monitoring, drop it to 600.

Also set up periodic scrubbing — a consistency check that reads every block and verifies parity. This catches UREs before a drive failure makes them catastrophic:

Terminal window
# Trigger a manual scrub on your array
echo check > /sys/block/md5/md/sync_action
# Watch progress
watch cat /proc/mdstat
# Automate scrubs monthly via cron (add to /etc/cron.d/raid-scrub)
# 0 2 1 * * root echo check > /sys/block/md5/md/sync_action

A scrub that finds errors is telling you a drive is going bad before it fully fails. That’s the early warning system you want.

Simulating a Failure (Before You Have a Real One)

The best time to test your recovery procedure is not at 11 PM when a real drive dies. Do it now, on purpose:

Terminal window
# Mark a drive as failed (non-destructive — just marks it failed in md)
mdadm /dev/md5 --fail /dev/sdd
# Check degraded state
mdadm --detail /dev/md5 | grep State
# State : clean, degraded
# Remove the failed device
mdadm /dev/md5 --remove /dev/sdd
# Add a replacement (hot spare or new drive)
mdadm /dev/md5 --add /dev/sde
# Watch the rebuild
watch cat /proc/mdstat

Run through this once. Know what the commands are, know what the output looks like, know how long the rebuild takes on your hardware. The first time you do this should not be during an actual emergency.

RAID Is Not a Backup. Seriously.

This is the part of the article you forward to the person in your life who thinks “I’ve got RAID so I’m fine”:

RAID protects against one thing: a complete physical drive failure causing service interruption. It does not protect against:

RAID is for uptime. Backups are for data recovery. You need both. The 3-2-1 rule: three copies of your data, on two different media types, with one copy off-site. RAID is not one of those copies. RAID is the thing that lets your NAS keep serving files while you wait for a replacement drive.

Run Restic, Borg, or any backup tool to a second location. Then run RAID. In that order of importance.

The Full Picture

This series covers the full RAID landscape for home lab use:

Build the array. Set up monitoring. Test a simulated failure before you need the real thing. And then go set up your backups, because that’s the part that actually saves you when everything else goes sideways.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Previous Post
MinIO Is Archived: Move to Garage
Next Post
RAID 6 vs RAID 10: Two Dead Disks

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts