Skip to content
Go back

RAID-Z and dRAID: ZFS Parity Explained

By SumGuy 12 min read
RAID-Z and dRAID: ZFS Parity Explained

Why Your RAID 5 Rebuild Took 36 Hours and ZFS Wouldn’t

You’ve read the RAID rebuild math. You’ve maybe even looked at RAID 50 and RAID 60 nesting — stacking mdadm arrays to reduce rebuild risk. Both answer the same problem: drives are huge now, rebuilds are slow, and the failure exposure window has stretched from hours into days.

Here’s the thing. mdadm does RAID as a block-layer abstraction. The filesystem has no idea parity is happening. It writes, mdadm shuffles bits, life goes on — until a power blip mid-write leaves parity inconsistent. Write hole. Now you’re praying the UPS held.

ZFS takes a different approach. The filesystem is the RAID layer. Checksums, snapshots, copy-on-write, and parity all live in the same engine. That changes what’s actually possible.

Let’s talk about RAID-Z, why it’s not just RAID 5 with a rebrand, and when to reach for dRAID instead.

Full example: Clone the working files at github.com/KingPin/sumguy-examples/linux/raid-z-and-draid


RAID-Z vs mdadm Parity: Three Key Differences

Before we get into Z1/Z2/Z3, there are three things that genuinely separate RAID-Z from mdadm RAID 5/6. Not marketing — actual architectural differences.

1. Variable-width stripes (no read-modify-write)

Traditional RAID 5 has a fixed stripe width. When you write a chunk smaller than that stripe, the controller has to read the existing stripe, modify it in memory, then write the whole stripe back — plus update parity. That’s three I/Os for what looked like one write. That’s the write penalty.

RAID-Z uses variable-width stripes. Every write is its own complete stripe, sized to exactly the data being written. If you write a 4KB block, the stripe is 4KB of data plus the parity. ZFS never does read-modify-write. The write penalty you’ve been warned about with RAID 5 does not exist here.

2. No write hole (COW + transactional writes)

The write hole — where a power failure during a RAID 5 stripe write leaves parity inconsistent — can’t happen in RAID-Z. ZFS writes are transactional. Data lands in new blocks first (copy-on-write), then the metadata is atomically updated to point to the new blocks. Either the whole transaction commits, or nothing does. Old blocks stay untouched until the transaction succeeds.

You still want a UPS. But you’re not losing sleep over write-hole corruption.

3. Per-block checksums

Every block in ZFS has a checksum. When you read data, ZFS verifies the checksum. If it doesn’t match, ZFS knows it’s corrupt. If you have redundancy (parity or mirror), ZFS heals it automatically from the good copy. It logs what it found and fixed.

mdadm knows nothing about this. If a bit flips on a RAID 5 drive and the read comes back wrong, mdadm serves you the corrupted data and calls it a day. Silent corruption is real, and it’s why a monthly zpool scrub is worth more than a year of crossed fingers.

RAID-Z1, Z2, Z3

These map roughly to RAID 5, 6, and 7:

LevelParity drivesCan surviveMin drives
RAID-Z111 drive failure3
RAID-Z222 simultaneous drive failures4
RAID-Z333 simultaneous drive failures5

“Roughly” is doing real work there. Unlike RAID 6, RAID-Z2 doesn’t have a fixed parity position — it’s distributed across variable-width stripes. The logical behavior is the same (lose two drives, still alive), but the on-disk layout is completely different.

For home labs and small businesses, Z2 is the sweet spot. Z3 is for large arrays where the math on simultaneous failures starts looking less academic.


The Expansion Problem (Historic)

For a long time — and I mean a decade-plus — RAID-Z had a notorious limitation: you could not add drives to an existing vdev. Once you created a 6-drive RAID-Z2 vdev, that was your geometry forever.

Want more space? Add a new vdev. Your pool now has two separate RAID-Z2 vdevs. Pool size doubled, but so did your vdev count, and your data was striped across both. You couldn’t consolidate. You couldn’t take six drives and turn them into seven without destroying and recreating the pool.

This was a genuine adoption killer for home labs. “I’ll just add a drive when I need space” is how people think about their NAS. ZFS said no. You plan your vdev geometry at creation and you live with it.

For years, the workaround was: plan your vdev sizes carefully upfront, or buy a bunch of drives at once and build the right-sized vdev. Not ideal. Definitely annoying.


RAID-Z Expansion in OpenZFS 2.3

In 2024, OpenZFS 2.3 shipped RAID-Z expansion — the ability to add a single drive to an existing RAID-Z vdev. This was years in the making and a genuinely big deal.

Here’s how it works:

Terminal window
# Check current pool status before expansion
zpool status mypool
# Add a drive to an existing RAID-Z2 vdev
# This converts a 6-drive Z2 into a 7-drive Z2
zpool attach mypool raidz2-0 /dev/sdg
# Monitor the expansion progress
zpool status mypool
pool: mypool
state: ONLINE
status: One or more devices is currently being resilvered.
scan: resilver in progress since Tue Jul 8 09:15:00 2026
3.68T scanned at 1.12G/s, 821G issued at 250M/s
config:
NAME STATE READ WRITE CKSUM
mypool ONLINE 0 0 0
raidz2-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0 (expanding)

Here’s the caveat nobody tells you upfront: existing data keeps the old parity ratio until it’s rewritten. If you had a 6-drive Z2 vdev and you add a 7th drive, the blocks that were written under the 6-drive geometry stay as-is. New writes use the 7-drive geometry. As data gets rewritten naturally over time (or you force a scrub-then-rewrite cycle), the old blocks migrate to the new layout.

This means usable capacity increases incrementally — not all at once the moment the expansion finishes. Check zpool status for the reflow progress if you’re impatient.

The expand-raidz.sh script in the example repo walks through this with loop devices so you can test it without touching real hardware. See the link at the top of this article.


vdev Sizing Rules of Thumb

Pool IOPS in ZFS doesn’t scale with drive count — it scales with vdev count. This is the single most important thing to understand about ZFS pool design, and it trips up people coming from mdadm.

In mdadm RAID 6, adding two more drives to your array increases your read throughput because all drives participate in every read. In ZFS, a single RAID-Z2 vdev with 10 drives has roughly the same random IOPS as one with 6 drives. You’re getting more capacity, not more parallelism.

Want more IOPS? Add more vdevs. Two 6-drive Z2 vdevs in a pool will give you roughly double the IOPS of one 12-drive Z2 vdev. This is the ZFS equivalent of going from RAID 6 to RAID 60 — the same concept we covered in the RAID 50/60 article, but baked into how ZFS pools are designed rather than stacked mdadm arrays.

Practical sizing guide for RAID-Z2:

Drives per vdevUsableNotes
4 drives50%Tight. Small arrays only.
6 drives67%Sweet spot.
8 drives75%Good. Rebuild time starts to grow on large drives.
12 drives83%Split into two 6-drive vdevs instead.

The sweet spot is 4–8 drives per Z2 vdev. Above 8, you’re better off splitting into multiple smaller vdevs for IOPS and shorter rebuild windows.

12 drives? Two 6-drive RAID-Z2 vdevs. Double the IOPS, shorter rebuilds, same usable space.


dRAID: Declustered Parity

dRAID is what happens when the ZFS team looked at large arrays — 24, 48, 60+ drives — and said “the traditional resilver model is going to take a week, that’s unacceptable, let’s rethink this.”

In a standard RAID-Z pool, when a drive fails, the resilver (rebuild) reads from every other drive in the failed drive’s vdev and writes to a hot spare. One hot spare, reading from N-1 drives. The bottleneck is that the hot spare is doing all the writing, and it’s absorbing all the work.

dRAID distributes the data, parity, and spares differently from the start. Spare capacity is spread across all drives in the array — not sitting idle on dedicated spare drives. When a drive fails, every other drive in the pool participates in the rebuild simultaneously. The work fans out instead of funneling through a single spare.

The result: resilver times that used to take days drop to hours. On a 60-drive array, the difference is dramatic.

dRAID was introduced in OpenZFS 2.1 (2021). The command syntax reflects the more complex layout:

Terminal window
# dRAID2:8d:1s = 2 parity, 8 data drives per stripe, 1 distributed spare
# This needs at least 11 drives (8 data + 2 parity + 1 spare)
zpool create mydraidpool draid2:8d:1s \
/dev/sda /dev/sdb /dev/sdc /dev/sdd \
/dev/sde /dev/sdf /dev/sdg /dev/sdh \
/dev/sdi /dev/sdj /dev/sdk
zpool status mydraidpool
pool: mydraidpool
state: ONLINE
config:
NAME STATE READ WRITE CKSUM
mydraidpool ONLINE 0 0 0
draid2:8d:1s-0 ONLINE 0 0 0
sda ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
spares
draid2-0-0 AVAIL

Note that draid2-0-0 shows as an available spare even though there’s no physically separate spare drive — it’s virtual capacity distributed across all 11 drives. When a real drive fails, that distributed spare capacity activates and all surviving drives contribute to the rebuild.

Where dRAID shines:

Where dRAID is overkill:


RAID-Z vs dRAID: Which One

Here’s the honest decision guide. No hedging.

RAID-Z2dRAID2
Drive count4–12 drives16+ drives
Rebuild speedSlow on large arraysDramatically faster
Hot sparesSeparate spare drive(s)Distributed in layout
ComplexityLow — easy to reason aboutHigher — layout geometry matters
vdev expansionSupported (OpenZFS 2.3+)Not supported
Random IOPSScales with vdev countSame
Comparable mdadmRAID 6 or RAID 60No direct mdadm equivalent
Best forHome labs, small NASLarge arrays, data centers

If you have a 6-bay or 8-bay NAS, RAID-Z2 is your answer. It’s simpler, well-understood, and the rebuild times on 8TB drives are uncomfortable but survivable. Two 6-drive Z2 vdevs if you want more IOPS — same concept as RAID 60 from the RAID 50/60 article, but done in ZFS natively.

If you’re building something with 24+ drives — maybe you’re running TrueNAS or OMV on real server hardware — dRAID2 is worth understanding seriously. The rebuild time difference at that scale is not academic. A 36-hour resilver window on a 24-drive pool running degraded is a 36-hour window where a second failure ends everything. dRAID shrinks that window to hours.


When NOT to Use ZFS

ZFS is great. ZFS is also not for everyone.

RAM constraints. The ARC cache is not optional. Rule of thumb: 4–8GB ARC handles most home pools fine, but if you’re running a 4GB server where every gigabyte fights for space, ZFS is going to lose that fight. Btrfs or ext4+mdadm will serve you better. The ZFS vs Btrfs article has the full breakdown.

Single-drive systems. ZFS without redundancy is just ZFS with extra RAM usage and a learning curve. Use ext4.

Distros with kernel licensing anxiety. ZFS is CDDL-licensed, Linux is GPL, these are arguably incompatible. Ubuntu ships ZFS anyway. Some distros don’t. If ZFS requires significant manual installation work on your system and you’re not committed to learning it, don’t bother.

You just want a quick NAS. If you’re spinning up a two-drive mirror for Plex media and you don’t want to learn pool management, mdadm RAID 1 + ext4 gets the job done in five minutes. ZFS is a commitment. Make sure you want to make it before you format six drives.


Real Talk

ZFS isn’t magic. It’s a specific set of trade-offs: more RAM, more concepts to learn, in exchange for checksums on every block, no write hole, COW snapshots, and replication that actually works. If those trade-offs match your use case, ZFS is genuinely excellent. If they don’t, it’s a forklift for your couch.

RAID-Z2 is the right default for 4–12 drive home lab arrays. dRAID is the right answer when traditional resilver times stop being theoretical and start being your Saturday problem.

Snapshots and replication are where ZFS really pulls away from mdadm. zfs send | zfs receive is one of the most elegant backup primitives in Linux. Covered in the ZFS replication with Syncoid and Sanoid article.

Picking a NAS OS to run ZFS — TrueNAS Scale, OpenMediaVault, Unraid — is its own rabbit hole. The TrueNAS vs OMV vs Unraid comparison sorts it out.

And as always: RAID is not a backup. RAID-Z2 survives two simultaneous drive failures. It does nothing when you rm -rf the wrong directory, ransomware encrypts your pool, or the server falls off a shelf. Run ZFS. Run replication. Run an offsite backup.

Your 2 AM self will appreciate the distinction.


Share this post on:

Send a Webmention

Written about this post on your own site? Send a webmention and it'll show up above once verified.


Next Post
Garden vs Tilt vs Skaffold

Discussion

Powered by Garrul . Sign in with GitHub or Google, or post anonymously.

Related Posts