My Apple Photos library is about 3 TB — roughly 1.3 million files. It outgrew a single drive a while back, and I had a wish list for the replacement setup:
- Span multiple drives — consumer SSDs go up to 8 TB now, but they’re outside the price-per-TB sweet spot. I wanted to stay future-proof without paying a premium.
- Single drive fault tolerance — I don’t want a dead SSD to mean lost photos
- Bit rot protection — silent data corruption is real and I want checksums
- APFS volume — Apple Photos requires it. I did try pointing Photos at a ZFS dataset directly. It looked like it worked for a moment, right up until I tried to do anything, at which point it came to a screeching halt. APFS also has instant, zero-cost file clones, which I rely on to keep files both inside the Photos library and in my own folder structure on disk. ZFS doesn’t have that.
- Direct-attached to my Mac Mini — not a NAS. I’d tried running Apple Photos over a network share before and didn’t enjoy the experience
No single technology does all of this. My esoteric solution: APFS on a ZFS zvol. I built a ZFS pool from three 4 TB SATA SSDs in a RAIDZ1 configuration (ZFS’s equivalent of RAID5), created a zvol, formatted it as APFS, and ticked every box. I did not benchmark the write performance first. This turned out to be a mistake.
It was painfully slow. Not “a little slow” — 19 MB/s for file-level operations on SSDs capable of hundreds of megabytes per second. I spent a few weeks running experiments to figure out why, and the answer surprised me.
The original (wrong) hypothesis
The pool was created with 16K volblocksize, and I’d read that small volblocksizes on RAIDZ cause write amplification. So I assumed migrating to 128K blocks would fix things. I created a new 128K zvol and started rsyncing 3 TB to it — without benchmarking first. Again.
The rsync crawled at 3.5 MB/s. Worse than the original.
It turned out the new zvol had sync=standard (the default), which forces every write to be flushed to disk. Setting sync=disabled on the 128K zvol brought it up to 25 MB/s — but 16K with sync=disabled was 280 MB/s. The original blocksize was fine. The original hypothesis was completely wrong.
But 280 MB/s was a cached result (the test only wrote 10 GB into 64 GB of RAM). The real question was: what’s the sustained throughput for real file-level workloads?
The experiments
I ran a series of benchmarks writing 6,000 files of 14 MB each (82 GB total), with the ZFS dirty data buffer reduced to 512 MB to prevent caching from masking the results. Each test ran long enough to reach steady state.
Here’s what I found:
The two bars in the middle are barely visible. That’s the point.
What each test measured
APFS on bare SSD (234 MB/s): Writing files directly to an APFS volume on a single SATA SSD, no ZFS involved. This is the baseline — how fast the hardware and filesystem can go.
APFS on RAIDZ1 zvol (19 MB/s): The production setup. APFS sitting on a ZFS zvol on the RAIDZ1 pool. 12x slower than bare APFS.
ZFS dataset on RAIDZ1 (15 MB/s): ZFS’s own native filesystem on the same RAIDZ1 pool, no APFS, no zvol. Even slower. This ruled out APFS as the culprit — the double copy-on-write “APFS on ZFS” stack wasn’t the problem.
ZFS stripe (254 MB/s): A ZFS pool with two drives, no parity. Each drive is its own vdev — data is striped across them, but no parity is computed. Actually faster than the single bare SSD because two drives share the load. This was the last experiment I ran, and the one that confirmed the fix.
RAIDZ1 is fine for big files
Here’s the thing that makes this confusing: RAIDZ1 is fast for sequential I/O. Writing a single 40 GB file to the same pool hit 203 MB/s. The hardware path is fine. The problem is specific to creating many files.
A 10x difference on the same hardware, same pool, same drives. The only variable is whether you’re writing one big file or many smaller ones.
Why RAIDZ1 is pathologically slow for file creates
In a ZFS stripe, when you create a file, the data and metadata blocks get written to whichever drive they hash to. One drive does the work, and you’re done.
In RAIDZ1, every write — no matter how small — must produce a full parity stripe across all drives. When ZFS needs to write a 16K metadata block (a dnode, an indirect block, a space map entry), it can’t just write 16K to one drive. It has to write data across the data drives and compute and write parity to the parity drive. All drives must participate in every write.
For sequential writes, this is fine. ZFS fills full stripes efficiently, and the parity cost is amortized across large chunks of data.
But creating a file generates many tiny, scattered metadata writes — B-tree updates, dnode allocations, indirect blocks, space map entries. Each one triggers a full-width stripe write. What would be a quick single-drive operation in a stripe becomes a synchronized multi-drive operation in RAIDZ1, serialized on the slowest drive.
This isn’t a bug — it’s inherent to how parity-based redundancy works with copy-on-write. Every COW metadata block on RAIDZ1 requires a parity stripe. And a photo library with a million files generates a lot of metadata.
The stripe benchmark
Once I understood the problem, I tested a ZFS stripe pool (two drives, no parity) to confirm that removing RAIDZ1 parity was the fix:
The stripe held steady at 254 MB/s for the entire 82 GB write with no degradation. The RAIDZ1 line barely registers on the same scale.
The fix
I’m going to destroy the RAIDZ1 pool and rebuild it as a stripe — three 4 TB SSDs as independent vdevs, no parity. I’ll go from ~8 TB usable (RAIDZ1 loses a drive to parity) to ~12 TB, and from 15 MB/s to 250+ MB/s for file-level operations.
The trade-off is obvious: any single drive failure takes down the whole pool. But I have offsite backup, so a drive failure means a day of restoring, not data loss. For a personal photo library, that’s an acceptable trade. For a production database, it wouldn’t be.
I do lose the bit rot protection from RAIDZ1 parity (a stripe pool can detect corruption via checksums, but can’t repair it without redundancy). I’m keeping ZFS for the checksums — at least I’ll know if something goes wrong — and relying on the offsite backup for recovery.
Lessons
-
Benchmark before you migrate. I migrated 3 TB onto RAIDZ1 without testing write performance. Then I almost migrated it again to a 128K zvol based on a theory that turned out to be completely wrong. Measure twice,
rsynconce. -
Cache will lie to you. With 64 GB of RAM and a 3 GB ZFS dirty data buffer, any benchmark under ~10 GB is meaningless on this machine. Early tests showed 280 MB/s that was really 19 MB/s.
-
RAIDZ is not RAID5. Traditional hardware RAID5 has its own problems, but it doesn’t do copy-on-write. ZFS COW + RAIDZ parity is a specific combination that creates pathological performance for small scattered writes.
-
“APFS on ZFS” wasn’t the problem. My initial suspicion was that running APFS on a ZFS zvol (double copy-on-write) was causing the slowdown. The experiments showed APFS-on-zvol was actually slightly faster than native ZFS for file writes. APFS is innocent.
-
This isn’t just an SSD thing. It’s tempting to think the penalty is only visible because SSDs are fast enough to expose it. But the underlying mechanism — every metadata write requiring a full parity stripe — applies to any storage. On spinning disks the absolute numbers are lower and the penalty may not be exactly 17x, but the same fundamental overhead is there. It’s just easier to blame the hardware when everything is already slow.
System: M4 Mac Mini, 64 GB RAM, 3x 4 TB SATA SSDs in OWC ThunderBay 4 mini (Thunderbolt 3), OpenZFS 2.3.0 on macOS. All experiment data and scripts are on GitHub.