What ZFS pool states mean and what to do first
Short answer: ZFS is a file system designed for NAS (network-attached storage) and server storage that stores cryptographic checksums alongside every data block. When it detects a mismatch, it either self-heals (if redundancy is available) or marks the pool as DEGRADED or FAULTED. Most DEGRADED pools are recoverable without data loss. FAULTED pools require investigation before declaring data gone. Start with diagnostics, not panic.
Step-by-step ZFS pool corruption recovery
Step 1: Run zpool status and understand what it tells you
Run zpool status -v poolname on the server or NAS. The output will show each disk's state (ONLINE, DEGRADED, FAULTED, REMOVED, UNAVAIL), the number of read/write/checksum errors per disk, and whether the pool itself is ONLINE, DEGRADED, or FAULTED. Read errors usually indicate a failing drive. Checksum errors often indicate bit rot (data corruption on disk) or cable issues. Write errors indicate drive problems or controller failures. This output tells you which specific drive is causing problems before you touch anything.
Step 2: Run zpool scrub before making any changes
A scrub reads every data block in the pool, verifies it against its stored checksum, and (if redundancy is available) repairs any mismatches automatically. Run zpool scrub poolname and let it complete — it can take hours on large pools. After scrub, check zpool status again. If the scrub completes with "repaired: 0B" or repairs a small amount without error, your data is intact and the pool is healthy again. If scrub shows large repaired amounts or fails to complete, drive replacement is needed.
Step 3: Replace failed drives and rebuild
For a RAIDZ1 pool (one drive can fail without data loss), replace the failed drive with a drive of equal or larger capacity and run zpool replace poolname /dev/old-disk /dev/new-disk. ZFS will begin resilver (a rebuild operation, like RAID rebuild) automatically. Monitor progress with zpool status. On a RAIDZ2 pool, two drives can fail. Never remove a second drive while the first resilver is in progress — RAIDZ1 pools become vulnerable the moment one drive is out.
Step 4: The India angle — power cuts and ZFS pool imports
The most common ZFS failure scenario we encounter for Indian SME NAS setups is pool import failure after an abrupt power cut. ZFS uses transaction groups (TXGs) — atomic commits of all changes — to stay consistent. An interrupted write mid-TXG leaves the pool in a state where zpool import refuses to proceed. The fix: zpool import -f poolname to force import, which discards the last incomplete TXG and rolls back to the last clean state. Some files written in the last few seconds before the power cut may be lost, but the bulk of data will be intact. A UPS (uninterruptible power supply) that gives even 5 minutes of runtime prevents this entirely.
When to call a recovery service (and what it costs in India)
When DIY ends
Stop and call a professional if: two or more drives have failed in a RAIDZ1 pool (beyond the pool's redundancy tolerance), the pool imports but zfs list shows datasets but zfs mount fails for all datasets, zpool scrub reports uncorrectable errors even after drive replacement, or if the ZFS metadata (pool configuration) itself is corrupt and the pool refuses to import at all.
Typical cost in India
Professional ZFS pool analysis and recovery assistance for software-level issues: ₹5,000–₹20,000. Physical drive recovery if individual drives have bad sectors: ₹5,000–₹18,000 per drive. For NAS-based recovery in Indian SME setups, also see our NAS volume rebuild recovery guide and the broader data recovery service.
A note from the LRW Engineer Team
ZFS is one of the most resilient file systems available, and we consistently see it save data that would be permanently lost on ext4 or NTFS under the same conditions. The key is responding to a DEGRADED state immediately rather than running the pool short a drive for weeks. A ZFS pool showing DEGRADED is a ticking clock, not a solved problem. Redundancy is gone, and the next drive failure will take your data with it. Replace the failed drive the same day.