I run several fileservers using the excellent ZFS filesystem. While originally a Solaris invention, it was successfully ported to FreeBSD for version 7.
It is extremely well suited for large fileservers due to a combination of reliable software RAID (modes 1, 5, and 6), checksumming and snapshots. Combined with the option of nested filesystems and the ability to set options on each of these filesystems seperately this means it’s extremely scalable, as well as endlessly configurable.
A quick introduction: ZFS works on two levels, pools and filesystems. All pool configuration is done through the zpool command, while filesystem configuration is done through the zfs command.
Pools
Pool configuration is generally only done once, and after that the zpool command is only used for either maintenance, repairs or expansion. Before you can use any ZFS filesystem, you need to create a pool, a collection of one or more disks or partitions ZFS can use to store data. The most basic pool is of course a single disk, but this wouldn’t give you any redundancy whatsoever. Using two disks, you can create a mirrored pool, using three or more disks, you could create a RAID 5 pool (called raidz by ZFS), and with four or more disks, RAID 6 (raidz2) becomes an option. It is possible to combine multiple multiple RAID sets into a single pool, so you could for example have a single pool consisting of two mirrorred 500GB drives, and five terabyte drives in a RAID 5 configuration.
Note that building blocks can be either complete disks, partitions or files, but since using a drive as two partitions means you lose two blocks if the drive fails, this doesn’t really add security, and using files makes you dependent on another filesystem. It is not possible to nest RAID levels, so building a RAID 5 set out of 3 mirrors is not an option.
Expansion
But what happens if your pool is getting full? Lets say we have a three-disk RAID 5 pool, consisting of three 500 GB disks, giving us about 1TB of usable space. Ideally, we’d add a single 500GB disk to grow that RAID set to 1.5 TB. Unfortunately, this is one of the few things not yet supported by ZFS. There are some alternatives though:
Adding a new RAID set. If your system has enough free connectors and bays, simply add several more disks to the system, and add them to the pool as a new RAID set. For example, we could add three 1.5TB disks in a raidz configuration to the pool, growing it by an effective 3TB. ZFS will automatically spread any new data over all disks to optimize performance.
If your system doesn’t have the available hardware to add more disks, you can grow the existing RAID set by replacing the disks one by one, allowing the array to rebuild after each swap. This is what I recently did in one of my servers. It was running the above mentioned three disk raid setup with 500GB disks, and one of them had failed. I purchased three new 1.5TB disks, shut down the system, and replaced the damaged one.
After rebooting, I gave the command:
[root@honeycomb /honeycomb/]# zpool replace honeycomb ad6
Where honeycomb is the name of the zpool, and ad6 the drive to be replaced. By indicating only one device, I’m telling ZFS to replace the device with a new one at the same location. Once it’s done, shut down, replace the next disk, and repeat.
While the rebuilding is in progress, you can use the status command to see how far along the rebuilding of the array is:
[root@honeycomb /honeycomb/]# zpool status -v
pool: honeycomb
state: DEGRADED
status: One or more devices is currently being resilvered. The pool
will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scrub: resilver in progress for 4h35m, 93.59% done, 0h18m to go
config:
NAME STATE READ WRITE CKSUM
honeycomb DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
ad6 ONLINE 0 0 0 229M resilvered
ad8 ONLINE 0 0 0 229M resilvered
replacing DEGRADED 0 0 0
ad10/old OFFLINE 0 137K 0
ad10 ONLINE 0 0 0 337G resilvered
errors: No known data errors
Here you see the last disk in the array being replaced. Note that because the old disk is offline, the rebuilding is done by recalculating the data from the other disks in the array. If the old disk is still operational, and you have a spare SATA port, you could replace the disk while the old one is still connected, possibly speeding up the process.
Update 2013-11-12
The last part of this post concerns an older version of FreeBSD/ZFS and is no longer correct. Read on here for the correct way to proceed.
After all disks have been replaced, you need to export and import the pool, after which ZFS will see the larger disk size, and grow the system accordingly:
[root@honeycomb /honeycomb/]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
honeycomb 908G 23.3G 5.26M /honeycomb
[root@honeycomb /honeycomb/]# zpool export honeycomb
[root@honeycomb /honeycomb/]# zpool import honeycomb
[root@honeycomb /honeycomb/]# zpool status
pool: honeycomb
state: ONLINE
scrub: resilver completed after 4h56m with 0 errors on Mon
Jan 17 11:21:38 2011
config:
NAME STATE READ WRITE CKSUM
honeycomb ONLINE 0 0 0
raidz1 ONLINE 0 0 0
ad6 ONLINE 0 0 0
ad8 ONLINE 0 0 0
ad10 ONLINE 0 0 0
errors: No known data errors
[root@honeycomb /honeycomb/]# zfs list
NAME USED AVAIL REFER MOUNTPOINT
honeycomb 908G 1.84T 5.26M /honeycomb
As you can see, the pool is complete again, with no trace of the failed drive, and the diskspace is grown by little over 1.8TB[1].
Next time, I’ll go deeper into the zfs command and creating filesystems.
- [1]Not by 2TB, as you would expect, since for marketing reasons, harddrives are specified in powers of 10, not powers of 2, as computers see them.↩
Comments are disabled for this post