Growing a ZFS pool

I run several fileservers using the excellent ZFS filesystem. While originally a Solaris invention, it was successfully ported to FreeBSD for version 7.

It is extremely well suited for large fileservers due to a combination of reliable software RAID (modes 1, 5, and 6), checksumming and snapshots. Combined with the option of nested filesystems and the ability to set options on each of these filesystems seperately this means it’s extremely scalable, as well as endlessly configurable.

A quick introduction: ZFS works on two levels, pools and filesystems. All pool configuration is done through the zpool command, while filesystem configuration is done through the zfs command.

Pools

Pool configuration is generally only done once, and after that the zpool command is only used for either maintenance, repairs or expansion. Before you can use any ZFS filesystem, you need to create a pool, a collection of one or more disks or partitions ZFS can use to store data. The most basic pool is of course a single disk, but this wouldn’t give you any redundancy whatsoever. Using two disks, you can create a mirrored pool, using three or more disks, you could create a RAID 5 pool (called raidz by ZFS), and with four or more disks, RAID 6 (raidz2) becomes an option. It is possible to combine multiple multiple RAID sets into a single pool, so you could for example have a single pool consisting of two mirrorred 500GB drives, and five terabyte drives in a RAID 5 configuration.

Note that building blocks can be either complete disks, partitions or files, but since using a drive as two partitions means you lose two blocks if the drive fails, this doesn’t really add security, and using files makes you dependent on another filesystem. It is not possible to nest RAID levels, so building a RAID 5 set out of 3 mirrors is not an option.

Expansion

But what happens if your pool is getting full? Lets say we have a three-disk RAID 5 pool, consisting of three 500 GB disks, giving us about 1TB of usable space. Ideally, we’d add a single 500GB disk to grow that RAID set to 1.5 TB. Unfortunately, this is one of the few things not yet supported by ZFS. There are some alternatives though:

Adding a new RAID set. If your system has enough free connectors and bays, simply add several more disks to the system, and add them to the pool as a new RAID set. For example, we could add three 1.5TB disks in a raidz configuration to the pool, growing it by an effective 3TB. ZFS will automatically spread any new data over all disks to optimize performance.

If your system doesn’t have the available hardware to add more disks, you can grow the existing RAID set by replacing the disks one by one, allowing the array to rebuild after each swap. This is what I recently did in one of my servers. It was running the above mentioned three disk raid setup with 500GB disks, and one of them had failed. I purchased three new 1.5TB disks, shut down the system, and replaced the damaged one.

After rebooting, I gave the command:

[root@honeycomb /honeycomb/]# zpool replace honeycomb ad6

Where honeycomb is the name of the zpool, and ad6 the drive to be replaced. By indicating only one device, I’m telling ZFS to replace the device with a new one at the same location. Once it’s done, shut down, replace the next disk, and repeat.

While the rebuilding is in progress, you can use the status command to see how far along the rebuilding of the array is:

[root@honeycomb /honeycomb/]# zpool status -v
  pool: honeycomb
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool 
        will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scrub: resilver in progress for 4h35m, 93.59% done, 0h18m to go
config:

        NAME            STATE     READ WRITE CKSUM
        honeycomb       DEGRADED     0     0     0
          raidz1        DEGRADED     0     0     0
            ad6         ONLINE       0     0     0  229M resilvered
            ad8         ONLINE       0     0     0  229M resilvered
            replacing   DEGRADED     0     0     0
              ad10/old  OFFLINE      0  137K     0
              ad10      ONLINE       0     0     0  337G resilvered

errors: No known data errors

Here you see the last disk in the array being replaced. Note that because the old disk is offline, the rebuilding is done by recalculating the data from the other disks in the array. If the old disk is still operational, and you have a spare SATA port, you could replace the disk while the old one is still connected, possibly speeding up the process.

Update 2013-11-12

The last part of this post concerns an older version of FreeBSD/ZFS and is no longer correct. Read on here for the correct way to proceed.

After all disks have been replaced, you need to export and import the pool, after which ZFS will see the larger disk size, and grow the system accordingly:

[root@honeycomb /honeycomb/]# zfs list
NAME                                  USED  AVAIL  REFER  MOUNTPOINT
honeycomb                            908G  23.3G  5.26M  /honeycomb

[root@honeycomb /honeycomb/]# zpool export honeycomb
[root@honeycomb /honeycomb/]# zpool import honeycomb
[root@honeycomb /honeycomb/]# zpool status
  pool: honeycomb
 state: ONLINE
 scrub: resilver completed after 4h56m with 0 errors on Mon 
  Jan 17 11:21:38 2011
config:

        NAME        STATE     READ WRITE CKSUM
        honeycomb   ONLINE       0     0     0
          raidz1    ONLINE       0     0     0
            ad6     ONLINE       0     0     0
            ad8     ONLINE       0     0     0
            ad10    ONLINE       0     0     0

errors: No known data errors
[root@honeycomb /honeycomb/]# zfs list
NAME                                  USED  AVAIL  REFER  MOUNTPOINT
honeycomb                            908G  1.84T  5.26M  /honeycomb

As you can see, the pool is complete again, with no trace of the failed drive, and the diskspace is grown by little over 1.8TB^[1].

Next time, I’ll go deeper into the zfs command and creating filesystems.

^[1]Not by 2TB, as you would expect, since for marketing reasons, harddrives are specified in powers of 10, not powers of 2, as computers see them.↩

January 17, 2011 – 15:41 | By Itsacon | Posted in Unix | Comments (17)

17 Comments

Stark wrote:

June 19, 2011 at 06:16

Did you do a scrub after the first disk?
With a failed drive it won’t do you much good, but if expansion is the reason, I’ve do a scrub before swapping the first drive.
But then again… I’m paranoid…
Itsacon wrote:

June 20, 2011 at 08:41

Quite frankly, I’m not entirely certain, but I think that when the disk first gave errors, I did a scrub just to see if it was really broken, or just messed up.

You can still do a scrub with a critical array by the way, but it will only do something for filesystems with checksumming turned on.

Doing it before a grow operation without a failed drive is certainly a good idea.
brian wrote:

June 25, 2012 at 02:30

After re-considering if zfs was the right option for me to use for a backup drive array, this articly re-assured me that there is a way to expand them. Thanks! I will be expanding as soon as i can purchase another 2TB.

i just set up a raidz with 3 drives, 2TB, 500GB, 500GB. im trying to figure out if i can do something similar to JBOD span them. I have found no documentation that mentiones raid1 either, except for this article. Is there anyway i can make these 3 drives to 3TB?
Itsacon wrote:

July 2, 2012 at 07:28

Technically, JBOD is possible, by just adding each disk as a seperate vdev, without specifying mirror (which is the zfs name for RAID 1), or raidz.

But JBOD will of course offer no redundancy whatsoever, and data recovery from checksum (scrubbing) will not be available.

You could simulate some redundancy by specifying “copies=2” in the zfs filesystem, but this won’t be as reliable, and still not support scrubbing

So using a 1TB raidz for now, and growing it in the future by replacing the 500GB disks with 2TB ones would be a better long-term strategy. Remember that it is not possible to increase the number of disks in a vdev, so once you have 3 vdevs of 1 disk, you can’t replace them with other vdevs that have redundancy. You can only add new vdevs to the pool.
Dan wrote:

August 23, 2012 at 07:20

Thanks, great article. Could you please clarify the
enlarge by ‘adding a RAID set’ idea? Would I end up with raidz2? I.e., if I have a raidz1 with 5x2TB disks, then create another raidz1 with 3x3TB disks … then somehow add that to the first pool? Doesn’t that mean there are 2 disks’ worth of parity?
Thanks!
Itsacon wrote:

August 23, 2012 at 07:47

Dan, you would indeed end up with 2 disks’ worth of parity. One in the 5×2 array, and on in the 3×3 array. They will be completely seperate vdevs, their only connection is that they’re assigned to the same zpool. ZFS will automatically spread data out over both arrays, so that neither will be full while the other isn’t (since that would impose a performance hit).

This may seem a waste of space, but actually has several advantages:

- The larger the number of disks in a single array, and the larger the disks, the more realistic the chance becomes that a second disk will fail while you are rebuilding the array after replacing a disk. For this reason, many people move to RAID6 or even more parity with big disk/big array combinations.
— You are not limited to the size of your existing disks, and you have more granularity while upgrading. So while on a ‘normal’ raid system, you’d have to grow your 5x 2TB array by adding more 2TB disks, or replacing all 5 disks with bigger ones, you can instead use whatever size and number of disks is preferable at the moment. This way, you are very flexible in the construction of your arrays.

For example, in my main fileserver, I started with a 4x 1.5TB array. I recently had a lot of spare older disks (mostly 250GB). I added a second set of 4x 250GB. Since those are older disks, I expect them to fail at some point, and then I can either replace them with other disks I have lying around, or grow the array with newer bigger disks. Should I need more space, I can upgrade or grow with four disks at a time, instead of having to grow all disks at once. Since disks keep getting bigger and cheaper, buying only what you need now is generally a better investment.

I hope this helps.
Expanding a ZFS pool – Itsacon's Log wrote:

October 9, 2012 at 09:39

[…] an earlier post, I explained how it was possible to grow a ZFS pool by replacing all disks one-by-one. In that […]
Mattew Panz wrote:

June 13, 2013 at 12:50

How to grow a pool encrypted with GELI?
Itsacon wrote:

June 13, 2013 at 12:59

@Mattew Panz: I honestly don’t know. I have never worked with disk encryption.
brian wrote:

June 15, 2013 at 23:30

im still using my ZFS, i havent had enough space in my PC’s not having a SAN storage location yet, to create a 2^nd array making a mice RaidZ setup that actually doest the checksumming, but since im using it as backup, i already have the original data as a dup so its not the big of a deal.

occasionally ill have the problem where my pools go down, about 1 or 2 times per year, and digging through documentation is a bit difficult to get them back online, not using those commands engugh, but it sure is an AWESOME filesystem, and very robust. Thanks for the tips last year or two ago, it helped my get this system up thats still kicking today.
Growing a ZFS pool - Update - Itsacon's Log wrote:

November 12, 2013 at 14:19

[…] an earlier post I described how you can grow an existing ZFS pool by replacing all the disks with bigger […]
Eric wrote:

December 23, 2013 at 21:05

Question. When ZFS is resilvering the new, larger drive, does it resilver using the full capacity of the drive, or with the same capacity of the original drive; and it isn’t until you bring the disk online with ‘zpool online ‑e’ command will ZFS expand the pool to use the full drive?

Also, Question 2. What if you only replaced one disk with a larger one. Will you be able to run it that way? How does ZFS deal with differently sized disks, will it only utilize 500 GB of the 1.5 TB disk?

Thanks
Itsacon wrote:

December 30, 2013 at 12:06

@Eric: A resilver will generally not re-distribute data. It will just verify, and repair damaged sectors. If you replace a disk with a larger one and resilver, it will use as much data of the new disk as was available on the old one.

If you want to use the extra space after replacing disks with bigger ones, use the zpool online ‑e command, as described here.

Note that, indeed, the smallest disk in an array determines the amount of space used of every disk, so if the smallest disk in an array is 500GB, that’s the amount of space available on each disk.
Eric wrote:

December 30, 2013 at 22:28

@Itsacon Thanks for the reply! Also, how are your thoights about utilizing md for building the raid6 and lvm for creating a single volume to put zfs on? So rather than needing to resilver individual drives, could you just expand the LV, then ‘zpool ‑e online’ the zfs pool?
Itsacon wrote:

January 2, 2014 at 09:05

@Eric: I wouldn’t consider it a good idea. First of all, it gives a lot more overhead, so performance will likely suffer.

Furthermore, using MDs (I assume you intend to use swap disks with the lvm for storage) will likely cause problems with write caching, as each layer is unawre of the caching policies of the other layers, so a crash or power outage might lead to corrupted or missing data. ZFS will likely be able to recover that, but I think prevention is better than repairs.

And finally, lvm has its own limitations, so you’re just swapping one set of problems for another, compounded by the fact that you now have three things to monitor instead of one. Simplicity is key, in my opinion.
fusionstream wrote:

June 23, 2016 at 15:09

Say I have two mirrored vdevs in a pool. Can I expand 1 vdev only and get more space?
Itsacon wrote:

June 23, 2016 at 15:15

Yes, each mirror or raidzX set in the pool is a seperate device, and can be grown independently.

Itsacon's Log