Reducing the number of disks in a ZFS pool

In the past I’ve shown how easy it is to expand a ZFS pool by adding extra disks.

While this is the easiest way of expanding your storage capacity, it does come with a downside: From that moment on, you’re stuck with the larger number of disks, as ZFS does not have an easy way of removing disks from a pool without destroying it.

This can be annoying, as with future replacements, you’ll need more disks, or you’re stuck running old disks well past their warranty date. Other issues include connector space and power usage.

Luckily, there is a way to do work around this, but there is one big caveat: you need a lot of SATA connectors.

As you may recall from the previous article, I expanded my pool by adding a second 4‑disk RAID set, bringing the number of disks to 8. As all the disks in my pool were over 5 years old, I wanted to replace them. By using bigger disks, I’d gain a lot of space, so I didn’t need 8 disks any more, and I wouldn’t mind having a few SATA connectors available for other things.

First I temporarily added an extra controller card to the system, giving me 4 more ports to play with. I added 4 new disks, and created a second zpool:

[root@thunderflare ~]# zpool create newdata raidz \
/dev/label/TB4TB0 /dev/label/TB4TB1 \
/dev/label/TB4TB2 /dev/label/TB4TB3
[root@thunderflare ~]# zpool status
  pool: data
 state: ONLINE
  scan: scrub repaired 0 in 8h40m with 0 errors on Mon Oct 12 23:46:40 2015
config:

        NAME                      STATE     READ WRITE CKSUM
        data                      ONLINE       0     0     0
          raidz1-0                ONLINE       0     0     0
            diskid/DISK-5XW00Z7D  ONLINE       0     0     0
            diskid/DISK-5XW00X65  ONLINE       0     0     0
            diskid/DISK-5XW01GG4  ONLINE       0     0     0
            diskid/DISK-5XW01RQM  ONLINE       0     0     0
          raidz1-1                ONLINE       0     0     0
            raid/r2               ONLINE       0     0     0
            raid/r0               ONLINE       0     0     0
            raid/r3               ONLINE       0     0     0
            raid/r1               ONLINE       0     0     0

errors: No known data errors

  pool: newdata
 state: ONLINE
  scan: none requested
config:

        NAME              STATE     READ WRITE CKSUM
        newdata           ONLINE       0     0     0
          raidz1-0        ONLINE       0     0     0
            label/TB4TB0  ONLINE       0     0     0
            label/TB4TB1  ONLINE       0     0     0
            label/TB4TB2  ONLINE       0     0     0
            label/TB4TB3  ONLINE       0     0     0

errors: No known data errors

As you can see, the old pool has been moved between systems a lot, and there’s some old metadata on there from different controllers. For the new pool, I first labeled the disks using glabel. The labels are also written on the physical disks themselves, so individual replacements are much easier, as I won’t have to figure out what disk is on what port. You can also see that I did a scrub of the old pool shortly before starting the process. This is not required, but always good practice.

Now, on to moving the filesystems. Of course, we could just recreate all filesystems on the new pool, move the data over and be done, but then we would lose all snapshots, clones, etc, and since I use a lot of filesystems, it’d be a lot of effort. I’d much rather move everything at once.

Luckily, ZFS has a nice set of commands to make this process easier: zfs send and zfs receive. With zfs send, you can convert a snapshot to a datastream. This can be send to a file (for backup purposes), or piped to the zfs receive command, which will convert it back into a filesystem or snapshot.

So after creating the new pool, I gave the following commands:

[root@thunderflare ~]# zfs snapshot -r data@migrate
[root@thunderflare ~]# zfs send -R data@migrate | \
 zfs receive -dF newdata

The first command creates a snapshot @migrate on the root filesystem, and all descendent filesystems. The second command sends that snapshot to the new pool. The ‑R flag tells ZFS that I don’t want just the indicated snapshot, but also all descendent file systems. Since I do this on the top-most filesystem, I’m sending everything. the ‑d flag tells ZFS to use the full path structure when receiving the new snapshot, while the ‑F flag is required with the ‑R flag, meaning all other filesystems on the destination pool will be overwritten (which is what we want).

Running this command will take a while (about 12 hours on my system, for a little over 4TB of data). You can keep track of approximate progress by running zfs list and comparing the sizes of the old pool and the new pool.

Once it’s done, we’ll need to copy over everything that was changed on the old pool since the process started. To prevent any further changes, I stopped all services that used the old filesystem, and then unmounted them.

[root@thunderflare ~]# zfs unmount -a
[root@thunderflare ~]# zfs snapshot -r data@migrate2
[root@thunderflare ~]# zfs send -R -I data@migrate data@migrate2 | \
 zfs receive -dF newdata

The first command unmounts all zfs filesystems. If you forgot to close all relevant processes, you’ll get an error. Either stop some more stuff, or use the ‑f flag to force unmounting.

Next, I made another snapshot of the old pool.

The third command is similar to the previous send/receive, but with one important difference: the ‑I flag. This flag must be followed by the name of an older snapshot. ZFS will look at the changes made between that snapshot, and the second one given, and only send the difference. 

Since little had changed overnight, this process was a lot quicker.

So now we have two pools with the exact same content. At this point we could simple change all links and config files to point to newdata, and we’d be done. But that meant going over a lot of config files, so I wanted to rename the new pool to the name of the old one. Unfortunately, zpool doesn’t have a rename command.

But, of course, there’s a workaround for that as well:

[root@thunderflare ~]# zpool export data
[root@thunderflare ~]# zpool import data olddata
[root@thunderflare ~]# zpool export newdata
[root@thunderflare ~]# zpool import newdata data

Simply export each pool, then import under the desired name. Obviously, renaming the old pool isn’t strictly necessary, we could just destroy it. But since I didn’t need the disks out of the system in a hurry, I preferred to keep the old pool around for a bit while I verified that the new one was working properly. The first thing I did after that was starting a scrub on the new pool to make sure all the data was written down correctly.

So there you have it: Reducing a ZFS pool from eight disks down to a more manageable four. All you need is enough ports to mount the old disks and the new disks at the same time. 

If you don’t have that many ports available, but do have enough diskspace somewhere to store your entire pool, you could do the zfs send to a file, then replace the disks with the new ones, and do the zfs receive from the file. Doing so would require more downtime though, as you’d have to unmount the old pool right at the start to prevent changes.

One Comment

  • vpatil wrote:

    Nice and details info.
    I have created zpool using google disk of 1TB.
    Google provided facility to expand same disk using “gcloud compute disks resize”.
    Now can you help me to expand zpool in such case?

Leave a Reply

Your email is never shared.Required fields are marked *