NVMe Partitions for ZFS SLOG and L2ARC on TrueNAS

Two NVMe drives partitioned for SLOG and L2ARC on TrueNAS. I also tried a special metadata vdev but ended up dropping it.

NVMe Partitions for ZFS SLOG and L2ARC on TrueNAS
Photo by Luis Soto / Unsplash

I bought two 512GB NVMe drives (Patriot P320) for my NAS rebuild. Two NVMe slots, and I wanted a SLOG and L2ARC on my ZFS pool.

I actually did this twice.

The first time the pool was a 4-drive RAIDZ1 and I also added a special (metadata) vdev on the NVMe drives. The pool had been running for over a year without one, and after a week with it I didn't notice a difference, probably not enough workload to surface it.

The catch is that a special vdev holds real data and if you lose the mirror you lose the entire pool, which is a steep price for a performance gain I couldn't even perceive.

I was already planning to rebuild as RAIDZ2 (added a fifth drive), so I used TrueNAS replication to rebuild and dropped the special vdev. This time I kept it to just SLOG and L2ARC, both of which are safe to lose without taking the pool with them.

Partition layout

Each drive gets two partitions, the rest is just unused.

per drive (512GB Patriot P320):
  p1: 64GB   -> SLOG (mirrored across both drives)
  p2: 250GB  -> L2ARC (not mirrored)
  ~163GB     -> unused

SLOG buffers sync writes, which is what happens when an app (NFS, databases) waits for confirmation that data hit stable storage before moving on. Without a SLOG, ZFS has to flush all the way to spinning rust before acknowledging. With one, it acknowledges from the NVMe and flushes in the background. If the SLOG dies, ZFS just falls back to writing sync directly to the pool, so it's slower but no data loss.

L2ARC is a read cache that extends the in-memory ARC onto NVMe, so frequently read blocks that don't fit in RAM get served from SSD instead of spinning drives. Same deal if it dies, reads just go back to the pool and nothing is lost.

Partitioning the drives

The DXP8800 Plus ships with a 128GB NVMe (nvme1n1) that I wiped and installed TrueNAS on.

The two drives I added are nvme0n1 and nvme2n1. Worth checking lsblk first so you don't accidentally wipe the boot drive.

lsblk -d -o NAME,SIZE,ROTA,MODEL /dev/nvme0n1 /dev/nvme2n1

sudo smartctl -H -i -A /dev/nvme0n1
sudo smartctl -H -i -A /dev/nvme2n1

Both drives get identical partitions:

sudo sgdisk -Z /dev/nvme0n1
sudo sgdisk \
  -n 1:0:+64G  -t 1:bf01 -c 1:slog \
  -n 2:0:+250G -t 2:bf01 -c 2:l2arc \
  /dev/nvme0n1

sudo sgdisk -Z /dev/nvme2n1
sudo sgdisk \
  -n 1:0:+64G  -t 1:bf01 -c 1:slog \
  -n 2:0:+250G -t 2:bf01 -c 2:l2arc \
  /dev/nvme2n1

-Z wipes existing partition tables. -n creates a partition (number:start:size). -t sets the type (bf01 is Solaris/ZFS). -c gives it a label and verify with sgdisk -p.

Adding them to the pool

I added the vdevs one at a time so I could verify after each. The -f flag forces the add if the drives had previous ZFS labels.

sudo zpool add -f swamp log mirror /dev/nvme0n1p1 /dev/nvme2n1p1
sudo zpool status swamp

sudo zpool add -f swamp cache /dev/nvme0n1p2 /dev/nvme2n1p2
sudo zpool status swamp

The SLOG is a log mirror because you would ideally want it redundant while it's alive, a single drive failure shouldn't degrade sync write performance. The L2ARC is just cache with no mirror since it's throwaway read data anyway.

After both adds, zpool status looked like this:

  pool: swamp
  ...
    raidz2-0        (5 drives)
    logs
      mirror-1      (nvme0n1p1 + nvme2n1p1)
    cache
      nvme0n1p2
      nvme2n1p2

NFS tuning

While testing NFS read/write speeds over 10G, I tried bumping the network buffer sizes and see if i got any improvements. I set them through midclt, TrueNAS Scale's middleware CLI:

sudo midclt call tunable.create \
  '{"var": "net.core.rmem_max", "value": "16777216", "type": "SYSCTL", "enabled": true}'

sudo midclt call tunable.create \
  '{"var": "net.core.wmem_max", "value": "16777216", "type": "SYSCTL", "enabled": true}'

sudo midclt call tunable.create \
  '{"var": "sunrpc.tcp_slot_table_entries", "value": "128", "type": "SYSCTL", "enabled": true}'

You can do the same thing in the GUI under System > Advanced > Sysctl. Either way they persist through reboots (I have been trying to use commands more often because they are easily reproduced in systems without say a terraform provider). The same rmem_max, wmem_max, and sunrpc.tcp_slot_table_entries values need to be set on the NFS clients too (the Proxmox nodes in my case). Both ends need the larger buffers.

If / when a drive dies

If one NVMe fails, the mirrored SLOG degrades but keeps working and the L2ARC just loses that cache device. The replacement process would look something like this:

sudo zpool replace swamp /dev/old_nvmeXn1p1 /dev/new_nvmeXn1p1

sudo zpool remove swamp /dev/old_nvmeXn1p2
sudo zpool add swamp cache /dev/new_nvmeXn1p2

ZFS resilvers the SLOG mirror in-place, for L2 ARC devices you remove and re-add them.