Occasional blog posts from a random systems engineer

Goodbye SSDs, long live ZFS: Homelab storage failure

· Read in about 8 min · (1497 Words)

Homelab storage

Over the past 15 years, my homelab has seen a biiiig variety of iterations and solutions, in somewhat of an order:

  • Random desktop machine with a hard drive
  • First actual servers, with local hard drives
  • Home-built NAS with iSCSI for VMs (ESX!)
  • Random array of laptops with iPXE and NFS root drives (running out of money!)
  • Real datacenter with iSCSI to NEXSAN SATABOY!
  • Back home to iSCSI to Synology NAS

Then, the latest upgrade about 6 years ago:

  • 11 SSDs - 8 in software RAID for VMs, 2 x boot and 1 x build VM

I kept the Synology for biiig VMs, but for the smaller, this 4TB of SSD was quiiiiiiick.

Uh oh

But.. of all days, Jan 2nd, I saw that an SSD had failed. For those who are used to hard drive failures, you might not think too much about it. For those who have used SSDs before (especially consumer SSDs, like in my case), this spelt baaad news. Since these are in RAID, all SSDs are getting same number of writes (which is the killing factor here), meaning that it was just a matter of time before more failed..

I had luckily bought 2 spares when I originally purchased the array, so I replaced the failed drive and things were fine.

Moving forward

Of course this had to happen in 2026, the year where SATA SSDs are basically non-existent (I literally could not find any on Amazon) and those that existed (higher capacity) were much more expensive.

So, looking at moving to 2TB SSDs, I’d be looking at ~£500 to replace the array - keeping in mind the original array was only £400, I wasn’t too happy - particularly because only having 3 drives would mean:

  • Less redundancy
  • Higher cost per failure and realistically, I’d still end up with consumer drives, which aren’t meant for this workload (though surviving nearly 6 years wasn’t bad).

Spinning rust

I considered how much of a performance impact I’d had with spinning rust. Looking at (at least what I’d considered) creme’de la creme of hard drives (2.5" 15K SAS drives), I could get somewhat of a comparison:

Crucial MX500 SSD Dell 15K 2.5" SAS
IOPs ~95K ~150-210
Throughput (write) ~510MB/s ~115MB/s
Throughput (read) ~560MB/s ~115MB/s
Latency <1ms ~2-3ms

Unfortunately, this isn’t a surprise, we can clearly we can see who the real winner is.

And to be clear, my memory of running 50+ VMs on spinning disk is definitely a bit hazey, but I started to remember burst of high load averages (> 300 !!!) and times where a machine could barely claw 1MB/s of read/write. I know it’s a homelab, but I know I’ve gotten used to the SSD lifestyle.

But, at the same time, 10 x 1.2TB SAS drives for £120 is beyond cheap.

Getting the drives working

This should really be a non-issue, but after putting in a single drive to test it, I could not get it to be detected by the operating system. After checking a variety of things (the server is not easily physically accessible and I didn’t want to turn it off whilst the SSDs were actively working).. I came to the conclusion that the hard drive frontplane was SSD-only. So after buying a replacement frontpanel and cable (£30), only at this point did I login to the ILO (I leave the network cable unplugged) and check - and course, the drive showed up as a foreign RAID member, so needed initialising :cry:

Cache without cash

So I considered what if I could cache as much as possible.

Read cache would obviously be free and easy - I’ve got a whole bunch of SSDs that are yet to fail, so these were a no-brainer: 2 x 500GB read cache was a definite must.

However, the question about write cache is more problematic… failure situations, data corruption and data loss. My knowledge around this wasn’t great, but at least from some very ancient history on the topic, my worst case is:

  • Power outage
  • Data wasn’t written
  • Filesystems are unrecoverable

So I set out what I care about and what I don’t care about:

  • Loss of filesystem - disasterous (recoverable from backups, sure, but a very bad time - especially if we’re talking about an entire array)
  • Outage due to drive failure - fine, as long as it can be started again
  • Non-corruptable loss of data - if a write cache failure meant that a certain amount of time’s worth of data was, effectively, lost BUT CONSISTENT. That is, imagine the last 5m of changes just weren’t present, however it could be brought back to life pretending like they never happened and the filesystem corruption risk is no different to a power loss, then actually this is fine. If I need to re-create a ticket or to re-push some changes in git or 5m of whatever I was doing in the homelab.. who cares!

As long as I could get a write cache that could handle this, then I might be good - and I still have 2 basically brand new SSDs that had just gone into the array.

ZFS

The technology chain up to this point had been:

SSDs -> md -> LVM PV -> VG -> LV (per VM) -> Exposed to qemu as VM disk

I’d used some per-LV caching in LVM before for some read/write caches for the NAS, but honeslty I wasn’t keen:

  1. It seemed fairly brittle - it would sometimes really have issues detecting the PV from the cache (even though it was there) and seeing the status of it all was a pain
  2. It was setup per LV, meaning that each VM had it’s own cache, at scale this would be a pain to keep up with.

ZFS on the other hand..

  • Read cache: L2ARC - give it two SSDs and bang, it’s done
  • Write cache: SLOG - give it two SSDs and bagn, it’s done!

Data transfer

This isn’t particularly interesting, but thought I’d note, just because I was quite particular about ensuring this transfer worked:

  • I had a spare server with 24TB storage, so I re-installed and setup LVs:
    • 1 for the PV of the SSD pool
    • 1 for RootFS backup of the server (see next section)

This isn’t my first rodeo with copying block devices over a network, so for anyone who’s interested:

dd if=/dev/md0 bs=1M | pv | ssh user@temp-r720 "dd of=/dev/backup_vg/tmp_backup bs=1M"

I then ran the following to verify the transfer:

ssh user@temp-r720 "cmp /dev/backup_vg/tmp_backup - " < /dev/md0

The next step was to setup each ZFS volume, copy data from SSD array and then also wanted to then compare each ZFS VM volume with the temp server (this veries copy from original against the backup):

LV=VM-DISK-NAME
LV_BYTES=$(blockdev --getsize64 /dev/ssd-1/$LV)

# Create ZFS volume with same size
zfs create -V ${LV_BYTES}B vm_pool/$LV
# Disbale sync for copy
zfs set sync=disabled vm_pool/$LV

dd if=/dev/ssd-vg/$LV of=/dev/zvol/vm_pool/$LV bs=1M oflag=direct

# Re-enable sync
zfs set sync=standard vm_pool/$LV
zpool sync vm_pool

# Checksum target volume to compare against temp backup volumes
sha256sum /dev/zvol/vm_pool/$LV
ssh user@temp-r720 "sha256sum /dev/ssd-vg/$LV"

ZFS versions

I noted that, as with a lot of my homelab, this physical server had been setup and maintenance only done as necessary.. meaning it was running Ubuntu 14.04. This meant that it had an old version of ZFS - to avoid tempting fate, I decided to upgrade to the latest version to avoid any horrendous old bugs.

Fortunately, this was incredibly easy… 5 rounds of do-release-upgrade and reboots actually worked flawlessly.

Performance

I went into this quite blind - I had to migrate and get the data safe, no matter the performance.

However, somewhat magically, I actually found some of the applications running were actually quicker.

ZFS provides a wealth of information about it’s status and I found it wonderful that I could you just take all of the raw data and ChatGPT could give me a great insight into how it’s doing.

I took three takeaways from the output:

  • The L2ARC read cache was about 25%:
    • Compressed: 72.7%
    • Hit ratio: 25.5%
    • This seems to be pretty good
  • The SLOG write cache:
    • Transactions to SLOG storage pool: 3.1 TiB (111.2M txns)
    • This should be a pretty great saving instead of waiting for writes to spinning rust

The final thing I really really hadn’t considered which was very interesting:

  • ARC (RAM cache)
    • Size 85GB (out of 90GB max)
    • Memory throttle count: 0 (very important) This explained EXACTLY why it felt faster.. I had moved from SSDs to caching in RAM! I’m not sure how it had decided 90GB max, but it seemed to be a good default (given the amount of use without any throttling).

Was it a success

Overall, I’ve reduced my reliance on consumer SSDs, allowing for more of a mix-and-match (I can replace the current SSDs without data transfer)

I’ve now moved to enterprise hard drives, which are avaiblable cheaply from eBay and can hoard a bunch of spares for little money.