ZFS and Proxmox performance on consumer SSDs
Background - I have a pair of Samsung QVO 8TB consumer grade drives, and some Seagate Ironwolf SSD 4TB enterprise drives. The 8TB drives were configured as a mirror and were performing terribly - like 3 fsyncs a second terrible. The enterprise drives work great. The following is my conclusions:
The root issue was the consumer grade performance and the design of consumer SSDs necessitating trimming. Once enough bytes have been written (say by moving a VM), their performance falls off a cliff. Proxmox has a handy tool called pveperf that tells you Fsyncs a second. Running them on one of my old drives in situ was giving 10(!) a second. The enterprise was giving 1100! Surely I can make this better. So I emptied the drive recreated the pool, and 700-900 fsyncs a second?? Moving a VM on and doing some benchmarks with dd, and I can write 100Mb a second, but FSYNCs dropped to 300. Repeat, and FSYNCs drop into the 200. Writes drop to 50Mb. Trim the drive, which takes 30 minutes and performance returns, but basically you can ruin the performance of consumer grade drives quickly.
The same test on Enterprise grade drives sees no fall off in performance - effectively shielded from this problem. However 100Mb a second seemed very low for these drives, and one box was even worse. Turns out it had been formatted ashift=0 (not me!!). From SSDs this picks up 512 byte sectors. Everyone says ashift 12, which is 4K, but there is general mutterings that some enterprise ssds can use bigger numbers. Nothing hard and fast. So I tested ashift 13 and 14. 13 gave 229MB/s and 14 gave 254MB/s on my Samsung QVO 8TB mirror. Big improvement!
This does not make the degradation problem go away, but it does mean less performance loss. Reading up on how ZFS works this makes a lot of sense if the internal sectors are 16K pages. Every 4K write is effectively 4 writes to the same block. It's a quirk of ZFS that makes it very unsuitable for trimmed SSD - as with copy on write it uses new space that ruins the wear levelling algorithm in them. The enterprise drives do not suffer this, presumably thru a different heuristic. This will also affect SMR spinning rust drives greatly in a cluster.
So mirroring was a red herring - it was fundamentally an issue with the drive interacting with ZFS. Now I have spent some time with it I would say ZFS definitely needs modernising. It's got a load of great stuff, but it produces opaque problems, and in its current form is not ideal for VM backing store. Why? Because caching is the bane of vms. A VM host should have effectively a write thru filing system, as the host is probably caching, and definitely knows what it needs to commit now. VMware's VMFS basically works this way, and behaves much more predictably.
However I forgive ZFS for much of this, as it has many other benefits and heart is in the right place - encryption, compression, data integrity and expansion are off the charts better. Let's hope they continue to iterate on it for modern hardware.
And what of the 48TB? Well I am eventually going to pull them out and put them in a single server to play with Raid 5. I am still seeing low disk performance and would like to see what can be done in this config.
I've learned a lot on this journey, and think it was well worth it - people will not have to suffer broken servers in production under load, which is the worst.
zpool trim ssdpool
root@proxmox3:/ssd12/t# dd if=/dev/random of=test bs=2G count=16
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+16 records in
0+16 records out
34359672832 bytes (34 GB, 32 GiB) copied, 115.415 s, 298 MB/s
root@proxmox3:/ssd13/t# dd if=/dev/random of=test bs=2G count=16
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+16 records in
0+16 records out
34359672832 bytes (34 GB, 32 GiB) copied, 117.676 s, 292 MB/s
root@proxmox3:/ssd14/t# dd if=/dev/random of=test bs=2G count=16
dd: warning: partial read (2147479552 bytes); suggest iflag=fullblock
0+16 records in
0+16 records out
34359672832 bytes (34 GB, 32 GiB) copied, 116.004 s, 296 MB/s
root@proxmox3:/ssd12/t# pveperf /ssd12/t/
CPU BOGOMIPS: 192034.80
REGEX/SECOND: 3250181
HD SIZE: 3596.50 GB (ssd12)
FSYNCS/SECOND: 965.86
DNS EXT: 11.80 ms
DNS INT: 10.24 ms
Comments