Here’s a useful tool for quickly testing whether a disk (or a cloud block store volume) is a good candidate for your database WAL/redo logs and any other files that require low latency writes. The pg_test_fsync tool is bundled with standard Postgres packages, so no extra installation is needed. You don’t actually have to use Postgres as your database, this tool’s output is universally valuable for any workload requiring fast writes.
First, I’ll show the disks attached to my test server, using my new lsds tool. You can scroll right to see the full output:
$ lsds -a FUA,HWSEC DEVNAME MAJ:MIN SIZE TYPE SCHED ROT MODEL QDEPTH NR_RQ WCACHE FUA HWSEC nvme0n1 259:1 186.3 GiB NVMeDisk none 0 Micron_7400_MTFDKBA960TDZ - 1023 write through 0 4096 nvme0n2 259:6 200.0 GiB NVMeDisk none 0 Micron_7400_MTFDKBA960TDZ - 1023 write through 0 4096 nvme1n1 259:0 1863.0 GiB NVMeDisk none 0 Samsung SSD 990 PRO 2TB - 1023 write back 1 512 nvme2n1 259:2 260.8 GiB NVMeDisk none 0 INTEL SSDPED1D280GA - 1023 write through 0 512 sda 8:0 3726.0 GiB Disk mq-deadline 1 P9233 30 60 write back 0 4096 sdb 8:16 3726.0 GiB Disk mq-deadline 1 P9233 30 60 write back 0 4096 sdc 8:32 3726.0 GiB Disk mq-deadline 1 P9233 30 60 write back 0 4096 sdd 8:48 3726.0 GiB Disk mq-deadline 1 P9233 30 60 write back 0 4096
I’ll first run a test on my consumer-grade Samsung 990 Pro locally attached NVMe SSD (/dev/nvme1n1):
$ mount | grep /data /dev/nvme1n1 on /data type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota) $ cd /data $ /usr/lib/postgresql/16/bin/pg_test_fsync 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 249.578 ops/sec 4007 usecs/op fdatasync 608.573 ops/sec 1643 usecs/op fsync 177.431 ops/sec 5636 usecs/op fsync_writethrough n/a open_sync 185.095 ops/sec 5403 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 125.091 ops/sec 7994 usecs/op fdatasync 765.775 ops/sec 1306 usecs/op fsync 178.218 ops/sec 5611 usecs/op fsync_writethrough n/a open_sync 91.849 ops/sec 10887 usecs/op
There’s more output below, but adding a few comments here:
- fdatasync is faster than fsync or the O_SYNC flag used at file open, as it can avoid waiting for the additional filesystem journal I/Os (when overwriting blocks that already exist on disk).
- Nevertheless, even the single 8kB physical write done by fdatasync still takes 1.6 milliseconds!
- Consumer SSDs (and any enterprise SSD) without controller DRAM based and power-loss-protected (PLP) write cache will have high synchronous write latency due to how NAND writes work. That’s why my lsds tool also shows the
write_cache
setting of a disk. - You may be able to run a lot of writes concurrently and get 500k+ write IOPS per disk, but this does not mean that each individual write operation would have a very low latency.
The rest of the output from this tool is below. The latencies compound if you issue multiple separate (smaller) writes in an O_SYNC mode, effectively serializing the SSD to handle one write at a time, before you even have a chance to issue the next one. This is where first buffering the writes in OS RAM and later fdatasync’ing all them to disk together would help. The individual write latency won’t get any lower, but at least you can sync a bunch of them to the SSD together and the SSD controller will take advantage of that.
Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 180.052 ops/sec 5554 usecs/op 2 * 8kB open_sync writes 98.289 ops/sec 10174 usecs/op 4 * 4kB open_sync writes 43.374 ops/sec 23055 usecs/op 8 * 2kB open_sync writes 21.095 ops/sec 47404 usecs/op 16 * 1kB open_sync writes 11.298 ops/sec 88513 usecs/op Test if fsync on non-write file descriptor is honored: (If the times are similar, fsync() can sync data written on a different descriptor.) write, fsync, close 184.788 ops/sec 5412 usecs/op write, close, fsync 180.081 ops/sec 5553 usecs/op Non-sync'ed 8kB writes: write 651627.688 ops/sec 2 usecs/op
The last line above is about just doing pwrite writes into filesystem pagecache, without waiting for the actual flush disk I/O, so it’s not relevant for physical disk I/O measurement.
The above output was from a consumer SSD, now let’s run the same tests on an enterprise-grade SSD (Micron 7400) that I had highlighted above. It has a capacitor-based power loss protection feature built in for its controller cache DRAM, therefore enabling “write-through cache” for this SSD. Synchronous writes can be just copied to the SSD controller DRAM and a “write OK” acknowledgement can be sent back to the host immediately. The destaging to the persistent NAND media can happen asynchronously later, or when power-loss event is detected, then quickly, while the capacitors still provide some power for an emergency flush.
This time I ran the test in my home directory, that happens to be on a filesystem on a LVM block device:
$ df -h . Filesystem Size Used Avail Use% Mounted on /dev/mapper/ubuntu--vg-ubuntu--lv 90G 68G 18G 80% / $ sudo pvs PV VG Fmt Attr PSize PFree /dev/nvme0n1p3 ubuntu-vg lvm2 a-- 183.21g <91.61g /dev/nvme0n2 backup lvm2 a-- <200.00g 0 /dev/sdb backup lvm2 a-- <3.64t 0 /dev/sdc backup lvm2 a-- <3.64t 0 /dev/sdd backup lvm2 a-- <3.64t 0 $ /usr/lib/postgresql/16/bin/pg_test_fsync 5 seconds per test O_DIRECT supported on this platform for open_datasync and open_sync. Compare file sync methods using one 8kB write: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 52813.315 ops/sec 19 usecs/op fdatasync 42003.972 ops/sec 24 usecs/op fsync 39751.307 ops/sec 25 usecs/op fsync_writethrough n/a open_sync 48070.142 ops/sec 21 usecs/op Compare file sync methods using two 8kB writes: (in wal_sync_method preference order, except fdatasync is Linux's default) open_datasync 26275.705 ops/sec 38 usecs/op fdatasync 32551.276 ops/sec 31 usecs/op fsync 31099.826 ops/sec 32 usecs/op fsync_writethrough n/a open_sync 24105.752 ops/sec 41 usecs/op Compare open_sync with different write sizes: (This is designed to compare the cost of writing 16kB in different write open_sync sizes.) 1 * 16kB open_sync write 40447.014 ops/sec 25 usecs/op 2 * 8kB open_sync writes 23854.185 ops/sec 42 usecs/op 4 * 4kB open_sync writes 13436.020 ops/sec 74 usecs/op 8 * 2kB open_sync writes pg_test_fsync: error: write failed: Invalid argument
Everything’s super-fast! And, as long as the SSD vendor’s promises hold true, persistent too!
Note that this tool failed with an “Invalid argument” error in the end when it got to trying 2kB sized I/Os, this is because that disk is currently configured to use 4kB sector size (scroll up to see the lsds output and its HWSEC field).