Using Postgres pg_test_fsync tool for testing low latency writes

2025-05-27

Here’s a useful tool for quickly testing whether a disk (or a cloud block store volume) is a good candidate for your database WAL/redo logs and any other files that require low latency writes. The pg_test_fsync tool is bundled with standard Postgres packages, so no extra installation is needed. You don’t actually have to use Postgres as your database, this tool’s output is universally valuable for any workload requiring fast writes.

First, I’ll show the disks attached to my test server, using my new lsds tool. You can scroll right to see the full output:

$ lsds -a FUA,HWSEC
DEVNAME  MAJ:MIN  SIZE        TYPE      SCHED        ROT  MODEL                      QDEPTH  NR_RQ  WCACHE         FUA  HWSEC
nvme0n1  259:1    186.3 GiB   NVMeDisk  none         0    Micron_7400_MTFDKBA960TDZ  -       1023   write through  0    4096 
nvme0n2  259:6    200.0 GiB   NVMeDisk  none         0    Micron_7400_MTFDKBA960TDZ  -       1023   write through  0    4096 
nvme1n1  259:0    1863.0 GiB  NVMeDisk  none         0    Samsung SSD 990 PRO 2TB    -       1023   write back     1    512  
nvme2n1  259:2    260.8 GiB   NVMeDisk  none         0    INTEL SSDPED1D280GA        -       1023   write through  0    512  
sda      8:0      3726.0 GiB  Disk      mq-deadline  1    P9233                      30      60     write back     0    4096 
sdb      8:16     3726.0 GiB  Disk      mq-deadline  1    P9233                      30      60     write back     0    4096 
sdc      8:32     3726.0 GiB  Disk      mq-deadline  1    P9233                      30      60     write back     0    4096 
sdd      8:48     3726.0 GiB  Disk      mq-deadline  1    P9233                      30      60     write back     0    4096 

I’ll first run a test on my consumer-grade Samsung 990 Pro locally attached NVMe SSD (/dev/nvme1n1):

$ mount | grep /data
/dev/nvme1n1 on /data type xfs (rw,relatime,attr2,inode64,logbufs=8,logbsize=32k,noquota)
 
$ cd /data
 
$ /usr/lib/postgresql/16/bin/pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       249.578 ops/sec    4007 usecs/op
        fdatasync                           608.573 ops/sec    1643 usecs/op
        fsync                               177.431 ops/sec    5636 usecs/op
        fsync_writethrough                              n/a
        open_sync                           185.095 ops/sec    5403 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                       125.091 ops/sec    7994 usecs/op
        fdatasync                           765.775 ops/sec    1306 usecs/op
        fsync                               178.218 ops/sec    5611 usecs/op
        fsync_writethrough                              n/a
        open_sync                            91.849 ops/sec   10887 usecs/op

There’s more output below, but adding a few comments here:

  1. fdatasync is faster than fsync or the O_SYNC flag used at file open, as it can avoid waiting for the additional filesystem journal I/Os (when overwriting blocks that already exist on disk).
  2. Nevertheless, even the single 8kB physical write done by fdatasync still takes 1.6 milliseconds!
  3. Consumer SSDs (and any enterprise SSD) without controller DRAM based and power-loss-protected (PLP) write cache will have high synchronous write latency due to how NAND writes work. That’s why my lsds tool also shows the write_cache setting of a disk.
  4. You may be able to run a lot of writes concurrently and get 500k+ write IOPS per disk, but this does not mean that each individual write operation would have a very low latency.

The rest of the output from this tool is below. The latencies compound if you issue multiple separate (smaller) writes in an O_SYNC mode, effectively serializing the SSD to handle one write at a time, before you even have a chance to issue the next one. This is where first buffering the writes in OS RAM and later fdatasync’ing all them to disk together would help. The individual write latency won’t get any lower, but at least you can sync a bunch of them to the SSD together and the SSD controller will take advantage of that.

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write           180.052 ops/sec    5554 usecs/op
         2 *  8kB open_sync writes           98.289 ops/sec   10174 usecs/op
         4 *  4kB open_sync writes           43.374 ops/sec   23055 usecs/op
         8 *  2kB open_sync writes           21.095 ops/sec   47404 usecs/op
        16 *  1kB open_sync writes           11.298 ops/sec   88513 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written on a different
descriptor.)
        write, fsync, close                 184.788 ops/sec    5412 usecs/op
        write, close, fsync                 180.081 ops/sec    5553 usecs/op

Non-sync'ed 8kB writes:
        write                            651627.688 ops/sec       2 usecs/op

The last line above is about just doing pwrite writes into filesystem pagecache, without waiting for the actual flush disk I/O, so it’s not relevant for physical disk I/O measurement.

The above output was from a consumer SSD, now let’s run the same tests on an enterprise-grade SSD (Micron 7400) that I had highlighted above. It has a capacitor-based power loss protection feature built in for its controller cache DRAM, therefore enabling “write-through cache” for this SSD. Synchronous writes can be just copied to the SSD controller DRAM and a “write OK” acknowledgement can be sent back to the host immediately. The destaging to the persistent NAND media can happen asynchronously later, or when power-loss event is detected, then quickly, while the capacitors still provide some power for an emergency flush.

This time I ran the test in my home directory, that happens to be on a filesystem on a LVM block device:

$ df -h .
Filesystem                         Size  Used Avail Use% Mounted on
/dev/mapper/ubuntu--vg-ubuntu--lv   90G   68G   18G  80% /

$ sudo pvs
  PV             VG        Fmt  Attr PSize    PFree  
  /dev/nvme0n1p3 ubuntu-vg lvm2 a--   183.21g <91.61g
  /dev/nvme0n2   backup    lvm2 a--  <200.00g      0 
  /dev/sdb       backup    lvm2 a--    <3.64t      0 
  /dev/sdc       backup    lvm2 a--    <3.64t      0 
  /dev/sdd       backup    lvm2 a--    <3.64t      0 

$ /usr/lib/postgresql/16/bin/pg_test_fsync
5 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     52813.315 ops/sec      19 usecs/op 
        fdatasync                         42003.972 ops/sec      24 usecs/op
        fsync                             39751.307 ops/sec      25 usecs/op
        fsync_writethrough                              n/a
        open_sync                         48070.142 ops/sec      21 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync is Linux's default)
        open_datasync                     26275.705 ops/sec      38 usecs/op
        fdatasync                         32551.276 ops/sec      31 usecs/op
        fsync                             31099.826 ops/sec      32 usecs/op
        fsync_writethrough                              n/a
        open_sync                         24105.752 ops/sec      41 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB in different write
open_sync sizes.)
         1 * 16kB open_sync write         40447.014 ops/sec      25 usecs/op
         2 *  8kB open_sync writes        23854.185 ops/sec      42 usecs/op
         4 *  4kB open_sync writes        13436.020 ops/sec      74 usecs/op
         8 *  2kB open_sync writes    pg_test_fsync: error: write failed: Invalid argument

Everything’s super-fast! And, as long as the SSD vendor’s promises hold true, persistent too!

Note that this tool failed with an “Invalid argument” error in the end when it got to trying 2kB sized I/Os, this is because that disk is currently configured to use 4kB sector size (scroll up to see the lsds output and its HWSEC field).

Discussion


  1. Updated video course material to be announced soon:
    Linux Performance & Troubleshooting training, Advanced Oracle SQL Tuning training, Advanced Oracle Troubleshooting training.
    Check the current versions out here! ^^^
  2. Get randomly timed updates by email or follow Social/RSS