Testing Oracle's Use of Optane Persistent Memory, Part 2 - Fast Log File Sync Waits

Update: Just as I started blogging about the Optane technology, it turns out that Intel is shutting down the Optane business (after bringing the latest-gen “Crow Pass” Optane memory to market - probably just for satisfying existing commitments). While not exactly the same thing, keep an eye on the new CXL standard developments to see where the data center/cloud infrastructure world is going (CXL is basically cache-coherent disaggregated memory access protocol via PCIe5.0+).

In the previous post about Oracle’s Use of Optane Persistent memory for fast commits, I wrote about microsecond-scale “I/O” done against Optane persistent memory modules. This allowed Oracle’s LGWR process to complete small log buffer “flush” operations in a few microseconds, even with the additional redolog header update for high-water mark increases for each write.

But the LGWR log buffer flush time is just one part of the total commit latency that an application process experiences. There’s some interprocess signaling overhead, process posting/wakeup & scheduling latency and related CPU usage too.¹ When the redo writes take only a couple of microseconds, it’s the other stuff that ends up a significant latency contributor at high transaction rates. Even things like instrumentation syscalls, metrics updates and redolog block checksum calculations may noticeably increase the commit latency.

In this post I will show a new feature introduced in Oracle 21c that apparently works only when redologs are placed on Optane PMEM. Judging from the various undocumented parameter and related metrics’ names, the feature is called fast log file syncs (not to be confused with adaptive log file syncs²).

With this new feature, a committing Oracle foreground processes can now very briefly go to sleep and then wake up to do “busywaiting” (spinning on CPU) right around the time when the log buffer flush I/O to PMEM is expected to complete.

But first, some history!

History

This is a short, high-level overview of some relevant LGWR bottlenecks and performance improvements from the past; it doesn’t aim to be a complete reference.

Back in the old times, Oracle foreground processes always went to sleep using a semtimedop() syscall and it was LGWR’s duty to wake up all the (sometimes hundreds of!) waiting processes after a log file parallel write. This put a lot of extra burden on LGWR though, as issuing hundreds of semop() system calls after each disk write used kernel mode CPU and reduced LGWR responsiveness to subsequent flush requests.

This is why the adaptive log file sync was introduced back in Oracle 11.2 - it allowed LGWR to switch foreground processes to polling mode instead of the semaphore-based post/wait mode. With polling mode, foreground sessions assess how long it usually takes for LGWR to flush redo to disk and do any related housekeeping - measured from the start time of a log file sync request issued by the foreground.

The committing foregrounds could then use this measurement to voluntarily go to sleep for the right amount of time, so that by the time they wake up, LGWR ought to be done with its disk write. This allows the foreground to confirm if the disk write was done just by reading a SGA memory location that LGWR maintains. No semaphore “wakeup” system calls by LGWR are needed and LGWR can focus on its most important task - persisting redo logs quickly.

Oracle 12c enhanced LGWRs capabilities further, now we have a feature called adaptive scalable log writers that can give us multiple LGWR worker processes (LG00, LG01, … etc) that do all the I/O work and semaphore posting/foreground-wakeups and the single LGWR process itself acts as a coordinator.

And now back to the new feature described in this post, fast log file syncs with PMEM.

Fast Log File Sync

If it’s expected that LGWR will persist the redo and update its progress via a SGA variable in a matter of a few microseconds, does it make sense to go to sleep at all? Why not make the committing foreground processes always spin and poll the LGWR progress from its SGA variable immediately? This would be a very simple, but also naive approach. If the LGWR slows down for whatever reason, many committing processes would be burning CPU unnecessarily - and likely make the problem worse.

Oracle folks have chosen an adaptive-duration sleep path here as well. When issuing the commit and sending a message to LGWR, your foreground process immediately checks (via the SGA variable) if LGWR has managed to write the redo to PMEM and if yes, no sleeping/spinning is needed at all, the foreground commit operation can finish immediately. But in the more likely case, the redo hasn’t been persisted yet, so the foreground will still go to sleep (using nanosleep()) for an adaptive, very short duration that includes the estimated LGWR write time and also other delays, like the OS scheduling latency for the foreground to get back onto CPU after the sleep (more about that later).

After the short adaptive wait, the foreground process wakes up slightly before the LGWR redo write is expected to be done and starts spinning on the CPU instead, to “wait” for LGWR by polling the SGA memory location that LGWR maintains. This approach should help to get foreground processes the absolutely lowest commit times by proactively hiding any scheduling latency, while not going too crazy spinning on the CPU all the time either.

Note that the LGWR itself doesn’t go to sleep at the OS level when it writes to PMEM. A write to persistent memory is “just” a memory access - or could be even a CPU cache write only. In case of CPU cache flush/writeback instructions, the CPU will stall and burn some empty cycles until the cache line flush instruction finishes, but as far as the OS can tell, LGWR is running on CPU. Of course, when LGWR does not have any work to do, it will sleep with a semtimedop() system call as usual, showing the rdbms ipc message wait event in Oracle.

Now let’s look into some implementation details. First, let’s look into a few new parameters that hint what’s going on:

SQL> @pd fast%sync
Show all parameters and session values from x$ksppi/x$ksppcv...

  NUM NAME                             VALUE      DESCRIPTION
----- -------------------------------- ---------- -----------------------------------
 2312 _fg_fast_sync_sleep_usecs        0          DAX log file sync sleep time
 2313 _fg_fast_sync_spin_usecs         100        DAX log file sync spin time
 2314 _fg_fast_sync_slack_usecs        1          DAX log file sync sleep slack time
 2315 _fg_fast_sync_sleep_target_pct   50         DAX log file sync sleep target pct

The _fg_fast_sync_sleep_usecs parameter controls how many microseconds to sleep for, before waking up and starting spinning & polling the LGWR IO completion. However, when the parameter value is 0, it means that the sleep duration is adaptive based on tracking of recent LGWR IO durations. So, there’s no need to tune this parameter, it will adapt/evolve with your system behavior. It’s possible to trace that, as you’ll see later.

The _fg_fast_sync_spin_usecs parameter indicates that the database foreground process (application session) is willing to spin on CPU and poll LGWR progress from memory in a loop, for up to 100 microseconds. As mentioned above, the foreground process still goes to sleep first, for a few microseconds. The Linux context switch “overhead” is often said to be a couple of microseconds, but it depends on a number of factors, including your CPU architecture and whether CPU cache/TLB entries are discarded on the switch and wether the various CPU side-channel attack mitigations are enabled. Nevertheless, Oracle engineers have decided that it makes sense to sleep first and then spin for quite a long time “100 microseconds”, to optimize for the lowest commit latencies and potentially sacrifice some CPU time if the LGWR writes get slower for some reason.

Back in Oracle 18c (when this feature was apparently developed), only the two first parameters in the above list showed up. But from 19c onwards, the two latter parameters were introduced too. The _fg_fast_sync_sleep_target_pct parameter says that the short sleep before spinning should be 50% of the known redo write time of LGWR. So, the sleep duration is actually dynamic, modern LGWR keeps track of how much time the redolog write preparation, I/O wait and postprocessing actually take. The _fg_fast_sync_slack_usecs adds 1 microsecond to the sleep time just to account for any additional overhead (I didn’t test that parameter much).

Tracing Log Writer Debug Module

There’s an undocumented event for tracing some log writer and log file sync decisions.

10468, 00000, "log writer debug module"
// *Document: NO
// *Cause:
// *Action: Set this event to the appropriate level for log writer debugging.

I enabled levels 64+32+16+8 = 120 for tracing a foreground process doing lots of commits:

SQL> exec dbms_system.set_ev(9318, 47441, 10468, 120, '')

PL/SQL procedure successfully completed.

The trace output was following:

Trace file /u01/app/oracle/diag/rdbms/lin21c/LIN21C/trace/LIN21C_ora_54908.trc
Oracle Database 21c Enterprise Edition Release 21.0.0.0.0 - Production
Version 21.3.0.0.0

...

LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=8, spin=100

Later I enabled SQL*Trace for that session too and you can see how the adaptive sleep duration was now at 17us, while the log file sync wait event duration (ela= 36) was around twice higher. The spinning & polling CPU time is included in the Oracle-level log file sync wait event. This is a yet another example, how Oracle’s wait events include more stuff than just the sleep/wait time and it becomes even more obvious nowadays when the physical IO events take just a handful of microseconds.

Note that the adaptive sleep duration had increased from 8 to 17 microseconds as later I had enabled tracing for LGWR too and the tracing overhead itself caused LGWR to work slower than usual.

LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=17, spin=100
WAIT #140470331913416: nam='log file sync' ela= 36 buffer#=2044 sync scn=2842848485 p3=0 obj#=-1 tim=329070600191
EXEC #140470331913416:c=249,e=249,p=0,cr=0,cu=1,mis=0,r=0,dep=1,og=0,plh=0,tim=329070600263
CLOSE #140470331913416:c=0,e=1,dep=1,type=3,tim=329070600326
EXEC #140470331161328:c=0,e=36,p=0,cr=2,cu=3,mis=0,r=1,dep=1,og=1,plh=1135238127,tim=329070600410
CLOSE #140470331161328:c=0,e=1,dep=1,type=3,tim=329070600493
XCTEND rlbk=0, rd_only=0, tim=329070600562
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=17, spin=100
WAIT #140470331913416: nam='log file sync' ela= 35 buffer#=13107 sync scn=2842848512 p3=0 obj#=-1 tim=329070600746
EXEC #140470331913416:c=0,e=263,p=0,cr=0,cu=1,mis=0,r=0,dep=1,og=0,plh=0,tim=329070600822
CLOSE #140470331913416:c=0,e=0,dep=1,type=3,tim=329070600875
EXEC #140470331161328:c=0,e=36,p=0,cr=2,cu=3,mis=0,r=1,dep=1,og=1,plh=1135238127,tim=329070600974
CLOSE #140470331161328:c=0,e=1,dep=1,type=3,tim=329070601054
XCTEND rlbk=0, rd_only=0, tim=329070601121

Oracle wait event durations at microsecond (and sub-microsecond) level get less precise, especially as there’s some CPU time included within wait event durations. And there’s the long-standing weird behavior of instrumenting asynchronous I/Os, it gets even weirder with PMEM writes that are entirely CPU time - no OS level thread sleeping involved. Thankfully, Oracle folks introduced a bunch of new V$SESSTAT metrics for both LGWR writes and foreground session commits quite a while ago and extended them further in 21c.

These new metrics don’t have to follow the wait event measurement logic (that tries to instrument system call durations), but can time more complex operations, like the (CPU) time it takes to setup up and schedule asynchronous I/O requests vs only measuring the time it takes for async I/O completion checks.

Redo Metrics

redo synch% - foreground log file sync duration breakdown (been around since 11.2 or so)
redo synch fast sync% - new foreground metrics introduced in 21c for fast sync on PMEM
redo write% - background LGWR write IO duration breakdown

Here’s Snapper output from a foreground session doing over 13,000 commits per second. I have edited the output to be terser than usual. Scroll right to see additional useful info about the yellow lines:

SQL> @snapper all 5 1 9318
Sampling SID 9318 with interval 5 seconds, taking 1 snapshots...

-- Session Snapper v4.33 - by Tanel Poder ( https://tanelpoder.com/snapper ) - Enjoy the Most Advanced Oracle Troubleshooting Script on the Planet! :)

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
TYPE, STATISTIC                                    ,         DELTA, HDELTA/SEC,    %TIME, GRAPH       , NUM_WAITS,  WAITS/SEC,   AVERAGES
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
STAT, opened cursors cumulative                    ,        131252,     26.85k,         ,             ,          ,           ,          2 per execution
STAT, user commits                                 ,         65626,     13.43k,         ,             ,          ,           ,          1 per execution
STAT, recursive calls                              ,        196881,     40.28k,         ,             ,          ,           ,    21.49us recursive CPU per recursive call
STAT, messages sent                                ,         62728,     12.83k,         ,             ,          ,           ,        .96 per execution
STAT, redo entries                                 ,        131264,     26.86k,         ,             ,          ,           ,          2 per execution
STAT, redo size                                    ,      44790900,      9.16M,         ,             ,          ,           ,     682.52 bytes per user commit
STAT, redo ordering marks                          ,           731,     149.56,         ,             ,          ,           ,        .01 per execution
STAT, redo synch time                              ,           117,      23.94,         ,             ,          ,           ,          0 per execution
STAT, redo synch time (usec)                       ,       1172476,    239.88k,         ,             ,          ,           ,      17.87 per execution
STAT, redo synch time overhead (usec)              ,        289057,     59.14k,         ,             ,          ,           ,      4.4us FG wakeup overhead per log file sync
STAT, redo synch time overhead count (  2ms)       ,         65626,     13.43k,         ,             ,          ,           ,          1 per execution
STAT, redo synch fast sync all sleep (usec)        ,       1041171,    213.01k,         ,             ,          ,           ,      15.87 per execution
STAT, redo synch fast sync all sleep count         ,         65626,     13.43k,         ,             ,          ,           ,          1 per execution
STAT, redo synch fast sync all sleep (20us)        ,         63079,     12.91k,         ,             ,          ,           ,        .96 per execution
STAT, redo synch fast sync all sleep (40us)        ,          2522,     515.98,         ,             ,          ,           ,        .04 per execution
STAT, redo synch fast sync all sleep (60us)        ,            23,       4.71,         ,             ,          ,           ,  .35*10^-3 per execution
STAT, redo synch fast sync all sleep (80us)        ,             2,        .41,         ,             ,          ,           ,  .03*10^-3 per execution
STAT, redo synch fast sync sleep count             ,         53449,     10.94k,         ,             ,          ,           ,        .81 per execution
STAT, redo synch fast sync sleep (usec)            ,        938720,    192.05k,         ,             ,          ,           ,       14.3 per execution
STAT, redo synch fast sync spin count              ,         12177,      2.49k,         ,             ,          ,           ,        .19 per execution
STAT, redo synch fast sync spin (usec)             ,        248912,     50.93k,         ,             ,          ,           ,       3.79 per execution
STAT, redo synch writes                            ,         65626,     13.43k,         ,             ,          ,           ,          1 per execution
STAT, redo synch poll writes                       ,         65626,     13.43k,         ,             ,          ,           ,          1 per execution
STAT, redo synch polls                             ,       3225548,    659.92k,         ,             ,          ,           ,      49.15 per execution
STAT, redo write info find                         ,         65626,     13.43k,         ,             ,          ,           ,          1 per execution
STAT, execute count                                ,         65626,     13.43k,         ,             ,          ,           ,          ~ executions per parse
TIME, PL/SQL execution elapsed time                ,        115674,    23.67ms,     2.4%, [#         ],          ,           ,
TIME, DB CPU                                       ,       3605697,   737.69ms,    73.8%, [@@@@@@@@  ],          ,           ,
TIME, sql execute elapsed time                     ,       4003639,   819.11ms,    81.9%, [######### ],          ,           ,
TIME, DB time                                      ,       4003639,   819.11ms,    81.9%, [######### ],          ,           ,      18.09 % unaccounted-for time*
WAIT, log file sync                                ,        690732,   141.32ms,    14.1%, [WW        ],     65626,     13.43k,    10.53us average wait

--  End of Stats snap 1, end=2022-05-07 21:05:55, seconds=4.9

---------------------------------------------------------------------------------------------------------------
  ActSes   %Thread | INST | SQL_ID          | SQL_CHILD | EVENT                               | WAIT_CLASS
---------------------------------------------------------------------------------------------------------------
     .41     (41%) |    1 | 14q1rdj55ud17   | 0         | ON CPU                              | ON CPU
     .39     (39%) |    1 | 4dcjpm8279gw8   | 0         | ON CPU                              | ON CPU
     .17     (17%) |    1 | 14q1rdj55ud17   | 0         | log file sync                       | Commit
     .02      (2%) |    1 | 4dcjpm8279gw8   | 0         | log file sync                       | Commit

--  End of ASH snap 1, end=2022-05-07 21:05:55, seconds=5, samples_taken=46, AAS=1

If you look deeper into the highlighted (yellow) metrics, you’ll see this:

redo synch time (usec) - 17.87us/exec
redo synch time overhead (usec) - 4.4us (FG wakeup overhead per log file sync)
redo synch fast sync sleep count - 10.94k/s
redo synch fast sync spin count - 2.49k/s
log file sync - 10.53us average wait

So even though the log file sync wait event takes 10.53us on average, the redo synch time (usec) metric shows that the total amount of time from setting up the commit sync processing (sending a message to LGWR, before waiting for it, etc) is more like 17.87us on average. This total redo synch time includes various overhead (like the foreground taking time to wake up after the LGWR write is already completed) - 4.4us on average delay in this case.

When you compare the redo synch fast sync sleep count (10.94k/s) and redo synch fast sync spin count (2.49k/s), you’ll see that in most cases, when the Oracle foreground process woke up after that short sleep, LGWR was already done with its IO and no spinning was needed. In the remaining cases (2.49k/s), a poll spin was needed too after the foreground wakeup.

I’m not big on tweaking undocumented parameters “for performance” in production, but if you’re after low commit latency in this scenario, one finetuning approach would be to reduce the _fg_fast_sync_sleep_target_pct value (after doing all the other things needed for low latency and confirming with Oracle Support). I changed that parameter from 50% to 10% and indeed the sleep durations changed:

EXEC #140336886422824:c=0,e=72,p=0,cr=0,cu=1,mis=0,r=0,dep=1,og=0,plh=0,tim=195429133444
CLOSE #140336886422824:c=0,e=0,dep=1,type=3,tim=195429133451
EXEC #140336887086272:c=29,e=29,p=0,cr=2,cu=3,mis=0,r=1,dep=1,og=1,plh=412410146,tim=195429133485
CLOSE #140336887086272:c=0,e=0,dep=1,type=3,tim=195429133495
XCTEND rlbk=0, rd_only=0, tim=195429133502
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=2, spin=100
WAIT #140336886422824: nam='log file sync' ela= 30 buffer#=4906 sync scn=2822183515 p3=0 obj#=-1 tim=195429133552
EXEC #140336886422824:c=63,e=63,p=0,cr=0,cu=1,mis=0,r=0,dep=1,og=0,plh=0,tim=195429133563
CLOSE #140336886422824:c=0,e=0,dep=1,type=3,tim=195429133570
EXEC #140336887086272:c=27,e=27,p=0,cr=2,cu=3,mis=0,r=1,dep=1,og=1,plh=412410146,tim=195429133602
CLOSE #140336887086272:c=0,e=0,dep=1,type=3,tim=195429133612
XCTEND rlbk=0, rd_only=0, tim=195429133619
LFS: mode=fast sync, polling=Y, adaptive=Y, sleep=2, spin=100
WAIT #140336886422824: nam='log file sync' ela= 28 buffer#=4908 sync scn=2822183518 p3=0 obj#=-1 tim=195429133666

This would make the Oracle foreground process to sleep less, wake up and start spinning/polling sooner. Sadly the log file sync wait event still shows a pretty long duration as I had tracing enabled for both the LGWR and this foreground process. Someday I’ll rerun the tests without any tracing (as your production system would do) and also set statistics_level=basic too - see if we can shed some more microseconds off the total commit latency!