What Caused This Wait Event: Using Oracle's wait_event[] tracing

I talked about a new diagnostic event wait_event[] back at OakTableWorld 2014. Oracle introduced this feature in version 12.1. But since there’s only a camera recorded video of that talk, I’ll document some examples here.

Attaching Actions to Wait Events

The new wait_event[] event name allows you to attach actions, like tracing, to wait events. The action runs whenever a wait event ends. It behaves like Oracle’s SQL Trace that writes out trace lines for wait events only when the wait ends.

For example, I could do something silly (but fun) like this:

SQL> ALTER SESSION SET EVENTS 'wait_event["log file sync"] crash()';

Session altered.

SQL> COMMIT;

Commit complete.

SQL> DELETE t WHERE rownum = 1;

1 row deleted.

SQL> COMMIT;
COMMIT
     *
ERROR at line 1:
ORA-03113: end-of-file on communication channel
Process ID: 15872
Session ID: 135 Serial number: 140728898484950

I just told my session to crash itself whenever a log file sync wait ends in it. The first commit above didn’t crash as it didn’t have any outstanding transaction to commit. So there was no need for a log file sync wait. But the second commit had some changes to persist, so it followed my direction and crashed itself, right after the wait ended.

Augmenting Wait Event Traces

A more practical use for this event is additional tracing. Oracle SQL Trace can write out wait events, their elapsed times and parameters. But in some cases, I also want to know which exact Oracle kernel function caused that wait event, ideally the whole stack backtrace. This would give me more insight into where and why did this wait happen. A full table scan example is below:

SQL> SELECT MAX(e) FROM t;

MAX(E)
-----------
e

When I ran the above query, I saw lots of db file sequential read waits showing up. It was surprising, as I would have expected almost all waits to be multi-block reads for this full table scan. Either db file scattered read or direct path read wait events.

...
WAIT #139658341140928: nam='db file scattered read' ela= 854 file#=7 block#=29236 blocks=123 obj#=117150 tim=411962792936
WAIT #139658341140928: nam='db file scattered read' ela= 559 file#=7 block#=29359 blocks=65 obj#=117150 tim=411962794592
WAIT #139658341140928: nam='db file sequential read' ela= 462 file#=7 block#=29695 blocks=1 obj#=117150 tim=411962796752
WAIT #139658341140928: nam='db file sequential read' ela= 423 file#=7 block#=29680 blocks=1 obj#=117150 tim=411962797242
WAIT #139658341140928: nam='db file sequential read' ela= 382 file#=7 block#=29681 blocks=1 obj#=117150 tim=411962797724
...

The execution plan confirms that we should be doing a “simple” full table scan throughout the table segment with multiblock reads:

SQL> @x
Display execution plan for last statement for this session from library cache...

-----------------------------------
| Id  | Operation          | Name |
-----------------------------------
|   0 | SELECT STATEMENT   |      |
|   1 |  SORT AGGREGATE    |      |
|   2 |   TABLE ACCESS FULL| T    |
-----------------------------------

In past I have written about another approach for troubleshooting such surprises using V$SESSTAT metrics ¹ ². Stack tracing is more universal though, but it is a much lower level tool. I use it to “zoom in” when the typical higher-level tools do not provide enough information.

Let’s enable SQL Trace with the added shortstack tracing for the db file sequential read wait event and rerun the earlier query. I also flushed the buffer cache to avoid having all table blocks already cached in the buffer pool.

SQL> ALTER SESSION SET EVENTS 'wait_event["db file sequential read"] 
                               trace("stack is: %\n", shortstack())';

Session altered.

SQL> SELECT MAX(e) FROM t;

MAX(E)
-----------
e

Now we have one-line stack backtraces (“short stacks”) printed out whenever a db file sequential read wait ends:

WAIT #139658340304896: nam='db file scattered read' ela= 392 file#=7 block#=28668 blocks=4 obj#=117150 tim=412079037275
WAIT #139658340304896: nam='db file scattered read' ela= 796 file#=7 block#=28676 blocks=123 obj#=117150 tim=412079038251
WAIT #139658340304896: nam='db file sequential read' ela= 323 file#=7 block#=28915 blocks=1 obj#=117150 tim=412079039074
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcbzib<-kcbgtcr<-ktrget2<-kdsgrp<-kdsgnp1<-kafger<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-ttcpip<-opitsk<-opiino<-opiodr<-opidrv<-sou2o<-opimai_real<-ssthrdmain< 
WAIT #139658340304896: nam='db file sequential read' ela= 511 file#=7 block#=28916 blocks=1 obj#=117150 tim=412079070653
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcbzib<-kcbgtcr<-ktrget2<-kdsgrp<-kdsgnp1<-kafger<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-ttcpip<-opitsk<-opiino<-opiodr<-opidrv<-sou2o<-opimai_real<-ssthrdmain< 
WAIT #139658340304896: nam='db file sequential read' ela= 599 file#=7 block#=28917 blocks=1 obj#=117150 tim=412079105620
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcbzib<-kcbgtcr<-ktrget2<-kdsgrp<-kdsgnp1<-kafger<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-ttcpip<-opitsk<-opiino<-opiodr<-opidrv<-sou2o<-opimai_real<-ssthrdmain<

Indeed, we do see stack traces printed out after each wait event of interest. If I replace the “<-” with “\n” in one of the traces, I see the following:

stack is: 
ksfd_io  kernel service file disk I/O function
ksfdread read random access file in sync mode
kcfrbd1  Read Buffers from Database:
kcbzib   KCB: input buffer - reads a block from disk into a buffer
kcbgtcr  KCB: Get a block for CR
ktrget2  kernel transaction read consistency - get
kdsgrp   Get row piece (fetch by rowid)
kdsgnp1  Get NEXT row piece (fetch by rowid)
kafger
kdstf000010100001000km
kdsttgr
qertbFetch
qerstFetch
qergsFetch
qerstFetch
opifch2
kpoal8
opiodr
ttcpip
opitsk
opiino
opiodr
opidrv
sou2o
opimai_real
ssthrdmain<

Without going too deep into stack-walking topics in this article, you can read the stack from top down. The idea is to just acknowledge low level physical I/O and logical I/O functions that always show up anyway when accessing data blocks, until you reach “higher level” functions that explain why such single block accesses were requested.

The wait event ended somewhere under the ksfd_io function all in the top.³ Then I continued walking down the list, just acknowledging and skipping the functions that I’d expect to show up anyway, whenever single block buffered physical reads are used. In short, the kcb* functions are about buffer cache, ktr* functions about transactional consistency and kds* ones are about the data layer (the functions that actually navigate to rows in blocks and read required columns).

I expected some of the kcb, ktr, kds functions to show up as usual, but I stopped walking the stack trace once I saw kdsgnp1() in there. The presence of this function in the current stack trace gives us some new information. kdsgnp* functions are about fetching the NEXT row piece for chained and migrated rows (and any columns spilled over to IOT overflow segments).

These additional single block reads that our simple full table scan was causing, come from chained or migrated rows! And since we are just doing a simple, serial SELECT query without any index-driven access, immediate fetching of next rowpieces during the scan very likely means having too wide for a single block chained rows and not just “plain” migration. There are exceptions, as usual.

Let’s confirm what my table looks like:

SQL> @desc t
           Name                            Null?    Type
           ------------------------------- -------- ----------------------------
    1      A                                        CHAR(2000)
    2      B                                        CHAR(2000)
    3      C                                        CHAR(2000)
    4      D                                        CHAR(2000)
    5      E                                        CHAR(2000)

Oh yeah, no surprise here. Thanks to the CHAR(2000) columns and 8kB block size, every row that has non-NULL values in these columns ends up chained between 2 blocks.

If you know about the table fetch continued row metric in V$SESSTAT, you would be able to systematically troubleshoot row chaining/migration-related issues without needing wait event tracing. But as I said earlier, tracing any wait event, regardless what causes it, is a much more universal tool, with wide coverage. Not all unexpected single block reads are caused by row chaining, as you’ll see below.

Here’s an example of the same full table scan after I had forced serial direct path read scanning to kick in. This time I’ll show you the trace from start:

PARSE #140324515798504:c=22,e=22,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=84517501201
EXEC #140324515798504:c=42,e=42,p=0,cr=0,cu=0,mis=0,r=0,dep=0,og=1,plh=2966233522,tim=84517501267
WAIT #140324515798504: nam='SQL*Net message to client' ela= 4 driver id=1413697536 #bytes=1 p3=0 obj#=114666 tim=84517501333
WAIT #140324515798504: nam='db file sequential read' ela= 1030 file#=7 block#=8034 blocks=1 obj#=117150 tim=84517502603
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcbzib<-kcbgtcr<-ktecgsc<-ktecgetsh<-ktecgshx<-kteinpscan_int<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-ttcpip<-opitsk<-opiino<-opiodr<-opidrv<-sou2o<-opimai_real<-ssthrdmain<-main<-__libc_start_m
WAIT #140324515798504: nam='direct path read' ela= 601 file number=7 first dba=8035 block cnt=13 obj#=117150 tim=84517538040
WAIT #140324515798504: nam='db file sequential read' ela= 502 file#=7 block#=8042 blocks=1 obj#=117150 tim=84517538829
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcfrbd<-kcbldrget<-kcbgtcr<-ktrget2<-kdsgrp<-kdsgnp1<-kafger<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-ttcpip<-opitsk<-opiino<-opiodr<-opidrv<-sou2o<-opimai_real<-
WAIT #140324515798504: nam='direct path read' ela= 571 file number=7 first dba=8209 block cnt=15 obj#=117150 tim=84517575425
WAIT #140324515798504: nam='db file sequential read' ela= 443 file#=7 block#=8202 blocks=1 obj#=117150 tim=84517575992
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcfrbd<-kcbldrget<-kcbgtcr<-ktrget2<-kdsgrp<-kdsgnp1<-kafger<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-ttcpip<-opitsk<-opiino<-opiodr<-opidrv<-sou2o<-opimai_real<-

The first db file sequential read shows a bunch of kte* functions causing it. If you scroll right, you’ll see kteinpscan_int() right under the table scan row source function (qertbFetch() means TABLE ACCESS FULL Fetch in this case). This single block read right in the beginning of the table scan is done for reading this segment’s extent map and High-Water Mark from the segment header block (block #8034) above, so that Oracle would know which block ranges in the data files to scan through. The remaining two single block reads above are the familiar “get next row piece” reasons I described earlier (kdsgnp1).

But if you continue reading further down in the trace file, there are more reasons, check out this excerpt:

WAIT #140324515798504: nam='db file sequential read' ela= 319 file#=7 block#=38915 blocks=1 obj#=117150 tim=84681878386
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcbzib<-kcbgtcr<-ktspgfblk3<-ktspScanInit<-ktspGenExtentMap1<-kteinmap0<-kteinmap<-kdselget<-kdstsne<-kdstsnb<-kdst_fetch0<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr
WAIT #140324515798504: nam='db file sequential read' ela= 743 file#=7 block#=8033 blocks=1 obj#=117150 tim=84681909584
stack is: ksfd_io<-ksfdread<-kcfrbd1<-kcbzib<-kcbgtcr<-ktspgsb3<-ktspScanInit<-ktspGenExtentMap1<-kteinmap0<-kteinmap<-kdselget<-kdstsne<-kdstsnb<-kdst_fetch0<-kdstf000010100001000km<-kdsttgr<-qertbFetch<-qerstFetch<-qergsFetch<-qerstFetch<-opifch2<-kpoal8<-opiodr<-

The ktsp functions (Kernel Transaction Space) usually have to do with free space search within a segment - find which blocks below the HWM of a segment still have space for insertion. This is interesting as I am running a simple select query here, not an insert or update, intra-segment free space awareness should not be needed? Before guessing what’s going on, let’s see what exact blocks the block numbers #38915 and #8033 point to in file #7.

I could dump these blocks to trace too, but I’ll use a different technique using just V$ views. As this table’s blocks happen to be in buffer cache due to previous testing, I can query the block class from V$BH:

SQL> SELECT class# FROM v$bh WHERE file# = 7 AND block# = 38915;

    CLASS#
----------
         8

SQL> @bclass.sql 8

CLASS              UNDO_SEGMENT_ID
------------------ ---------------
1st level bmb

Apparently, the single block read in the middle of the segment is done against an ASSM 1st level bitmap block. Indeed, my testing table is using Automatic Segment Space Management. Let’s check the second single block I/O too:

SQL> SELECT class# FROM v$bh WHERE file# = 7 AND block# = 8033;

    CLASS#
----------
         9

SQL> @bclass 9

CLASS              UNDO_SEGMENT_ID
------------------ ---------------
2nd level bmb

Ok, the 2nd I/O is also against an ASSM bitmap block (L2). You may have noticed something intriguing above. The ASSM L2 bitmap block is at block number #8033. But wait a minute, wasn’t the table’s segment header at block #8034, thus making this ASSM block reside before the segment header block in the datafile? The answer is yes, this is how ASSM works, there are always two or more ASSM blocks (L1+L2) stored in the first extent of a segment and the segment header block comes after.

FYI: You can download my 15yr old presentation on Oracle ASSM internals, if you dare :-)

Ok, back to the troubleshooting. Why would my simple full table scan need to do single block I/Os to read ASSM blocks? The extent map and High-Water Mark of a segment are stored in the segment header block. A map with offsets of up to 505 extents can be stored in the segment header with 8kB block size, my table only had 92 extents. If the table had more than 505 extents, then since Oracle 8.0, additional overflow extent map blocks would have been created in the segment as it grows.

Regardless of the extent map layout, in my case we were reading ASSM bitmap blocks, not extent map blocks anyway. The reason why my simple full table scan ends up reading ASSM space metadata is thanks to HWM changes in ASSM. There’s not just one HWM in ASSM, but two - Low High-Water Mark (LHWM) and High High-Water Mark (HHWM). All blocks below LHWM are formatted, so ok to scan, there are no formatted blocks at all above the HHWM and some blocks may be formatted between LHWM and HHWM. In order to know which blocks are formatted and may contain data, my table scan needs to consult the relevant ASSM bitmap blocks.

If you are wondering why such complexity is needed - it’s not really for query speedup. This is designed to spread contention when many concurrent sessions try to insert data and allocate blocks for themselves within a segment. The reason why you typically don’t see such single block reads showing up with buffered table scans, is that a buffered multiblock read will read the ASSM bitmap blocks in from the beginning of each extent as it does its multiblock reads. The ASSM bitmap blocks are read into cache together with regular data blocks under the db file scattered read events you see. With a direct path read, you’d read the block ranges into PGA private memory, but the ASSM space search codepath only knows how to read ASSM bitmap blocks via buffer cache.

Ok, I’ll stop the ASSM-specific content here, I wanted to demo just another scenario where wait event tracing can shed light on hard to explain I/Os or waits.

Note that the wait_event[] actions are independent of SQL Tracing, I just happened to enable both SQL Trace and custom wait event tracing in this example.

Dumping PL/SQL Call Stack on a Wait Event

In the previous section we did some pretty microscopic analysis using Oracle RDBMS kernel call stacks. It is possible to use Oracle’s errorstack() action for also dumping PL/SQL call stack and source code line number info. For example, you can trace where exactly in your stored procedures are all these DBMS_LOCK.SLEEP() waits coming from:

SQL> ALTER SESSION SET EVENTS 'wait_event["PL/SQL lock timer"] errorstack(1)';

Session altered.

SQL> CREATE OR REPLACE PROCEDURE p AS
  2  BEGIN
  3    -- do something useful
  4    DBMS_LOCK.SLEEP(1);
  5  END;
  6  /

Procedure created.

SQL> EXEC p

PL/SQL procedure successfully completed.

The errorstack trace will also interpret and dump your PL/SQL call stack (in addition to the binary kernel C function call stack). As you see below, it even tells you the source code line number for use with DBA_SOURCE (line #4 in SYSTEM.P):

***2019-03-22T15:48:23.312943-04:00
dbkedDefDump(): Starting a non-incident diagnostic dump (flags=0x0, level=1, mask=0x0)
----- Error Stack Dump -----
 at 0x7ffc1adc9608 placed dbkda.c@295
----- Current SQL Statement for this session (sql_id=0g33y61cdq54y) -----
BEGIN p; END;
----- PL/SQL Stack -----
----- PL/SQL Call Stack -----
  object      line  object
  handle    number  name
0x708aeeb8       215  package body SYS.DBMS_LOCK.SLEEP
0x677c6850         4  procedure SYSTEM.P
0x676edf98         1  anonymous block

----- Call Stack Trace -----
calling              call     entry                argument values in hex      
location             type     point                (? means dubious value)     
-------------------- -------- -------------------- ----------------------------
ksedst1()+110        call     kgdsdst()            7FFC1ADC8A70 000000002
                                                   7FFC1ADC2DB0 ? 7FFC1ADC2EC8 ?
                                                   7FFC1ADC8590 ? 000000082 ?
ksedst()+64          call     ksedst1()            000000000 000000001
                                                   7FFC1ADC2DB0 ? 7FFC1ADC2EC8 ?
                                                   7FFC1ADC8590 ? 000000082 ?
... a lot of extra output removed ...
...

As dumping the session’s errorstack is a relatively heavy operation, you might not want to trace all occurrences of this event and perhaps just dump it out once in a session:

SQL> ALTER SESSION SET EVENTS 'wait_event["PL/SQL lock timer"] 
                               {occurence:end_after 1} errorstack(1)';

Session altered.

Note that you have to use the misspelled word occurence in the above syntax for it to work.

Or you could invoke the errorstack dump action only when the wait event’s duration was longer than N milliseconds:

SQL> ALTER SESSION SET EVENTS 'wait_event["PL/SQL lock timer"]
                               {wait: minwait=5000} errorstack(1)';

Session altered.

Most of the tracing options are documented in the ORADEBUG DOC command output (you’ll need to run it as SYS). For example, I can list the event filters (for narrowing down when the actions fire) using ORADEBUG DOC EVENT FILTER and drill down deeper from there:

SQL> ORADEBUG DOC EVENT FILTER rdbms

Event filters in library RDBMS:
------------------------------
wait                 filter for specific wait parameters and wait duration
process              filter to set events only for a specific process
px                   filter to check identity of the process for fault injection


SQL> ORADEBUG DOC EVENT FILTER rdbms.wait

wait: filter for specific wait parameters and wait duration

Usage
-------
{wait:  minwait         ,
        p1              ,
        p2              ,
        p3              ,
        _actual_wait_time  default 'evargn(pos=1)',
        _actual_wait_p1  default 'evargn(pos=2)',
        _actual_wait_p2  default 'evargn(pos=3)',
        _actual_wait_p3  default 'evargn(pos=4)' }


SQL> ORADEBUG DOC EVENT FILTER rdbms.process

process: filter to set events only for a specific process

Usage
-------
{process:  ospid           [20],
           orapid          ,
           pname           [20],
           con_id           }

You can see all the options by experimenting with these commands yourself or just go read my ORADEBUG DOC article.

Tracing ALL wait events

So far, we have traced only specific wait events by their name. Oracle allows to also trace all wait events when passing in “all” as the wait event name:

SQL> ALTER SESSION SET EVENTS 'wait_event[all] 
             trace(''event="%" ela=% p1=% p2=% p3=%\n'', 
             evargs(5), evargn(1), evargn(2), evargn(3), evargn(4))';

Session altered.

Since I’m tracing all wait events now, I want the trace action to print out the wait event names too. The event argument #5 is a pointer to the event name string, so I’ll have to copy the string in using evargs(5) function. The other arguments are just numeric values, so they can be listed with evargn().

When running DBMS_LOCK from my session (SQL Trace is not enabled, only the custom wait_event trace), we get the following trace output:

SQL> EXEC dbms_lock.sleep(1.234)

PL/SQL procedure successfully completed.

Trace output:

*** 2019-03-29T10:47:14.131674-04:00
event="SQL*Net message from client" ela=11952206 p1=1413697536 p2=1 p3=0
event="db file sequential read" ela=1067 p1=1 p2=20527 p3=1
event="db file sequential read" ela=563 p1=1 p2=65915 p3=1
event="db file sequential read" ela=449 p1=1 p2=65892 p3=1
event="db file sequential read" ela=434 p1=1 p2=47369 p3=1
event="db file sequential read" ela=458 p1=1 p2=33381 p3=1
event="db file sequential read" ela=492 p1=1 p2=65877 p3=1

*** 2019-03-29T10:47:15.417846-04:00
event="PL/SQL lock timer" ela=1279073 p1=0 p2=0 p3=0
event="SQL*Net message to client" ela=7 p1=1413697536 p2=1 p3=0

We just built our own, flexible, SQL Trace-like wait event tracing using the Oracle’s diagnostic tracing infrastructure.

Summary

The aim of this post was to demonstrate the flexibility of Oracle’s built in diagnostic infrastructure, in the context of augmenting wait events. We looked into how to print out Oracle RDBMS kernel stack traces, PL/SQL stack traces and custom tracing when a wait event of interest completes. We looked into how to fire the diagnostic events only when specific filter conditions match and even how to conditionally crash our session for fun!

As these tracing techniques are undocumented and as some dumps can slow your process down, you should not use these in production as your first tool of choice. You’d still start from things like SQL Monitoring, ASH, SQL Trace and if these techniques are not enough, see how to apply selective additional tracing with as narrow focus as possible. Additionally, you may want to make these trace events turn themselves off automatically, after a few occurrences of the problematic event.

More info

If you want more info about low-level wait event internals and some pretty advanced low-level workings, watch these YouTube videos:

My OakTableWorld 2014 talk where I first showed this technique (camera recording)
My Hacking session about wait event internals and X$TRACE (screencast)
Evgeny Puzikov has published another good use case of using wait_event debug hooks for dumping detailed PGA usage info when measuring PGA growth caused by PL/SQL activity (it works for other things like memory leaks too)

Footnotes

Detect Chained and Migrated rows in Oracle ↩︎
Advanced Oracle Troubleshooting Guide: Index Unique Scan Doing Multiblock Reads ↩︎
Actually, the function dealing with wait event ending (and calling any additional debug/tracing actions) is called kslwtectx(), but the traced function itself is not dumped out by the shortstack() action. It can be printed using the evfunc() trace function. ↩︎