I recently diagnosed a performance issue where the “events in waitclass Other” occasionally took significant part of the session’s response time. For example Snapper (which reads wait event data from V$SESSION_EVENT) reported that during measuring 39.9% of the response time was spent on “events in waitclass Other”.
SQL> @sn 1 119 -- Session Snapper v1.07 by Tanel Poder ( ) --------------------------------------------------------------------------------------------------------------------------------------------- HEAD, SID, SNAPSHOT START , SECONDS, TYPE, STATISTIC , DELTA, DELTA/SEC, HDELTA, HDELTA/SEC --------------------------------------------------------------------------------------------------------------------------------------------- DATA, 119, 20080621 05:22:05, 1, STAT, session logical reads , 18284, 18284, 18.28k, 18.28k DATA, 119, 20080621 05:22:05, 1, STAT, consistent gets , 15301, 15301, 15.3k, 15.3k DATA, 119, 20080621 05:22:05, 1, STAT, consistent gets from cache , 15228, 15228, 15.23k, 15.23k DATA, 119, 20080621 05:22:05, 1, STAT, consistent gets from cache (fastpath) , 15136, 15136, 15.14k, 15.14k DATA, 119, 20080621 05:22:05, 1, STAT, calls to get snapshot scn: kcmgss , 89, 89, 89, 89 DATA, 119, 20080621 05:22:05, 1, STAT, no work - consistent read gets , 14883, 14883, 14.88k, 14.88k DATA, 119, 20080621 05:22:05, 1, STAT, table scans (short tables) , 21, 21, 21, 21 DATA, 119, 20080621 05:22:05, 1, STAT, table scan rows gotten , 1429227, 1429227, 1.43M, 1.43M DATA, 119, 20080621 05:22:05, 1, STAT, table scan blocks gotten , 17440, 17440, 17.44k, 17.44k DATA, 119, 20080621 05:22:05, 1, WAIT, events in waitclass Other , 399831, 399831, 399.83ms, 399.83ms -- End of snap 1
From Oracle 10g Oracle has consolidated lots of the events into “events in waitclass Other”. This is because saving all 900+ wait event stats for every session (in V$SESSION_EVENT array) would waste too much memory with giving little benefit (normally there’s only a handful of troublemaking events anyway). Therefore makes sense to aggregate the least likely happening events under some common category. Looks like Oracle kernel coders have set a threshold in event number above which all events are grouped under the “other” waitclass.
See below, this is from 11g:
SQL> select count(*) from v$event_name; COUNT(*) ---------- 961 SQL> select wait_class, count(*) from v$event_name group by wait_class; WAIT_CLASS COUNT(*) ---------------------------------------------------------------- ---------- Concurrency 26 System I/O 23 User I/O 22 Administrative 51 Other 632 Configuration 21 Scheduler 3 Cluster 47 Application 15 Queueing 4 Idle 80 Network 35 Commit 2 13 rows selected.
There is total 961 different wait events in this version of Oracle, but 632 of these are categorized under “other” name.
So, what to do when this “events in waitclass Other” wait becomes significant in the response time profile?
Actually it’s simple. Only the V$SESSION_EVENT experiences this issue of event aggregation, V$SESSION_WAIT (and sql_tracing) do not. The reason (again) is that unlike V$SESSION_EVENT, the V$SESSION_WAIT view does not have to store the cumulative stats in somewhere memory, therefore aggregation for memory savings is not needed there.
So, let’s use my WaitProf to check what’s the real event my session is waiting for:
SQL> @waitprof print 119 e123 1000000 -- WaitProf 1.04 by Tanel Poder ( ) % Total Total Event Distinct Avg time SID STATE EVENT P1 P2 P3 SEQ# Time Time ms Events ms/Event ------- ------- ----------------------------------- -------------------------- --------------- --------------- ---------- ------- ------------ ---------- ---------- 119 WORKING On CPU / runqueue 78.04 13056.393 1 13056.393 119 WAITING latch free address= 000000005E60BE28 number= 148 tries= 0 21.96 3673.607 1 3673.607 Elapsed: 00:00:16.73
See, it’s actually waiting for a “latch free” event!
What is this latch?
SQL> @la 5E60BE28 ADDR LATCH# CHLD NAME GETS IGETS MISSES IMISSES SLEEPS WAIT_TIME -------- ---------- ------ ---------------------------------------- ---------- ---------- ---------- ---------- ---------- ---------- 5E60BE28 148 6 simulator lru latch 16718933 3 144011 0 3456 1894137768
And we found that this is a buffer cache simulation (V$DB_CACHE_ADVICE) related latch!
To be honest, I kind of knew what the issue was already – I noticed that the servers CPUs were already 100% utilized and the CPU runqueues were quite lenghty. Once you have 100% CPU usage and runqueues start to build up, you will start seeing latch contention on solitary latches and latches with only few children. This is mostly due that if the unlucky event happens that a latch holder is scheduled off CPU then it may take tens or hundreds of milliseconds before the holder gets back to CPU to complete its work and release the latch. However it’s likely that during that time many other processes try to grab the latch and start spinning on it. This raises the CPU usage even more. In this db there thousands of cache buffers chains latches, but only 16 of simulator lru latches. Therefore, even if a CBC latch holder went to sleep, the likelyhood of a collision was quite unlikely than in a similar case with simulator lru latch holder.
This brings me to point that whenever you see any latch contention (especially unusual one) in your performance tool output, first check the CPU utilization. High CPU usage (due execution plan gone bad, parallelism kicking in unexpectedly etc) can cause latch contention and in those conditions the “latch free” events are just a symptom of your real problem.
Anyway, I was writing about “events in waitclass Other”, not latches today, so few more points on events:
I have a script which queries V$FIXED_VIEW_DEFINITION to see the contents of V$ views – and I check out on what X$ table is V$SESSION_EVENT based:
SQL> @v gv$session_event VIEW_NAME TEXT ------------------------------ ---------------------------------------------------------------------------------------------------- GV$SESSION_EVENT select s.inst_id, s.kslessid, d.kslednam, s.ksleswts, s.kslestmo, round(s.kslestim / 10000), round(s.kslestim / (10000 * s.ksleswts), 2), round(s.kslesmxt / 10000), s.kslestim, d.ksledhash, d.ksledclassid, d.ksledclass#, d.ksledclass from x$ksles s, x$ksled d where s.ksleswts != 0 and s.kslesenm = d.indx SQL>
X$KSLES stores the session level wait event stats.
X$KSLED stores the event names (as X$KSLES doesn’t store the event name itself but just its ID)
So, we check how many V$SESSION_EVENT records we have for session 119:
SQL> select count(*) from x$ksles where kslessid = 119; COUNT(*) ---------- 331 SQL>
331 records only for this session.
But, as I showed above, V$EVENT_NAME which is based on X$KSLED had 900+ event names in it.
So lets see what events do we see after event# 331:
SQL> select indx, kslednam from x$ksled where indx > 331; INDX KSLEDNAM ---------- --------------------------------------------------------- 332 latch free 333 unspecified wait event 334 latch activity 335 wait list latch activity 336 wait list latch free 337 kslwait unit test event 1 338 kslwait unit test event 2 339 kslwait unit test event 3 340 global enqueue expand wait 341 free process state object 342 inactive session 343 process terminate 344 latch: session allocation 345 check CPU wait times 346 enq: CI - contention 347 enq: PR - contention 348 ksim generic wait event 349 debugger command 350 ksdxexeother 351 ksdxexeotherwait 352 enq: PE - contention 353 enq: PG - contention 354 ksbsrv [...snipped...]
Most of the events are quite unknown ones, but it looks like latch free is also in one of the “other” ones now!
That explains why Snapper showed “events in wait class Other” instead of latch free.
Note that V$SYSTEM_EVENT doesn’t have that event aggregation issue – as you need to store systemlevel wait stats only in one location in instance (and not for each session).
V$SYSTEM_EVENT is based on X$KSLEI:
SQL> select count(*) from x$kslei; COUNT(*) ---------- 961
That’s pretty much all for today, but theres one more think I want to show, related to X$ table quirks. As V$ views and X$ tables are not normal tables (but rather structure definitions and pointers to underlying functions), they don’t have to play according to normal table rules. Check this:
SQL> select min(indx),max(indx) from X$ksles where kslessid in (1); MIN(INDX) MAX(INDX) ---------- ---------- 0 330 SQL> select min(indx),max(indx) from X$ksles where kslessid in (2); MIN(INDX) MAX(INDX) ---------- ---------- 0 330
You see that regardless of the parameter I pass to the query, I do get min and max values of 0 and 330 in INDX column.
SQL> select min(indx),max(indx) from X$ksles where kslessid in (1,2); MIN(INDX) MAX(INDX) ---------- ---------- 0 661
However if I pass both parameters together, then the max value reaches 661.
SQL> select min(indx),max(indx) from X$ksles where kslessid in (1,2,3); MIN(INDX) MAX(INDX) ---------- ---------- 0 992
If I add third value, the max(indx) goes up to 992… however If I use only 3rd value (which caused the max to go up to 992), then the max goes down to 330:
SQL> select min(indx),max(indx) from X$ksles where kslessid in (3); MIN(INDX) MAX(INDX) ---------- ---------- 0 330
This illustrates that the INDX column in X$ tables is quite like ROWNUM in normal tables. Actually, the INDX (as the name says) is the index into underlying X$ table array. However as this X$ table was a “dynamic” one which was not based on a fixed SGA structure directly but really called an intermediate X$ service function, which fetched only the required data from SGA and placed it into an array in UGA, the INDX represents the slot in the result UGA array, not SGA one ;)
I’ll write about identifying different variations of X$ tables in a future post ;)