Advanced Oracle Troubleshooting Guide, Part 7: Sampling latch holder statistics using LatchProf

I have been too busy since getting back from vacation, thus no posts for a while. But I hope the waiting was worthwhile as I present you LatchProf, a tool for digging in to latch contention problems – using plain SQL and sqlplus!

As, I’m still busy, I make it short.

LatchProf is a script similar to WaitProf, only it samples latch holder statistics from V$LATCHHOLDER. As V$LATCHHOLDER contains a SID column (with session ID of a latch holder) it becomes possible to find who is hitting a latch the most (a way to prove that crappy monitoring tools which constantly scan through V$SQL DO cause library cache latch contention themselves).

Some day I will write longer about this (and check the script itself), but here’s the syntax:

@latchprof <what> <sid|%> <latch|%|childaddr> <#samples>

Here are the parameter definitions:

Columns to select actually map to V$LATCHHOLDER columns. So you can use SID,NAME,LADDR there for example.
SID is either a sid as number or % to see latches held by all SIDs
latch_name can be either % for all latches, a LIKE string, like %library% to see all latches with library in their name or the actual latch address! This allows us to monitor only specific child latches we want.
Number of samples to take. I normally use 100 000 samples as starting point.

As latchprof (and waitprof for that matter) never sleep, they burn as much CPU resource as they can. This is good for sampling accuracy (well, at least on multiprocessor systems with enough spare CPU capacity), but will consume one of your CPUs. So, you don’t want to run this for too long at a time ( you could implement a short wait in there with a nice trick by Jonathan Lewis ).

100k samples is usually done in around a second on my test kit (giving 100+ kHz sampling rate :) This will vary by your hardware and Oracle version and what’s your processes parameter (as V$LATCHHOLDER iterates through process state objects – the more processes you’ve got, the more time it takes).

Anyway, let say we have a performance issue and using wait interface we have determined that for a particular session, waits for cache buffers chains latches are unreasonably high.

Now, one way forward would be to identify the database blocks involved in this problem, by looking what blocks are protected by the latch (using X$BH.HLADDR = V$LATCH_CHILDREN.ADDR mapping) and trying to figure out the hottest block by X$BH.TCH touchcount. But where would that lead us? We would find out into which segment that block belongs, but it doesn’t really help us to understand who’s the troublemaker causing this issue.

Therefore we use another approach – sampling V$LATCHHOLDER to see which sessions hold the latches the most:

Just for demo purposes I start with listing general latchholder stats (not broken down by holding SID):

SQL> @latchprof name % % 100000

NAME                                 Held       Gets  Held %     Held ms Avg hold ms
------------------------------ ---------- ---------- ------- ----------- -----------
cache buffers chains                21248      17418   21.25     267.725        .015
simulator lru latch                  7192       7169    7.19      90.619        .013
simulator hash latch                   66         66     .07        .832        .013
enqueue hash chains                     4          4     .00        .050        .013
ges resource hash list                  3          3     .00        .038        .013
process allocation                      2          2     .00        .025        .013
library cache                           2          2     .00        .025        .013
name-service namespace bucket           2          2     .00        .025        .013
name-service memory objects             1          1     .00        .013        .013
ges caches resource lists               1          1     .00        .013        .013
library cache pin                       1          1     .00        .013        .013

11 rows selected.

Yes it looks that cache buffers chains latches are held the most. The Held column here means that during how many samples a latch was held. The Gets column shows roughly how many times that latch was gotten during sampling time.

As we took 100000 samples and during 21248 of them a cache buffers chains latch was held, we can conclude that this latch was busy (Held) 21.25% of the time. And as the sampling lasted about 1.3 seconds that time, we can extrapolate that of the 1.3 seconds, some cache buffers chains latches were in Held state for rougly 267 milliseconds. And as we know the number of gets, then we can compute the average latch hold time in milliseconds. This could help us detect very long latch holds like what can happen when someone is scanning through X$KSMSP or the server is CPU starved. The current estimate of 15 microseconds per latch hold is not that bad (note that with very few samples like 10000 or less, the avg hold time estimates become unreliable).

Anyway, let’s break down the individual latch holding sessions now:

SQL> @latchprof sid,name % % 100000

       SID NAME                                 Held       Gets  Held %     Held ms Avg hold ms
---------- ------------------------------ ---------- ---------- ------- ----------- -----------
        81 cache buffers chains                14916      14915   14.92     187.942        .013
        81 simulator lru latch                  6797       6745    6.80      85.642        .013
        85 cache buffers chains                 6195       2284    6.20      78.057        .034
        85 simulator lru latch                   524        506     .52       6.602        .013
        81 simulator hash latch                  121         85     .12       1.525        .018
       112 name-service namespace bucket           2          2     .00        .025        .013
       100 enqueue hash chains                     2          2     .00        .025        .013
        98 library cache                           2          2     .00        .025        .013
       112 process allocation                      2          2     .00        .025        .013
       112 ges caches resource lists               1          1     .00        .013        .013
       112 enqueues                                1          1     .00        .013        .013
       104 messages                                1          1     .00        .013        .013
       100 cache buffers chains                    1          1     .00        .013        .013
       100 active service list                     1          1     .00        .013        .013
       100 enqueues                                1          1     .00        .013        .013
        85 simulator hash latch                    1          1     .00        .013        .013
       100 library cache                           1          1     .00        .013        .013

17 rows selected.

We see that sessions 81 and 85 are hitting CBC latches the most. Let say (for demo purposes again) that the SID 85 was the one which experienced bad performance and led us to investigate this issue.

As there are tens of thousands (or even more) of CBC latches in a decent Oracle database, then of course looking only at aggregated latch name level stats is not enough. Do the sessions compete for a single child latch? Or do they just use lots of different CBC latches over time and not collide at all? This is the question where LatchProf can help.

Let’s include latch address (child latch address) into picture and query only “cache” related latches. This returns lots of rows, so I normally press CTRL+C after first pageful of results as they are most important (the results are ordered by most busy ones first):

SQL> @latchprof sid,name,laddr % cache 100000

       SID NAME                           LADDR          Held       Gets  Held %     Held ms Avg hold ms
---------- ------------------------------ -------- ---------- ---------- ------- ----------- -----------
        81 cache buffers chains           41AACEEC      15061      15017   15.06     186.756        .012
        85 cache buffers chains           41B2C37C         34         34     .03        .422        .012
        85 cache buffers chains           41B85D64         31         31     .03        .384        .012
        85 cache buffers chains           41BCF7FC         30         30     .03        .372        .012
        85 cache buffers chains           41B415EC         28         28     .03        .347        .012
        85 cache buffers chains           41BCA658         25         25     .03        .310        .012
        85 cache buffers chains           41B4E738         25         25     .03        .310        .012
        85 cache buffers chains           41AE4694         25         25     .03        .310        .012
        85 cache buffers chains           41AB02E0         25          9     .03        .310        .034
        85 cache buffers chains           427FA1F8         24         24     .02        .298        .012
        85 cache buffers chains           427F6E04         22         22     .02        .273        .012
        85 cache buffers chains           41B78D94         22         22     .02        .273        .012
        85 cache buffers chains           41BA2220         21         21     .02        .260        .012
        85 cache buffers chains           41BEDA68         21         21     .02        .260        .012
        85 cache buffers chains           41BC8D1C         21         21     .02        .260        .012
        85 cache buffers chains           41B486BC         20         20     .02        .248        .012
        85 cache buffers chains           42B99698         19         19     .02        .236        .012
        85 cache buffers chains           41AE6EA8         19         14     .02        .236        .017
        85 cache buffers chains           41B1F230         19         19     .02        .236        .012
        85 cache buffers chains           41B09E44         19         17     .02        .236        .014
        85 cache buffers chains           41AE7A88         19         19     .02        .236        .012
        85 cache buffers chains           41B69E98         18         18     .02        .223        .012
[...snip...]

And voila, we see SID 81 is hitting one single child latch in maniac fashion while SID 85 is behaving more normally, just hitting lots of different CBC latches (you see the number of gets of a single child during 100k sampling doesn’t exceed 34). Interesting.

So now that we know the child latch address (LADDR), we can refine our query to only monitor that child latch to see if any other sessions could be affected:

SQL> @latchprof sid,name,laddr % 41AACEEC 100000

       SID NAME                           LADDR          Held       Gets  Held %     Held ms Avg hold ms
---------- ------------------------------ -------- ---------- ---------- ------- ----------- -----------
        81 cache buffers chains           41AACEEC      14058      14047   14.06     177.131        .013
        85 cache buffers chains           41AACEEC         18         18     .02        .227        .013

As of now it looks like noone else except the maniac SID 81 and victim SID 85 contend for the latch (otherwise one would expect to see at least few successful holds against that latch).

So, as SID 81 is so evidently the troublemaker here, let’s check what it’s doing:

SQL> select sql_text from v$sql where hash_value = (select sql_hash_value from v$session where sid = 81);

SQL_TEXT
---------------------------------------------------------------------------------------------------------
select count(*) X from kill_cpu connect by n > prior n start with n = 1

1 row selected.

This is Jonathan Lewis’es kill_cpu script in action, which hammered the same datablock in a tight loop using connect by ( you need to set _old_connect_by_enabled=true on 9i+ for this test to work :)

I will blog about some more use cases and an extended version of LatchProf script in (hopefully) near future.

Note once more that this script only detects latch holders. A latch held the most doesn’t necessarily mean it’s the latch waited for the most. You still need to start your diagnosis from wait interface, using Snapper or WaitProf for example. Only if these tools show significant latch waits, you should drill down into latches further. And as Anjo Kolk recently (very well) said, latch contention is a symptom, not the problem. The root cause is somewhere else (in the application likely), once you fix that, the latch contention will go away itself.

Update: I have published a more detailed latch profiling tool (latchprofX) which uses X$ tables for getting the extended data and can give even more detailed info about the latch holders. Read this article plus links in it:

/2009/08/27/latch-cache-buffers-chains-latch-contention-a-better-way-for-finding-the-hot-block/

Advanced Oracle Troubleshooting Guide, Part 7: Sampling latch holder statistics using LatchProf

Tanel Poder

2008-07-08