I have been too busy since getting back from vacation, thus no posts for a while. But I hope the waiting was worthwhile as I present you LatchProf, a tool for digging in to latch contention problems – using plain SQL and sqlplus!
As, I’m still busy, I make it short.
LatchProf is a script similar to WaitProf, only it samples latch holder statistics from V$LATCHHOLDER. As V$LATCHHOLDER contains a SID column (with session ID of a latch holder) it becomes possible to find who is hitting a latch the most (a way to prove that crappy monitoring tools which constantly scan through V$SQL DO cause library cache latch contention themselves).
Some day I will write longer about this (and check the script itself), but here’s the syntax:
@latchprof <what> <sid|%> <latch|%|childaddr> <#samples>
Here are the parameter definitions:
-
Columns to select actually map to V$LATCHHOLDER columns. So you can use SID,NAME,LADDR there for example.
-
SID is either a sid as number or % to see latches held by all SIDs
-
latch_name can be either % for all latches, a LIKE string, like %library% to see all latches with library in their name or the actual latch address! This allows us to monitor only specific child latches we want.
-
Number of samples to take. I normally use 100 000 samples as starting point.
As latchprof (and waitprof for that matter) never sleep, they burn as much CPU resource as they can. This is good for sampling accuracy (well, at least on multiprocessor systems with enough spare CPU capacity), but will consume one of your CPUs. So, you don’t want to run this for too long at a time ( you could implement a short wait in there with a nice trick by Jonathan Lewis ).
100k samples is usually done in around a second on my test kit (giving 100+ kHz sampling rate :) This will vary by your hardware and Oracle version and what’s your processes parameter (as V$LATCHHOLDER iterates through process state objects – the more processes you’ve got, the more time it takes).
Anyway, let say we have a performance issue and using wait interface we have determined that for a particular session, waits for cache buffers chains latches are unreasonably high.
Now, one way forward would be to identify the database blocks involved in this problem, by looking what blocks are protected by the latch (using X$BH.HLADDR = V$LATCH_CHILDREN.ADDR mapping) and trying to figure out the hottest block by X$BH.TCH touchcount. But where would that lead us? We would find out into which segment that block belongs, but it doesn’t really help us to understand who’s the troublemaker causing this issue.
Therefore we use another approach – sampling V$LATCHHOLDER to see which sessions hold the latches the most:
Just for demo purposes I start with listing general latchholder stats (not broken down by holding SID):
SQL> @latchprof name % % 100000 NAME Held Gets Held % Held ms Avg hold ms ------------------------------ ---------- ---------- ------- ----------- ----------- cache buffers chains 21248 17418 21.25 267.725 .015 simulator lru latch 7192 7169 7.19 90.619 .013 simulator hash latch 66 66 .07 .832 .013 enqueue hash chains 4 4 .00 .050 .013 ges resource hash list 3 3 .00 .038 .013 process allocation 2 2 .00 .025 .013 library cache 2 2 .00 .025 .013 name-service namespace bucket 2 2 .00 .025 .013 name-service memory objects 1 1 .00 .013 .013 ges caches resource lists 1 1 .00 .013 .013 library cache pin 1 1 .00 .013 .013 11 rows selected.
Yes it looks that cache buffers chains latches are held the most. The Held column here means that during how many samples a latch was held. The Gets column shows roughly how many times that latch was gotten during sampling time.
As we took 100000 samples and during 21248 of them a cache buffers chains latch was held, we can conclude that this latch was busy (Held) 21.25% of the time. And as the sampling lasted about 1.3 seconds that time, we can extrapolate that of the 1.3 seconds, some cache buffers chains latches were in Held state for rougly 267 milliseconds. And as we know the number of gets, then we can compute the average latch hold time in milliseconds. This could help us detect very long latch holds like what can happen when someone is scanning through X$KSMSP or the server is CPU starved. The current estimate of 15 microseconds per latch hold is not that bad (note that with very few samples like 10000 or less, the avg hold time estimates become unreliable).
Anyway, let’s break down the individual latch holding sessions now:
SQL> @latchprof sid,name % % 100000 SID NAME Held Gets Held % Held ms Avg hold ms ---------- ------------------------------ ---------- ---------- ------- ----------- ----------- 81 cache buffers chains 14916 14915 14.92 187.942 .013 81 simulator lru latch 6797 6745 6.80 85.642 .013 85 cache buffers chains 6195 2284 6.20 78.057 .034 85 simulator lru latch 524 506 .52 6.602 .013 81 simulator hash latch 121 85 .12 1.525 .018 112 name-service namespace bucket 2 2 .00 .025 .013 100 enqueue hash chains 2 2 .00 .025 .013 98 library cache 2 2 .00 .025 .013 112 process allocation 2 2 .00 .025 .013 112 ges caches resource lists 1 1 .00 .013 .013 112 enqueues 1 1 .00 .013 .013 104 messages 1 1 .00 .013 .013 100 cache buffers chains 1 1 .00 .013 .013 100 active service list 1 1 .00 .013 .013 100 enqueues 1 1 .00 .013 .013 85 simulator hash latch 1 1 .00 .013 .013 100 library cache 1 1 .00 .013 .013 17 rows selected.
We see that sessions 81 and 85 are hitting CBC latches the most. Let say (for demo purposes again) that the SID 85 was the one which experienced bad performance and led us to investigate this issue.
As there are tens of thousands (or even more) of CBC latches in a decent Oracle database, then of course looking only at aggregated latch name level stats is not enough. Do the sessions compete for a single child latch? Or do they just use lots of different CBC latches over time and not collide at all? This is the question where LatchProf can help.
Let’s include latch address (child latch address) into picture and query only “cache” related latches. This returns lots of rows, so I normally press CTRL+C after first pageful of results as they are most important (the results are ordered by most busy ones first):
SQL> @latchprof sid,name,laddr % cache 100000 SID NAME LADDR Held Gets Held % Held ms Avg hold ms ---------- ------------------------------ -------- ---------- ---------- ------- ----------- ----------- 81 cache buffers chains 41AACEEC 15061 15017 15.06 186.756 .012 85 cache buffers chains 41B2C37C 34 34 .03 .422 .012 85 cache buffers chains 41B85D64 31 31 .03 .384 .012 85 cache buffers chains 41BCF7FC 30 30 .03 .372 .012 85 cache buffers chains 41B415EC 28 28 .03 .347 .012 85 cache buffers chains 41BCA658 25 25 .03 .310 .012 85 cache buffers chains 41B4E738 25 25 .03 .310 .012 85 cache buffers chains 41AE4694 25 25 .03 .310 .012 85 cache buffers chains 41AB02E0 25 9 .03 .310 .034 85 cache buffers chains 427FA1F8 24 24 .02 .298 .012 85 cache buffers chains 427F6E04 22 22 .02 .273 .012 85 cache buffers chains 41B78D94 22 22 .02 .273 .012 85 cache buffers chains 41BA2220 21 21 .02 .260 .012 85 cache buffers chains 41BEDA68 21 21 .02 .260 .012 85 cache buffers chains 41BC8D1C 21 21 .02 .260 .012 85 cache buffers chains 41B486BC 20 20 .02 .248 .012 85 cache buffers chains 42B99698 19 19 .02 .236 .012 85 cache buffers chains 41AE6EA8 19 14 .02 .236 .017 85 cache buffers chains 41B1F230 19 19 .02 .236 .012 85 cache buffers chains 41B09E44 19 17 .02 .236 .014 85 cache buffers chains 41AE7A88 19 19 .02 .236 .012 85 cache buffers chains 41B69E98 18 18 .02 .223 .012 [...snip...]
And voila, we see SID 81 is hitting one single child latch in maniac fashion while SID 85 is behaving more normally, just hitting lots of different CBC latches (you see the number of gets of a single child during 100k sampling doesn’t exceed 34). Interesting.
So now that we know the child latch address (LADDR), we can refine our query to only monitor that child latch to see if any other sessions could be affected:
SQL> @latchprof sid,name,laddr % 41AACEEC 100000 SID NAME LADDR Held Gets Held % Held ms Avg hold ms ---------- ------------------------------ -------- ---------- ---------- ------- ----------- ----------- 81 cache buffers chains 41AACEEC 14058 14047 14.06 177.131 .013 85 cache buffers chains 41AACEEC 18 18 .02 .227 .013
As of now it looks like noone else except the maniac SID 81 and victim SID 85 contend for the latch (otherwise one would expect to see at least few successful holds against that latch).
So, as SID 81 is so evidently the troublemaker here, let’s check what it’s doing:
SQL> select sql_text from v$sql where hash_value = (select sql_hash_value from v$session where sid = 81); SQL_TEXT --------------------------------------------------------------------------------------------------------- select count(*) X from kill_cpu connect by n > prior n start with n = 1 1 row selected.
This is Jonathan Lewis’es kill_cpu script in action, which hammered the same datablock in a tight loop using connect by ( you need to set _old_connect_by_enabled=true on 9i+ for this test to work :)
I will blog about some more use cases and an extended version of LatchProf script in (hopefully) near future.
Note once more that this script only detects latch holders. A latch held the most doesn’t necessarily mean it’s the latch waited for the most. You still need to start your diagnosis from wait interface, using Snapper or WaitProf for example. Only if these tools show significant latch waits, you should drill down into latches further. And as Anjo Kolk recently (very well) said, latch contention is a symptom, not the problem. The root cause is somewhere else (in the application likely), once you fix that, the latch contention will go away itself.
Update: I have published a more detailed latch profiling tool (latchprofX) which uses X$ tables for getting the extended data and can give even more detailed info about the latch holders. Read this article plus links in it:
/2009/08/27/latch-cache-buffers-chains-latch-contention-a-better-way-for-finding-the-hot-block/