In my last AOT post I published my LatchProf script which is able to sample detailed latchholder data from V$LATCHHOLDER.
Latchprof allows you to drill down into your latching problems at session level (which V$LATCH, V$LATCH_PARENT and V$LATCH_CHILDREN can’t do). It allows you to get valuable details about individual sessions who are holding a latch the most, therefore likely contributing to the latch contention problem the most.
However after you have discovered the troublemaking session, then what next? One way forward is looking into V$SESSTAT counters using Snapper tool. Depending on what latch is the problematic one, you would look for different stats like various buffer get stats for cache buffers chains latches and parsing/executing stats when looking into library cache latches. However if those stats look “normal”, is there any other way do drill down further?
Yeah, there is and lets look into it!
A side comment: Nowadays I try to avoid using X$ tables in favor of V$ views (because in my experience in many client environments you don’t always have easy SYS access or X$ table “proxy views” created). Also, if you can get your job done using simple V$ views, then why bother with more complex X$ tables? (yeah, it’s much cooler, I know ;)
Anyway, here is an exception: LatchProfX script, which means Latch holder Profiler eXtended, because it provides some very interesting extended information about latch holders. It shows why the latch currently held was taken after all. It uses X$KSUPRLAT instead of V$LATCHHOLDER as the latter doesn’t externalize all required information. I’m talking about latch Where statistics.
By “Where statistics” I mean that Oracle kernel’s latching module coders have instrumented their code well enough – whenever they get a latch, they mark down the code location ID of the latch getter within the latch structure itself.
Let’s not keep you waiting anymore and show some examples. The output is quite wide so you may need to size your browser window accordingly.
Normally you would run LatchProfX with specific enough parameters to monitor only one or few latches which you know are the troublemakers. However the first example run I do across all instance latches (the third parameter is %) to illustrate the usefulness of LatchProfX result data better.
LatchProfX syntax is similar to the original LatchProf script, with the addition of ksllwnam and ksllwlbl field names you can sample. These are exactly the fields (from X$KSLLW) which give us the descriptions of “where in kernel the latch was aquired from” opcodes we get from X$KSUPRLAT (see the script itself for syntax and internals).
So (finally), lets run it:
SQL> @latchprofx name,ksllwnam,ksllwlbl % % 100000 -- LatchProfX 1.06 by Tanel Poder ( ) NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- ----------- shared pool kghalo 9705 5987 9.71 185.366 .031 ges resource hash list kjuscl: lock close request 2429 2350 2.43 46.394 .020 kks stats kksAllocChildStat 2001 1460 2.00 38.219 .026 ges resource hash list kjrmas1: lookup master node 1780 1724 1.78 33.998 .020 shared pool kghupr1 Chunk Header 1618 891 1.62 30.904 .035 row cache objects kqrpre: find obj 988 986 .99 18.871 .019 shared pool simulator kglsim_unpin_simhp 804 97 .80 15.356 .158 ges process parent latch kjlalc: lock allocation from lmon 779 766 .78 14.879 .019 active service list kswsgetso: get service object 663 658 .66 12.663 .019 shared pool kghalp 635 628 .64 12.129 .019 ges process parent latch kjatioc 518 518 .52 9.894 .019 shared pool simulator kglsim_upd_newhp 402 401 .40 7.678 .019 ges resource hash list kjrref: find matched resource 374 371 .37 7.143 .019 row cache objects kqreqd 360 360 .36 6.876 .019 ges process parent latch kjlrem: detach lock from group 235 229 .24 4.489 .020 enqueues ksqdel 162 162 .16 3.094 .019 enqueues ksqgel: create enqueue parent obj 148 148 .15 2.827 .019 row cache objects kqreqd: reget 108 105 .11 2.063 .020 enqueue hash chains ksqrcl resource addr 101 101 .10 1.929 .019 shared pool kgh_heap_sizes 99 99 .10 1.891 .019 enqueue hash chains ksqgtl3 acquiring session 93 93 .09 1.776 .019 ges enqueue table freelist kjlalc: lock allocation 85 83 .09 1.624 .020 shared pool kghfre Chunk Header 74 65 .07 1.413 .022 shared pool simulator kglsim_scan_lru: scan 69 69 .07 1.318 .019 ges process parent latch kjucll: closing lock 68 61 .07 1.299 .021 shared pool simulator kglsim_chg_simhp_free 48 47 .05 .917 .020 ges domain table kjdmlad: add a lock to an xid-hashed loc 42 41 .04 .802 .020 shared pool simulator kglsim_chg_simhp_noob 41 41 .04 .783 .019 shared pool simulator kglsim_unpin_simhp: fast path 39 39 .04 .745 .019 ksuosstats global area ksugetosstat 29 1 .03 .554 .554 ges group table kjgdgll: move a lock from group lock lis 13 6 .01 .248 .041 kokc descriptor allocation latch kokcdlt: regular free latch 11 4 .01 .210 .053 ges enqueue table freelist kjlfr: remove lock from parent object 11 11 .01 .210 .019 kokc descriptor allocation latch kokcdlt: allocation latch 5 4 .01 .096 .024 cache buffers chains kcbgtcr: fast path 3 3 .00 .057 .019 ges caches resource lists kjrchc: cached obj cleanup 2 2 .00 .038 .019 process allocation ksufap: active procs 2 2 .00 .038 .019 ges group table kjgagll: move a lock from process lock l 2 2 .00 .038 .019 ges domain table kjdmlrm: remove a lock from its xid-hash 1 1 .00 .019 .019 ges caches resource lists kjchc: clean resource cache 1 1 .00 .019 .019 ges resource hash list kjruch: move resource from cache to free 1 1 .00 .019 .019 name-service memory objects kjxgsnep: get a value-block object 1 1 .00 .019 .019 active service list kswsite: service iterator 1 1 .00 .019 .019 43 rows selected. Elapsed: 00:00:02.06
What do we see? While LatchProf would just have provided latch holder info about which (child) latch was held and by which SID, then LatchProfX will show you additional why this latch was gotten.
From above example you see that there were various reasons why the shared pool latch was gotten, but the reason which caused the SP latch to be held the most time, was a code location kghalo, which means something like Kernel Generic Heap Allocate. Which essentially is the function used fol searching and allocating free space from shared pool. This kind of makes sense as stuff like freeing a chunk and updating heap global stats are simple operations as Oracle already knows the memory location it needs to update, but searching for free chunks for allocation can require scanning through a lot of chunks if there are no suitable ones in freelists and a LRU scan is needed.
Next, there’s a kghupr1 location which has caused the shared pool latches held for 1.62% of total sampling time. This location, in function kghupr (un-pin recreatable) needs to take shared pool latch for putting the unpinned chunk back to heap LRU list. An Oracle kernel coder has also specified a label (action description) as “Chunk Header” which is seen from KSLLWLBL column. Its in place to state that this kernel location gets shared pool latch as it needs to update a chunk header – putting a chunk back to LRU list means updating the chunk header (and its neighbor(s) headers).
There are also kgh_heap_sizes and kghfre locations, which update global heap statistics (the stuff you see from v$sgastat for example) and free a chunk in heap respectively.
Wherever you have estabilished which latch is the troublemaker (under too much contention) and you know the “why” reason for it, you probably also want to know the sessions contributing to the problem. And LatchProfX can show that:
SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000 -- LatchProfX 1.06 by Tanel Poder ( ) SID NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms ---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- ----------- 113 shared pool kghalo 5038 2787 5.04 92.699 .033 112 shared pool kghalo 4723 2820 4.72 86.903 .031 113 shared pool kghupr1 Chunk Header 875 369 .88 16.100 .044 113 shared pool simulator kglsim_unpin_simhp 728 48 .73 13.395 .279 112 shared pool kghupr1 Chunk Header 623 438 .62 11.463 .026 112 shared pool simulator kglsim_unpin_simhp 273 37 .27 5.023 .136 112 shared pool kghalp 232 232 .23 4.269 .018 113 shared pool simulator kglsim_upd_newhp 230 230 .23 4.232 .018 113 shared pool kghalp 224 223 .22 4.122 .018 112 shared pool simulator kglsim_upd_newhp 194 194 .19 3.570 .018 112 shared pool kgh_heap_sizes 49 49 .05 .902 .018 113 shared pool kgh_heap_sizes 49 49 .05 .902 .018 112 shared pool kghfre Chunk Header 37 37 .04 .681 .018 113 shared pool simulator kglsim_scan_lru: scan 32 32 .03 .589 .018 112 shared pool simulator kglsim_scan_lru: scan 28 27 .03 .515 .019 112 shared pool simulator kglsim_chg_simhp_free 27 27 .03 .497 .018 113 shared pool simulator kglsim_chg_simhp_free 26 25 .03 .478 .019 113 shared pool kghfre Chunk Header 24 23 .02 .442 .019 112 shared pool simulator kglsim_unpin_simhp: fast path 23 23 .02 .423 .018 113 shared pool simulator kglsim_chg_simhp_noob 22 22 .02 .405 .018 112 shared pool simulator kglsim_chg_simhp_noob 18 18 .02 .331 .018 113 shared pool simulator kglsim_unpin_simhp: fast path 18 18 .02 .331 .018 112 shared pool sim alloc kglsim_chk_objlist: alloc 1 1 .00 .018 .018 122 shared pool kghfre Chunk Header 1 1 .00 .018 .018 24 rows selected. Elapsed: 00:00:01.93 SQL>
From above you see are 2 main sessions contributing to shared pool latch contention and both are doing lots of kghalo calls, thus allocating memory. When V$SESSTAT for those sessions shows high “parse count (hard)” statistic it would be obvious that the heavy shared pool activity is due hard parsing (which is kind of common knowledge as excessive hard parsing very often the reason for shared pool latch contention). However check out another sample with same parameters:
SQL> @latchprofx sid,name,ksllwnam,ksllwlbl % "shared pool" 100000 -- LatchProfX 1.06 by Tanel Poder ( ) SID NAME KSLLWNAM KSLLWLBL Held Gets Held % Held ms Avg hold ms ---------- ----------------------------------- ---------------------------------------- -------------------- ---------- ---------- ------- ----------- ----------- shared pool kghalo 10714 5 10.71 212.137 42.427 122 shared pool kghalo 7685 42 7.69 152.163 3.623 112 shared pool kghalo 3874 2305 3.87 76.705 .033 113 shared pool kghalo 3616 2318 3.62 71.597 .031 112 shared pool kghupr1 Chunk Header 659 327 .66 13.048 .040 113 shared pool kghupr1 Chunk Header 597 353 .60 11.821 .033 112 shared pool simulator kglsim_unpin_simhp 416 29 .42 8.237 .284 [...snip...] 122 shared pool simulator kglsim_upd_newhp 2 2 .00 .040 .020 117 shared pool kghupr1 Chunk Header 2 2 .00 .040 .020 122 shared pool kghalp 2 2 .00 .040 .020 117 shared pool kghalp 1 1 .00 .020 .020 38 rows selected. Elapsed: 00:00:02.14 SQL>
What’s this? Why is there a session with SID=0 which has called kghalo function and held the latch for much longer on average than any other sessions? LatchProfX has detected only 5 latch gets by that session from kghalo location which still have taken the most time in total. And average hold time for that latch is estimated over 42 milliseconds compared to most others which are in microseconds (even though the 3.6 milliseconds average hold time for a latch is quite long too).
One of the reason for such long hold times could be CPU preemption – a latch holder process was switched off the CPU and it took a while before it got back on it to complete its work and release the latch again. However on my 4 CPU test machine the was plenty of CPU time available so such preemption should have not been the problem. The SID 0 gives a clue. When a new session is started in Oracle, apparently it’s reported as SID 0 during the session initialization sequence in some places, which includes the memory allocation for things like the v$sesstat and v$session_event arrays and also the session parameter values and set events in some cases. Considering that total size of these structures is quite large in normal shared pool usage context (tens of kB), then if freelists are empty and LRU scan must happen, this inevitably takes longer time – and shared pool latch is held all that time. (phew, I must write shorter statements).
Anyway, I recommend playing around with this script in a test environment, run different workloads like hard parses, soft parses, physical IOs and buffer gets and you should get quite a good overview of some Oracle internal workings that way. This hopefully also answers how do I know some of the Oracle function names and what’s their purpose – the answer is that Oracle itself exposes quite a lot of this information in the X$ tables I’m using in my script. You just need to experiment, observe and try to figure out what are the real names behind function names like kghalo etc. Fun :)
Few more notes:
If you want to drill down to latch child level, then use LADDR as another field (among SID,NAME etc) in first parameter to LatchProfX (I’ve described this in AOT guide part 7).
Do not use LatchProf, LatchProfX and WaitProf when your server’s CPUs are 100% utilized, for two reasons:
- All those scripts never sleep during their sampling but spin on CPU – thus they would make the CPU starvation even worse
- More importantly, if all your CPUs are running at 100% utilization, the latch contention you see is probably just a symptom of the CPU starvation itself! The same goes to wait events as well – when your CPU runqueues are long, some of the CPU runqueue waiting time can be attributed to wait events such IO instead (for example when IO completes, but the process can’t get onto CPU to “end” the IO wait
I have updated both LatchProf and LatchProfX do detect whether they’re running on Oracle 9i or 10g+ and dynamically run different code based on version (on 9i, there is no GETS column in V$LATCHHOLDER).