Here's a treat for Oracle performance professionals and geeks who are looking for more systematic ways for cache buffers chains (CBC) latch contention troubleshooting. Cache buffers chains latches are taken when a process wants to walk through a cache buffer hash chain, looking if the block with required DBA (data block address) is in buffer cache. If the block happens to be in cache, then in most cases it has to be pinned first before use and unpinned after use, to make sure no-one else can perform an incompatible operation on that block at the same time. The modification of pin structures (pinning/unpinning) is also protected by CBC latches.
CBC latch contention can happen for multiple reasons, but one reason is that there is some really hot block in a SMP system with high number of CPUs (or CMT system with high number of threads like Sun T-series servers). Sometimes there happen to be multiple moderately hot blocks "under" the same CBC latch, which can result in latch contention again.
Traditionally DBAs used to look up the child latch address from V$SESSION_WAIT, sql_trace output or ASH and then look up all buffers protected by that latch from X$BH, using HLADDR column (HLADDR stands for Hash Latch Address). I also have a script for that, bhla.sql (Buffer Headers by Latch Address), which reports me all blocks currently in buffer cache, "under" that particular latch and the corresponding data block addresses and object names:
SQL> @bhla 27E5A780 <-- latch address reported by wait interface (Parameter1 in latch: cache buffers chains wait event) FLG_LRUFLG OBJ OBJECT_TYPE object TCH DBA ------------------- ---------- ------------------- ---------------------------------------- ---------- --------- 80000:8 2 TABLE SYS.SUBCOLTYPE$ 0 1 7873 80000:8 2 TABLE SYS.ATTRCOL$ 0 1 7873 0:0 50472 INDEX SYSMAN.MGMT_JOB_EXEC_IDX04 1 3 27699 0:0 50472 INDEX SYSMAN.MGMT_JOB_EXEC_IDX04 1 3 27699 0:4 50472 INDEX SYSMAN.MGMT_JOB_EXEC_IDX04 1 3 27699 2202000:8 3710 TABLE SYS.WRI$_ADV_REC_ACTIONS 1 3 972 0:0 50472 INDEX SYSMAN.MGMT_JOB_EXEC_IDX04 1 3 27699 0:0 50472 INDEX SYSMAN.MGMT_JOB_EXEC_IDX04 1 3 27699 2202000:8 94227 TABLE PARTITION SYS.WRH$_ROWCACHE_SUMMARY 2 3 36142 80000:8 54880 INDEX PERFSTAT.STATS$UNDOSTAT_PK 14 4 118331 0:8 37 INDEX SYS.I_OBJ2 24 1 55591 2000:8 3680 TABLE SYS.WRI$_ADV_TASKS 26 3 739 0:8 75 TABLE SYS.IDL_UB2$ 35 1 11745 0:8 73 TABLE SYS.IDL_UB1$ 38 1 12211
From above we see that this child latch is protecting many blocks in buffer cache. So which one is the troublemaker? This is where the touchcount (TCH column) has been traditionally used, the higher the TCH the hotter the buffer, right?
Actually this may not always be true. For example if there is a somewhat hot block which has been steadily in buffer cache for weeks, it may have very high touch count. Now if there is an extremely hot block used by every morning’s batch job (and causing the trouble) but not used later on in the day, this block will get aged out later in the day. Once a block is aged out, its metadata, including TCH is gone! So when the block is loaded back in next morning (causing trouble again), its touch count starts from zero again.
So, by just looking into touchcount numbers you can’t always reliably detect who’s the current troublemaker. One option here would be to query the touchcounts, then wait for a minute, query the touchcounts again and see who’s touchcount increased the most during the minute. This would give a bit better picture about who is causing trouble now.
But still, it would not be always reliable for another reason – touchcounts are incremented only after 3 seconds have passed since last increment! This factor has been coded in to avoid situation such a short but crazy nested loop join hammering a single buffer hundreds of thousands of times in few seconds and then finishing. The buffer wouldn’t be hot anymore but the touchcount would be hundreds of thousands due a single SQL execution. So, unless 3 seconds (of SGA internal time) has passed since last TCH update, the touchcounts would not be increased during buffer access.
This time is controlled by SGA variable kcbatt_ by the way:
SQL> oradebug dumpvar sga kcbatt_ ub4 kcbatt_ [3C440F4, 3C440F8) = 00000003
This 3-second delay leaves us in the following situation, let say there are 2 blocks protected by a CBC child latch:
One block has been accessed once every 3 seconds for 24 hours in a row. A block accessed once per 3 seconds is definitely not a hot block, but its touchcount would be around 28800 (86400 seconds per 24 hours / 3 = 28800).
And there is another block which is accessed crazily for 2 seconds in a row and this happens every 10 seconds. 2 seconds of consecutive access would increase the touchcount by 1. If such access pattern has been going on every 10 seconds over last 24 hours, then the touch count for that buffer would be 86400 / 10 = 8640.
In the first case we can have a very cold block with TCH = 28800 and in second case a very hot block with TCH = 8640 only and this can mislead DBAs to fixing the wrong problem.
Luckily this problem can be solved with LatchProfX :-)
If you don’t know what LatchProfX is, you may want to read these articles first - LatchProf gives you a better way for latch contention troubleshooting compared to old approaches:
- /2008/07/09/advanced-oracle-troubleshooting-guide-part-7-sampling-latch-holder-statistics-using-latchprof/
- /2008/07/23/advanced-oracle-troubleshooting-guide-part-8-even-more-detailed-latch-troubleshooting-using-latchprofx/
- /2009/03/20/another-latchprofx-use-case/
And now I introduce a new feature in LatchProfX.
In addition to being able to report the SID of the latch holder and the kernel function why this latch was taken by the session, the newest LatchProfX can also report the object what was protected by the latch under investigation. For cache buffers chains the object is data block address of the block being accessed.
Here’s an example, let say I have CBC latch contention problems and from monitoring tools or sql_trace I have identified the hot latch, I use my sw.sql script here, to report what a session is waiting for:
SQL> @sw 138 SID STATE EVENT SEC_IN_WAIT P1 ------- ------- ---------------------------- ----------- ------------------ 138 WAITING latch: cache buffers chains 0 address=0x27E5A780
Sw reports that this session is waiting on cache buffer chains child latch with address 0x27E5a780.
Now we can run latchprofx and report sid, name, hold mode and protected object for that child latch only (instead of latch name we specified its address):
SQL> @latchprofx sid,name,hmode,object % 27E5A780 100000 SID NAME HMODE OBJECT Held Gets Held % Held ms ---- ---------------------- ------------ ------- ------ ----- ------- -------- 138 cache buffers chains exclusive 40EB02 3928 3797 3.93 41.637 151 cache buffers chains exclusive 40EB02 3711 3660 3.71 39.337 138 cache buffers chains shared 40EB02 623 623 .62 6.604 151 cache buffers chains shared 40EB02 544 544 .54 5.766
As you see, LatchProfX has reported a data block address in the object column. This is the datablock we are trying to access in buffer cache under protection of the given child latch.
Lets see what this DBA translates to:
SQL> @dba 40EB02 RFILE# BLOCK# ---------- ---------- 1 60162 Press enter to find the segment using V$BH (this may take CPU time), CTRL+C to cancel: STATE BLOCK_CLASS OBJECT_TYPE object TCH MODE_HELD ---------- ------------------ ------------------- ---------------------------------------- ---------- ---------- xcur data block INDEX SYS.SYS_IOT_TOP_94276 331 1 Press enter to query what segment resides there using DBA_EXTENTS (this can be IO intensive), CTRL+C to cancel: OWNER SEGMENT_NAME PARTITION_NAME TABLESPACE_NAME ------------------------------ -------------------- --------------- ----------------- SYS SYS_IOT_TOP_94276 SYSTEM
So, the troublemaker is an IOT index segment block belonging to table with object_id 94276. I can find the table name from dba_objects by object_id, or by using my script:
SQL> @oid 94276 owner object_name object_type CREATED ------------------------- ------------------------------ ------------------ ----------------- SYS KILL_CPU TABLE 20090825 23:19:49
Also, as we know the segment name and relative file number (RFILE#) and block number (BLOCK#) from above output we can dump the datablock with an ALTER SYSTEM DUMP DATAFILE command.
As that command requires absolute file id (not relative) we have to do a little conversion first (in cases where there are multiple files with same relative file number in database):
SQL> select file_id from dba_extents where relative_fno = 1 and segment_name = 'SYS_IOT_TOP_94276'; FILE_ID ---------- 1
Of course in my little test case all relative file numbers match with absolute numbers (but if you have lots of datafiles, have directly migrated from Oracle 7 or use transportable tablespaces, you may hit this issue).
Finally I can run ALTER SYSTEM DUMP DATAFILE 1 BLOCK 60162 to find out more about the contents of that block. Another option would be to construct a rowid from the data_object_id, rfile and block and query the table using that but as that’s complicated for IOTs, I’ll leave this to a future post.
You can download latest version of LatchProfX here: