Get ready for some more adventures in Oracle process stack!
Before proceeding though, please read this post about safety of different stack sampling approaches.
I have had few non-trivial Oracle troubleshooting cases, related to query hangs and bad performance, where I’ve wanted to know where exactly in execution plan the current execution is.
Remember, Oracle is just another program executing instructions clustered in functions on your server, so stack sampling can help out here as well.
So, I was looking into the following stack trace taken from an Oracle 10.1 database on Solaris SPARC, running a SQL with this execution plan.
$ cat pstack.txt ------------------------------------------------------------------------ 0000000101eca6e4 kdirfrs (ffffffff78eca5d8, 1, 0, 2, 86521ee00, 0) + 24 0000000102502adc qerixFetchFastFullScan (ffffffff78eca3e0, 10248ebc0, 0, 1, 86521ebf8, fffd) + 33c 0000000102558720 qergiFetch (ffffffff78ecaca8, 10248ebc0, ffffffff7fff5ee8, 10449dba0, 1, 86bb6bb38) + 80 000000010248edd8 qerjoFetch (45, 10248ebc0, ffffffff7fff5ee8, ffffffff78ecad60, 86bb6bc08, 114) + 118 000000010248eeb8 qerjoFetch (7ffe, 10248ebc0, ffffffff7fff5fe8, ffffffff78ecb5e8, 7f399b6f8, 7ffe) + 1f8 00000001025171b8 rwsfcd (8177747a8, 10248e1e0, ffffffff7fff6168, 7fff, f0, 10449dba0) + 78 000000010248e4dc qeruaFetch (745c0ab40, 10248e1e0, ffffffff7fff6168, ffffffff78ecb620, 10449dba8, 8177747a8) + 11c 000000010248d840 qervwFetch (1d, 1d, ffffffff7fff6238, 7fff, 7cfc87c50, 10449d000) + a0 00000001025171b8 rwsfcd (7876e8d18, 10248e1e0, ffffffff7fff63b8, 7fff, c0, 10449dba0) + 78 000000010248e4dc qeruaFetch (817d1e438, 10248e1e0, ffffffff7fff63b8, ffffffff78e7d318, 10449dba8, 7876e8d18) + 11c 000000010248d840 qervwFetch (1d, 1d, ffffffff7fff6488, 7fff, 82355a348, 10449d000) + a0 00000001025171b8 rwsfcd (7454b0408, 10249e4e0, ffffffff7fff6608, 7fff, c0, 10449dba0) + 78 00000001024a4620 qerhjFetch (86b1d64d8, 0, 0, 1, ffffffff78e7df08, 0) + 300 000000010248f99c qerjotFetch (78d8ea1d8, 0, 0, 1, ffffffff78e7e8d0, 10449dba0) + dc 000000010248eeb8 qerjoFetch (1, 10248ebc0, ffffffff7fff6838, ffffffff78e7f2b0, 7d53bb730, 1) + 1f8 000000010248eeb8 qerjoFetch (1, 10248ebc0, ffffffff7fff6938, ffffffff7b9b2f78, 74b306ba0, 7fff) + 1f8 000000010248d840 qervwFetch (5, 5, ffffffff7fff6a08, 7fff, 80b0faa18, 10449d000) + a0 00000001025171b8 rwsfcd (865e988f8, 10249e4e0, ffffffff7fff6b88, 7fff, c0, 10449dba0) + 78 00000001024a4620 qerhjFetch (7a57ae350, 10249e4e0, ffffffff7fff6d88, 1, ffffffff7bb89f00, 0) + 300 00000001025171b8 rwsfcd (823642298, 10249e4e0, ffffffff7fff6d88, 7fff, 6f0, 10449dba0) + 78 00000001024a4620 qerhjFetch (751158588, 102517980, 7511587f0, 1, ffffffff7b9b5090, 0) + 300 000000010251d1f0 qersoFetch (94, 10506adf8, 10449d000, ffffffff7b9b53b8, 799854d60, 7511587f0) + 350 0000000101aa6f24 opifch2 (7, f, 150, 1, 104400, 1050685e8) + a64 0000000101aa6384 opifch (5, 2, ffffffff7fff79f8, 105000, 0, 10434c0a8) + 44 0000000101ad81ec opipls (104000, 10434c, 1, ffffffff7bba3e6a, 0, 140010) + f4c 00000001002d0058 opiodr (6, 10506ae10, 10434cfb0, 10506a, 105000, 104000) + 598 00000001002d4ec0 rpidrus (ffffffff7fff8820, 105067f18, 105068860, ffffffff7bb5f560, 4a8c, 200000) + a0 0000000102f615e4 skgmstack (ffffffff7fff8a48, ffffffff7f87cf8f, ffffffff7fff898f, 1002d4e20, ffffffff7fff8a70, acc01800) + a4 00000001002d504c rpidru (ffffffff7fff9140, 10422b000, 10422a918, 104229598, 410, 82) + ac 00000001002d4808 rpiswu2 (0, 104556000, ffffffff7fff8b88, 2, 104556418, ffffffff7fff92c0) + 1a8 00000001002d61cc rpidrv (a, ffffffff7fff9044, 9, ffffffff7fff90c0, ffffffff7fff9140, 1002d5180) + 62c 0000000102797f90 psddr0 (104400, 86e2fc628, ffffffff7fff92c0, 9, 1050769d8, 1050769d8) + 1f0 0000000102798e04 psdnal (ffffffff7fffa0b8, ffffffff7fffa258, 105000, ffffffff7bc5eb00, 7f7aa86d0, 40) + 184 000000010376d268 pevm_BFTCHC (0, 7f7aa86d0, 50, ffffffff7bb5f560, ffffffff7bc5eb00, 0) + 188 000000010373dff4 pfrinstr_FTCHC (10, 15d0000000000000, ffffffff7bb5f5c8, ffffffff7bb5f560, 4a8c, ffffffff7bb66330) + b4 00000001037362c8 pfrrun_no_tool (ffffffff7bb5f560, 779953684, ffffffff7bb5f5c8, 10457c9d8, 2001, 2001) + 48 00000001037372d0 pfrrun (ffffffff7bb5f5c8, 200000, 0, 200000, ffffffff7bb5f560, 779abb110) + 2f0 0000000103783374 plsql_run (ffffffff7bb55408, 1, 0, ffffdfff, ffffffff7fffa0b8, 0) + 274 0000000103722554 peicnt (ffffffff7fffa0b8, 105068860, 6, ffffffff7fff9f28, 41d8, 1050685e8) + d4 000000010327b784 kkxexe (105000, 104000, 105068, 104296000, 1050685e8, ffffffff7bb5f560) + 284 0000000101ad0228 opiexe (4, ffffffff7bc68ee8, ffffffff7fffab00, 0, 0, ffffffff7bc70480) + 33c8 0000000101a4c0a8 kpoal8 (40008, 1, ffffffff7fffd890, 0, 0, 4) + 648 00000001002d0058 opiodr (14, 10506ae10, 10434ce70, 10506a, 105000, 104000) + 598 0000000102cded94 ttcpip (105071450, 18, ffffffff7fffd890, ffffffff7fffcb88, 104229c98, ffffffff7fffcb84) + 694 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffd9e8, 105071450, 105071458) + 428 0000000101aaf564 opiino (105070000, 1050683c0, 0, 0, f5, 105070290) + 404 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354 00000001002c9828 sou2o (ffffffff7fffe6b8, 3c, 4, ffffffff7fffe698, 104aa6000, 104aa6) + 48 00000001002a7b34 main (2, ffffffff7fffe798, ffffffff7fffe7b0, 0, 0, 100000000) + 94 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c
Not too encouraging, huh?
So, let’s run this stack trace through my os_explain script:
$ ./os_explain pstack.txt SELECT FETCH: SORT: Fetch HASH JOIN: Fetch * HASH JOIN: Fetch * VIEW: Fetch NESTED LOOP OUTER: Fetch NESTED LOOP OUTER: Fetch NESTED LOOP JOIN: Fetch HASH JOIN: Fetch * VIEW: Fetch UNION-ALL: Fetch * VIEW: Fetch UNION-ALL: Fetch * NESTED LOOP OUTER: Fetch NESTED LOOP OUTER: Fetch GRANULE ITERATOR: Fetch INDEX: FetchFastFullScan kdirfrs
Now this is much more understandable!
All my script did was:
- remove the bottom part of the stack not relevant to plan execution
- reverse the order of stack trace lines for better human readability
- translating Query Execution Rowsource (qer) function prefixes to their corresponding names using info provided in Metalink note 175982.1
Easy :)
So, how to read this?
First, by now it should be obvious that in Oracle, each rowsource operator (the different lines you see in SQL execution plans) is actually just a function inside Oracle kernel. These are the row source functions, starting with “qer”. QER stands for Query Execution Rowsource as far as I know.
Whenever an Oracle process fetches data from a cursor, it calls opifch2() which in turn calls the root rowsource function in execution plan. In my case that function was qersoFetch and my os_explain script just substituted the “qerso” part with SORT (as per the Metalink note I mentioned above). The first child function of qersoFetch was qerhjFetch, which is a hash join rowsource, and so on. Note that os_explain prefixes some lines with an asterisk (*), this indicates that the output of given function is in turn filtered by a filter operation (the same filter ops what you normally see in the bottom of DBMS_XPLAN explained plans).
So, logically you can imagine an execution plan as a tree of Oracle functions:
- For a SELECT query the OPI fetch function (opifch2) would be the root.
- Various join and union functions like qerhjFetch (HASH JOIN) and qerjotFetch (NESTED LOOPS JOIN) would be branches.
- The leaves would always be some sort of access path functions like qertbFetch (TABLE ACCESS FULL) or qerixFetch ( INDEX UNIQUE / FULL / RANGE SCAN ).
But physically, an execution plan is just a memory structure in subheap 6 of a child cursor (x$kglcursor.kglobhd6), which has a bunch of rowsource function opcodes in it.
During plan execution Oracle process “just” traverses through those opcodes, looks up the corresponding rowsource function starting address using a lookup table and calls it. That function does its task (probably calls other rowsource functions recursively) and returns to caller.
Note that many rowsource functions are designed to be cascading, being able to do only the work needed for returning a small subset of rows and return only few rows at a time, as opposed to the whole resultset.
This is a very good thing as rows can be cascaded, or pipelined back to parent functions as rows become available. For example a table fetch only fetches a handful of rows (and not the whole table) at a time and returns these “up” for further processing. Also, a nested loop join is able to pass matching rows “up” from the moment first matches are found, again there’s no need to perform the join on full dataset first before returning first rows.
This also means that there is no need to store the whole intermediate resultset somewhere in memory before passing it up to parent function; instead we just revisit that branch of execution plan tree whenever we need more rows from it. And the os_explain script shows you exactly in which execution branch the execution currently is.
Addition: I will elaborate on how to match the execution plan with stack trace in an upcoming post – it’s too much material for an introductory post.
So, cascading rowsources allow us to “incrementally” execute plans involving large datasets, without need to keep the intermediate resultsets in memory. On the other hand, few rowsource operators in your execution plan, like SORT, can not return any rows up before all its children’s rows are processed. With SORT (and aggregate operations which also use SORT) you just have to process all the source rows before returning any meaningful result back. You can’t just go through only half of the source data, order it and start returning rows in hope that the rest of the rows should have not been returned as first in the order. This is where the SQL cursor workareas come into play for such operations.
SORT, HASH and BITMAP rowsources can allocate SQL workareas for them, while others can’t. This can easily be identified from execution plan statistics of following sample query:
SQL> select /*+ gather_plan_statistics */ owner, count(*) from dba_source group by owner; OWNER COUNT(*) ------------------------------ ---------- WKSYS 8988 HR 34 [...some output snipped...] SYS 129299 WMSYS 704 25 rows selected. SQL> select * from table(dbms_xplan.display_cursor(null,null,'MEMSTATS LAST')); PLAN_TABLE_OUTPUT ---------------------------------------------------------------------------------------------------------------------------------- SQL_ID dcp37kxt02m9f, child number 0 ------------------------------------- select /*+ gather_plan_statistics */ owner, count(*) from dba_source group by owner Plan hash value: 114136443 ---------------------------------------------------------------------------------------------------------------------------- | Id | Operation | Name | Starts | E-Rows | A-Rows | A-Time | OMem | 1Mem | Used-Mem | ---------------------------------------------------------------------------------------------------------------------------- | 1 | HASH GROUP BY | | 1 | 593K| 25 |00:00:18.24 | 821K| 821K| 5213K (0)| | 2 | VIEW | DBA_SOURCE | 1 | 593K| 595K|00:00:16.84 | | | | | 3 | UNION-ALL | | 1 | | 595K|00:00:15.06 | | | | |* 4 | FILTER | | 1 | | 595K|00:00:10.88 | | | | |* 5 | HASH JOIN | | 1 | 595K| 595K|00:00:08.50 | 884K| 884K| 1316K (0)| |* 6 | HASH JOIN | | 1 | 6527 | 6292 |00:00:00.12 | 870K| 870K| 1179K (0)| | 7 | TABLE ACCESS FULL | USER$ | 1 | 98 | 98 |00:00:00.01 | | | | |* 8 | HASH JOIN | | 1 | 6527 | 6292 |00:00:00.09 | 909K| 909K| 1181K (0)| | 9 | INDEX FULL SCAN | I_USER2 | 1 | 98 | 98 |00:00:00.01 | | | | |* 10 | INDEX FAST FULL SCAN | I_OBJ2 | 1 | 6527 | 6292 |00:00:00.04 | | | | | 11 | INDEX FAST FULL SCAN | I_SOURCE1 | 1 | 595K| 595K|00:00:01.56 | | | | | 12 | NESTED LOOPS | | 0 | 1 | 0 |00:00:00.01 | | | | |* 13 | INDEX FULL SCAN | I_USER2 | 0 | 1 | 0 |00:00:00.01 | | | | |* 14 | INDEX RANGE SCAN | I_OBJ4 | 0 | 1 | 0 |00:00:00.01 | | | | | 15 | NESTED LOOPS | | 1 | 1 | 0 |00:00:00.01 | | | | | 16 | NESTED LOOPS | | 1 | 1 | 0 |00:00:00.01 | | | | | 17 | NESTED LOOPS | | 1 | 1 | 0 |00:00:00.01 | | | | |* 18 | INDEX FAST FULL SCAN | I_OBJ2 | 1 | 6 | 0 |00:00:00.01 | | | | |* 19 | FIXED TABLE FIXED INDEX| X$JOXFS (ind:1) | 0 | 1 | 0 |00:00:00.01 | | | | |* 20 | INDEX RANGE SCAN | I_USER2 | 0 | 1 | 0 |00:00:00.01 | | | | | 21 | TABLE ACCESS CLUSTER | USER$ | 0 | 1 | 0 |00:00:00.01 | | | | |* 22 | INDEX UNIQUE SCAN | I_USER# | 0 | 1 | 0 |00:00:00.01 | | | | ----------------------------------------------------------------------------------------------------------------------------
You see how the HASH operations do have their last Mem columns populated, therefore those rowsource functions did allocate a SQL workarea for them. Others like NESTED LOOPS joins and data access rowsources did not have any SQL workarea memory allocated as they are completely cascading.
Note that os_explain can also read the stack from STDIN as seen in example below. Also the -a option will tell os_explain to show all functions in the stack and not only the execution plan ones and their children.
Command:
SQL> alter session set statistics_level=typical; Session altered. SQL> select avg(length(text)) from dba_source where owner = 'SYS'; AVG(LENGTH(TEXT)) ----------------- 125.032127
Plan:
--------------------------------------------------------------------------------------------------- | Id | Operation | Name | E-Rows | OMem | 1Mem | Used-Mem | --------------------------------------------------------------------------------------------------- | 1 | SORT AGGREGATE | | 1 | | | | | 2 | VIEW | DBA_SOURCE | 17972 | | | | | 3 | UNION-ALL | | | | | | |* 4 | FILTER | | | | | | | 5 | NESTED LOOPS | | | | | | | 6 | NESTED LOOPS | | 18056 | | | | |* 7 | HASH JOIN | | 198 | 909K| 909K| 1193K (0)| | 8 | INDEX FULL SCAN | I_USER2 | 98 | | | | | 9 | NESTED LOOPS | | 198 | | | | | 10 | TABLE ACCESS BY INDEX ROWID| USER$ | 1 | | | | |* 11 | INDEX UNIQUE SCAN | I_USER1 | 1 | | | | |* 12 | INDEX RANGE SCAN | I_OBJ5 | 198 | | | | |* 13 | INDEX RANGE SCAN | I_SOURCE1 | 93 | | | | | 14 | TABLE ACCESS BY INDEX ROWID | SOURCE$ | 91 | | | | | 15 | NESTED LOOPS | | 1 | | | | |* 16 | INDEX FULL SCAN | I_USER2 | 1 | | | | |* 17 | INDEX RANGE SCAN | I_OBJ4 | 1 | | | | | 18 | NESTED LOOPS | | 1 | | | | | 19 | NESTED LOOPS | | 1 | | | | | 20 | NESTED LOOPS | | 1 | | | | | 21 | TABLE ACCESS BY INDEX ROWID | USER$ | 1 | | | | |* 22 | INDEX UNIQUE SCAN | I_USER1 | 1 | | | | |* 23 | INDEX RANGE SCAN | I_OBJ5 | 1 | | | | |* 24 | FIXED TABLE FIXED INDEX | X$JOXFS (ind:1) | 1 | | | | |* 25 | INDEX RANGE SCAN | I_USER2 | 1 | | | | ---------------------------------------------------------------------------------------------------
Stack:
$ pstack 23740 | ./os_explain -a main ssthrdmain opimai_real sou2o opidrv opiodr opiino opitsk ttcpip opiodr kpoal8 SELECT FETCH: GROUP BY SORT: Fetch VIEW: Fetch UNION-ALL: Fetch * FILTER DEFINITION: FetchOutside UNION-ALL: RowProcedure VIEW: RowProcedure qesaAggNonDistSS. evaopn2 evalen lxsCntChar
The FILTER rowsources (not talking about filter predicates on normal rowsources here) can make things more complicated though as they can introduce their own logic and loops into the execution plan (for running correlated subqueries etc).
By the way, see what happens when I run exactly the same query with rowsource level statistics collection enabled:
SQL> alter session set statistics_level=all; Session altered. SQL> select avg(length(text)) from dba_source where owner = 'SYS'; AVG(LENGTH(TEXT)) ----------------- 125.032127 $ pstack 23740 | ./os_explain -a main ssthrdmain opimai_real sou2o opidrv opiodr opiino opitsk ttcpip opiodr kpoal8 SELECT FETCH: QUERY EXECUTION STATISTICS: Fetch GROUP BY SORT: Fetch QUERY EXECUTION STATISTICS: Fetch VIEW: Fetch QUERY EXECUTION STATISTICS: Fetch UNION-ALL: Fetch QUERY EXECUTION STATISTICS: Fetch * QUERY EXECUTION STATISTICS: Fetch FILTER DEFINITION: FetchOutside QUERY EXECUTION STATISTICS: Fetch NESTED LOOP JOIN: Fetch QUERY EXECUTION STATISTICS: Fetch QUERY EXECUTION STATISTICS: SnapStats sltrgftime64 gettimeofday __kernel_vsyscall
Every rowsource is wrapped into a QUERY EXECUTION STATISTICS wrapper, which’ task is just to count the number of rows sent “up” to parents in the tree, logical IOs and also rowsource timing info whenever an internal counter (set by _rowsource_statistics_sampfreq parameter) wraps.
This is just an intro to Oracle execution plan internals and troubleshooting. Hopefully you don’t need this technique too often, however it has helped me to successfully pinpoint the root cause of a problem in few non-trivial database problems.
Note that in Oracle 11g there’s an excellent new feature called Real-time SQL Monitoring. It allows you to monitor the progress of currently running SQL statements. For serially running statements the monitoring kicks in for a statement after it’s used total 5 seconds of CPU or IO time (this time is controlled by _sqlmon_threshold parameter, but for PX the monitoring is always enabled). After that you can query V$SQL_MONITOR and V$SQL_PLAN_MONITOR for seeing how much time/rows/LIOs that session has spent executing a statement. You can see these details even at the SQL execution plan line level. Alternatively you can use DBMS_SQLTUNE. REPORT_SQL_MONITOR function to get this info nicely formatted. Greg Rahn has written a good blog entry about it.
However it’s important to note that both DBMS_SQLTUNE and V$SQL_MONITOR/V$SQL_PLAN_MONITOR use requires you to have Oracle Tuning Pack license which in turn requires Oracle Diagnostic Pack license. Details are in Oracle 11g Licensing Guide.
So another, low level approach for real-time monitoring will still be handy even after 11g becomes mainstream. In a future post I will be showing how to measure the progress and get execution profile of your plan by aggregating multiple stack traces and also some cool opportunities with DTrace.