…or rather thread stack as nowadays decent operating systems execute threads (or tasks as they’re called in Linux kernel).
Anyway, stack trace gives you the ultimate truth on what your program is doing, exactly right now. There are couple of but’s like stack corruptions and missing symbol information which may make the traces less useful for us, but for detailed hang & performance troubleshooting the stack traces are a goldmine.
So, I present another case study – how to diagnose a complete database hang when you can’t even log on to the database.
I was doing some performance diagnosis work and wanted to find out where some pointers in some stack traces were pointing in shared pool. The X$KSMSP is the place where you’d find out what type of chunk happens to own the given address and which one is the parent heap of that chunk.
However, at Hotsos Symposium a year or two ago I remember Jonathan Lewis mentioning to be very careful with selecting from X$KSMSP on busy systems with large shared pools.
The reason is that X$KSMSP will take the shared pool latches and start scanning through shared pool heaps, reporting all shared pool chunks in it.
- This means that shared pool latches are held for very long time (the larger the shared pool, the more chunks you’ve got in SP, the more it takes).
- This in turn means that you will not be able to allocate or deallocate any chunk from these pools during there latches are held
- This in turn means that you will not be able to log on to database (as it allocates space in shared pool for session parameter values), you will not be able to hard parse a statement (as it needs to allocate SP space for your cursor and execution plan), you will not be able to even do some soft parses (as those may also use shared pool latches in some cases).
The shared pool size was set to 5GB, due some memory leak issues in past.
I decided to not run just select * from x$ksmsp on the full X$ table, I ran a select * from x$ksmsp where rownum <=10 instead, just to see whether this approach of fetching few rows at a time would work better. Obviously I did this in a test environment first… and it got completely hung. Oh man.
So I started up another sqlplus session to see what’s going on – and it got hung too. By then it was pretty clear to me that it was this killer X$KSMSP query, which didn’t care about my “where rownum <= 10” trick and probably kept sweeping through the whole shared pool – holding the shared pool latch all the time.
As I wasn’t able to log on to Oracle, the next thing was to check what my process was doing from OS level. The problem with that was that I didn’t have an idea what was my sessions process ID, I didn’t know when exactly I had logged on with my session. I didn’t have access to /dev/kmem to use lsof tool either which would have pointed me out which SPIDs correspond to TCP connections from my workstation’s IP.
Note that even though a test environment, this was a busy system, used heavily by various other processes and batch jobs. So I needed to be really sure which one is my process before killing it at OS level.
I started with ps and prstat:
$ prstat 1 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 24331 xxx_user 8312M 8254M cpu515 10 0 1:22.44 4.1% oracle/1 26136 oracle 8347M 8285M cpu528 10 0 0:10.25 4.1% oracle/1 23949 xxx_user 8306M 8248M cpu520 0 0 1:23.21 4.1% oracle/1 21853 oracle 20G 59M sleep 58 0 0:00.00 2.2% oracle/21 21849 oracle 20G 59M cpu16 31 0 0:19.52 2.2% oracle/21 21860 oracle 20G 58M cpu512 32 0 0:23.32 2.0% oracle/19 24681 oracle 20G 35M cpu531 42 0 8:06.28 0.9% oracle/15 26516 oracle 20G 13M sleep 58 0 11:34.29 0.0% oracle/1 27522 oracle 20G 11M sleep 58 0 0:57.33 0.0% oracle/152 13742 oracle 2064K 1800K cpu19 56 0 0:00.00 0.0% prstat/1 27526 oracle 20G 11M sleep 58 0 0:58.45 0.0% oracle/149 27528 oracle 20G 11M sleep 58 0 0:57.07 0.0% oracle/148 27520 oracle 20G 11M sleep 58 0 0:57.28 0.0% oracle/157 27524 oracle 20G 11M sleep 58 0 0:58.18 0.0% oracle/166
Even though the CPU usage and the process owner names gave me some hints which one could have been the troublemaker, I wanted to get proof before taking any action. And pstack was very helpful for that (note that due this hang I wasn’t even able to log on with another session!).
So I ran pstack on the main suspect process, 26136. I picked this one as it was the only remote Oracle process using 100% CPU (on a 24-CPU server). The other two top processes were spawned by a local user using bequeath, thus their username was different.
$ pstack 26136 26136: oracleDBNAME01 (LOCAL=NO) 000000010301f6c0 kghsrch (105068700, 0, ffffffff7bc77a68, 118, ffffffff7bc77ad8, 0) + 20 0000000103021f20 kghprmalo (7fffffff, 0, ffffffff7bc77a68, 0, ffffffff77130b90, 140) + 460 0000000103024804 kghalp (ffffffff7fffa898, ffffffff7bc77a68, 58, 7ffffffc, fffffffffffffffc, 104b86ce8) + 884 0000000100bb78a4 ksmspc (0, 38004d988, ffffffff7bc7ad80, 54e50a7d0, 458, 104e22180) + 44 000000010303564c kghscn (104e22180, 38004d988, 1000000000000000, 7ffffff8, 105068700, 54e50a7d0) + 68c 0000000103034750 kghnwscn (0, ffffffff7bc7ad80, 100bb7860, 1858, 2fa0, 105068700) + 270 0000000100bb7a44 ksmshp (100bb7, 1, 380000030, ffffffff7bc77a68, 100bb7860, ffffffff7bc7ad80) + 64 00000001024e67a8 qerfxFetch (10449c000, 0, 10506ae10, ffffffff7bc7ad40, 5651c5518, 1043ae680) + 328 00000001024e7edc qercoFetch (541d07900, 10449dba0, 4e, 1, ffffffff7bc7ad98, 1024e6480) + fc 0000000101aa6f24 opifch2 (2, 5, 60, 1, 104400, 1050685e8) + a64 0000000101a4c694 kpoal8 (0, 1, ffffffff7fffdc70, 0, 10434c0a8, 1) + c34 00000001002d0058 opiodr (14, 10506ae10, 10434ce70, 10506a, 105000, 104000) + 598 0000000102cded94 ttcpip (105071450, 18, ffffffff7fffdc70, ffffffff7fffcf68, 104229c98, ffffffff7fffcf64) + 694 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffddc8, 105071450, 105071458) + 428 0000000101aaf564 opiino (105070000, 105000, 57a321e48, 105000, dc, 105070290) + 404 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354 00000001002c9828 sou2o (ffffffff7fffea98, 3c, 4, ffffffff7fffea78, 104aa6000, 104aa6) + 48 00000001002a7b34 main (2, ffffffff7fffeb78, ffffffff7fffeb90, 0, 0, 100000000) + 94 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c
So, reading the bold sections from bottom up (and using the Metalink note 175982.1 I’ve mentioned earlier)
- opifch2: this is a FETCH call being executed
- qercoFetch: is a COUNT row source in execution plan
- qerfxFetch: is a FIXED TABLE row source (in other words, accessing an X$ table)
- ksmshp: looks like this is the function which is called under the covers when X$KSMSP is accessed
- kghscn: sounds a lot like Kernel Generic Heap SCaN :)
So, this stack trace proves that the process was definitely executing code which was doing a fetch from an X$ table (the qerfxFetch) and this had eventually resulted in an operation which sounded like a heap scanner ( kghscn ). This is exactly what X$KSMSP does, it scans through the entire shared pool heap and returns a row for every memory chunk in there.
Why was there a qercoFetch function involved even though my query did not have a count(*) in it? The explanation is below:
SQL> select * from x$ksmsp where rownum <=1; ADDR INDX INST_ID KSMCHIDX KSMCHDUR KSMCHCOM -------- ---------- ---------- ---------- ---------- --------------- 04AA0A00 0 1 1 1 free memory SQL> @x PLAN_TABLE_OUTPUT -------------------------------------------------------------------- -------------------------------------------------------------------- | Id | Operation | Name | Rows | Bytes | Cost | -------------------------------------------------------------------- | 0 | SELECT STATEMENT | | | | | |* 1 | COUNT STOPKEY | | | | | | 2 | FIXED TABLE FULL | X$KSMSP | | | | -------------------------------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 1 - filter(ROWNUM<=1)
The qercoFetch rowsource function was the one which should have enforced the “where rownum <=10” part of my query. It should have stopped calling the qerfxFetch function after it had received its 10 rows, but for some reason it continued on. Or the reason could be that the qerfxFetch function is not sophisticated enough to return control back to calling function with only part of its resultset, leaving no chance to qercoFetch to decide whether to stop fetching half-way through. (Many row sources have such flexibility of returning data in cascading fashion up in stack. For example nested loops are cascading (qerjoFetch, qerjotFetch), hash joins are cascading starting from the moment the hash join build partition has built in memory (qerhjFetch). Sort merge joins are not cascading from high-level perspective, as both of the joined row sources need to be sorted before any rows can be returned. However when both row sources have been sorted (either in memory or temp tablespace), the matched rows can be returned in a cascading fashion again, meaning that the calling function can fetch only X rows, do its processing and fetch next X rows etc. The X is normally dependent on arraysize – the amount of rows you fetch from Oracle at a time. Anyway, I leave this interesting topic for a next post ;)
Continuing with my hang diagnosis, after I had identified the troublemaker X$KSMSP scanner process, using prstat and pstack, I remembered that I could also have checked the sqlplus.exe’s timestamp against the server process start time (accounting the time zone difference and clock drift of course). This can be helpful when you have all your processes either trying to be on CPU or sleeping – unless you like to pstack through all of them.
I had identified one such process with logon time close to my sqlplus.exe’s start time and I took pstack on it:
$ pstack 4848 4848: oracleDBNAME01 (LOCAL=NO) ffffffff7c9a5288 semsys (2, 6d0a000a, ffffffff7fffa79c, 1, 10) 0000000102f782c8 sskgpwwait (ffffffff7fffab48, 10501dba8, 0, 200, ffffffffffffffff, 2b) + 168 0000000100af4b74 kslges (31c, 0, 57f338090, 0, 876, 578802b68) + 4f4 0000000102d70ea0 kglhdgn (3000, ffffffff7fffb3d0, 12, 0, 0, 380003910) + 180 0000000102d58974 kglget (105068700, ffffffff7fffb360, fc, 1000, 0, 1) + b94 000000010128f454 kkttrex (8, 1050683c0, 10506ae08, 105000, 10506a000, 9) + 4b4 000000010128e6c8 kktexeevt0 (31a8, 57a3ae0a0, 105068f74, 2, 4, 10506ae08) + 3a8 0000000101a77b60 kpolon (51, 51, 40002d91, 105071450, 80000, ffffffff7fffdc70) + c0 00000001002d0058 opiodr (1a, 10506ae10, 10434cc68, 10506a, 105000, 105000) + 598 0000000102cded94 ttcpip (105071450, 20, ffffffff7fffdc70, ffffffff7fffcf68, 104228a98, ffffffff7fffcf64) + 694 00000001002cd3e8 opitsk (1002cf000, 1, 0, ffffffff7fffddc8, 105071450, 105071458) + 428 0000000101aaf564 opiino (105070000, 105000, 57f337fd8, 105000, e3, 105070290) + 404 00000001002d0058 opiodr (4, 10506ae10, 10434c920, 10000, 105071, 105000) + 598 00000001002cc174 opidrv (0, 4, 10506a, 105071450, 0, 3c) + 354 00000001002c9828 sou2o (ffffffff7fffea98, 3c, 4, ffffffff7fffea78, 104aa6000, 104aa6) + 48 00000001002a7b34 main (2, ffffffff7fffeb78, ffffffff7fffeb90, 0, 0, 100000000) + 94 00000001002a7a7c _start (0, 0, 0, 0, 0, 0) + 17c
Looks like this one was my other session I tried to start up after my first session got hung.
Let’s read the highlighted sections of stack from the top this time:
- semsys is a Solaris system call allowing to wait for semaphore count (value) to increase above 0. In other words this system call allows a thread to sleep until someone posts it through the semaphore
- sskgpwwait is an Oracle operating system dependent (OSD) layer function which allows Oracle process to wait for an event
- kslges is a latch get function (with sleep and timeout capability)
- kglhdgn is a call for creating a new library cache object (handle)
- kglget is a lookup call for locating a library cache object (if it doesn’t find one, a library cache miss is incremented and kglhdgn() above is called)
- kkt calls are related to firing some internal triggers (logon triggers and auditing, anyone?)
- kpolon is related to logon and session setup as far as I know
So this stack indicates ( I might not be entirely correct ) that this process had got stuck while trying to create a library cache object in shared pool for an internal trigger during logon ( this database used session auditing by the way ).
Anyway, when I killed my original session (the one who was scanning through X$KSMSP), the database worked ok again.
The advantage of stack tracing was once again shown in a case where Oracle’s instrumentation was not usable (due my stupid mistake ;). This again illustrates once more that you should test everything out thoroughly in test environments even if you think it should work ok.