Back in the (really) old days, systemstate dumps had to be used for diagnosing hangs and finding blockers of hung databases. Basically you just identified which resource your waiting sessions were waiting for and then scanned through the rest of the system state dump to see which session held that particular resource (note that over time the instrumentation and systemstate dumps have evolved to resolve and dump the blocker information right at the waiting session section in the dump).
Diagnosing complex hangs was a tedious manual process (or required some scripting), so the hanganalyze was a really welcome addition to hang diagnosis. The hanganalyze basically walked through the blocker-waiter chains, found which session was the LEAF in the chain (the LEAF is the ultimate/final blocker in the wait chain), which everybody either directly or indirectly was waiting for.
Additionally the hanganalyze output allows you to draw the topology of the hang too, which may be useful in complex hang scenarios. In fact Enterprise Manager’s hang topology page also just calls ORADEBUG HANGANALYZE internally and formats the output into a visual hang graph (this is why the hang page requires OS credentials or sysdba privileges).
Starting from 11g, there is a welcome change though – there’s a new V$WAIT_CHAINS view, which lists all the hangs it has detected, it’s RAC-aware and is accessible via plain SQL. No need to run oradebug anymore, assuming that you are able to log on to query that V$ view :-)
And this is where we have a little problem – in rare circumstances an instance can get hung so that even a SYSDBA privilege holder can not log on, so how would you run the diagnostic queries & commands? This is why Oracle has introduced the preliminary connection option in sqlplus for diagnosing hung databases. With a prelim connection you will have a server process started for you, it attaches to the SGA shared memory segments, but it skips the last step of setting up that session/process-related structures in SGA. These operations on shared SGA structures have to be protected by latches/mutexes and would potentially get blocked if the related latches/mutexes are already held by someone else part of that hang.
I have already written about How to log on to Oracle when even sysdba cannot do so, but here’s an example:
$ sqlplus -prelim "/as sysdba" SQL*Plus: Release 11.2.0.3.0 Production on Sun Jan 29 12:48:03 2012 Copyright (c) 1982, 2011, Oracle. All rights reserved. SYS:dbm1> ORADEBUG HANGANALYZE 3 Statement processed.
The “problem” is though that prelim connections only allow you to run ORADEBUG commands as running regular SQL do require that all the session/process structures are set up properly, plus that the parsing, library cache access etc would get latches/mutexes which could again block.
But we just wanted to dump hang analysis and ORADEBUG HANGANALYZE above seems to have succeeded. However when looking into the resulting tracefile, we see this:
*** 2012-01-29 12:48:11.041 Processing Oradebug command 'HANGANALYZE 3' =============================================================================== HANG ANALYSIS: ERROR: Can not perform hang analysis dump without a process state object and a session state object. ( process=(nil), sess=(nil) ) ===============================================================================
It looks like hanganalyze isn’t able to work anymore without a fully logged on session where the process & session state objects (in v$process and v$session respectively) have been set up. This is a bit weird as one of the benefits of hanganalyze (and systemstate dumps) was that it did not require allocating any SGA structures nor take any latches & locks.
This behavior change seems to have appeared since Oracle 11.2.0.2 and it’s documented in MOS note 452358.1, but that note doesn’t offer a good alternative for diagnosing hangs in such situations.
The alternative is actually simple – you should attach to an existing fully-initialized & logged on process with oradebug and let that process do the hanganalyze for you. Whatever process you attach to with oradebug is the process which will be performing the work for you.
So if you ever see the error above when diagnosing extreme hangs, then just identify some least critical process logged in to the database (not a background process ideally and definitely not critical bg processes such DBWR & LGWR etc) and attach to that with oradebug (as opposed to “oradebug setmypid”) and the hang analysis should work ok.
However, the catch is that the above ORADEBUG HANGANALYZE command is a shortcut, which always tries to run hanganalyze in your (preliminary) connection, by calling the hanganalyze function directly in the process. So you have to use ORADEBUG DUMP HANGANALYZE 3 instead (the 3 is the detail level where I usually start from) as this syntax will send the “dump hanganalyze” command to the target (fully-initialized) process where you’ve attached to.
Update: I added this to clarify which commands should you use when you hit this problem:
- sqlplus -prelim “/as sysdba”
- ORADEBUG SETOSPID OS_PID
- ORADEBUG DUMP HANGANALYZE 3
Enjoy :)