Whenever I deliver training or conference presentations on advanced troubleshooting topics, I usually spend some time demonstrating how to get and interpret Oracle server process stack traces.
As I’ve mentioned before, stack traces are the ultimate indicators showing where in Oracle kernel (or whatever application) code the execution currently is (or where it was when a crash occurred). This is the reason Oracle Support asks for stack traces whenever there’s a crash or non-trivial hang involved, that’s why Oracle database dumps errorstacks when ORA-600’s and other exceptions occur.
There are multiple ways for getting stack traces for Oracle, but not all ways are equal. Some give you more contextual info, some less, but what I’m blogging about today is that some ways are less safe than others.
I was using pstack on Linux for diagnosing an IO related performance issue. I executed a create table as select statement and ran pstack in a loop for getting stack traces from the running process.
However in one of the test runs I got following error in my Oracle session:
SQL> create table t as select * from dba_source; create table t as select * from dba_source * ERROR at line 1: ORA-01115: IO error reading block from file 1 (block # 11161) ORA-01110: data file 1: '/u01/oradata/LIN10G/system01.dbf' ORA-27091: unable to queue I/O ORA-27072: File I/O error Additional information: 3 Additional information: 11145 Additional information: 32768
I suspected that this issue was due Linux pstack, stopped the pstack script and ran my CTAS from the same Oracle session again:
SQL> create table t as select * from dba_source; Table created.
The command now succeeded ok.
I tried to reproduce the failure again, but during the few minutes I spent testing, it didn’t occur again.
The failure happened likely due the fact that pstack on Linux is actually just a wrapper script around gdb. GDB in turn suspends the process under investigation and attaches to it through ptrace() syscall. And ptrace() syscall (and debuggers in general) have historically caused issues to host processes when communicating with kernel and other processes. For example they can block some signals or interrupts from being propagated back up to the “host” process.
Normally I have warned people about using debugger-based stack tracing due exactly those reasons, and now I managed to capture nice evidence.
I recommend to stay away with debugger from critical background processes on production systems, unless things have already collapsed anyway (that ought to be common sense anyway). So it’s good to know that Linux pstack is actually just a script with GDB backtrace command in it.
So, what are the safer and less safer stack tracing options:
- oradebug dump errorstack – unsafe for production (as dump errorstack actually alters the process under investigation from its original codepath)
- debugger based errorstack (gdb,dbx,mdb and Linux pstack) – can be unsafe for production due missed signals & interrupts if you get unlucky. Therefore you should stay away from at least the critical background processes with such tools
- pstack on Solaris (and procstack on AIX) – safe as they don’t use the ptrace() interface but just read the info required from /proc filesystem
- DTrace – safe by design
I haven’t checked how HP-UX pstack works, so can’t advise on that.