Welcome to read my first real post on this blog!
If I ever manage to post any more entries, the type and style of content will be pretty much as this one: Oracle problem diagnosis and troubleshooting techniques with some OS and hardware touch in it. And internals! ;-)
Nevertheless, I am also a fan of systematic approaches and methods, so I plan to propose some less known OS and Oracle techniques for reducing guesswork in advanced Oracle troubleshooting even further.
Ok, the topic of today - troubleshooting.
Troubleshooting = finding out what is really going on + finding suitable solutions or workarounds for the problem.
This post covers one unexplained issue I once had with Oracle external tables that eventually turned out to be a problem with Oracle wait interface instrumentation. I used some of these “what’s really going on” techniques to drill down into a problem. I reproduced the problem on Solaris 10 x64 / Oracle 10.2.0.2.
I worked on a project for which I needed to read data through an external table from an Unix pipe ( ever wanted to load compressed flat file contents to Oracle on-the-fly? ;-)
I created a Unix pipe:
$ mknod /tmp/tmp_pipe p
I created an Oracle external table, reading from that pipe:
Connected to: Oracle Database 10g Enterprise Edition Release 10.2.0.2.0 - Production With the Partitioning, OLAP and Data Mining options USERNAME INSTANCE_NAME HOST_NAME VER STARTED SID SERIAL# SPID ------------ ---------------- ------------------------- ---------- -------- ------- ------- ------- TANEL SOL01 solaris01 10.2.0.2.0 20070618 470 14 724 Tanel@Sol01> CREATE DIRECTORY dir AS '/tmp'; Directory created. Tanel@Sol01> CREATE TABLE ext ( 2 value number 3 ) 4 ORGANIZATION EXTERNAL ( 5 TYPE oracle_loader 6 DEFAULT DIRECTORY dir 7 ACCESS PARAMETERS ( 8 FIELDS TERMINATED BY ';' 9 MISSING FIELD VALUES ARE NULL 10 (value) 11 ) 12 LOCATION ('tmp_pipe') 13 ) 14 ; Table created. Tanel@Sol01> select * from ext;
So far so good… unfortunately this select statement never returned any results. As it turned out later, the gunzip over remote ssh link which should have fed the Unix pipe with flat file data, had got stuck.
Without realizing that, I approached this potential session hang condition with first obvious check – a select from V$SESSION_WAIT:
Tanel@Sol01> select sid, event, state, seq#, seconds_in_wait, p1,p2,p3 2 from v$session_wait 3 where sid = 470; SID EVENT STATE SEQ# SECONDS_IN_WAIT P1 P2 P3 ------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ---------- 470 db file sequential read WAITED KNOWN TIME 164 7338 1 1892 1 Tanel@Sol01> / SID EVENT STATE SEQ# SECONDS_IN_WAIT P1 P2 P3 ------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ---------- 470 db file sequential read WAITED KNOWN TIME 164 7353 1 1892 1 Tanel@Sol01> / SID EVENT STATE SEQ# SECONDS_IN_WAIT P1 P2 P3 ------- ------------------------------ ------------------- ---------- --------------- ---------- ---------- ---------- 470 db file sequential read WAITED KNOWN TIME 164 7374 1 1892 1
The STATE and SECONDS_IN_WAIT columns in V$SESSION_WAIT say we have been crunching the CPU for last two hours, right? (as WAITED… means NOT waiting on any event, in this case the EVENT just shows the last event on which we waited before getting on CPU)
Hmm.. let’s check it out:
$ prstat -p 724 PID USERNAME SIZE RSS STATE PRI NICE TIME CPU PROCESS/NLWP 724 oracle 621M 533M sleep 59 0 0:00:00 0.0% oracle/1
prstat
reports that this process is currently in sleep state, is not using CPU and has used virtually no CPU during its 2-hour “run” time!
Let’s check with ps (which is actually a quite powerful tool):
$ ps -o user,pid,s,pcpu,time,etime,wchan,comm -p 724 USER PID S %CPU TIME ELAPSED WCHAN COMMAND oracle 724 S 0.0 00:01 02:18:08 ffffffff8135cadc oracleSOL01
ps
also confirms that the process 724 has existed for over 2 hours 18 minutes (ELAPSED), but has only used roughly 1 second of CPU time (TIME). The state column “S” also indicates the sleeping status.
So, either Oracle V$SESSION_WAIT or standard Unix tools are lying to us. From above evidence it is pretty clear that it’s Oracle who’s lying (also, in cases like that, lower-level instrumentation always has a better chance to know what’s really going on at the upper level than vice versa).
So, let’s use truss
(or strace
on Linux, tusc
on HP-UX) to see if our code is making any system calls or is sleeping within a system call…
$ truss -p 724 read(14, 0xFFFFFD7FFD6FDE0F, 524273) (sleeping...)
Hmm, as no followup is printed to this line, it looks like the process is waiting for a read operation on a file descriptor 14 to complete.
Which file is this fd 14 about?
$ pfiles 724 724: oracleSOL01 (LOCAL=NO) ...snip... 14: S_IFIFO mode:0644 dev:274,2 ino:4036320452 uid:100 gid:300 size:0 O_RDONLY|O_LARGEFILE /tmp/tmp_pipe ...snip...
So from here it’s already pretty obvious where the problem is. There is no data coming from the tmp_pipe. This led me to check what was my gunzip doing on the other end of the pipe and it was stuck, in turn waiting for ssh to feed more data into it. And ssh had got stuck due some network transport issue.
The baseline is that you can rely on low-level (OS) tools to identify what’s really going on when higher level tools (like Oracle wait interface) provide weird or contradicting information, in this case the Oracle wait interface was not recording external table read wait events. I reported this info to Oracle people and I think it has been filed as a bug by now.
This was only a simple demo, identifying a pretty clear case of a session hang, however with the use of a pretty intrusive tool (I would not attach truss
to a critical production instance process without thinking twice).
However there are other options. In the next part of this guide ( when I manage to write it ) I will deal with more complex problems like what to do when the session is not reporting significant waits and is spinning heavily on CPU. Using Oracle and Unix tools it is quite easy to figure out the execution profile of a spinning server process, even without connecting to Oracle at all ( do I hear pstack, mdb and stack tracing? ;-)
As I’ve just started blogging, I would appreciate any feedback, including about things like blog layout, font sizes, readability, understandability etc. Also I think it will take few days before I manage to post the Part 2 of this troubleshooting guide.
Thank you for your patience reading through this :-)