Oracle Fatal Background Processes


When talking about Oracle background processes, there’s a term/qualifier “fatal” background process. This means that when one of these background processes crashes, then whoever detects the process disappearance (PMON or LGWR or CLMN possibly), will shut down the instance as it cannot function normally anymore.

Not all background process crashes take down the whole instance, for example processes like J000 and P000 are technically background processes (daemons disconnected from network), but their crash won’t take down the instance. It’s less known, but the log archiver process (ARCn) crashes/kills won’t take down the instance either. These processes are automatically restarted. Please don’t test this out in production :-)

I’ve been wondering for a while, whether there’s some flag in a V$ or X$ that shows which background process is fatal and which one is expendable (yes, a movie reference!).

I had some time to look into this today and looks like there’s a flag in X$KSUPR (the X$ underlying the V$PROCESS, but unfortunately the V$ doesn’t have this flag exposed). Looks like it’s the 3rd bit of X$KSUPR.KSUPRFLG.

Here’s an example from a single instance 18c database:

SQL> SELECT indx,ksuprpnm,TO_CHAR(ksuprflg,'XXXXXXXXXXXXXXXX')
  2  FROM x$ksupr
  3  WHERE BITAND(ksuprflg,4) = 4 ORDER BY indx
  4  /

      INDX KSUPRPNM                                         TO_CHAR(KSUPRFLG,
---------- ------------------------------------------------ -----------------
         2 oracle@linux01.localdomain (PMON)                                E
         3 oracle@linux01.localdomain (CLMN)                                E
         4 oracle@linux01.localdomain (PSP0)                                6
         5 oracle@linux01.localdomain (VKTM)                                6
         6 oracle@linux01.localdomain (GEN0)                                6
         7 oracle@linux01.localdomain (MMAN)                                6
        14 oracle@linux01.localdomain (DBRM)                                6
        17 oracle@linux01.localdomain (PMAN)                                6
        19 oracle@linux01.localdomain (DBW0)                                6
        20 oracle@linux01.localdomain (DBW1)                                6
        21 oracle@linux01.localdomain (LGWR)                                6
        22 oracle@linux01.localdomain (CKPT)                                6
        23 oracle@linux01.localdomain (LG00)                                6
        24 oracle@linux01.localdomain (SMON)                               16
        25 oracle@linux01.localdomain (LG01)                                6
        29 oracle@linux01.localdomain (LREG)                                6

This is an example from a RAC instance (19c):

SQL> SELECT indx,ksuprpnm,TO_CHAR(ksuprflg,'XXXXXXXXXXXXXXXX')
  2  FROM x$ksupr
  3  WHERE BITAND(ksuprflg,4) = 4 ORDER BY indx
  4  /

      INDX KSUPRPNM                                         TO_CHAR(KSUPRFLG,
---------- ------------------------------------------------ -----------------
         2 oracle@ol7-19-rac1.localdomain (PMON)                            E
         3 oracle@ol7-19-rac1.localdomain (CLMN)                            E
         4 oracle@ol7-19-rac1.localdomain (PSP0)                            6
         5 oracle@ol7-19-rac1.localdomain (IPC0)                            6
         6 oracle@ol7-19-rac1.localdomain (VKTM)                            6
         7 oracle@ol7-19-rac1.localdomain (GEN0)                            6
         8 oracle@ol7-19-rac1.localdomain (MMAN)                            6
         9 oracle@ol7-19-rac1.localdomain (LG00)                            6
        15 oracle@ol7-19-rac1.localdomain (DBRM)                            6
        19 oracle@ol7-19-rac1.localdomain (ACMS)                            6
        20 oracle@ol7-19-rac1.localdomain (PMAN)                            6
        22 oracle@ol7-19-rac1.localdomain (LMON)                            6
        23 oracle@ol7-19-rac1.localdomain (LMD0)                            6
        24 oracle@ol7-19-rac1.localdomain (LMS0)                            6
        26 oracle@ol7-19-rac1.localdomain (LMS1)                            6
        28 oracle@ol7-19-rac1.localdomain (LMD1)                            6
        29 oracle@ol7-19-rac1.localdomain (RMS0)                            6
        30 oracle@ol7-19-rac1.localdomain (RS01)                            6
        31 oracle@ol7-19-rac1.localdomain (RS00)                            6
        33 oracle@ol7-19-rac1.localdomain (LCK1)                            6
        34 oracle@ol7-19-rac1.localdomain (DBW0)                            6
        35 oracle@ol7-19-rac1.localdomain (LGWR)                            6
        36 oracle@ol7-19-rac1.localdomain (CKPT)                            6
        37 oracle@ol7-19-rac1.localdomain (SMON)                           16
        38 oracle@ol7-19-rac1.localdomain (LG01)                            6
        43 oracle@ol7-19-rac1.localdomain (LREG)                            6
        45 oracle@ol7-19-rac1.localdomain (RBAL)                            6
        46 oracle@ol7-19-rac1.localdomain (ASMB)                            6
        47 oracle@ol7-19-rac1.localdomain (FENC)                            6
        53 oracle@ol7-19-rac1.localdomain (IMR0)                            6
        55 oracle@ol7-19-rac1.localdomain (LCK0)                            6
        71 oracle@ol7-19-rac1.localdomain (GTX0)                            6

So, these are the fatal background processes and as I mentioned, log archiver processes are not part of this set. Let’s try to kill some ARCn processes:

$ ps -ef | grep -i arc.*LINPRD
oracle    4368     1  0 03:05 ?        00:00:00 ora_arc0_LINPRD
oracle    4372     1  0 03:05 ?        00:00:00 ora_arc1_LINPRD
oracle    4374     1  0 03:05 ?        00:00:00 ora_arc2_LINPRD
oracle    4376     1  0 03:05 ?        00:00:00 ora_arc3_LINPRD

$ kill -9 4368 4372 4374 4376

The ARCn processes disappeared, but the instance didn’t crash. After some seconds, these entries showed up in the alert.log:

TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): Detected ARCH process failure
Starting background process ARC0
ARC0 started with pid=43, OS id=13833 
Starting background process ARC1
ARC1 started with pid=93, OS id=13835 
Starting background process ARC2
ARC2 started with pid=96, OS id=13837 
Starting background process ARC3
ARC3 started with pid=101, OS id=13839 
TMON (PID:4024): ARC0: Archival started
TMON (PID:4024): ARC1: Archival started
TMON (PID:4024): ARC2: Archival started
ARC2 (PID:13837): Becoming a 'no FAL' ARCH
ARC2 (PID:13837): Becoming the 'no SRL' ARCH
TMON (PID:4024): ARC3: Archival started

Apparently some process called TMON (Transport Monitor) restarted the log archivers. Let’s kill TMON!

After a while, the TMON was started again:

Restarting dead background process TMON
Starting background process TMON
TMON started with pid=39, OS id=18949 

But let’s pick something from the fatal background process list above. Not some obvious process (like LGWR,DBWR,SMON,PMON,CKPT), but something that doesn’t look like it’s that necessary:

$ ps -ef | grep lreg.*LINPRD
oracle    4002     1  0 03:05 ?        00:00:00 ora_lreg_LINPRD

$ kill -9 4002

LREG is the Listener Registration process and killing it apparently takes down the whole instance:

PMON (ospid: 3940): terminating the instance due to ORA error 500
Cause - 'Instance is being terminated due to fatal process death (pid: 29, ospid: 4002, LREG)'
System state dump requested by (instance=1, osid=3940 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/linprd/LINPRD/trace/LINPRD_diag_3961_20200324031039.trc
Dumping diagnostic data in directory=[cdmp_20200324031039], requested by (instance=1, osid=3940 (PMON)), summary=[abnormal instance termination].
Instance terminated by PMON, pid = 3940

I also killed one of the fatal background processes (IMR0) in my RAC VM:

$ ps -ef | grep -i imr
oracle   13811 13774  0 07:20 pts/0    00:00:00 grep --color=auto -i imr
oracle   15044     1  0 Mar14 ?        00:51:58 asm_imr0_+ASM1
oracle   15703     1  0 Mar14 ?        00:35:20 ora_imr0_cdbrac1

$ kill -9 15703

… and indeed that instance was terminated by PMON:

System state dump requested by (instance=1, osid=15488 (PMON)), summary=[abnormal instance termination]. error - 'Instance is terminating.
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdbrac/cdbrac1/trace/cdbrac1_diag_15510.trc
PMON (ospid: ): terminating the instance due to ORA error
Cause - 'Instance is being terminated due to fatal process death (pid: 53, ospid: 15703, IMR0)'
ORA-1092 : opitsk aborting process

So far the basic tests show that this flag indeed indicates whether a background process is fatal or not. But I haven’t tested all of them and sometimes the reasons for instance crashes are more complicated than just a single-process crash (like some process holding the control file enqueue for too long, but that’s for another day).

I have uploaded the fatal_bg_proc.sql script to my GitHub repo.

  1. Subscribe to weekly updates by email or follow Social/RSS
  2. Check out my 2020 online training classes here!
    Advanced Oracle SQL Tuning training, Linux Performance & Troubleshooting training, Advanced Oracle Troubleshooting training. In addition to the live online classes, all attendees will receive personal downloadable video recordings too!