When talking about Oracle background processes, there’s a term/qualifier “fatal” background process. This means that when one of these background processes crashes, then whoever detects the process disappearance (PMON or LGWR or CLMN possibly), will shut down the instance as it cannot function normally anymore.
Not all background process crashes take down the whole instance, for example processes like J000 and P000 are technically background processes (daemons disconnected from network), but their crash won’t take down the instance. It’s less known, but the log archiver process (ARCn) crashes/kills won’t take down the instance either. These processes are automatically restarted. Please don’t test this out in production :-)
I’ve been wondering for a while, whether there’s some flag in a V$ or X$ that shows which background process is fatal and which one is expendable (yes, a movie reference!).
I had some time to look into this today and looks like there’s a flag in X$KSUPR (the X$ underlying the V$PROCESS, but unfortunately the V$ doesn’t have this flag exposed). Looks like it’s the 3rd bit of X$KSUPR.KSUPRFLG
.
Here’s an example from a single instance 18c database:
SQL> SELECT indx,ksuprpnm,TO_CHAR(ksuprflg,'XXXXXXXXXXXXXXXX')
2 FROM x$ksupr
3 WHERE BITAND(ksuprflg,4) = 4 ORDER BY indx
4 /
INDX KSUPRPNM TO_CHAR(KSUPRFLG,
---------- ------------------------------------------------ -----------------
2 oracle@linux01.localdomain (PMON) E
3 oracle@linux01.localdomain (CLMN) E
4 oracle@linux01.localdomain (PSP0) 6
5 oracle@linux01.localdomain (VKTM) 6
6 oracle@linux01.localdomain (GEN0) 6
7 oracle@linux01.localdomain (MMAN) 6
14 oracle@linux01.localdomain (DBRM) 6
17 oracle@linux01.localdomain (PMAN) 6
19 oracle@linux01.localdomain (DBW0) 6
20 oracle@linux01.localdomain (DBW1) 6
21 oracle@linux01.localdomain (LGWR) 6
22 oracle@linux01.localdomain (CKPT) 6
23 oracle@linux01.localdomain (LG00) 6
24 oracle@linux01.localdomain (SMON) 16
25 oracle@linux01.localdomain (LG01) 6
29 oracle@linux01.localdomain (LREG) 6
This is an example from a RAC instance (19c):
SQL> SELECT indx,ksuprpnm,TO_CHAR(ksuprflg,'XXXXXXXXXXXXXXXX')
2 FROM x$ksupr
3 WHERE BITAND(ksuprflg,4) = 4 ORDER BY indx
4 /
INDX KSUPRPNM TO_CHAR(KSUPRFLG,
---------- ------------------------------------------------ -----------------
2 oracle@ol7-19-rac1.localdomain (PMON) E
3 oracle@ol7-19-rac1.localdomain (CLMN) E
4 oracle@ol7-19-rac1.localdomain (PSP0) 6
5 oracle@ol7-19-rac1.localdomain (IPC0) 6
6 oracle@ol7-19-rac1.localdomain (VKTM) 6
7 oracle@ol7-19-rac1.localdomain (GEN0) 6
8 oracle@ol7-19-rac1.localdomain (MMAN) 6
9 oracle@ol7-19-rac1.localdomain (LG00) 6
15 oracle@ol7-19-rac1.localdomain (DBRM) 6
19 oracle@ol7-19-rac1.localdomain (ACMS) 6
20 oracle@ol7-19-rac1.localdomain (PMAN) 6
22 oracle@ol7-19-rac1.localdomain (LMON) 6
23 oracle@ol7-19-rac1.localdomain (LMD0) 6
24 oracle@ol7-19-rac1.localdomain (LMS0) 6
26 oracle@ol7-19-rac1.localdomain (LMS1) 6
28 oracle@ol7-19-rac1.localdomain (LMD1) 6
29 oracle@ol7-19-rac1.localdomain (RMS0) 6
30 oracle@ol7-19-rac1.localdomain (RS01) 6
31 oracle@ol7-19-rac1.localdomain (RS00) 6
33 oracle@ol7-19-rac1.localdomain (LCK1) 6
34 oracle@ol7-19-rac1.localdomain (DBW0) 6
35 oracle@ol7-19-rac1.localdomain (LGWR) 6
36 oracle@ol7-19-rac1.localdomain (CKPT) 6
37 oracle@ol7-19-rac1.localdomain (SMON) 16
38 oracle@ol7-19-rac1.localdomain (LG01) 6
43 oracle@ol7-19-rac1.localdomain (LREG) 6
45 oracle@ol7-19-rac1.localdomain (RBAL) 6
46 oracle@ol7-19-rac1.localdomain (ASMB) 6
47 oracle@ol7-19-rac1.localdomain (FENC) 6
53 oracle@ol7-19-rac1.localdomain (IMR0) 6
55 oracle@ol7-19-rac1.localdomain (LCK0) 6
71 oracle@ol7-19-rac1.localdomain (GTX0) 6
So, these are the fatal background processes and as I mentioned, log archiver processes are not part of this set. Let’s try to kill some ARCn processes:
$ ps -ef | grep -i arc.*LINPRD
oracle 4368 1 0 03:05 ? 00:00:00 ora_arc0_LINPRD
oracle 4372 1 0 03:05 ? 00:00:00 ora_arc1_LINPRD
oracle 4374 1 0 03:05 ? 00:00:00 ora_arc2_LINPRD
oracle 4376 1 0 03:05 ? 00:00:00 ora_arc3_LINPRD
$ kill -9 4368 4372 4374 4376
The ARCn processes disappeared, but the instance didn’t crash. After some seconds, these entries showed up in the alert.log:
2020-03-24T03:06:45.000971-04:00
TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): Detected ARCH process failure
TMON (PID:4024): STARTING ARCH PROCESSES
Starting background process ARC0
2020-03-24T03:06:45.018122-04:00
ARC0 started with pid=43, OS id=13833
Starting background process ARC1
2020-03-24T03:06:45.032923-04:00
ARC1 started with pid=93, OS id=13835
Starting background process ARC2
2020-03-24T03:06:45.047141-04:00
ARC2 started with pid=96, OS id=13837
Starting background process ARC3
2020-03-24T03:06:45.062135-04:00
ARC3 started with pid=101, OS id=13839
TMON (PID:4024): ARC0: Archival started
TMON (PID:4024): ARC1: Archival started
TMON (PID:4024): ARC2: Archival started
2020-03-24T03:06:45.062469-04:00
ARC2 (PID:13837): Becoming a 'no FAL' ARCH
ARC2 (PID:13837): Becoming the 'no SRL' ARCH
2020-03-24T03:06:45.072580-04:00
TMON (PID:4024): ARC3: Archival started
TMON (PID:4024): STARTING ARCH PROCESSES COMPLETE
Apparently some process called TMON (Transport Monitor) restarted the log archivers. Let’s kill TMON!
After a while, the TMON was started again:
2020-03-24T03:08:26.440213-04:00
Restarting dead background process TMON
Starting background process TMON
2020-03-24T03:08:26.455607-04:00
TMON started with pid=39, OS id=18949
But let’s pick something from the fatal background process list above. Not some obvious process (like LGWR,DBWR,SMON,PMON,CKPT), but something that doesn’t look like it’s that necessary:
$ ps -ef | grep lreg.*LINPRD
oracle 4002 1 0 03:05 ? 00:00:00 ora_lreg_LINPRD
$ kill -9 4002
LREG is the Listener Registration process and killing it apparently takes down the whole instance:
2020-03-24T03:10:39.838690-04:00
PMON (ospid: 3940): terminating the instance due to ORA error 500
Cause - 'Instance is being terminated due to fatal process death (pid: 29, ospid: 4002, LREG)'
2020-03-24T03:10:39.839918-04:00
System state dump requested by (instance=1, osid=3940 (PMON)), summary=[abnormal instance termination].
System State dumped to trace file /u01/app/oracle/diag/rdbms/linprd/LINPRD/trace/LINPRD_diag_3961_20200324031039.trc
2020-03-24T03:10:41.108874-04:00
Dumping diagnostic data in directory=[cdmp_20200324031039], requested by (instance=1, osid=3940 (PMON)), summary=[abnormal instance termination].
2020-03-24T03:10:42.246134-04:00
Instance terminated by PMON, pid = 3940
I also killed one of the fatal background processes (IMR0) in my RAC VM:
$ ps -ef | grep -i imr
oracle 13811 13774 0 07:20 pts/0 00:00:00 grep --color=auto -i imr
oracle 15044 1 0 Mar14 ? 00:51:58 asm_imr0_+ASM1
oracle 15703 1 0 Mar14 ? 00:35:20 ora_imr0_cdbrac1
$ kill -9 15703
… and indeed that instance was terminated by PMON:
2020-03-24T07:21:05.379992+00:00
System state dump requested by (instance=1, osid=15488 (PMON)), summary=[abnormal instance termination]. error - 'Instance is terminating.
'
System State dumped to trace file /u01/app/oracle/diag/rdbms/cdbrac/cdbrac1/trace/cdbrac1_diag_15510.trc
2020-03-24T07:21:05.423752+00:00
PMON (ospid: ): terminating the instance due to ORA error
2020-03-24T07:21:05.425906+00:00
Cause - 'Instance is being terminated due to fatal process death (pid: 53, ospid: 15703, IMR0)'
2020-03-24T07:21:06.764433+00:00
ORA-1092 : opitsk aborting process
So far the basic tests show that this flag indeed indicates whether a background process is fatal or not. But I haven’t tested all of them and sometimes the reasons for instance crashes are more complicated than just a single-process crash (like some process holding the control file enqueue for too long, but that’s for another day).
I have uploaded the fatal_bg_proc.sql script to my GitHub repo.