Yes!
Various traps like page faults may cause your user process to be switched into kernel codepath even when the process is minding its own business in the userspace. It just needs to do something that causes such a trap, like touch a new page of memory in its virtual memory address space, that hasn’t been “fully materialized” yet in the kernel memory structures.
For example, after starting up an Oracle database instance that uses a large amount of shared memory, there’s one process that burns CPU for a while:
$ sudo pidstat -G ora_sa 1 Linux 5.14.0-427.24.1.el9_4.x86_64 (linux01.localdomain) 07/10/2024 _x86_64_ (104 CPU) 08:57:22 AM UID PID %usr %system %guest %wait %CPU Command 08:57:23 AM 54321 25950 0.00 97.00 0.00 0.00 97.00 ora_sa00_lin19m 08:57:23 AM UID PID %usr %system %guest %wait %CPU Command 08:57:24 AM 54321 25950 0.00 99.00 0.00 0.00 99.00 ora_sa00_lin19m
If you run top, you’ll just see the total CPU consumption of the process, but pidstat shows you the breakdown into %usr
,%system
, etc. Indeed, the process is mostly burning CPU in the kernel mode.
Usually this means that it’s running lots of system calls, but let’s not guess! One way to drill down further would be to run perf top -p PID on that process and see which symbols/module it reports. Since I’m building out the 0x.tools xcapture-bpf utility, (eventually) for always-on xray vision into all activity on Linux, I’ll show you that instead.
$ sudo ./xcapture-bpf -xN -G oncpu_u,oncpu_k === [0x.tools] xtop 2.0.3 BETA by Tanel Poder. Rhel Linux 9.4 5.14.0 x86_64 === Loading BPF... === Ready (mypid 30216) === Active Threads =================================================================== seconds | visual_pct | st | username | comm | syscall | oncpu_u | oncpu_k -------------------------------------------------------------------------------------- 3.68 | ███████ | R | oracle | ora_sa*_lin*m | - | 45834 | 45539 0.92 | ██ | R | oracle | ora_sa*_lin*m | - | 45834 | 7941 0.13 | ▎ | R | oracle | ora_mmon_lin*m | - | - | - 0.07 | ▏ | R | oracle | ora_dbrm_lin*m | - | - | - 0.07 | ▏ | R | oracle | ora_psp*_lin*c | read | - | - 0.07 | ▏ | R | oracle | ora_psp*_lin*c | - | - | - 0.07 | ▏ | R | oracle | ora_q*_lin*c | - | - | - 0.07 | ▏ | R | oracle | ora_cjq*_lin*c | - | - | - 0.07 | ▏ | R | oracle | ora_smco_lin*m | openat | - | - 0.07 | ▏ | R | oracle | ora_clmn_lin*m | - | - | - 0.07 | ▏ | R | oracle | ora_sa*_lin*m | pselect6 | 45834 | 7941 0.07 | ▏ | R | oracle | ora_vktm_lin*m | - | - | - 0.07 | ▏ | R | oracle | ora_sa*_lin*m | pselect6 | 45834 | 45539 0.07 | ▏ | R | oracle | ora_dia*_lin*m | - | - | - sampled: 76 times, avg_thr: 1.11 start: 2024-07-10 08:59:59, duration: 5s
You see that there’s a process on CPU (in runnable “R” state) and it is not in a system call. Nevertheless, the oncpu_k field shows a kernel mode stack ID being sampled, while that process was on CPU. Since I ran xcapture-bpf
with the -N
option, it also prints out the symbolicated stack traces of any stack IDs noticed during earlier sampling:
---------------------------------------------------------------------------------- kstack 7941 | kstack 45539 | asm_exc_page_fault | asm_exc_page_fault | exc_page_fault | exc_page_fault | do_user_addr_fault | do_user_addr_fault | handle_mm_fault | handle_mm_fault | hugetlb_fault | hugetlb_fault | hugetlb_no_page | hugetlb_no_page | clear_huge_page | clear_huge_page | | clear_page_erms | ---------------------------------------------------------------------------------- ustack 45834 | __libc_start_call_main | main | ssthrdmain | opimai_real | sou2o | opidrv | opirip | ksvrdp_int | ksm_sslv_exec_cbk | ksmprepage | skgmapply | ksm_prepage_sga_seg | ksmprepage_memory |
If you look into the stacktile in the bottom first, it shows a userspace stack (ustack) captured from this process, and it shows ksmprepage_memory
as the “current” userspace function executed. This Oracle process is proactively touching all shared virtual memory pages right after the startup, so that everything that needs to be physically “materialized” in the kernel, would be done so immediately. Newly used pages on Linux are zero-filled on their first pagefault, so for hugepages that may take a while. That way, other processes doing real work won’t hit such exceptions and latency hiccups at random times later on (sometimes while holding busy spinlocks)!
The entry point into the kernel-space (the highlighted stacktile in top right) is asm_exc_page_fault
. There’s no system call involved, otherwise it would have shown up in the first section above and also the kernel stack trace would have included various functions like sys_xyz
and do_xyz
etc.
More info about 0x.tools and my other Linux performance & troubleshooting articles available here: