TLDR; Starting from Linux kernel version 6.9 on x86_64, there’s a new config option CONFIG_X86_FRED
enabled and it adds 16 bytes to the starting point of a task’s kernel stack area, so you’ll need to account for this extra padding in your “raw” kernel stack & pt_regs lookup code.
Introduction
I’ve been using Ubuntu 24.04 as my main eBPF development and testing platform without issues since its release. It is shipped with Linux kernel version 6.8.0, but Canonical recently released an optional newer kernel (HWE) version 6.11 for it too. So, naturally I upgraded to the latest version (linux-image-generic-hwe-24.04 -> 6.11.0) and moved on.
Then, suddenly my 0x.tools xcapture-next (v3) eBPF prototype started returning garbage values for the current system call and argument samples of its monitored OS threads. No problem when booting up with the old 6.8.0 kernel.
The xcapture-next (v3) tool passively samples other threads’ activity & state by reading their
task_struct
kernel memory as a decoupled outside observer, without needing to inject any probes or tracepoints into other tasks execution paths. This gives us a pretty good starting point for building a top for wall-clock time tool on Linux, without any slowdown for all your other threads in the system (thanks to passive sampling with BPF task iterators). 1
By “garbage”, I mean that my eBPF task iterator program that looped through all the other threads’ task_struct info and accessed things like task->stack->pt_regs->orig_ax
to get the current system call number (if a task was in a system call), suddenly reported that all userspace threads were in a getsockname
syscall instead of their usual names seen in previous tests. This was clearly not correct, so I started investigating:
$ grep getsockname syscall_x86_64.tbl 51 common getsockname sys_getsockname
The reported syscall’s NR is 51 - I first wondered if my eBPF program hit some memory access violation on the new kernel and the xcapture frontend incorrectly treated that error code as a syscall number (the EACCESS errno is 13 and EFAULT is 14):
$ grep 51 include/uapi/asm-generic/errno.h #define EL2HLT 51 /* Level 2 halted */
Yeah, this doesn’t look like my issue. Errno 51 is something archaic and esoteric and I don’t think it’s used anywhere really - and a grep through Linux source code confirmed that (although some out-of-tree module could still raise it if it wanted to).
I then changed my eBPF code to use BPF_CORE_READ_INTO
macros (instead of BPF_CORE_READ
) that return the result/error code and the actual value read as separate variables and it confirmed that I wasn’t hitting any kernel memory reading errors. The memory address I was reading just happened to hold a value of 51
in it, but only when booting the installation up with the new 6.11 kernel, not with the previous 6.8 kernel on the same system that worked perfectly fine.
Long story short: On x86_64 platforms, starting from Linux kernel v6.9, there’s a new feature called FRED enabled by default and it affects the kernel stack area init/usage behavior. 16 extra bytes are allocated at the bottom of a task’s kernel-side stack, before the pt_regs
structure is allocated and placed above it on syscall entry from userspace. The definition is in /arch/x86/include/asm/thread_info.h in kernel source (search for FRED).
I have 2 kernels installed, 6.8 and 6.11:
$ ls -l /boot/config-6.* -rw-r--r-- 1 root root 292076 Jan 20 15:47 /boot/config-6.11.0-17-generic -rw-r--r-- 1 root root 287562 Jan 17 07:05 /boot/config-6.8.0-53-generic
When I grep for the CONFIG_X86_FRED
variable, 6.8 does not have such a feature flag!
$ grep CONFIG_X86_FRED /boot/config-6.* /boot/config-6.11.0-17-generic:CONFIG_X86_FRED=y
The Linux thread_info.h for x86 shows this:
... #ifdef CONFIG_X86_32 # ifdef CONFIG_VM86 # define TOP_OF_KERNEL_STACK_PADDING 16 # else # define TOP_OF_KERNEL_STACK_PADDING 8 # endif #else /* x86-64 */ # ifdef CONFIG_X86_FRED # define TOP_OF_KERNEL_STACK_PADDING (2 * 8) # else # define TOP_OF_KERNEL_STACK_PADDING 0 # endif #endif ...
This flag showed up in v6.9-rc1, so that explains why my previous 6.8 kernel didn’t have this issue, but 6.11 did!
What is FRED?
What is FRED - Flexible Return and Event Delivery system? It’s basically a further CPU privilege level switching (and returning) optimization in Intel CPUs, somewhat like the evolution from having to raise int 0x80
interrupts to just running syscall
or sysenter
built-in CPU instructions for faster context & privilege level switching to invoke system calls. FRED (on Intel CPUs) brings us two brand new instructions ERETU
and ERETS
. You can read the entire Intel FRED architecture spec (but I didn’t), the best summary I found is in this article:
Apparently, FRED means that Intel CPUs are moving away from the four CPU privilege levels (ring 0-3) that nobody ever widely used anyway, back to just two - privileged and unprivileged.
Dynamic FRED-detection implementation in my eBPF code
It’s not easy to read such compile-time constants or build-time settings from the eBPF kernel-land, so my current (unpublished) code snippet for detecting FRED in xcapture on x86 is roughly following.
I have defined a custom fred_info___check structure in a separate .h file, so that the program would still compile on kernels without knowledge of the actual FRED fred_info structure. The triple underscore “___” in the struct name has a special meaning2.
struct fred_info___check { long unsigned int edata; } __attribute__((preserve_access_index));
And a snippet from the main .bpf.c program code:
// Default page size and thread stack size (THREAD_SIZE) configuration #define PAGE_SIZE 4096 #define KASAN_STACK_ORDER 0 #define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER) #define THREAD_SIZE (PAGE_SIZE << THREAD_SIZE_ORDER) // x86_64 only (CONFIG_X86_FRED) #define TOP_OF_KERNEL_STACK_PADDING_FRED (2 * 8) // The PT_REGS structure is located at the "top" (highest address) of the kernel stack static __always_inline struct pt_regs *get_task_pt_regs(struct task_struct *task) { // will use true=1 and false=0 for regs_addr calculation later on const bool fred_enabled = bpf_core_type_exists(struct fred_info___check); __u64 stack_base = (__u64) BPF_CORE_READ(task, stack); if (!stack_base) return NULL; __u64 regs_addr = ( stack_base + THREAD_SIZE - (fred_enabled * TOP_OF_KERNEL_STACK_PADDING_FRED) - sizeof(struct pt_regs) ); struct pt_regs *regs = (struct pt_regs *) regs_addr; return regs; }
I’ll publish the next version of xcapture-next with the full code soon, so you’ll see how the dynamic & direct kernel feature detection works. I’m using a bpf_core_type_exists
trick to see if the new kernel structures added by FRED exist in the currently running kernel and decide whether to add extra “padding” to my pt_regs
lookup logic based on that.
A few details worth remembering
- FRED is an Intel x86 platform feature. AMD has (or will have) also something similar, but since there’s a single x86 build for both AMD and Intel’s
x86_64
, then this FRED padding kicks in also on AMD machines. - Disabling FRED-behavior using
fred=off
as a kernel boot argument won’t remove this extra struct/padding, as long asCONFIG_X86_FRED
is enabled in your kernel build settings and you don’t compile a whole new kernel. - This should not affect programs that just follow/dereference the standard arguments & structs that the typical eBPF probes present. The built-in bpf_core_* helper functions and corresponding BPF_CORE_* macros should take care of any “surprises” introduced in newer kernel versions and hide that complexity from you.
- The moment you get into the “raw” eBPF progam territory, you will have to deal with any underlying behavior changes and address shifts yourself. In this case I was doing custom (non-BTF) memory address arithmetic on basic unsigned long ints and later “casting” the resulting number to a struct address. eBPF verifier thankfully allowed me to do that, but I should better know what I’m doing - on the given platform and kernel version!
I hope this article saves a few hours of head-scratching & troubleshooting time for some future eBPF developers turned into web-searchers out there, just like me a week ago :-)
HN discussion is here:
-
See the videos on the front page of the 0x.tools page for a full overview of what I’m building right now! I briefly demo the BPF task iterator approach in the end of my 2024 interview video with Liz Rice. ↩︎