When eBPF task->stack->pt_regs reads return garbage on the latest Linux kernels, blame Fred!

TLDR; Starting from Linux kernel version 6.9 on x86_64, there’s a new config option CONFIG_X86_FRED enabled and it adds 16 bytes to the starting point of a task’s kernel stack area, so you’ll need to account for this extra padding in your “raw” kernel stack & pt_regs lookup code.

Update: It turns out that there’s a bpf_task_pt_regs() helper function available in newer kernel/eBPF versions and it does the same offset math for you, if you can retrieve your task using the bpf_get_current_task_btf() BTF-enhanced helper that is data structure aware. I’ve started using it in my eBPF code for now, but might need to use the custom approach if I ever try to backport my tools to older Linux versions or avoid GPL-licensed helper functions.

Introduction

I’ve been using Ubuntu 24.04 as my main eBPF development and testing platform without issues since its release. It is shipped with Linux kernel version 6.8.0, but Canonical recently released an optional newer kernel (HWE) version 6.11 for it too. So, naturally I upgraded to the latest version (linux-image-generic-hwe-24.04 -> 6.11.0) and moved on.

Then, suddenly my 0x.tools xcapture-next (v3) eBPF prototype started returning garbage values for the current system call and argument samples of its monitored OS threads. No problem when booting up with the old 6.8.0 kernel.

The xcapture-next (v3) tool passively samples other threads’ activity & state by reading their task_struct kernel memory as a decoupled outside observer, without needing to inject any probes or tracepoints into other tasks execution paths. This gives us a pretty good starting point for building a top for wall-clock time tool on Linux, without any slowdown for all your other threads in the system (thanks to passive sampling with BPF task iterators). ¹

By “garbage”, I mean that my eBPF task iterator program that looped through all the other threads’ task_struct info and accessed things like task->stack->pt_regs->orig_ax to get the current system call number (if a task was in a system call), suddenly reported that all userspace threads were in a getsockname syscall instead of their usual names seen in previous tests. This was clearly not correct, so I started investigating:

$ grep getsockname syscall_x86_64.tbl
51  common  getsockname     sys_getsockname

The reported syscall’s NR is 51 - I first wondered if my eBPF program hit some memory access violation on the new kernel and the xcapture frontend incorrectly treated that error code as a syscall number (the EACCESS errno is 13 and EFAULT is 14):

$ grep 51 include/uapi/asm-generic/errno.h
#define EL2HLT    51  /* Level 2 halted */

Yeah, this doesn’t look like my issue. Errno 51 is something archaic and esoteric and I don’t think it’s used anywhere really - and a grep through Linux source code confirmed that (although some out-of-tree module could still raise it if it wanted to).

I then changed my eBPF code to use BPF_CORE_READ_INTO macros (instead of BPF_CORE_READ) that return the result/error code and the actual value read as separate variables and it confirmed that I wasn’t hitting any kernel memory reading errors. The memory address I was reading just happened to hold a value of 51 in it, but only when booting the installation up with the new 6.11 kernel, not with the previous 6.8 kernel on the same system that worked perfectly fine.

Long story short: On x86_64 platforms, starting from Linux kernel v6.9, there’s a new feature called FRED (Flexible Return and Event Delivery system) enabled by default and it affects the kernel stack area init/usage behavior. 16 extra bytes are allocated at the bottom of a task’s kernel-side stack, before the pt_regs structure is allocated and placed after it on syscall entry from userspace. The definition is in /arch/x86/include/asm/thread_info.h in kernel source (search for FRED).

I have 2 kernels installed, 6.8 and 6.11:

$ ls -l /boot/config-6.*
-rw-r--r-- 1 root root 292076 Jan 20 15:47 /boot/config-6.11.0-17-generic
-rw-r--r-- 1 root root 287562 Jan 17 07:05 /boot/config-6.8.0-53-generic

When I grep for the CONFIG_X86_FRED variable, 6.8 does not have such a feature flag!

$ grep CONFIG_X86_FRED /boot/config-6.*
/boot/config-6.11.0-17-generic:CONFIG_X86_FRED=y

The Linux thread_info.h for x86 shows this:

...
#ifdef CONFIG_X86_32
# ifdef CONFIG_VM86
#  define TOP_OF_KERNEL_STACK_PADDING 16
# else
#  define TOP_OF_KERNEL_STACK_PADDING 8
# endif
#else /* x86-64 */
# ifdef CONFIG_X86_FRED
#  define TOP_OF_KERNEL_STACK_PADDING (2 * 8)
# else
#  define TOP_OF_KERNEL_STACK_PADDING 0
# endif
#endif
...

This flag showed up in v6.9-rc1, so that explains why my previous 6.8 kernel didn’t have this issue, but 6.11 did!

What is FRED?

What is FRED? It’s basically a further CPU privilege level switching (and returning) optimization in Intel CPUs, somewhat like the evolution from having to raise int 0x80 interrupts to just running syscall or sysenter built-in CPU instructions for faster context & privilege level switching to invoke system calls. FRED (on Intel CPUs) brings us two brand new instructions ERETU and ERETS. You can read the entire Intel FRED architecture spec (but I didn’t), the best summary I found is in this article:

https://www.eejournal.com/article/we-interrupt-this-program/

Apparently, FRED means that Intel CPUs are moving away from the four CPU privilege levels (ring 0-3) that nobody ever widely used anyway, back to just two - privileged and unprivileged.

Dynamic FRED-detection implementation in my eBPF code

It’s not easy to read such compile-time constants or build-time settings from the eBPF kernel-land, so my current (unpublished) code snippet for detecting FRED in xcapture on x86 is roughly following.

I have defined a custom fred_info___check structure in a separate .h file, so that the program would still compile on kernels without knowledge of the actual FRED fred_info structure. The triple underscore “___” in the struct name has a special meaning².

struct fred_info___check {
    long unsigned int edata;
} __attribute__((preserve_access_index));

And a snippet from the main .bpf.c program code:

// Default page size and thread stack size (THREAD_SIZE) configuration
#define PAGE_SIZE 4096
#define KASAN_STACK_ORDER 0

#define THREAD_SIZE_ORDER (2 + KASAN_STACK_ORDER)
#define THREAD_SIZE  (PAGE_SIZE << THREAD_SIZE_ORDER)

// x86_64 only (CONFIG_X86_FRED)
#define TOP_OF_KERNEL_STACK_PADDING_FRED (2 * 8)

// The PT_REGS structure is located at the "top" (highest address) of the kernel stack
static __always_inline struct pt_regs *get_task_pt_regs(struct task_struct *task)
{
    // will use true=1 and false=0 for regs_addr calculation later on
    const bool fred_enabled = bpf_core_type_exists(struct fred_info___check);

    __u64 stack_base = (__u64) BPF_CORE_READ(task, stack);
    if (!stack_base)
        return NULL;
    
    __u64 regs_addr = (
            stack_base
          + THREAD_SIZE
          - (fred_enabled * TOP_OF_KERNEL_STACK_PADDING_FRED)
          - sizeof(struct pt_regs)
    );

    struct pt_regs *regs = (struct pt_regs *) regs_addr;

    return regs;
}

I’ll publish the next version of xcapture-next with the full code soon, so you’ll see how the dynamic & direct kernel feature detection works. I’m using a bpf_core_type_exists trick to see if the new kernel structures added by FRED exist in the currently running kernel and decide whether to add extra “padding” to my pt_regs lookup logic based on that.

A few details worth remembering

FRED is an Intel x86 platform feature. AMD has (or will have) also something similar, but since there’s a single x86 build for both AMD and Intel’s x86_64, then this FRED padding kicks in also on AMD machines.
Disabling FRED-behavior using fred=off as a kernel boot argument won’t remove this extra struct/padding, as long as CONFIG_X86_FRED is enabled in your kernel build settings and you don’t compile a whole new kernel.
This should not affect programs that just follow/dereference the standard arguments & structs that the typical eBPF probes present. The built-in bpf_core_* helper functions and corresponding BPF_CORE_* macros should take care of any “surprises” introduced in newer kernel versions and hide that complexity from you.
The moment you get into the “raw” eBPF progam territory, you will have to deal with any underlying behavior changes and address shifts yourself. In this case I was doing custom (non-BTF) memory address arithmetic on basic unsigned long ints and later “casting” the resulting number to a struct address. eBPF verifier thankfully allowed me to do that, but I should better know what I’m doing - on the given platform and kernel version!

I hope this article saves a few hours of head-scratching & troubleshooting time for some future eBPF developers turned into web-searchers out there, just like me a week ago :-)

HN discussion is here:

https://news.ycombinator.com/item?id=43214576

See the videos on the front page of the 0x.tools page for a full overview of what I’m building right now! I briefly demo the BPF task iterator approach in the end of my 2024 interview video with Liz Rice. ↩︎
Triple-underscore special meaning in eBPF symbol names ↩︎

When eBPF task->stack->pt_regs reads return garbage on the latest Linux kernels, blame Fred!

2025-02-28

Introduction

What is FRED?

Dynamic FRED-detection implementation in my eBPF code

A few details worth remembering