The Slowdown That Doesn’t Show Up in Profiles

006 · 2026-05-17 · false sharing, cache lines, struct layout

I had a channel state struct with three atomic fields — a status flag and two counters. Each one was written by a different thread, and they didn’t share any data through mutexes or references. Every field was independently owned.

#[repr(C)]
struct ChannelState {
    status: AtomicU8,    // control thread
    rx_count: AtomicU64, // reader thread
    tx_count: AtomicU64, // writer thread
}

It was fast single-threaded. When I added a second thread it got slower, and a third made it worse. The more cores I threw at it, the less work each one actually got done.

I ran perf stat and IPC looked fine. Flamegraph showed nothing unexpected — the hot function was a tight fetch_add loop, exactly where it should be. CPU utilization was high but work wasn’t getting done.

I spent an afternoon on it before realizing the answer had nothing to do with my code.

Cache lines

CPUs don’t read individual bytes from memory. They pull in 64-byte contiguous blocks called cache lines. When any core writes to any byte in a line, every other core’s cached copy of that entire 64-byte block gets invalidated — not just the byte that changed, the whole block.

That’s the cache coherency protocol doing its job. A round-trip to re-fetch a line from another core’s cache costs tens of nanoseconds, which is fast in isolation but adds up quickly in a tight loop.

My struct fit in a single cache line:

cache line 0 — 64 bytes

status (1B) padding (7B) rx_count (8B) tx_count (8B) unused (40B)

Three threads writing to three separate fields, with no shared data as far as the source code is concerned. But they all sit in the same 64-byte block, so every time core 0 writes status, cores 1 and 2 lose their cached copies of rx_count and tx_count.

That’s false sharing — the threads aren’t sharing any data, they’re sharing a cache line.

At the hardware level, two cores passing the same line back and forth, each write invalidating the other:

STEP 1

core 0

Modified

x y

writes x

core 1

—

x y

Core 0 owns the line. Writes x — no stall, the data is in L1.

STEP 2

core 0

Invalid

x y

invalidated

← RFO · data →

core 1

Modified

x y

writes y

Core 1 writes y. Sends Request For Ownership. Core 0 flushes the line. ~40 cycle stall.

STEP 3

core 0

Modified

x y

writes x

RFO → · ← data

core 1

Invalid

x y

invalidated

Core 0 writes x. Needs the line back. Another RFO, another ~40 cycles. Repeat forever.

total bus stalls: 80 cycles (and counting)

Proving it

I stripped it down to the smallest possible repro: two versions of the same struct, one that packs both fields onto the same cache line and one that pads them apart.

// Version A: both fields on one cache line
#[repr(C)]
struct Contended {
    x: AtomicU64,  // thread 1 writes here
    y: AtomicU64,  // thread 2 writes here
}

// Version B: each field on its own line
#[repr(C)]
struct Padded {
    x: AtomicU64,
    _pad: [u8; 56],
    y: AtomicU64,
}

contended — same cache line

line 0

x (8B) y (8B)

padded — separate lines

line 0

line 1

x (8B) pad (56B) y (8B)

Two threads, each doing 50M fetch_add calls on its own field. Warmup, then measure:

use std::sync::{Arc, atomic::{AtomicU64, Ordering::Relaxed}};
use std::time::Instant;

#[repr(C)]
struct Contended { x: AtomicU64, y: AtomicU64 }

#[repr(C)]
struct Padded { x: AtomicU64, _pad: [u8; 56], y: AtomicU64 }

const N: u64 = 50_000_000;

fn bench<T: Send + Sync + 'static>(
    label: &str,
    data: Arc<T>,
    f0: fn(&T), f1: fn(&T),
) {
    // warmup
    let (d0, d1) = (data.clone(), data.clone());
    std::thread::scope(|s| { s.spawn(|| f0(&d0)); s.spawn(|| f1(&d1)); });

    let t = Instant::now();
    std::thread::scope(|s| { s.spawn(|| f0(&data)); s.spawn(|| f1(&data)); });
    println!("{label}: {:?}", t.elapsed());
}

fn main() {
    bench("contended", Arc::new(Contended {
        x: AtomicU64::new(0), y: AtomicU64::new(0),
    }), |d| { for _ in 0..N { d.x.fetch_add(1, Relaxed); }},
       |d| { for _ in 0..N { d.y.fetch_add(1, Relaxed); }});

    bench("padded", Arc::new(Padded {
        x: AtomicU64::new(0), _pad: [0; 56], y: AtomicU64::new(0),
    }), |d| { for _ in 0..N { d.x.fetch_add(1, Relaxed); }},
       |d| { for _ in 0..N { d.y.fetch_add(1, Relaxed); }});
}

50M fetch_add(Relaxed) per thread, 2 threads, Zen 4 single CCD

contended

924 ms

padded

184 ms

5.0x

Same work, same atomic operations, just 56 bytes of padding between the fields. 5x difference.

The fix

The fix is to put each contended field on its own cache line. crossbeam has CachePadded<T> for exactly this:

use crossbeam_utils::CachePadded;

struct ChannelState {
    status: CachePadded<AtomicU8>,
    rx_count: CachePadded<AtomicU64>,
    tx_count: CachePadded<AtomicU64>,
}

Or without the dependency, manual padding:

#[repr(C)]
struct ChannelState {
    status: AtomicU8,
    _pad0: [u8; 63],
    rx_count: AtomicU64,
    _pad1: [u8; 56],
    tx_count: AtomicU64,
}

It costs 192 bytes instead of 24 — three cache lines instead of one — but it’s worth it.

Once I fixed the struct, throughput scaled linearly with core count.

Why you can’t see it

There’s no function call, no lock, no syscall involved — the stall happens entirely inside the CPU. Core B issues a store, the cache controller sees the line is Invalid, sends a Request For Ownership, waits for the data, transitions to Modified, then completes the store. It’s invisible to software.

perf stat can surface it if you know which counters to look at — Intel’s MEM_LOAD_L3_HIT_RETIRED.XSNP_HITM or AMD’s ls_dmnd_fills_from_sys.remote_cache — but you have to already suspect false sharing to think to check those.

perf c2c breaks that catch-22 by profiling cache-to-cache transfers and reporting which addresses are bouncing between cores. It’s heavy though — full memory tracing, not something you’d run in CI.

Reading:

I wrote a linter that catches this from source: snarf.