When the Compiler Is the Side Channel
Compilers optimize your code. That's their job—fewer instructions, faster execution. But when you're writing security-critical code, "faster" and "correct" can be in direct conflict. The optimizer doesn't know which parts of your code exist for security reasons. It just sees inefficiency and removes it.
I found this out firsthand while checking the machine code output of a function I'd written for RISC-V. At -O0—the "no optimization" setting, where the compiler does exactly what you wrote—it looked right. Then I switched to -O2—aggressive optimization—and the compiler had rewritten my carefully constructed security code into something that leaks secrets through timing.
I started looking at what else the optimizer does to security-critical patterns, and it keeps getting worse. The same logic that introduces timing leaks will also erase code that scrubs secrets from memory, and silently drop commands your hardware was waiting for.
All output below is real—clang 23.0.0, RISC-V 32-bit target, not pseudocode.
THE TIMING LEAK
When code handles a secret—a cryptographic key, a password comparison—it must not behave differently depending on what the secret is. If the code takes one path for a 1 bit and a different path for a 0 bit, an attacker can measure which path was taken by timing how long the function takes. Different paths take different amounts of time. That's a side channel: information leaking through execution time, not through the return value.
The defense is to write code that always does the same work regardless of the secret. No if statements, no branches. Instead, you use bitwise arithmetic—turn the flag into a mask, select with AND and OR. Same result every time, same instructions every time.
uint32_t ct_select(uint32_t x, uint32_t y, uint32_t flag) {
uint32_t mask = -(flag & 1);
return (x & mask) | (y & ~mask);
}
At -O0, the compiler does exactly what you wrote. Five arithmetic instructions, no branches:
# clang -O0 (no optimization)
ct_select:
# ... stack setup ...
slli a0, a0, 31 # mask = -(flag & 1)
srai a0, a0, 31 # sign-extend to full word
xor a0, a0, a1 # x ^ y
and a0, a0, a2 # (x ^ y) & mask
xor a0, a0, a1 # ((x ^ y) & mask) ^ y
ret
Same instructions run whether the flag is 0 or 1. That's the whole point.
At -O2, LLVM rewrites it:
# clang -O2 (optimized)
ct_select:
andi a2, a2, 1
beqz a2, .LBB0_2 # ← branch on secret
mv a1, a0
.LBB0_2:
mv a0, a1
ret
I traced it through every optimization pass to find where this happens. It's a pass called InstCombine—LLVM's pattern-matching simplifier. It recognizes that the mask arithmetic is really just choosing between x and y based on a condition, and replaces the whole thing with a simple branch. Fewer instructions, faster code. The output is identical for every input—functionally, nothing changed.
But everything changed for security. The CPU now takes a different path depending on the flag. If that flag is a secret—a key bit, a password comparison result—an attacker who can measure timing just learned its value. LLVM doesn't know the flag is secret. It sees a chance to cut three instructions and takes it.
THE VANISHING CLEANUP
The timing leak is bad, but at least the code still runs—it just runs differently depending on the secret. This one's worse. The compiler deletes your security code entirely and you get no warning.
When a function handles a secret key, standard practice is: copy the key in, do your work, then overwrite the key with zeros before returning. This is called scrubbing or zeroing. You don't want that key sitting in memory after you're done—any code that later reuses the same region of memory could read it. A crash dump, a memory scanner, a speculative-execution attack—all of these can harvest secrets left behind on the stack.
uint32_t process_secret(const uint8_t *input) {
uint8_t key[32];
for (int i = 0; i < 32; i++)
key[i] = input[i];
uint32_t result = 0;
for (int i = 0; i < 32; i++)
result ^= key[i];
// scrub the key
for (int i = 0; i < 32; i++)
key[i] = 0;
return result;
}
At -O0, the zeroing loop is there:
# clang -O0 — third loop: zeroing
li a0, 0
sb a0, 0(a1) # key[i] = 0
# ... loop continues for all 32 bytes ...
At -O2, the zeroing is gone:
# clang -O2 (optimized)
process_secret:
# ... copy input, compute xor ...
xor a0, a0, a1
xor a0, a0, t0 # final xor for result
#
# no zeroing. no stores. nothing.
#
addi sp, sp, 48 # stack frame reclaimed
ret
The optimization pass responsible is called Dead Store Elimination—it removes writes to memory that nothing will ever read. From LLVM's perspective it's making the right call. key is a local variable, nobody reads it after the zeroing loop, and the function returns right after—so the writes are "dead." What LLVM doesn't model is that "dead" here means your secret key is still sitting in memory when the next function reuses that same stack space.
Fix 1: volatile cast
The blunt fix: cast the pointer to volatile. The compiler isn't allowed to optimize away volatile stores.
volatile uint8_t *vp = (volatile uint8_t *)key;
for (int i = 0; i < 32; i++)
vp[i] = 0;
# clang -O2 — volatile version: zeroing is emitted
sb zero, 0(sp)
sb zero, 1(sp)
sb zero, 2(sp)
# ... 32 byte-stores total ...
sb zero, 31(sp)
ret
32 byte-stores, all present. Works, but it's byte-at-a-time—on a 32-bit target you'd rather do word-aligned stores. There's also an open question about whether volatile on a cast pointer actually prevents reordering relative to non-volatile accesses.
Fix 2: compiler barrier
The cleaner fix: keep the regular zeroing loop, but add a compiler barrier after it.
for (int i = 0; i < 32; i++)
key[i] = 0;
__asm__ __volatile__("" :: "r"(key) : "memory");
The asm block is empty—zero instructions at runtime. But the "memory" clobber is a lie to the compiler: "something here might read all of memory." Now LLVM can't prove the zeroing stores are dead, because something might observe them. The "r"(key) constraint keeps the buffer address alive.
# clang -O2 — barrier version
sb zero, 28(sp)
sb zero, 29(sp)
# ... zeroing stores ...
sb zero, 59(sp)
#APP
#NO_APP # ← the barrier: zero runtime cost
ret
This is what libsodium does (sodium_memzero). The Rust zeroize crate uses a similar approach—volatile write plus compiler fence. C23 added memset_explicit to standardize this, but compiler support is still patchy.
THE MISSING COMMAND
This isn't just a crypto problem. It shows up anywhere software talks to hardware.
In embedded systems, software controls hardware by writing to specific memory addresses. These aren't normal memory—they're registers mapped into the address space, and every write triggers a physical action. It's called memory-mapped I/O. You're writing firmware for a crypto accelerator. The command register at 0x40000008 controls the hardware state machine—write 0x00 to reset, write 0x01 to start. Two writes, specific order.
void reset_and_start(void) {
uint32_t *cmd = (uint32_t *)0x40000008;
*cmd = 0x00; // clear
*cmd = 0x01; // start
}
At -O0, two stores:
# clang -O0 (no optimization)
li a0, 0
sw a0, 0(a1) # *cmd = 0x00
li a0, 1
sw a0, 0(a1) # *cmd = 0x01
At -O2:
# clang -O2 (optimized)
reset_and_start:
lui a0, 262144
li a1, 1
sw a1, 8(a0) # *cmd = 0x01 only
ret
Dead Store Elimination again—the same pass that erased the key-zeroing. Two writes to the same address, no read in between, so LLVM assumes the first one is pointless. For regular memory, it is. But this is a hardware register where every write triggers a physical state transition. The accelerator never got its reset command.
Fix is the same as always: volatile.
volatile uint32_t *cmd = (volatile uint32_t *)0x40000008;
*cmd = 0x00;
*cmd = 0x01;
# clang -O2 — volatile version: both writes emitted
reset_and_start:
lui a0, 262144
li a1, 1
sw zero, 8(a0) # clear
sw a1, 8(a0) # start
ret
Both stores present, correct order. volatile tells LLVM these accesses have side effects it can't reason about.
One thing to know: volatile constrains the compiler, not the CPU. On ARM or POWER, the processor itself can still reorder stores to device memory even with volatile. You need fence instructions or memory regions marked as strongly-ordered. On RISC-V that's the FENCE instruction or .io ordering bits. volatile is a floor, not a ceiling.
THE PATTERN
Three examples, one root cause. The compiler's job is to produce the fastest correct program—but its definition of "correct" is purely functional: same inputs, same outputs. It has no concept of how long something should take, whether memory should be zeroed after use, or whether two writes to the same address both matter. Security properties that depend on timing, on cleanup, on side effects—those are invisible to the optimizer.
This isn't academic. Every TLS library, every cryptocurrency wallet, every smart card runtime, every disk encryption driver has to deal with exactly these problems. OpenSSL, libsodium, the Linux kernel's crypto subsystem, BoringSSL—all of them contain workarounds for the patterns described above. When they get it wrong, the result is a key extraction vulnerability in production.
All three examples have fixes—volatile, compiler barriers, memset_explicit, inline assembly. They work today. But you're annotating source code to fight a general-purpose optimizer, and new passes ship with every LLVM release. A pattern that's safe in clang 23 might not survive clang 24. You end up in an arms race with your own toolchain.
The alternative is to push the requirement into hardware. A fixed-cycle datapath doesn't branch on secrets because it can't—the timing is set at the circuit level. A hardware register that accepts write commands can't have those writes optimized away—the bus transaction happens unconditionally. RISC-V already has this approach: OTBN, the OpenTitan big-number coprocessor. Its own register file, its own instruction set, no data-dependent branches. The compiler can't introduce a timing leak because the ISA doesn't have the instructions for one.
Software workarounds treat the symptoms. Hardware treats the cause.
Reading:
- Kocher 1996 — Timing Attacks on Implementations of Diffie-Hellman, RSA, DSS, and Other Systems
- BearSSL: Constant-Time Crypto — Thomas Pornin's guide to writing constant-time C
- cryptocoding — Coding rules for constant-time implementations
- zeroize — Rust crate for securely zeroing memory
- C23 memset_explicit — WG14 proposal for non-dead-store-eliminable memset
- LLVM Passes — Reference for the passes mentioned above
- RISC-V ISA Specification — Unprivileged spec, calling convention in Chapter 18