98.5% Less Latency: What Actually Matters When Tuning Linux on Apple Silicon

Everyone has a list of kernel flags that’ll make your system faster. Most of them are copium. Hell, some of mine might end up copium, too. It’s too early to tell. :)

This post walks through five kernel optimization tiers, each benchmarked A/B on an M1 Max, each building on the last. This is part of arashiOS — a Linux kernel purpose-built for Apple Silicon. Here’s what actually worked, and what was just noise.

TL;DR: Five kernel optimization tiers benchmarked on an M1 Max. Config-only changes (Tier 1) and the Clang compiler migration (Tier 2a) were noise — zero meaningful gains. The BORE scheduler was the single biggest win, dropping wake-up latency 98.5% (4,808 us to 52 us). Memory subsystem tuning (ZRAM, sysctl, boot params) boosted page fault throughput 41%. The Clang migration pays off later when ThinLTO lands in kernel 7.0.

The Setup

Hardware: Apple M1 Max, 32GB, 2 Icestorm E-cores + 8 Firestorm P-cores
Base: Arch Linux ARM, linux-asahi 6.18.15
Methodology: bench-full.sh harness, controlled conditions, multiple runs, coefficient of variation tracked per metric. Same machine, same disk, same measurement code.

If a number moved less than the noise floor, it gets called noise.

Tier 0 is the stock kernel. Every claim in this post is a delta against these numbers:

Metric	Tier 0 (Stock)
PyBench	9.536s
Hackbench pipe	6.93s
Hackbench socket	14.29s
Schbench p99	4,808 us
Page fault	29,301 ops/s
Boot time	5.640s
FIO seq read	22,328 MB/s
FIO seq write	9,428 MB/s
FIO rand read	900,302 IOPS
glmark2	3,184

Tier 1: Config-Only

MGLRU, DAMON, THP, RCU Lazy, sched_ext. The stuff you see in every “optimize your kernel” blog post. Turn on the good flags, turn off the bad ones, rebuild, reboot.

Metric	T0	T1	Delta
PyBench	9.536s	9.466s	-0.73%
Hackbench pipe	6.93s	7.01s	+1.15%
Hackbench socket	14.29s	14.00s	-2.03%
FIO seq read	22,328 MB/s	22,541 MB/s	+0.96%
FIO rand read	900K IOPS	897K IOPS	-0.35%
Boot time	5.640s	5.632s	-0.14%

Verdict: Participation trophy. Every delta is within noise. Safe to keep, nothing to brag about.

FIO sequential write showed +18% (9,428 to 11,128 MB/s) but that was an environmental artifact — fresh boot vs. fragmented allocator. Uptime difference, not config. This is why you control your test conditions.

Lesson: There are no magic Kconfig switches. If there were, the upstream maintainers would have flipped them years ago and written a smug commit message about it.

Tier 2a: Clang + -O3 + -mcpu=apple-m1

Recompile the entire kernel with Clang instead of GCC, crank optimization to -O3, and target the exact CPU microarchitecture. The theory: Apple’s cores have wide pipelines and deep reorder buffers. A compiler that knows this should produce tighter code.

Metric	T1	T2a	Delta
PyBench	9.466s	9.480s	+0.15%
Hackbench pipe	7.01s	7.48s	+6.7%
Hackbench socket	14.00s	14.13s	+0.9%
FIO seq write	11,128 MB/s	10,900 MB/s	-2.0%
FIO seq read	22,541 MB/s	22,849 MB/s	+1.4%
FIO rand read	897K IOPS	905K IOPS	+0.8%
glmark2	3,052	3,071	+0.6%
Boot time	5.632s	5.915s	+5.0%

Verdict: Expensive nothing. We expected 0-4%. We got noise. We’re showing you anyway because honesty is a personality trait.

The Clang migration wasn’t free, either. Clang + pahole (the tool that generates BTF metadata for BPF programs) has compatibility issues that took 4 patches to resolve. Boot time regressed 5%, likely from BTF validation overhead.

So why bother?

The short version: Clang unlocks a specific optimization (ThinLTO) that the kernel build system can’t use yet because of a Rust conflict. Here’s the long version.

ThinLTO is link-time optimization that inlines across file boundaries — that’s where the real 2-5% kernel-wide gains live. But right now the kernel build system won’t allow Rust + LTO + BTF simultaneously. LTO merges compilation units; pahole can’t separate Rust debug info from C after the merge. No BTF means no BPF means no sched_ext (the pluggable scheduler framework).

CachyOS sidesteps this by disabling Rust. We can’t. DRM_ASAHI — the GPU driver — is 21,000 lines of Rust. No Rust = no display. That’s the Apple Silicon tax.

Alice Ryhl’s patches resolving one of the two blockers are landing in kernel 7.0. When Asahi Linux rebases to 7.0 (estimated May-June 2026), we flip the switch. The Clang migration is already done.

Zero gains today. But when 7.0 drops, we’re ready and everyone else is starting from scratch.

Tier 3a: Sysctl + ZRAM + Boot Params

15 Kconfig changes. ZRAM at 15.4GB with LZ4 compression. Sysctl tuning. Boot params tuned. Debug infrastructure stripped.

vm.swappiness = 180
vm.page-cluster = 0
vm.dirty_writeback_centisecs = 600

The initial run had regressions — FIO sequential read down 35%, glmark2 down 19%. All five regressions root-caused to a single boot parameter: nohz_full=2-9, which disables timer ticks on all 8 P-cores. Intended for dedicated real-time workloads, not desktops. Removed it, re-benchmarked. Every regression recovered with zero tradeoff — the latency wins from other changes were fully retained.

The full regression investigation is its own post. Short version: Apple’s AIC2 interrupt controller hardware-manages IRQ routing across cores. It dynamically routes to awake, least-loaded cores — better than any static affinity mask. The IRQ affinity service I wrote was dead code. Don’t fight the hardware.

Key wins that survived: page fault throughput +41.2%, cyclictest latency improvements across the board. These carried forward into Tier 3b.

Tier 3b: BORE + BBRv3

BORE (Burst-Oriented Response Enhancer) is a drop-in scheduler replacement that’s smarter about which thread runs next. BBRv3 is Google’s congestion control algorithm for TCP. Both applied as patches on top of the Tier 2a Clang base.

This is where things got stupid fast — in the good way.

Metric	T2a	T3b	Delta
Schbench p99	3,581 us	52 us	-98.5%
Hackbench pipe	7.48s	5.96s	-20.3%
Hackbench socket	14.13s	11.90s	-15.8%
Page fault	27,693 ops/s	39,115 ops/s	+41.2%
Boot time	5.915s	5.42s	-8.3%
PyBench	9.48s	9.46s	-0.2% (noise)
FIO seq write	10,900 MB/s	10,885 MB/s	-0.1% (noise)
FIO seq read	22,849 MB/s	21,550 MB/s	-5.7% (suspect)
FIO rand read	905K IOPS	888K IOPS	-1.8% (noise)
glmark2	3,071	2,771	-9.8% (suspect)

Schbench p99 went from 3,581 microseconds to 52. That’s wake-up latency — how long a thread waits after being marked runnable before it actually gets a CPU. In human terms: how snappy your desktop feels under load. 98.5% improvement. This one isn’t optional.

Hackbench (a scheduler stress test that measures throughput across many communicating threads) dropped 20% on pipes and 16% on sockets. Page fault handling — how fast the kernel can set up new memory pages — jumped 41%.

Honest caveats: The glmark2 and FIO seq read regressions are suspicious. T3b benchmarks ran at 24% battery vs 88% for T2a, charge regulator was 62.6C vs 54.3C. Apple Silicon throttles under thermal pressure and at low charge. Pending clean re-test with matched conditions. I’m not claiming wins I can’t prove, and I’m not hiding losses I can’t explain yet.

The Full Picture

Stock Asahi to Arashi Tier 3b, every number in one table:

Metric	T0 (Stock)	T1 (Config)	T2a (Clang)	T3b (BORE+BBR)	T0 to T3b
PyBench (s)	9.536	9.466	9.480	9.46	-0.8%
Hackbench pipe (s)	6.93	7.01	7.48	5.96	-14.0%
Schbench p99 (us)	4,808	4,264	3,581	52	-98.9%
Hackbench socket (s)	14.29	14.00	14.13	11.90	-16.7%
Page fault (ops/s)	29,301	28,522	27,693	39,115	+33.5%
Boot time (s)	5.640	5.632	5.915	5.42	-3.9%
FIO seq read (MB/s)	22,328	22,541	22,849	21,550	-3.5%*
FIO rand read (IOPS)	900K	897K	905K	888K	-1.3%
glmark2	3,184	3,052	3,071	2,771	-13.0%*

*Suspect — thermal/battery confound. Re-test pending.

The pattern is clear. Tiers 1 and 2a were noise. The real gains came from BORE, memory subsystem tuning, and knowing what to remove.

What We Learned

The compiler didn’t matter. Sorry, compiler nerds. The real gains came from the scheduler and memory config.
Some optimizations are regressions in disguise. nohz_full on all P-cores cratered I/O throughput by 35%. We found it, proved it, and removed it.
Apple’s interrupt controller (AIC2) is smarter than manual tuning. It hardware-routes 85 of 89 IRQs. Userspace affinity masks are a no-op. Don’t fight the hardware.
BORE is the single biggest win. Wake-up latency dropped 98.5%. Your desktop feels snappier because threads stop waiting in line.
If you’re not controlling for battery level, thermals, and uptime, your benchmarks are fan fiction. We caught it. Twice.

What’s Next

Clean stock-vs-Arashi A/B with matched environmental conditions (battery, thermals, uptime)
ZRAM algorithm comparison: LZ4 vs ZSTD, with data
Power optimization — 77 items tracked, 34 shipped so far
ThinLTO when kernel 7.0 lands — the Clang groundwork pays off
Public GitHub repo

Arashi (嵐) — storm. Built on an M1 Max in Arch Linux ARM. No compromises, no generics. Just the kernel, the benchmarks, and the diff.