98.5% Less Latency: What Actually Matters When Tuning Linux on Apple Silicon
Everyone has a list of kernel flags that’ll make your system faster. Most of them are copium. Hell, some of mine might end up copium, too. It’s too early to tell. :)
This post walks through five kernel optimization tiers, each benchmarked A/B on an M1 Max, each building on the last. This is part of arashiOS — a Linux kernel purpose-built for Apple Silicon. Here’s what actually worked, and what was just noise.
TL;DR: Five kernel optimization tiers benchmarked on an M1 Max. Config-only changes (Tier 1) and the Clang compiler migration (Tier 2a) were noise — zero meaningful gains. The BORE scheduler was the single biggest win, dropping wake-up latency 98.5% (4,808 us to 52 us). Memory subsystem tuning (ZRAM, sysctl, boot params) boosted page fault throughput 41%. The Clang migration pays off later when ThinLTO lands in kernel 7.0.
The Setup
- Hardware: Apple M1 Max, 32GB, 2 Icestorm E-cores + 8 Firestorm P-cores
- Base: Arch Linux ARM, linux-asahi 6.18.15
- Methodology:
bench-full.shharness, controlled conditions, multiple runs, coefficient of variation tracked per metric. Same machine, same disk, same measurement code.
If a number moved less than the noise floor, it gets called noise.
Tier 0 is the stock kernel. Every claim in this post is a delta against these numbers:
| Metric | Tier 0 (Stock) |
|---|---|
| PyBench | 9.536s |
| Hackbench pipe | 6.93s |
| Hackbench socket | 14.29s |
| Schbench p99 | 4,808 us |
| Page fault | 29,301 ops/s |
| Boot time | 5.640s |
| FIO seq read | 22,328 MB/s |
| FIO seq write | 9,428 MB/s |
| FIO rand read | 900,302 IOPS |
| glmark2 | 3,184 |
Tier 1: Config-Only
MGLRU, DAMON, THP, RCU Lazy, sched_ext. The stuff you see in every “optimize your kernel” blog post. Turn on the good flags, turn off the bad ones, rebuild, reboot.
| Metric | T0 | T1 | Delta |
|---|---|---|---|
| PyBench | 9.536s | 9.466s | -0.73% |
| Hackbench pipe | 6.93s | 7.01s | +1.15% |
| Hackbench socket | 14.29s | 14.00s | -2.03% |
| FIO seq read | 22,328 MB/s | 22,541 MB/s | +0.96% |
| FIO rand read | 900K IOPS | 897K IOPS | -0.35% |
| Boot time | 5.640s | 5.632s | -0.14% |
Verdict: Participation trophy. Every delta is within noise. Safe to keep, nothing to brag about.
FIO sequential write showed +18% (9,428 to 11,128 MB/s) but that was an environmental artifact — fresh boot vs. fragmented allocator. Uptime difference, not config. This is why you control your test conditions.
Lesson: There are no magic Kconfig switches. If there were, the upstream maintainers would have flipped them years ago and written a smug commit message about it.
Tier 2a: Clang + -O3 + -mcpu=apple-m1
Recompile the entire kernel with Clang instead of GCC, crank optimization to -O3, and target the exact CPU microarchitecture. The theory: Apple’s cores have wide pipelines and deep reorder buffers. A compiler that knows this should produce tighter code.
| Metric | T1 | T2a | Delta |
|---|---|---|---|
| PyBench | 9.466s | 9.480s | +0.15% |
| Hackbench pipe | 7.01s | 7.48s | +6.7% |
| Hackbench socket | 14.00s | 14.13s | +0.9% |
| FIO seq write | 11,128 MB/s | 10,900 MB/s | -2.0% |
| FIO seq read | 22,541 MB/s | 22,849 MB/s | +1.4% |
| FIO rand read | 897K IOPS | 905K IOPS | +0.8% |
| glmark2 | 3,052 | 3,071 | +0.6% |
| Boot time | 5.632s | 5.915s | +5.0% |
Verdict: Expensive nothing. We expected 0-4%. We got noise. We’re showing you anyway because honesty is a personality trait.
The Clang migration wasn’t free, either. Clang + pahole (the tool that generates BTF metadata for BPF programs) has compatibility issues that took 4 patches to resolve. Boot time regressed 5%, likely from BTF validation overhead.
So why bother?
The short version: Clang unlocks a specific optimization (ThinLTO) that the kernel build system can’t use yet because of a Rust conflict. Here’s the long version.
ThinLTO is link-time optimization that inlines across file boundaries — that’s where the real 2-5% kernel-wide gains live. But right now the kernel build system won’t allow Rust + LTO + BTF simultaneously. LTO merges compilation units; pahole can’t separate Rust debug info from C after the merge. No BTF means no BPF means no sched_ext (the pluggable scheduler framework).
CachyOS sidesteps this by disabling Rust. We can’t. DRM_ASAHI — the GPU driver — is 21,000 lines of Rust. No Rust = no display. That’s the Apple Silicon tax.
Alice Ryhl’s patches resolving one of the two blockers are landing in kernel 7.0. When Asahi Linux rebases to 7.0 (estimated May-June 2026), we flip the switch. The Clang migration is already done.
Zero gains today. But when 7.0 drops, we’re ready and everyone else is starting from scratch.
Tier 3a: Sysctl + ZRAM + Boot Params
15 Kconfig changes. ZRAM at 15.4GB with LZ4 compression. Sysctl tuning. Boot params tuned. Debug infrastructure stripped.
vm.swappiness = 180
vm.page-cluster = 0
vm.dirty_writeback_centisecs = 600
The initial run had regressions — FIO sequential read down 35%, glmark2 down 19%. All five regressions root-caused to a single boot parameter: nohz_full=2-9, which disables timer ticks on all 8 P-cores. Intended for dedicated real-time workloads, not desktops. Removed it, re-benchmarked. Every regression recovered with zero tradeoff — the latency wins from other changes were fully retained.
The full regression investigation is its own post. Short version: Apple’s AIC2 interrupt controller hardware-manages IRQ routing across cores. It dynamically routes to awake, least-loaded cores — better than any static affinity mask. The IRQ affinity service I wrote was dead code. Don’t fight the hardware.
Key wins that survived: page fault throughput +41.2%, cyclictest latency improvements across the board. These carried forward into Tier 3b.
Tier 3b: BORE + BBRv3
BORE (Burst-Oriented Response Enhancer) is a drop-in scheduler replacement that’s smarter about which thread runs next. BBRv3 is Google’s congestion control algorithm for TCP. Both applied as patches on top of the Tier 2a Clang base.
This is where things got stupid fast — in the good way.
| Metric | T2a | T3b | Delta |
|---|---|---|---|
| Schbench p99 | 3,581 us | 52 us | -98.5% |
| Hackbench pipe | 7.48s | 5.96s | -20.3% |
| Hackbench socket | 14.13s | 11.90s | -15.8% |
| Page fault | 27,693 ops/s | 39,115 ops/s | +41.2% |
| Boot time | 5.915s | 5.42s | -8.3% |
| PyBench | 9.48s | 9.46s | -0.2% (noise) |
| FIO seq write | 10,900 MB/s | 10,885 MB/s | -0.1% (noise) |
| FIO seq read | 22,849 MB/s | 21,550 MB/s | -5.7% (suspect) |
| FIO rand read | 905K IOPS | 888K IOPS | -1.8% (noise) |
| glmark2 | 3,071 | 2,771 | -9.8% (suspect) |
Schbench p99 went from 3,581 microseconds to 52. That’s wake-up latency — how long a thread waits after being marked runnable before it actually gets a CPU. In human terms: how snappy your desktop feels under load. 98.5% improvement. This one isn’t optional.
Hackbench (a scheduler stress test that measures throughput across many communicating threads) dropped 20% on pipes and 16% on sockets. Page fault handling — how fast the kernel can set up new memory pages — jumped 41%.
Honest caveats: The glmark2 and FIO seq read regressions are suspicious. T3b benchmarks ran at 24% battery vs 88% for T2a, charge regulator was 62.6C vs 54.3C. Apple Silicon throttles under thermal pressure and at low charge. Pending clean re-test with matched conditions. I’m not claiming wins I can’t prove, and I’m not hiding losses I can’t explain yet.
The Full Picture
Stock Asahi to Arashi Tier 3b, every number in one table:
| Metric | T0 (Stock) | T1 (Config) | T2a (Clang) | T3b (BORE+BBR) | T0 to T3b |
|---|---|---|---|---|---|
| PyBench (s) | 9.536 | 9.466 | 9.480 | 9.46 | -0.8% |
| Hackbench pipe (s) | 6.93 | 7.01 | 7.48 | 5.96 | -14.0% |
| Schbench p99 (us) | 4,808 | 4,264 | 3,581 | 52 | -98.9% |
| Hackbench socket (s) | 14.29 | 14.00 | 14.13 | 11.90 | -16.7% |
| Page fault (ops/s) | 29,301 | 28,522 | 27,693 | 39,115 | +33.5% |
| Boot time (s) | 5.640 | 5.632 | 5.915 | 5.42 | -3.9% |
| FIO seq read (MB/s) | 22,328 | 22,541 | 22,849 | 21,550 | -3.5%* |
| FIO rand read (IOPS) | 900K | 897K | 905K | 888K | -1.3% |
| glmark2 | 3,184 | 3,052 | 3,071 | 2,771 | -13.0%* |
*Suspect — thermal/battery confound. Re-test pending.
The pattern is clear. Tiers 1 and 2a were noise. The real gains came from BORE, memory subsystem tuning, and knowing what to remove.
What We Learned
- The compiler didn’t matter. Sorry, compiler nerds. The real gains came from the scheduler and memory config.
- Some optimizations are regressions in disguise.
nohz_fullon all P-cores cratered I/O throughput by 35%. We found it, proved it, and removed it. - Apple’s interrupt controller (
AIC2) is smarter than manual tuning. It hardware-routes 85 of 89 IRQs. Userspace affinity masks are a no-op. Don’t fight the hardware. BOREis the single biggest win. Wake-up latency dropped 98.5%. Your desktop feels snappier because threads stop waiting in line.- If you’re not controlling for battery level, thermals, and uptime, your benchmarks are fan fiction. We caught it. Twice.
What’s Next
- Clean stock-vs-Arashi A/B with matched environmental conditions (battery, thermals, uptime)
- ZRAM algorithm comparison:
LZ4vsZSTD, with data - Power optimization — 77 items tracked, 34 shipped so far
ThinLTOwhen kernel 7.0 lands — the Clang groundwork pays off- Public GitHub repo
Arashi (嵐) — storm. Built on an M1 Max in Arch Linux ARM. No compromises, no generics. Just the kernel, the benchmarks, and the diff.