Everyone has a list of kernel flags that’ll make your system faster. Most of them are copium. Hell, some of mine might end up copium, too. It’s too early to tell. :)

This post walks through five kernel optimization tiers, each benchmarked A/B on an M1 Max, each building on the last. This is part of arashiOS — a Linux kernel purpose-built for Apple Silicon. Here’s what actually worked, and what was just noise.

TL;DR: Five kernel optimization tiers benchmarked on an M1 Max. Config-only changes (Tier 1) and the Clang compiler migration (Tier 2a) were noise — zero meaningful gains. The BORE scheduler was the single biggest win, dropping wake-up latency 98.5% (4,808 us to 52 us). Memory subsystem tuning (ZRAM, sysctl, boot params) boosted page fault throughput 41%. The Clang migration pays off later when ThinLTO lands in kernel 7.0.

The Setup

  • Hardware: Apple M1 Max, 32GB, 2 Icestorm E-cores + 8 Firestorm P-cores
  • Base: Arch Linux ARM, linux-asahi 6.18.15
  • Methodology: bench-full.sh harness, controlled conditions, multiple runs, coefficient of variation tracked per metric. Same machine, same disk, same measurement code.

If a number moved less than the noise floor, it gets called noise.

Tier 0 is the stock kernel. Every claim in this post is a delta against these numbers:

MetricTier 0 (Stock)
PyBench9.536s
Hackbench pipe6.93s
Hackbench socket14.29s
Schbench p994,808 us
Page fault29,301 ops/s
Boot time5.640s
FIO seq read22,328 MB/s
FIO seq write9,428 MB/s
FIO rand read900,302 IOPS
glmark23,184

Tier 1: Config-Only

MGLRU, DAMON, THP, RCU Lazy, sched_ext. The stuff you see in every “optimize your kernel” blog post. Turn on the good flags, turn off the bad ones, rebuild, reboot.

MetricT0T1Delta
PyBench9.536s9.466s-0.73%
Hackbench pipe6.93s7.01s+1.15%
Hackbench socket14.29s14.00s-2.03%
FIO seq read22,328 MB/s22,541 MB/s+0.96%
FIO rand read900K IOPS897K IOPS-0.35%
Boot time5.640s5.632s-0.14%

Verdict: Participation trophy. Every delta is within noise. Safe to keep, nothing to brag about.

FIO sequential write showed +18% (9,428 to 11,128 MB/s) but that was an environmental artifact — fresh boot vs. fragmented allocator. Uptime difference, not config. This is why you control your test conditions.

Lesson: There are no magic Kconfig switches. If there were, the upstream maintainers would have flipped them years ago and written a smug commit message about it.


Tier 2a: Clang + -O3 + -mcpu=apple-m1

Recompile the entire kernel with Clang instead of GCC, crank optimization to -O3, and target the exact CPU microarchitecture. The theory: Apple’s cores have wide pipelines and deep reorder buffers. A compiler that knows this should produce tighter code.

MetricT1T2aDelta
PyBench9.466s9.480s+0.15%
Hackbench pipe7.01s7.48s+6.7%
Hackbench socket14.00s14.13s+0.9%
FIO seq write11,128 MB/s10,900 MB/s-2.0%
FIO seq read22,541 MB/s22,849 MB/s+1.4%
FIO rand read897K IOPS905K IOPS+0.8%
glmark23,0523,071+0.6%
Boot time5.632s5.915s+5.0%

Verdict: Expensive nothing. We expected 0-4%. We got noise. We’re showing you anyway because honesty is a personality trait.

The Clang migration wasn’t free, either. Clang + pahole (the tool that generates BTF metadata for BPF programs) has compatibility issues that took 4 patches to resolve. Boot time regressed 5%, likely from BTF validation overhead.

So why bother?

The short version: Clang unlocks a specific optimization (ThinLTO) that the kernel build system can’t use yet because of a Rust conflict. Here’s the long version.

ThinLTO is link-time optimization that inlines across file boundaries — that’s where the real 2-5% kernel-wide gains live. But right now the kernel build system won’t allow Rust + LTO + BTF simultaneously. LTO merges compilation units; pahole can’t separate Rust debug info from C after the merge. No BTF means no BPF means no sched_ext (the pluggable scheduler framework).

CachyOS sidesteps this by disabling Rust. We can’t. DRM_ASAHI — the GPU driver — is 21,000 lines of Rust. No Rust = no display. That’s the Apple Silicon tax.

Alice Ryhl’s patches resolving one of the two blockers are landing in kernel 7.0. When Asahi Linux rebases to 7.0 (estimated May-June 2026), we flip the switch. The Clang migration is already done.

Zero gains today. But when 7.0 drops, we’re ready and everyone else is starting from scratch.


Tier 3a: Sysctl + ZRAM + Boot Params

15 Kconfig changes. ZRAM at 15.4GB with LZ4 compression. Sysctl tuning. Boot params tuned. Debug infrastructure stripped.

vm.swappiness = 180
vm.page-cluster = 0
vm.dirty_writeback_centisecs = 600

The initial run had regressions — FIO sequential read down 35%, glmark2 down 19%. All five regressions root-caused to a single boot parameter: nohz_full=2-9, which disables timer ticks on all 8 P-cores. Intended for dedicated real-time workloads, not desktops. Removed it, re-benchmarked. Every regression recovered with zero tradeoff — the latency wins from other changes were fully retained.

The full regression investigation is its own post. Short version: Apple’s AIC2 interrupt controller hardware-manages IRQ routing across cores. It dynamically routes to awake, least-loaded cores — better than any static affinity mask. The IRQ affinity service I wrote was dead code. Don’t fight the hardware.

Key wins that survived: page fault throughput +41.2%, cyclictest latency improvements across the board. These carried forward into Tier 3b.


Tier 3b: BORE + BBRv3

BORE (Burst-Oriented Response Enhancer) is a drop-in scheduler replacement that’s smarter about which thread runs next. BBRv3 is Google’s congestion control algorithm for TCP. Both applied as patches on top of the Tier 2a Clang base.

This is where things got stupid fast — in the good way.

MetricT2aT3bDelta
Schbench p993,581 us52 us-98.5%
Hackbench pipe7.48s5.96s-20.3%
Hackbench socket14.13s11.90s-15.8%
Page fault27,693 ops/s39,115 ops/s+41.2%
Boot time5.915s5.42s-8.3%
PyBench9.48s9.46s-0.2% (noise)
FIO seq write10,900 MB/s10,885 MB/s-0.1% (noise)
FIO seq read22,849 MB/s21,550 MB/s-5.7% (suspect)
FIO rand read905K IOPS888K IOPS-1.8% (noise)
glmark23,0712,771-9.8% (suspect)

Schbench p99 went from 3,581 microseconds to 52. That’s wake-up latency — how long a thread waits after being marked runnable before it actually gets a CPU. In human terms: how snappy your desktop feels under load. 98.5% improvement. This one isn’t optional.

Hackbench (a scheduler stress test that measures throughput across many communicating threads) dropped 20% on pipes and 16% on sockets. Page fault handling — how fast the kernel can set up new memory pages — jumped 41%.

Honest caveats: The glmark2 and FIO seq read regressions are suspicious. T3b benchmarks ran at 24% battery vs 88% for T2a, charge regulator was 62.6C vs 54.3C. Apple Silicon throttles under thermal pressure and at low charge. Pending clean re-test with matched conditions. I’m not claiming wins I can’t prove, and I’m not hiding losses I can’t explain yet.


The Full Picture

Stock Asahi to Arashi Tier 3b, every number in one table:

MetricT0 (Stock)T1 (Config)T2a (Clang)T3b (BORE+BBR)T0 to T3b
PyBench (s)9.5369.4669.4809.46-0.8%
Hackbench pipe (s)6.937.017.485.96-14.0%
Schbench p99 (us)4,8084,2643,58152-98.9%
Hackbench socket (s)14.2914.0014.1311.90-16.7%
Page fault (ops/s)29,30128,52227,69339,115+33.5%
Boot time (s)5.6405.6325.9155.42-3.9%
FIO seq read (MB/s)22,32822,54122,84921,550-3.5%*
FIO rand read (IOPS)900K897K905K888K-1.3%
glmark23,1843,0523,0712,771-13.0%*

*Suspect — thermal/battery confound. Re-test pending.

The pattern is clear. Tiers 1 and 2a were noise. The real gains came from BORE, memory subsystem tuning, and knowing what to remove.


What We Learned

  • The compiler didn’t matter. Sorry, compiler nerds. The real gains came from the scheduler and memory config.
  • Some optimizations are regressions in disguise. nohz_full on all P-cores cratered I/O throughput by 35%. We found it, proved it, and removed it.
  • Apple’s interrupt controller (AIC2) is smarter than manual tuning. It hardware-routes 85 of 89 IRQs. Userspace affinity masks are a no-op. Don’t fight the hardware.
  • BORE is the single biggest win. Wake-up latency dropped 98.5%. Your desktop feels snappier because threads stop waiting in line.
  • If you’re not controlling for battery level, thermals, and uptime, your benchmarks are fan fiction. We caught it. Twice.

What’s Next

  • Clean stock-vs-Arashi A/B with matched environmental conditions (battery, thermals, uptime)
  • ZRAM algorithm comparison: LZ4 vs ZSTD, with data
  • Power optimization — 77 items tracked, 34 shipped so far
  • ThinLTO when kernel 7.0 lands — the Clang groundwork pays off
  • Public GitHub repo

Arashi (嵐) — storm. Built on an M1 Max in Arch Linux ARM. No compromises, no generics. Just the kernel, the benchmarks, and the diff.