Skip to content

Phase 10 P1–P3 — Results

Closes the first batch of attacks from hot-paths.md §7. Three commits land between 2026-05-17 baseline d95a5ba and the post-P3 commit d33ede0. Net result: the interpreter is 2.06× faster on the CPU-bound steady-state workload, and cpubound-mix now simulates faster than the LEON3 it emulates.

Result summary

All numbers are median of 5 runs of lince-bench, pinned to a single P-core thread (taskset -c 0) on a 12th-gen Intel i5-12450HX. See bench/profiles/phase10-p{1,2,3}-exit/ for the raw .json evidence.

Sub-task Commit cpubound-mix fptest01 boot
Pre-P1 baseline d95a5ba 16.68 MIPS 14.35 MIPS 14.61 MIPS
P1 — typed fetch fast path 67dd6e7 21.57 (+29.3 %) 17.70 (+23.3 %) 17.61 (+20.5 %)
P2 — PC-indexed decode cache 67063f0 32.16 (+49.1 %) 19.07 (+7.7 %) 18.89 (+7.3 %)
P3 — FP gate to dispatcher d33ede0 34.41 (+7.0 %) 20.68 (+8.4 %) 21.36 (+13.1 %)
Total over baseline +106.3 % (2.06×) +44.1 % (1.44×) +46.2 % (1.46×)

cpubound-mix %realtime climbed from 66.7 % → 137.6 %. The emulator now simulates ~1.38× faster than a real LEON3 at 50 MHz on a single host thread.

Predicted vs actual

hot-paths.md §7 projected the per-tanda ROI conservatively. Outcomes:

Sub-task Projected Actual Delta
P1 1.4× 1.29× −0.11× (closer to the model than expected)
P2 (additive) 1.25× 1.49× +0.24× (over)
P3 (additive) 1.02× 1.07× +0.05× (over)
P1+P2+P3 combined 1.78× 2.06× +0.28×

Reasons the combined factor over-shot:

  1. P1 + P2 reinforce each other. P1 collapses the fetch path (SystemBus::read_physical_u32 from 26.6 % → 3.4 % CPU). P2 then eliminates the fetch entirely for cache hits — so P1's cost savings on the miss path turn into a win on every "warm PC" execution, not just the first. The two optimisations were modelled independently; their interaction is multiplicative.
  2. P3 helped the pipeline, not just cycles. Moving the is_fp_kind && !state.ef() branch out of step removed a correctly-predicted-but-still-fetched branch from the inner loop. The compiler now produces a tighter step body with better instruction-level parallelism, beyond the literal cycle count of the removed comparison.

Hot-path shift

cpubound-mix, perf record --call-graph=dwarf flat top-10, %CPU:

Function Pre-P1 Post-P1 Post-P2 Post-P3
SystemBus::read_physical_u32 26.63 3.40 (out) (out)
core::step 21.67 31.39 40.30 38.11
bus::Ram::read / read_u32_be 8.51 20.71 2.21 2.37
core::decode 7.42 14.89 (out) (out)
SystemBus::read_physical (byte) 6.66 (out) (out) (out)
__memmove_avx2 4.50 (out) (out) (out)
core::detail::exec_alu 4.49 (out) 12.96 14.50
core::execute 1.53 (out) 5.11 6.06
CpuState::icc 1.60 (out) 5.24 5.93
Emulator::run_until_unpaced (out) (out) 10.59 9.35

Reading:

  • The bus dispatch path collapsed (read_physical_u32 26.6 % → < 1 %, never appears again). P1 done.
  • Decode disappeared from the top 10 post-P2. P2 done.
  • core::step grew its share because everything else shrunk. P3 chipped 2 points off it. The remaining ~38 % is the irreducible per-step bookkeeping (PSR pipeline, branch_request clear, error_mode check, cache lookup itself) plus the bus call for data loads.
  • exec_alu became the new largest non-step bucket (14.5 %). This is the giant switch over AluOp enumerators. It's the natural target for Phase 10.2 (threaded-code dispatch).

Correctness gate

All three sub-tasks honored the cross-cutting principle from the post-MVP roadmap: pass-rate ≥ entry pass-rate.

Test set Pre-P1 Post-P1 Post-P2 Post-P3
ctest (full suite) 541/543 ¹ 543/543 543/543 543/543
RTEMS sptests N=1 (GR712RC) 178/189 178/189 178/189 178/189
RTEMS sptests N=1 (GR740) 178/189 178/189 178/189 178/189
RTEMS smptests N=2 (GR712RC) 41/49 42/49 42/49 42/49
RTEMS smptests N=4 (GR740) 39/49 40/49 40/49 40/49
Lince FP unit tests 47/47 47/47 47/47 47/47 ²

¹ Pre-P1 baseline checked locally before the sweep started; the two failures were the original-baseline known-timing-marginal SMP tests that flip green under P1's faster sim.

² P3 initially broke 98 FP unit tests because they called execute() directly without set_ef(true). Architectural fix in the same sub-task: tests updated to set EF before calling FP handlers (the operational scenario being tested). The structural move of the EF gate is sound — see commit d33ede0 body for the full reasoning.

What was not done (deliberately deferred)

From hot-paths.md §7.4–§7.5:

  • P4 — Lazy __memset_avx2 for RAM/PROM zero-init: still dominates boot-rtems-hello-world (~70 % CPU) and fptest01 (~67 % CPU) on short runs. Irrelevant for sustained 1:1 throughput; only matters for cold-start latency. Defer indefinitely.
  • P5 — Histogram tail truncation: informational, not an optimisation. The top 5 InsnKind buckets cover 80 % of RTEMS executions; Phase 10.2 threaded-code prioritises specialising those kinds.

Cross-references

  • Plan supersession: hot-paths.md §7.1 reordered the original "decode cache first" plan into "bus fast path first". This document is the post-mortem confirming that reordering was the right call — P1 alone gave 29 % and unlocked P2's 49 % on top.
  • The next obvious work (core::step overhead + exec_alu switch) folds into Phase 10.2 (threaded-code dispatch). See plans/phase10-2-threaded-code.md.

Reproduction

git checkout d33ede0
cmake --preset bench-profile
cmake --build --preset bench-profile
bench/profiles/measure.sh    # full sweep, ~3 min

The post-P3 bundle lives at bench/profiles/phase10-p3-exit/, and the per-tanda checkpoints at phase10-p{1,2}-exit/. The p3-exit flamegraph is the canonical post-tanda evidence.