Phase 10 P1–P3 — Results¶

Closes the first batch of attacks from hot-paths.md §7. Three commits land between 2026-05-17 baseline d95a5ba and the post-P3 commit d33ede0. Net result: the interpreter is 2.06× faster on the CPU-bound steady-state workload, and cpubound-mix now simulates faster than the LEON3 it emulates.

Result summary¶

All numbers are median of 5 runs of lince-bench, pinned to a single P-core thread (taskset -c 0) on a 12^th-gen Intel i5-12450HX. See bench/profiles/phase10-p{1,2,3}-exit/ for the raw .json evidence.

Sub-task	Commit	cpubound-mix	fptest01	boot
Pre-P1 baseline	`d95a5ba`	16.68 MIPS	14.35 MIPS	14.61 MIPS
P1 — typed fetch fast path	`67dd6e7`	21.57 (+29.3 %)	17.70 (+23.3 %)	17.61 (+20.5 %)
P2 — PC-indexed decode cache	`67063f0`	32.16 (+49.1 %)	19.07 (+7.7 %)	18.89 (+7.3 %)
P3 — FP gate to dispatcher	`d33ede0`	34.41 (+7.0 %)	20.68 (+8.4 %)	21.36 (+13.1 %)
Total over baseline		+106.3 % (2.06×)	+44.1 % (1.44×)	+46.2 % (1.46×)

cpubound-mix %realtime climbed from 66.7 % → 137.6 %. The emulator now simulates ~1.38× faster than a real LEON3 at 50 MHz on a single host thread.

Predicted vs actual¶

hot-paths.md §7 projected the per-tanda ROI conservatively. Outcomes:

Sub-task	Projected	Actual	Delta
P1	1.4×	1.29×	−0.11× (closer to the model than expected)
P2 (additive)	1.25×	1.49×	+0.24× (over)
P3 (additive)	1.02×	1.07×	+0.05× (over)
P1+P2+P3 combined	1.78×	2.06×	+0.28×

Reasons the combined factor over-shot:

P1 + P2 reinforce each other. P1 collapses the fetch path (SystemBus::read_physical_u32 from 26.6 % → 3.4 % CPU). P2 then eliminates the fetch entirely for cache hits — so P1's cost savings on the miss path turn into a win on every "warm PC" execution, not just the first. The two optimisations were modelled independently; their interaction is multiplicative.
P3 helped the pipeline, not just cycles. Moving the is_fp_kind && !state.ef() branch out of step removed a correctly-predicted-but-still-fetched branch from the inner loop. The compiler now produces a tighter step body with better instruction-level parallelism, beyond the literal cycle count of the removed comparison.

Hot-path shift¶

cpubound-mix, perf record --call-graph=dwarf flat top-10, %CPU:

Function	Pre-P1	Post-P1	Post-P2	Post-P3
`SystemBus::read_physical_u32`	26.63	3.40	(out)	(out)
`core::step`	21.67	31.39	40.30	38.11
`bus::Ram::read` / `read_u32_be`	8.51	20.71	2.21	2.37
`core::decode`	7.42	14.89	(out)	(out)
`SystemBus::read_physical` (byte)	6.66	(out)	(out)	(out)
`__memmove_avx2`	4.50	(out)	(out)	(out)
`core::detail::exec_alu`	4.49	(out)	12.96	14.50
`core::execute`	1.53	(out)	5.11	6.06
`CpuState::icc`	1.60	(out)	5.24	5.93
`Emulator::run_until_unpaced`	(out)	(out)	10.59	9.35

Reading:

The bus dispatch path collapsed (read_physical_u32 26.6 % → < 1 %, never appears again). P1 done.
Decode disappeared from the top 10 post-P2. P2 done.
core::step grew its share because everything else shrunk. P3 chipped 2 points off it. The remaining ~38 % is the irreducible per-step bookkeeping (PSR pipeline, branch_request clear, error_mode check, cache lookup itself) plus the bus call for data loads.
exec_alu became the new largest non-step bucket (14.5 %). This is the giant switch over AluOp enumerators. It's the natural target for Phase 10.2 (threaded-code dispatch).

Correctness gate¶

All three sub-tasks honored the cross-cutting principle from the post-MVP roadmap: pass-rate ≥ entry pass-rate.

Test set	Pre-P1	Post-P1	Post-P2	Post-P3
`ctest` (full suite)	541/543 ¹	543/543	543/543	543/543
RTEMS sptests N=1 (GR712RC)	178/189	178/189	178/189	178/189
RTEMS sptests N=1 (GR740)	178/189	178/189	178/189	178/189
RTEMS smptests N=2 (GR712RC)	41/49	42/49	42/49	42/49
RTEMS smptests N=4 (GR740)	39/49	40/49	40/49	40/49
Lince FP unit tests	47/47	47/47	47/47	47/47 ²

¹ Pre-P1 baseline checked locally before the sweep started; the two failures were the original-baseline known-timing-marginal SMP tests that flip green under P1's faster sim.

² P3 initially broke 98 FP unit tests because they called execute() directly without set_ef(true). Architectural fix in the same sub-task: tests updated to set EF before calling FP handlers (the operational scenario being tested). The structural move of the EF gate is sound — see commit d33ede0 body for the full reasoning.

What was not done (deliberately deferred)¶

From hot-paths.md §7.4–§7.5:

P4 — Lazy __memset_avx2 for RAM/PROM zero-init: still dominates boot-rtems-hello-world (~70 % CPU) and fptest01 (~67 % CPU) on short runs. Irrelevant for sustained 1:1 throughput; only matters for cold-start latency. Defer indefinitely.
P5 — Histogram tail truncation: informational, not an optimisation. The top 5 InsnKind buckets cover 80 % of RTEMS executions; Phase 10.2 threaded-code prioritises specialising those kinds.

Cross-references¶

Plan supersession: hot-paths.md §7.1 reordered the original "decode cache first" plan into "bus fast path first". This document is the post-mortem confirming that reordering was the right call — P1 alone gave 29 % and unlocked P2's 49 % on top.
The next obvious work (core::step overhead + exec_alu switch) folds into Phase 10.2 (threaded-code dispatch). See plans/phase10-2-threaded-code.md.

Reproduction¶

git checkout d33ede0
cmake --preset bench-profile
cmake --build --preset bench-profile
bench/profiles/measure.sh    # full sweep, ~3 min

The post-P3 bundle lives at bench/profiles/phase10-p3-exit/, and the per-tanda checkpoints at phase10-p{1,2}-exit/. The p3-exit flamegraph is the canonical post-tanda evidence.