Phase 10 P1–P3 — Results¶
Closes the first batch of attacks from
hot-paths.md §7. Three commits land between
2026-05-17 baseline d95a5ba and the post-P3 commit d33ede0.
Net result: the interpreter is 2.06× faster on the CPU-bound
steady-state workload, and cpubound-mix now simulates faster
than the LEON3 it emulates.
Result summary¶
All numbers are median of 5 runs of lince-bench, pinned to a
single P-core thread (taskset -c 0) on a 12th-gen Intel
i5-12450HX. See bench/profiles/phase10-p{1,2,3}-exit/ for the
raw .json evidence.
| Sub-task | Commit | cpubound-mix | fptest01 | boot |
|---|---|---|---|---|
| Pre-P1 baseline | d95a5ba |
16.68 MIPS | 14.35 MIPS | 14.61 MIPS |
| P1 — typed fetch fast path | 67dd6e7 |
21.57 (+29.3 %) | 17.70 (+23.3 %) | 17.61 (+20.5 %) |
| P2 — PC-indexed decode cache | 67063f0 |
32.16 (+49.1 %) | 19.07 (+7.7 %) | 18.89 (+7.3 %) |
| P3 — FP gate to dispatcher | d33ede0 |
34.41 (+7.0 %) | 20.68 (+8.4 %) | 21.36 (+13.1 %) |
| Total over baseline | +106.3 % (2.06×) | +44.1 % (1.44×) | +46.2 % (1.46×) |
cpubound-mix %realtime climbed from 66.7 % → 137.6 %. The
emulator now simulates ~1.38× faster than a real LEON3 at 50 MHz
on a single host thread.
Predicted vs actual¶
hot-paths.md §7 projected the per-tanda ROI conservatively.
Outcomes:
| Sub-task | Projected | Actual | Delta |
|---|---|---|---|
| P1 | 1.4× | 1.29× | −0.11× (closer to the model than expected) |
| P2 (additive) | 1.25× | 1.49× | +0.24× (over) |
| P3 (additive) | 1.02× | 1.07× | +0.05× (over) |
| P1+P2+P3 combined | 1.78× | 2.06× | +0.28× |
Reasons the combined factor over-shot:
- P1 + P2 reinforce each other. P1 collapses the fetch path
(
SystemBus::read_physical_u32from 26.6 % → 3.4 % CPU). P2 then eliminates the fetch entirely for cache hits — so P1's cost savings on the miss path turn into a win on every "warm PC" execution, not just the first. The two optimisations were modelled independently; their interaction is multiplicative. - P3 helped the pipeline, not just cycles. Moving the
is_fp_kind && !state.ef()branch out ofstepremoved a correctly-predicted-but-still-fetched branch from the inner loop. The compiler now produces a tighter step body with better instruction-level parallelism, beyond the literal cycle count of the removed comparison.
Hot-path shift¶
cpubound-mix, perf record --call-graph=dwarf flat top-10, %CPU:
| Function | Pre-P1 | Post-P1 | Post-P2 | Post-P3 |
|---|---|---|---|---|
SystemBus::read_physical_u32 |
26.63 | 3.40 | (out) | (out) |
core::step |
21.67 | 31.39 | 40.30 | 38.11 |
bus::Ram::read / read_u32_be |
8.51 | 20.71 | 2.21 | 2.37 |
core::decode |
7.42 | 14.89 | (out) | (out) |
SystemBus::read_physical (byte) |
6.66 | (out) | (out) | (out) |
__memmove_avx2 |
4.50 | (out) | (out) | (out) |
core::detail::exec_alu |
4.49 | (out) | 12.96 | 14.50 |
core::execute |
1.53 | (out) | 5.11 | 6.06 |
CpuState::icc |
1.60 | (out) | 5.24 | 5.93 |
Emulator::run_until_unpaced |
(out) | (out) | 10.59 | 9.35 |
Reading:
- The bus dispatch path collapsed (
read_physical_u3226.6 % → < 1 %, never appears again). P1 done. - Decode disappeared from the top 10 post-P2. P2 done.
core::stepgrew its share because everything else shrunk. P3 chipped 2 points off it. The remaining ~38 % is the irreducible per-step bookkeeping (PSR pipeline, branch_request clear, error_mode check, cache lookup itself) plus the bus call for data loads.exec_alubecame the new largest non-step bucket (14.5 %). This is the giantswitchoverAluOpenumerators. It's the natural target for Phase 10.2 (threaded-code dispatch).
Correctness gate¶
All three sub-tasks honored the cross-cutting principle from the post-MVP roadmap: pass-rate ≥ entry pass-rate.
| Test set | Pre-P1 | Post-P1 | Post-P2 | Post-P3 |
|---|---|---|---|---|
ctest (full suite) |
541/543 ¹ | 543/543 | 543/543 | 543/543 |
| RTEMS sptests N=1 (GR712RC) | 178/189 | 178/189 | 178/189 | 178/189 |
| RTEMS sptests N=1 (GR740) | 178/189 | 178/189 | 178/189 | 178/189 |
| RTEMS smptests N=2 (GR712RC) | 41/49 | 42/49 | 42/49 | 42/49 |
| RTEMS smptests N=4 (GR740) | 39/49 | 40/49 | 40/49 | 40/49 |
| Lince FP unit tests | 47/47 | 47/47 | 47/47 | 47/47 ² |
¹ Pre-P1 baseline checked locally before the sweep started; the two failures were the original-baseline known-timing-marginal SMP tests that flip green under P1's faster sim.
² P3 initially broke 98 FP unit tests because they called
execute() directly without set_ef(true). Architectural fix in
the same sub-task: tests updated to set EF before calling FP
handlers (the operational scenario being tested). The structural
move of the EF gate is sound — see commit d33ede0 body for the
full reasoning.
What was not done (deliberately deferred)¶
From hot-paths.md §7.4–§7.5:
- P4 — Lazy
__memset_avx2for RAM/PROM zero-init: still dominatesboot-rtems-hello-world(~70 % CPU) andfptest01(~67 % CPU) on short runs. Irrelevant for sustained 1:1 throughput; only matters for cold-start latency. Defer indefinitely. - P5 — Histogram tail truncation: informational, not an
optimisation. The top 5
InsnKindbuckets cover 80 % of RTEMS executions; Phase 10.2 threaded-code prioritises specialising those kinds.
Cross-references¶
- Plan supersession:
hot-paths.md§7.1 reordered the original "decode cache first" plan into "bus fast path first". This document is the post-mortem confirming that reordering was the right call — P1 alone gave 29 % and unlocked P2's 49 % on top. - The next obvious work (
core::stepoverhead +exec_aluswitch) folds into Phase 10.2 (threaded-code dispatch). Seeplans/phase10-2-threaded-code.md.
Reproduction¶
git checkout d33ede0
cmake --preset bench-profile
cmake --build --preset bench-profile
bench/profiles/measure.sh # full sweep, ~3 min
The post-P3 bundle lives at
bench/profiles/phase10-p3-exit/,
and the per-tanda checkpoints at phase10-p{1,2}-exit/. The
p3-exit flamegraph is the canonical post-tanda evidence.