Phase 10.2 Tanda 4 — Results¶

Tanda 4 of the threaded-code dispatcher plan (plans/phase10-2-threaded-code.md) adds four more kind-specialised handlers — th_handler<Shift>, th_handler<Jmpl>, th_handler<Save>, th_handler<Restore> — following the same skeleton as Tanda 2's top-5. Together with Tanda 3's per-AluOp specialisation, the threaded dispatcher now resolves nine InsnKinds directly to specialised handlers (SetHi, Branch, Load, Store, Shift, Jmpl, Save, Restore, plus 24 th_alu<AluOp> for AluReg); the rest still goes through th_fallback.

Net result: structurally complete (each new kind is reached directly from the decode cache without paying the th_fallback inner switch), correctness equivalent (ctest 543/543 in both modes), and a small but real RTEMS-side delta. cpubound-mix is unchanged because it is AluReg-dominated.

Result summary¶

All numbers are median of 5 lince-bench runs, pinned to a single P-core (taskset -c 0) on a 12^th-gen Intel i5-12450HX. Side-by- side rows were captured back-to-back in the same session, so the switch-vs-threaded delta is noise-free.

Side-by-side, post-Tanda-4 session¶

Workload	switch	threaded	ratio	%realtime (threaded)
cpubound-mix	34.80 MIPS	46.48 MIPS	1.336×	185.92 %
boot-rtems-hello-world	18.81 MIPS	23.49 MIPS	1.249×	633.40 %
fptest01	21.46 MIPS	23.99 MIPS	1.118×	651.43 %

Cross-session delta over Tanda 3 (`bf5e274`)¶

The "additional ≥ 1.05× on RTEMS" goal in the plan is measured against Tanda 3's reported numbers. Cross-session comparisons absorb host jitter (a few percent run-to-run), so these are indicative rather than apples-to-apples.

Workload	Tanda 3	Tanda 4	delta
cpubound-mix	45.63 MIPS	46.48 MIPS	1.019× (noise)
boot-rtems-hello-world	22.80 MIPS	23.49 MIPS	1.030×
fptest01	22.86 MIPS	23.99 MIPS	1.049×

fptest01 lands right at the 1.05× plan target; boot-rtems at 1.03× is below target. The signal is real but small — the next section explains why and what it tells us.

Why the RTEMS delta is modest¶

Tanda 2 already routed Shift / Save / Restore / Jmpl through th_fallback. That fallback handler runs the same chain skeleton (PSR pipeline commit, branch-request clear, alignment check, annul, cycles bump, PC/nPC advance, musttail to the next slot) — the only thing Tanda 4 strips out is the body's switch (insn.kind) in core::execute, which dispatches to one of ~30 cases. That's a small jump-table dispatch on a value the branch predictor sees repeatedly, so the inner switch was already near-free.

Contrast with Tanda 3, which removed a second switch (alu_op) inside exec_alu for the 24 AluOps — a deeper switch on a hotter axis. Tanda 4's gain is structurally smaller because the inner cost it eliminates was smaller to begin with.

The dominant chain cost on RTEMS workloads is now:

state.commit_psr_pipeline() per-link (most calls are no-op but still touch the pipeline slot).
state.clear_branch_request() per-link.
prepare_slot()'s cache probe (pc_tag == pc then a fn load).
RamRegion::read_u32 for instruction fetches on cache miss.

The "stop-epilogue" refactor (move 1 + 2 to once-per-chain, chain-break on WRPSR) is the next big lever. Plan flagged this as held-until-Tanda-4 in the Tandas 1–3 results doc — Tanda 4 is the right place to revisit it.

What landed¶

Single-commit change: extends call_exec_for<K> with four new if constexpr arms (routing to detail::exec_shift, detail::exec_jmpl, detail::exec_save, detail::exec_restore), and extends dispatch_for_decoded with four new cases mapping each InsnKind to its th_handler<K> instantiation.

No new templates, no per-sub-op specialisation: Shift's inner switch has 3 cases (Sll/Srl/Sra) and the others have none, so a Tanda-3-style per-sub-op split would add code for negligible gain.

Structural verification¶

$ objdump -d build-bench-threaded/.../dispatch_threaded.cpp.o \
  | grep -oE 'th_handler<[^>]+>' | sort -u

shows th_handler<3> (SetHi), <4> (Branch), <6> (Shift, new), <10> (Save, new), <11> (Restore, new), <12> (Jmpl, new), <15> (Load), <16> (Store) — eight specialised instantiations plus 24 th_alu<AluOp> from Tanda 3. The four new kinds are reached from dispatch_for_decoded directly; the linker keeps the symbols because the dispatch table calls them.

Correctness gate¶

Test set	switch	threaded
`ctest` full suite	543/543	543/543
Total wall time	698.39 s	721.95 s

Both runs at ctest -j4. The 23 s extra wall in threaded mode is in the noise (smptests N=2 and N=4 each take 600–700 s and absorb host variance). The PROM/SMP boot path through threaded mode is identical because the chain ends at every trap (SAVE/ RESTORE window overflows during boot, IRQ entry from GPTIMER) and re-enters fresh — TLBs, branch predictor, and the chain state all reset at each trap.

What's left in Phase 10.2¶

Tanda	Status	What it does
1 — Scaffolding	✅ `7999f0e`	CMake option + types.
2 — Top-5 handlers + chain	✅ `44fa1f2`	First measurable win.
3 — Per-AluOp specialisation	✅ `bf5e274`	Inner ALU switch out of hot edge.
4 — Shift / Jmpl / Save / Restore	✅ this commit	Last per-kind threading; structurally complete.
5 — FP threading + EF gate move	⬜ Pending	fptest01 target ≥ 1.1×.
6 — Tuning + write-up	⬜ Pending	cpubound exit target ≥ 65 MIPS.

Plan exit target on cpubound-mix is ≥ 65 MIPS. We are at 46.48 MIPS — still ~1.40× to find across Tandas 5 + 6 + the stop-epilogue lever.

Reproduction¶

# Side-by-side bench (same session)
taskset -c 0 build-bench-profile/bench/lince-bench \
    --workload cpubound-mix --runs 5             # switch
taskset -c 0 build-bench-threaded/bench/lince-bench \
    --workload cpubound-mix --runs 5             # threaded

# ctest both modes
ctest --test-dir build               # switch mode
ctest --test-dir build-bench-threaded -j4   # threaded mode

Cross-references¶

Plan: plans/phase10-2-threaded-code.md.
Tandas 1–3 results: phase10-2-tandas-1-3-results.md.
Post-P3 baseline: phase10-p1-p3-results.md.