Skip to content

Phase 10.2 Tanda 4 — Results

Tanda 4 of the threaded-code dispatcher plan (plans/phase10-2-threaded-code.md) adds four more kind-specialised handlers — th_handler<Shift>, th_handler<Jmpl>, th_handler<Save>, th_handler<Restore> — following the same skeleton as Tanda 2's top-5. Together with Tanda 3's per-AluOp specialisation, the threaded dispatcher now resolves nine InsnKinds directly to specialised handlers (SetHi, Branch, Load, Store, Shift, Jmpl, Save, Restore, plus 24 th_alu<AluOp> for AluReg); the rest still goes through th_fallback.

Net result: structurally complete (each new kind is reached directly from the decode cache without paying the th_fallback inner switch), correctness equivalent (ctest 543/543 in both modes), and a small but real RTEMS-side delta. cpubound-mix is unchanged because it is AluReg-dominated.

Result summary

All numbers are median of 5 lince-bench runs, pinned to a single P-core (taskset -c 0) on a 12th-gen Intel i5-12450HX. Side-by- side rows were captured back-to-back in the same session, so the switch-vs-threaded delta is noise-free.

Side-by-side, post-Tanda-4 session

Workload switch threaded ratio %realtime (threaded)
cpubound-mix 34.80 MIPS 46.48 MIPS 1.336× 185.92 %
boot-rtems-hello-world 18.81 MIPS 23.49 MIPS 1.249× 633.40 %
fptest01 21.46 MIPS 23.99 MIPS 1.118× 651.43 %

Cross-session delta over Tanda 3 (bf5e274)

The "additional ≥ 1.05× on RTEMS" goal in the plan is measured against Tanda 3's reported numbers. Cross-session comparisons absorb host jitter (a few percent run-to-run), so these are indicative rather than apples-to-apples.

Workload Tanda 3 Tanda 4 delta
cpubound-mix 45.63 MIPS 46.48 MIPS 1.019× (noise)
boot-rtems-hello-world 22.80 MIPS 23.49 MIPS 1.030×
fptest01 22.86 MIPS 23.99 MIPS 1.049×

fptest01 lands right at the 1.05× plan target; boot-rtems at 1.03× is below target. The signal is real but small — the next section explains why and what it tells us.

Why the RTEMS delta is modest

Tanda 2 already routed Shift / Save / Restore / Jmpl through th_fallback. That fallback handler runs the same chain skeleton (PSR pipeline commit, branch-request clear, alignment check, annul, cycles bump, PC/nPC advance, musttail to the next slot) — the only thing Tanda 4 strips out is the body's switch (insn.kind) in core::execute, which dispatches to one of ~30 cases. That's a small jump-table dispatch on a value the branch predictor sees repeatedly, so the inner switch was already near-free.

Contrast with Tanda 3, which removed a second switch (alu_op) inside exec_alu for the 24 AluOps — a deeper switch on a hotter axis. Tanda 4's gain is structurally smaller because the inner cost it eliminates was smaller to begin with.

The dominant chain cost on RTEMS workloads is now:

  1. state.commit_psr_pipeline() per-link (most calls are no-op but still touch the pipeline slot).
  2. state.clear_branch_request() per-link.
  3. prepare_slot()'s cache probe (pc_tag == pc then a fn load).
  4. RamRegion::read_u32 for instruction fetches on cache miss.

The "stop-epilogue" refactor (move 1 + 2 to once-per-chain, chain-break on WRPSR) is the next big lever. Plan flagged this as held-until-Tanda-4 in the Tandas 1–3 results doc — Tanda 4 is the right place to revisit it.

What landed

Single-commit change: extends call_exec_for<K> with four new if constexpr arms (routing to detail::exec_shift, detail::exec_jmpl, detail::exec_save, detail::exec_restore), and extends dispatch_for_decoded with four new cases mapping each InsnKind to its th_handler<K> instantiation.

No new templates, no per-sub-op specialisation: Shift's inner switch has 3 cases (Sll/Srl/Sra) and the others have none, so a Tanda-3-style per-sub-op split would add code for negligible gain.

Structural verification

$ objdump -d build-bench-threaded/.../dispatch_threaded.cpp.o \
  | grep -oE 'th_handler<[^>]+>' | sort -u

shows th_handler<3> (SetHi), <4> (Branch), <6> (Shift, new), <10> (Save, new), <11> (Restore, new), <12> (Jmpl, new), <15> (Load), <16> (Store) — eight specialised instantiations plus 24 th_alu<AluOp> from Tanda 3. The four new kinds are reached from dispatch_for_decoded directly; the linker keeps the symbols because the dispatch table calls them.

Correctness gate

Test set switch threaded
ctest full suite 543/543 543/543
Total wall time 698.39 s 721.95 s

Both runs at ctest -j4. The 23 s extra wall in threaded mode is in the noise (smptests N=2 and N=4 each take 600–700 s and absorb host variance). The PROM/SMP boot path through threaded mode is identical because the chain ends at every trap (SAVE/ RESTORE window overflows during boot, IRQ entry from GPTIMER) and re-enters fresh — TLBs, branch predictor, and the chain state all reset at each trap.

What's left in Phase 10.2

Tanda Status What it does
1 — Scaffolding 7999f0e CMake option + types.
2 — Top-5 handlers + chain 44fa1f2 First measurable win.
3 — Per-AluOp specialisation bf5e274 Inner ALU switch out of hot edge.
4 — Shift / Jmpl / Save / Restore ✅ this commit Last per-kind threading; structurally complete.
5 — FP threading + EF gate move ⬜ Pending fptest01 target ≥ 1.1×.
6 — Tuning + write-up ⬜ Pending cpubound exit target ≥ 65 MIPS.

Plan exit target on cpubound-mix is ≥ 65 MIPS. We are at 46.48 MIPS — still ~1.40× to find across Tandas 5 + 6 + the stop-epilogue lever.

Reproduction

# Side-by-side bench (same session)
taskset -c 0 build-bench-profile/bench/lince-bench \
    --workload cpubound-mix --runs 5             # switch
taskset -c 0 build-bench-threaded/bench/lince-bench \
    --workload cpubound-mix --runs 5             # threaded

# ctest both modes
ctest --test-dir build               # switch mode
ctest --test-dir build-bench-threaded -j4   # threaded mode

Cross-references