Phase 10.2 Tanda 4 — Results¶
Tanda 4 of the threaded-code dispatcher plan
(plans/phase10-2-threaded-code.md)
adds four more kind-specialised handlers — th_handler<Shift>,
th_handler<Jmpl>, th_handler<Save>, th_handler<Restore> —
following the same skeleton as Tanda 2's top-5. Together with
Tanda 3's per-AluOp specialisation, the threaded dispatcher
now resolves nine InsnKinds directly to specialised handlers
(SetHi, Branch, Load, Store, Shift, Jmpl, Save,
Restore, plus 24 th_alu<AluOp> for AluReg); the rest still
goes through th_fallback.
Net result: structurally complete (each new kind is reached
directly from the decode cache without paying the th_fallback
inner switch), correctness equivalent (ctest 543/543 in both
modes), and a small but real RTEMS-side delta. cpubound-mix is
unchanged because it is AluReg-dominated.
Result summary¶
All numbers are median of 5 lince-bench runs, pinned to a single
P-core (taskset -c 0) on a 12th-gen Intel i5-12450HX. Side-by-
side rows were captured back-to-back in the same session, so the
switch-vs-threaded delta is noise-free.
Side-by-side, post-Tanda-4 session¶
| Workload | switch | threaded | ratio | %realtime (threaded) |
|---|---|---|---|---|
| cpubound-mix | 34.80 MIPS | 46.48 MIPS | 1.336× | 185.92 % |
| boot-rtems-hello-world | 18.81 MIPS | 23.49 MIPS | 1.249× | 633.40 % |
| fptest01 | 21.46 MIPS | 23.99 MIPS | 1.118× | 651.43 % |
Cross-session delta over Tanda 3 (bf5e274)¶
The "additional ≥ 1.05× on RTEMS" goal in the plan is measured against Tanda 3's reported numbers. Cross-session comparisons absorb host jitter (a few percent run-to-run), so these are indicative rather than apples-to-apples.
| Workload | Tanda 3 | Tanda 4 | delta |
|---|---|---|---|
| cpubound-mix | 45.63 MIPS | 46.48 MIPS | 1.019× (noise) |
| boot-rtems-hello-world | 22.80 MIPS | 23.49 MIPS | 1.030× |
| fptest01 | 22.86 MIPS | 23.99 MIPS | 1.049× |
fptest01 lands right at the 1.05× plan target; boot-rtems at
1.03× is below target. The signal is real but small — the next
section explains why and what it tells us.
Why the RTEMS delta is modest¶
Tanda 2 already routed Shift / Save / Restore / Jmpl through
th_fallback. That fallback handler runs the same chain skeleton
(PSR pipeline commit, branch-request clear, alignment check,
annul, cycles bump, PC/nPC advance, musttail to the next slot) —
the only thing Tanda 4 strips out is the body's switch
(insn.kind) in core::execute, which dispatches to one of
~30 cases. That's a small jump-table dispatch on a value the
branch predictor sees repeatedly, so the inner switch was already
near-free.
Contrast with Tanda 3, which removed a second switch (alu_op)
inside exec_alu for the 24 AluOps — a deeper switch on a
hotter axis. Tanda 4's gain is structurally smaller because the
inner cost it eliminates was smaller to begin with.
The dominant chain cost on RTEMS workloads is now:
state.commit_psr_pipeline()per-link (most calls are no-op but still touch the pipeline slot).state.clear_branch_request()per-link.prepare_slot()'s cache probe (pc_tag == pcthen afnload).RamRegion::read_u32for instruction fetches on cache miss.
The "stop-epilogue" refactor (move 1 + 2 to once-per-chain,
chain-break on WRPSR) is the next big lever. Plan flagged this
as held-until-Tanda-4 in the Tandas 1–3 results doc — Tanda 4 is
the right place to revisit it.
What landed¶
Single-commit change: extends call_exec_for<K> with four new
if constexpr arms (routing to detail::exec_shift,
detail::exec_jmpl, detail::exec_save, detail::exec_restore),
and extends dispatch_for_decoded with four new cases mapping
each InsnKind to its th_handler<K> instantiation.
No new templates, no per-sub-op specialisation: Shift's inner switch has 3 cases (Sll/Srl/Sra) and the others have none, so a Tanda-3-style per-sub-op split would add code for negligible gain.
Structural verification¶
$ objdump -d build-bench-threaded/.../dispatch_threaded.cpp.o \
| grep -oE 'th_handler<[^>]+>' | sort -u
shows th_handler<3> (SetHi), <4> (Branch), <6> (Shift,
new), <10> (Save, new), <11> (Restore, new), <12> (Jmpl,
new), <15> (Load), <16> (Store) — eight specialised
instantiations plus 24 th_alu<AluOp> from Tanda 3. The four
new kinds are reached from dispatch_for_decoded directly; the
linker keeps the symbols because the dispatch table calls them.
Correctness gate¶
| Test set | switch | threaded |
|---|---|---|
ctest full suite |
543/543 | 543/543 |
| Total wall time | 698.39 s | 721.95 s |
Both runs at ctest -j4. The 23 s extra wall in threaded mode
is in the noise (smptests N=2 and N=4 each take 600–700 s and
absorb host variance). The PROM/SMP boot path through threaded
mode is identical because the chain ends at every trap (SAVE/
RESTORE window overflows during boot, IRQ entry from GPTIMER)
and re-enters fresh — TLBs, branch predictor, and the chain
state all reset at each trap.
What's left in Phase 10.2¶
| Tanda | Status | What it does |
|---|---|---|
| 1 — Scaffolding | ✅ 7999f0e |
CMake option + types. |
| 2 — Top-5 handlers + chain | ✅ 44fa1f2 |
First measurable win. |
| 3 — Per-AluOp specialisation | ✅ bf5e274 |
Inner ALU switch out of hot edge. |
| 4 — Shift / Jmpl / Save / Restore | ✅ this commit | Last per-kind threading; structurally complete. |
| 5 — FP threading + EF gate move | ⬜ Pending | fptest01 target ≥ 1.1×. |
| 6 — Tuning + write-up | ⬜ Pending | cpubound exit target ≥ 65 MIPS. |
Plan exit target on cpubound-mix is ≥ 65 MIPS. We are at 46.48 MIPS — still ~1.40× to find across Tandas 5 + 6 + the stop-epilogue lever.
Reproduction¶
# Side-by-side bench (same session)
taskset -c 0 build-bench-profile/bench/lince-bench \
--workload cpubound-mix --runs 5 # switch
taskset -c 0 build-bench-threaded/bench/lince-bench \
--workload cpubound-mix --runs 5 # threaded
# ctest both modes
ctest --test-dir build # switch mode
ctest --test-dir build-bench-threaded -j4 # threaded mode
Cross-references¶
- Plan:
plans/phase10-2-threaded-code.md. - Tandas 1–3 results:
phase10-2-tandas-1-3-results.md. - Post-P3 baseline:
phase10-p1-p3-results.md.