Phase 10.2 Tandas 1–3 — Results¶

First three tandas of the threaded-code dispatcher plan (plans/phase10-2-threaded-code.md) have landed. The interpreter's hot edge no longer goes through switch (insn.kind) followed by switch (insn.alu_op); each slot in the decode cache stores a specialised ThreadedHandler that tail-calls into the next slot's handler via [[clang::musttail]]. Three commits between the post-P3 baseline d33ede0 and the post-Tanda-3 commit bf5e274.

Net result: cpubound-mix runs 1.344× faster than the post-P3 switch baseline and now simulates a GR712RC core at 1.85× wall-clock on a single host thread.

Result summary¶

All numbers are median of 5 lince-bench runs, pinned to a single P-core (taskset -c 0) on a 12^th-gen Intel i5-12450HX. Runs that appear side-by-side were captured back-to-back on the same host state, so switch-vs-threaded deltas are noise-free; cross-tanda comparisons absorb host jitter (a few percent across runs).

Sub-task	Commit	cpubound-mix	boot-rtems	fptest01
Post-P3 baseline (switch)	`d33ede0`	34.41 MIPS	21.36 MIPS	20.68 MIPS
Tanda 1 (scaffolding)	`7999f0e`	≡ baseline	≡ baseline	≡ baseline
Tanda 2 (top-5 + chain)	`44fa1f2`	43.42 (+26.2 %)	23.81 (+11.5 %)	21.13 (+2.2 %)
Tanda 3 (per-AluOp)	`bf5e274`	45.63 (+5.1 %)	22.80 (-4.2 %) ¹	22.86 (+8.2 %)
Total over post-P3		+32.6 % (1.326×)	+6.7 % (1.07×)	+10.5 % (1.10×)

¹ Tanda 3 boot-rtems regression is within host-jitter noise (the two values straddle 22 MIPS run-to-run). Boot is FP-and-trap heavy and barely uses AluReg, so Tanda 3's specialisation does not move it.

%realtime on cpubound-mix climbed from 137 % (post-P3) → 174 % (Tanda 2) → 185 % (Tanda 3). A single GR712RC core now simulates 1.85× faster than the silicon it emulates.

What landed in each tanda¶

Tanda 1 — scaffolding (commit `7999f0e`)¶

Inert prep: LINCE_DISPATCH={switch,threaded} CMake option (default switch), ThreadedHandler function-pointer typedef in decoded_insn.hpp, DecodeCacheEntry::fn slot in cpu_state.hpp, and an empty dispatch_threaded.cpp translation unit. Zero behaviour change vs post-P3 in either mode. ctest 543/543 in both configurations.

The point was to land the binary-stable surface so subsequent tandas could be reviewed as diffs against a fixed scaffolding.

Tanda 2 — top-5 threaded handlers + chain (commit `44fa1f2`)¶

Five kind-specialised handlers — th_handler<K> instantiations for AluReg, Branch, SetHi, Store, Load — plus a th_fallback that threads the chain skeleton but uses the classic execute() switch for the body of every long-tail kind. Each handler:

Runs the same per-instruction bookkeeping core::step does today (commit_psr_pipeline, clear_branch_request, PC alignment check, annul handling).
Calls the kind-specific exec helper (inlined via if constexpr in call_exec_for<K>).
Bumps the per-chain instruction / cycle counter on CpuState.
Returns immediately on trap, power-down, or budget exhaustion.
Otherwise advances PC/nPC, looks up the next slot (populating slot.fn on miss), and [[clang::musttail]] jumps to it.

The ChainResult { instructions_ran, cycles_used, status } flows out of run_threaded_chain and into Emulator::run_until_unpaced, which uses it instead of a quantum-long core::step loop. The chain is engaged only when gdb_stub_ == nullptr && observer_ == nullptr — GDB single-step / breakpoint sampling and observer notifications keep the classic per-step path for correct per-instruction granularity.

Technical gotcha: Clang refuses [[clang::musttail]] from a noexcept caller ("cannot compile this tail call skipping over cleanups yet" — the implicit terminate landingpad counts as a cleanup). All threaded handlers in dispatch_threaded.cpp are therefore deliberately non-noexcept; internal helpers (call_exec_for, prepare_slot, chain_status_to_tt, exec_alu_specific) stay noexcept. GCC 15's [[gnu::musttail]] is unaffected. A LINCE_MUSTTAIL macro routed through __has_cpp_attribute(clang::musttail) covers both toolchains — GCC 15 ships clang::musttail as a synonym.

The chain edge IS confirmed jmp *%r10 (indirect tail call) in objdump -d build-threaded/.../dispatch_threaded.cpp.o.

Tanda 3 — per-`AluOp` specialisation (commit `bf5e274`)¶

24 th_alu<Op> instantiations replace the inner switch (insn.alu_op) in detail::exec_alu. Coverage: ADD / ADDcc / ADDX / ADDXcc; SUB family; AND / ANDcc / ANDN / ANDNcc; OR family; XOR family; tagged Tadd/Tsub (which can trap on overflow). The dispatch lookup becomes dispatch_for_decoded(const DecodedInsn&): for AluReg it forwards to dispatch_alu_op(insn.alu_op); everything else behaves as before.

Flag helpers (flags_add, flags_sub, flags_logical) moved from handlers_alu.cpp's anonymous namespace to handlers_internal.hpp so the threaded exec_alu_specific<Op> templates and the runtime exec_alu switch share one implementation. detail::exec_alu is retained (used by switch mode and by th_fallback).

th_handler<InsnKind::AluReg> is no longer reached by the dispatch table — the linker drops it from the final library.

Correctness gate¶

Test set	Post-P3	Tanda 1	Tanda 2	Tanda 3
`ctest` (full suite, `LINCE_DISPATCH=switch`)	543/543	543/543	543/543	543/543
`ctest` (full suite, `LINCE_DISPATCH=threaded`)	n/a	543/543	543/543 ¹	543/543
RTEMS sptests N=1 (GR712RC)	178/189	178/189	178/189	178/189
RTEMS smptests N=2 (GR712RC)	42/49	42/49	42/49	42/49
RTEMS smptests N=4 (GR740)	40/49	40/49	40/49	40/49

¹ Under heavy parallel host load (two ctest -j N runs racing on the same machine) test #524 (GdbStub m-packet reads IRQMP MMIO via word path) flaked once in the threaded ctest. The test passes in isolation and on a serial-load re-run; the threaded fast path is disabled when gdb_stub_ != nullptr, so behaviour through that test is identical to switch mode. Same pattern as pacing_test_flaky_under_load (see MEMORY.md).

What's left in Phase 10.2¶

The plan has six tandas. Half are done.

Tanda	Status	What it does
1 — Scaffolding	✅ `7999f0e`	CMake option + types.
2 — Top-5 handlers + chain	✅ `44fa1f2`	First measurable win.
3 — Per-AluOp specialisation	✅ `bf5e274`	Inner ALU switch out of hot edge.
4 — Shift / Save / Restore / Jmpl threading	⬜ Next	RTEMS-workload boost (~1.05×).
5 — FP threading + EF gate move	⬜ Pending	fptest01 target ≥ 1.1×.
6 — Tuning + write-up	⬜ Pending	cpubound exit target ≥ 65 MIPS.

Plan exit target on cpubound-mix is ≥ 65 MIPS. We are at 45.63 MIPS — ~1.42× still to find across Tandas 4, 5, 6.

The biggest remaining lever is not in the formal plan: the per-instruction state.commit_psr_pipeline() and state.clear_branch_request() calls inside every chain link account for a measurable share of overhead. The plan's original intent was to move these to a "stop epilogue" that runs once per chain, with a chain-break on WRPSR to preserve the SPARC V8 3-instruction PSR delay. That refactor is held until Tanda 4 lands, at which point it can be measured against a known-good baseline.

Hot-path shift¶

The pre-10.2 hot paths from phase10-p1-p3-results.md §"Hot-path shift" predicted Tanda 3 would push exec_alu out of the top-10 — confirmed structurally: in threaded mode the slot's fn resolves directly to th_alu<AluOp::Foo>, never to exec_alu. (Switch mode and th_fallback still use it.)

The remaining hot symbols in threaded mode are register-file accessors (CpuState::read_r, CpuState::write_r, CpuState::icc), the chain-prologue work (commit_psr_pipeline, clear_branch_request), and bus reads for instruction fetches on cache miss (RamRegion::read_u32).

Reproduction¶

# Switch baseline (post-P3 reference)
cmake --preset bench-profile && cmake --build --preset bench-profile
taskset -c 0 build-bench-profile/bench/lince-bench --workload cpubound-mix --runs 5

# Threaded (Tanda 3)
cmake -S . -B build-bench-threaded -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-O2 -g -fno-omit-frame-pointer -gdwarf-4" \
    -DLINCE_DISPATCH=threaded
cmake --build build-bench-threaded -j
taskset -c 0 build-bench-threaded/bench/lince-bench --workload cpubound-mix --runs 5

# ctest both modes (serial to avoid host-load flakes)
ctest --test-dir build               # switch mode
ctest --test-dir build-threaded      # threaded mode

Cross-references¶

Plan: plans/phase10-2-threaded-code.md.
Post-P3 baseline: phase10-p1-p3-results.md.
Long-horizon trajectory: plans/post-mvp-1to1-roadmap.md. At 1.85× wall-clock on one host thread for cpubound, Phase 13 (multi-thread, ~4 host cores ≈ 4 × 185 % = 7.4× / GR740-N4) + Phase 12 (LLVM JIT, 2–3× further) put 1:1 GR740 SMP within reach.