Phase 10.2 Tandas 1–3 — Results¶
First three tandas of the threaded-code dispatcher plan
(plans/phase10-2-threaded-code.md)
have landed. The interpreter's hot edge no longer goes through
switch (insn.kind) followed by switch (insn.alu_op); each
slot in the decode cache stores a specialised ThreadedHandler
that tail-calls into the next slot's handler via
[[clang::musttail]]. Three commits between the post-P3 baseline
d33ede0 and the post-Tanda-3 commit bf5e274.
Net result: cpubound-mix runs 1.344× faster than the post-P3 switch baseline and now simulates a GR712RC core at 1.85× wall-clock on a single host thread.
Result summary¶
All numbers are median of 5 lince-bench runs, pinned to a single
P-core (taskset -c 0) on a 12th-gen Intel i5-12450HX. Runs that
appear side-by-side were captured back-to-back on the same host
state, so switch-vs-threaded deltas are noise-free; cross-tanda
comparisons absorb host jitter (a few percent across runs).
| Sub-task | Commit | cpubound-mix | boot-rtems | fptest01 |
|---|---|---|---|---|
| Post-P3 baseline (switch) | d33ede0 |
34.41 MIPS | 21.36 MIPS | 20.68 MIPS |
| Tanda 1 (scaffolding) | 7999f0e |
≡ baseline | ≡ baseline | ≡ baseline |
| Tanda 2 (top-5 + chain) | 44fa1f2 |
43.42 (+26.2 %) | 23.81 (+11.5 %) | 21.13 (+2.2 %) |
| Tanda 3 (per-AluOp) | bf5e274 |
45.63 (+5.1 %) | 22.80 (-4.2 %) ¹ | 22.86 (+8.2 %) |
| Total over post-P3 | +32.6 % (1.326×) | +6.7 % (1.07×) | +10.5 % (1.10×) |
¹ Tanda 3 boot-rtems regression is within host-jitter noise (the two values straddle 22 MIPS run-to-run). Boot is FP-and-trap heavy and barely uses AluReg, so Tanda 3's specialisation does not move it.
%realtime on cpubound-mix climbed from 137 % (post-P3) →
174 % (Tanda 2) → 185 % (Tanda 3). A single GR712RC core
now simulates 1.85× faster than the silicon it emulates.
What landed in each tanda¶
Tanda 1 — scaffolding (commit 7999f0e)¶
Inert prep: LINCE_DISPATCH={switch,threaded} CMake option
(default switch), ThreadedHandler function-pointer typedef in
decoded_insn.hpp, DecodeCacheEntry::fn slot in cpu_state.hpp,
and an empty dispatch_threaded.cpp translation unit. Zero
behaviour change vs post-P3 in either mode. ctest 543/543 in both
configurations.
The point was to land the binary-stable surface so subsequent tandas could be reviewed as diffs against a fixed scaffolding.
Tanda 2 — top-5 threaded handlers + chain (commit 44fa1f2)¶
Five kind-specialised handlers — th_handler<K> instantiations
for AluReg, Branch, SetHi, Store, Load — plus a
th_fallback that threads the chain skeleton but uses the
classic execute() switch for the body of every long-tail kind.
Each handler:
- Runs the same per-instruction bookkeeping
core::stepdoes today (commit_psr_pipeline,clear_branch_request, PC alignment check, annul handling). - Calls the kind-specific exec helper (inlined via
if constexprincall_exec_for<K>). - Bumps the per-chain instruction / cycle counter on
CpuState. - Returns immediately on trap, power-down, or budget exhaustion.
- Otherwise advances PC/nPC, looks up the next slot
(populating
slot.fnon miss), and[[clang::musttail]]jumps to it.
The ChainResult { instructions_ran, cycles_used, status } flows
out of run_threaded_chain and into Emulator::run_until_unpaced,
which uses it instead of a quantum-long core::step loop. The
chain is engaged only when gdb_stub_ == nullptr && observer_ ==
nullptr — GDB single-step / breakpoint sampling and observer
notifications keep the classic per-step path for correct
per-instruction granularity.
Technical gotcha: Clang refuses [[clang::musttail]] from a
noexcept caller ("cannot compile this tail call skipping over
cleanups yet" — the implicit terminate landingpad counts as a
cleanup). All threaded handlers in dispatch_threaded.cpp are
therefore deliberately non-noexcept; internal helpers
(call_exec_for, prepare_slot, chain_status_to_tt,
exec_alu_specific) stay noexcept. GCC 15's
[[gnu::musttail]] is unaffected. A
LINCE_MUSTTAIL macro routed through
__has_cpp_attribute(clang::musttail) covers both toolchains —
GCC 15 ships clang::musttail as a synonym.
The chain edge IS confirmed jmp *%r10 (indirect tail call) in
objdump -d build-threaded/.../dispatch_threaded.cpp.o.
Tanda 3 — per-AluOp specialisation (commit bf5e274)¶
24 th_alu<Op> instantiations replace the inner switch
(insn.alu_op) in detail::exec_alu. Coverage: ADD / ADDcc /
ADDX / ADDXcc; SUB family; AND / ANDcc / ANDN / ANDNcc; OR
family; XOR family; tagged Tadd/Tsub (which can trap on
overflow). The dispatch lookup becomes
dispatch_for_decoded(const DecodedInsn&): for AluReg it
forwards to dispatch_alu_op(insn.alu_op); everything else
behaves as before.
Flag helpers (flags_add, flags_sub, flags_logical) moved
from handlers_alu.cpp's anonymous namespace to
handlers_internal.hpp so the threaded exec_alu_specific<Op>
templates and the runtime exec_alu switch share one
implementation. detail::exec_alu is retained (used by switch
mode and by th_fallback).
th_handler<InsnKind::AluReg> is no longer reached by the
dispatch table — the linker drops it from the final library.
Correctness gate¶
| Test set | Post-P3 | Tanda 1 | Tanda 2 | Tanda 3 |
|---|---|---|---|---|
ctest (full suite, LINCE_DISPATCH=switch) |
543/543 | 543/543 | 543/543 | 543/543 |
ctest (full suite, LINCE_DISPATCH=threaded) |
n/a | 543/543 | 543/543 ¹ | 543/543 |
| RTEMS sptests N=1 (GR712RC) | 178/189 | 178/189 | 178/189 | 178/189 |
| RTEMS smptests N=2 (GR712RC) | 42/49 | 42/49 | 42/49 | 42/49 |
| RTEMS smptests N=4 (GR740) | 40/49 | 40/49 | 40/49 | 40/49 |
¹ Under heavy parallel host load (two ctest -j N runs racing on
the same machine) test #524 (GdbStub m-packet reads IRQMP MMIO
via word path) flaked once in the threaded ctest. The test
passes in isolation and on a serial-load re-run; the threaded
fast path is disabled when gdb_stub_ != nullptr, so behaviour
through that test is identical to switch mode. Same pattern as
pacing_test_flaky_under_load (see MEMORY.md).
What's left in Phase 10.2¶
The plan has six tandas. Half are done.
| Tanda | Status | What it does |
|---|---|---|
| 1 — Scaffolding | ✅ 7999f0e |
CMake option + types. |
| 2 — Top-5 handlers + chain | ✅ 44fa1f2 |
First measurable win. |
| 3 — Per-AluOp specialisation | ✅ bf5e274 |
Inner ALU switch out of hot edge. |
| 4 — Shift / Save / Restore / Jmpl threading | ⬜ Next | RTEMS-workload boost (~1.05×). |
| 5 — FP threading + EF gate move | ⬜ Pending | fptest01 target ≥ 1.1×. |
| 6 — Tuning + write-up | ⬜ Pending | cpubound exit target ≥ 65 MIPS. |
Plan exit target on cpubound-mix is ≥ 65 MIPS. We are at 45.63 MIPS — ~1.42× still to find across Tandas 4, 5, 6.
The biggest remaining lever is not in the formal plan: the
per-instruction state.commit_psr_pipeline() and
state.clear_branch_request() calls inside every chain link
account for a measurable share of overhead. The plan's original
intent was to move these to a "stop epilogue" that runs once per
chain, with a chain-break on WRPSR to preserve the SPARC V8
3-instruction PSR delay. That refactor is held until Tanda 4
lands, at which point it can be measured against a known-good
baseline.
Hot-path shift¶
The pre-10.2 hot paths from phase10-p1-p3-results.md §"Hot-path
shift" predicted Tanda 3 would push exec_alu out of the
top-10 — confirmed structurally: in threaded mode the slot's
fn resolves directly to th_alu<AluOp::Foo>, never to
exec_alu. (Switch mode and th_fallback still use it.)
The remaining hot symbols in threaded mode are register-file
accessors (CpuState::read_r, CpuState::write_r,
CpuState::icc), the chain-prologue work
(commit_psr_pipeline, clear_branch_request), and bus reads
for instruction fetches on cache miss
(RamRegion::read_u32).
Reproduction¶
# Switch baseline (post-P3 reference)
cmake --preset bench-profile && cmake --build --preset bench-profile
taskset -c 0 build-bench-profile/bench/lince-bench --workload cpubound-mix --runs 5
# Threaded (Tanda 3)
cmake -S . -B build-bench-threaded -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DCMAKE_CXX_FLAGS_RELWITHDEBINFO="-O2 -g -fno-omit-frame-pointer -gdwarf-4" \
-DLINCE_DISPATCH=threaded
cmake --build build-bench-threaded -j
taskset -c 0 build-bench-threaded/bench/lince-bench --workload cpubound-mix --runs 5
# ctest both modes (serial to avoid host-load flakes)
ctest --test-dir build # switch mode
ctest --test-dir build-threaded # threaded mode
Cross-references¶
- Plan:
plans/phase10-2-threaded-code.md. - Post-P3 baseline:
phase10-p1-p3-results.md. - Long-horizon trajectory:
plans/post-mvp-1to1-roadmap.md. At 1.85× wall-clock on one host thread for cpubound, Phase 13 (multi-thread, ~4 host cores ≈ 4 × 185 % = 7.4× / GR740-N4) + Phase 12 (LLVM JIT, 2–3× further) put 1:1 GR740 SMP within reach.