Phase 10.2 — Threaded-code dispatch: Results¶

Historical — threaded code was removed

This page is a historical record. The threaded-code dispatcher reached ~1.5× over the Switch interpreter but missed its exit targets and was removed in favour of the arch-neutral IR + LLVM JIT (the current fast path; default translation = true). See Design decisions (Decision 59) and IR and LLVM JIT. "Default-capable mode" and LINCE_DISPATCH below refer to the threaded prototype as it existed, not to current behaviour.

Closes the threaded-code dispatcher plan (plans/phase10-2-threaded-code.md). Tandas 1–4 landed 2026-05-17/18 (7999f0e, 44fa1f2, bf5e274, 75a5df5). The final batch — Tanda 5 (FP-handler threading + PSR.EF gate move) plus the dispatcher de-duplication and the stop-epilogue refactor — landed in 1530f99 (2026-05-24), bundled with the CPI timing-model rework that shares the same files.

What landed in the final batch (`1530f99`)¶

One th_skeleton<Body> replaces the per-handler th_handler<K> / th_alu<Op> duplication. Every chain body now delegates to the same detail::exec_* / detail::exec_alu_op<Op> that core::step uses, so the threaded path no longer duplicates instruction semantics. tests/unit/test_dispatch_equivalence.cpp is the differential safety net: it asserts step() and run_threaded_chain() reach bit-identical architectural state, including the threaded FP path.
Tanda 5 — FP threading: FpLoad/FpStore/FpOp1/FpOp2/ FpBranch/FpUnknown now resolve directly to th_skeleton<&fp_body<K>> instead of falling through th_skeleton<&execute>. The PSR.EF gate moved into fp_body, so non-FP instructions (>99 % of the hot path) never test state.ef(). The EF check stays a live runtime read inside fp_body rather than being baked into handler selection — so no decode-cache invalidation on WRPSR(EF) is needed (the plan's alternative). The classic execute() switch keeps its own EF gate for the step() path (single-step, observer, GDB).
Stop-epilogue refactor: commit_psr_pipeline() now inlines the no-pending fast path (one predictable branch) and pushes the rare 3-instruction PSR-delay apply work out-of-line into commit_psr_pipeline_slow(). The per-instruction cycle counter dropped from the chain bookkeeping (ChainResult/StepResult no longer carry cycles), since the CPI rework makes sim-time a single global scalar.

Result summary¶

All numbers are the median of 5 internal lince-bench runs, pinned to a single P-core (taskset -c 0) on a 12^th-gen Intel i5-12450HX, captured warm (after a cpubound warm-up) and back-to-back so the switch-vs-threaded comparison sees the same host state. Commit 1530f99, LINCE_DISPATCH=threaded vs =switch.

Workload	switch (MIPS)	threaded (MIPS)	ratio
cpubound-mix	~37.1	~55.7	1.50×
fptest01	~23.4	~24.3	1.04×
boot-rtems-hello-world	~24.1	~25.9	1.07×

cpubound-mix threaded sustains > 200 % realtime — the emulator simulates more than 2× a real LEON3 at 50 MHz on one host thread.

Progress over Tanda 4¶

Tanda 4 measured threaded cpubound-mix at 46.48 MIPS (memory snapshot). The de-dup + stop-epilogue refactor lifts it to ~55 MIPS — a ~1.18× gain within threaded mode, confirming the stop-epilogue was, as predicted, the largest remaining dispatch-level lever.

Honest read on the exit criteria¶

The plan set two numeric exit targets. Neither is met as written:

Criterion	Target	Measured	Met?
Tanda 5 — fptest01 threaded delta	≥ 1.1× over switch	~1.04×	❌
Tanda 6 — cpubound-mix absolute	≥ 65 MIPS	~55 MIPS (best 61.35)	❌

Why, and why this is still the right place to stop dispatch-level work:

fptest01 is compute-bound, not dispatch-bound. FP ops are backed by Berkeley SoftFloat; an fadd/fmul/fdiv costs far more than the switch (insn.kind) that Tanda 5 removed. Threading FP dispatch shaves a sliver off a large constant, so the workload barely moves (~1.04×). The prior expectation that "Tanda 5 wins on FP" was optimistic — the FP dispatch was never the bottleneck; the FP arithmetic is.
Threaded dispatch is structurally exhausted as a lever. Every hot InsnKind now resolves directly from the decode cache to a specialised th_skeleton<Body> with no th_fallback on the hot path. There is no further dispatch overhead to remove; the remaining cost is real work (register-file accesses, ALU bodies, SoftFloat). Closing the 55→65 gap is IR/block-level territory — Phase 11 (IR + block chaining), not more threading.
Host variance is high. This i5-12450HX laptop swings ±10 MIPS run-to-run on cpubound-mix (thermal boost, hybrid cores). The best warm runs reach 61 MIPS; a quiet server-class host with a fixed performance governor would likely clear 65, but acceptance should not depend on cherry-picking the host.

Correctness¶

ctest is 554/554 in both dispatch modes at 1530f99: - LINCE_DISPATCH=switch: 554/554 (24 min). - LINCE_DISPATCH=threaded: 554/554 (30 min).

This includes the full RTEMS sptest/smptest/fptest integration suites and test_dispatch_equivalence (which compiles and exercises both dispatch paths regardless of the build's mode).

Recommendation¶

Phase 10.2 is structurally complete and correct: threaded dispatch is the default-capable mode, all hot kinds are threaded, the dispatcher is de-duplicated against step(), and it delivers a solid 1.5× cpubound win over the switch interpreter. The two numeric targets are missed for structural reasons (FP is compute-bound; dispatch overhead is exhausted), not for lack of remaining threading work. The residual ~55→65 MIPS push belongs to Phase 11 (IR + block chaining), where block-level translation can eliminate the per-instruction skeleton overhead the threaded chain still pays.