Phase 10.2 — Threaded-code dispatch: Results¶
Historical — threaded code was removed
This page is a historical record. The threaded-code dispatcher
reached ~1.5× over the Switch interpreter but missed its exit targets
and was removed in favour of the arch-neutral IR + LLVM JIT (the
current fast path; default translation = true). See
Design decisions (Decision 59) and
IR and LLVM JIT. "Default-capable mode" and
LINCE_DISPATCH below refer to the threaded prototype as it existed,
not to current behaviour.
Closes the threaded-code dispatcher plan
(plans/phase10-2-threaded-code.md). Tandas 1–4 landed
2026-05-17/18 (7999f0e, 44fa1f2, bf5e274, 75a5df5). The final
batch — Tanda 5 (FP-handler threading + PSR.EF gate move) plus the
dispatcher de-duplication and the stop-epilogue refactor — landed
in 1530f99 (2026-05-24), bundled with the CPI timing-model rework that
shares the same files.
What landed in the final batch (1530f99)¶
- One
th_skeleton<Body>replaces the per-handlerth_handler<K>/th_alu<Op>duplication. Every chain body now delegates to the samedetail::exec_*/detail::exec_alu_op<Op>thatcore::stepuses, so the threaded path no longer duplicates instruction semantics.tests/unit/test_dispatch_equivalence.cppis the differential safety net: it assertsstep()andrun_threaded_chain()reach bit-identical architectural state, including the threaded FP path. - Tanda 5 — FP threading:
FpLoad/FpStore/FpOp1/FpOp2/FpBranch/FpUnknownnow resolve directly toth_skeleton<&fp_body<K>>instead of falling throughth_skeleton<&execute>. ThePSR.EFgate moved intofp_body, so non-FP instructions (>99 % of the hot path) never teststate.ef(). The EF check stays a live runtime read insidefp_bodyrather than being baked into handler selection — so no decode-cache invalidation onWRPSR(EF)is needed (the plan's alternative). The classicexecute()switch keeps its own EF gate for thestep()path (single-step, observer, GDB). - Stop-epilogue refactor:
commit_psr_pipeline()now inlines the no-pending fast path (one predictable branch) and pushes the rare 3-instruction PSR-delay apply work out-of-line intocommit_psr_pipeline_slow(). The per-instruction cycle counter dropped from the chain bookkeeping (ChainResult/StepResultno longer carrycycles), since the CPI rework makes sim-time a single global scalar.
Result summary¶
All numbers are the median of 5 internal lince-bench runs, pinned to
a single P-core (taskset -c 0) on a 12th-gen Intel i5-12450HX, captured
warm (after a cpubound warm-up) and back-to-back so the
switch-vs-threaded comparison sees the same host state. Commit 1530f99,
LINCE_DISPATCH=threaded vs =switch.
| Workload | switch (MIPS) | threaded (MIPS) | ratio |
|---|---|---|---|
| cpubound-mix | ~37.1 | ~55.7 | 1.50× |
| fptest01 | ~23.4 | ~24.3 | 1.04× |
| boot-rtems-hello-world | ~24.1 | ~25.9 | 1.07× |
cpubound-mix threaded sustains > 200 % realtime — the emulator
simulates more than 2× a real LEON3 at 50 MHz on one host thread.
Progress over Tanda 4¶
Tanda 4 measured threaded cpubound-mix at 46.48 MIPS (memory snapshot).
The de-dup + stop-epilogue refactor lifts it to ~55 MIPS — a ~1.18× gain
within threaded mode, confirming the stop-epilogue was, as predicted,
the largest remaining dispatch-level lever.
Honest read on the exit criteria¶
The plan set two numeric exit targets. Neither is met as written:
| Criterion | Target | Measured | Met? |
|---|---|---|---|
| Tanda 5 — fptest01 threaded delta | ≥ 1.1× over switch | ~1.04× | ❌ |
| Tanda 6 — cpubound-mix absolute | ≥ 65 MIPS | ~55 MIPS (best 61.35) | ❌ |
Why, and why this is still the right place to stop dispatch-level work:
- fptest01 is compute-bound, not dispatch-bound. FP ops are backed by
Berkeley SoftFloat; an
fadd/fmul/fdivcosts far more than theswitch (insn.kind)that Tanda 5 removed. Threading FP dispatch shaves a sliver off a large constant, so the workload barely moves (~1.04×). The prior expectation that "Tanda 5 wins on FP" was optimistic — the FP dispatch was never the bottleneck; the FP arithmetic is. - Threaded dispatch is structurally exhausted as a lever. Every hot
InsnKindnow resolves directly from the decode cache to a specialisedth_skeleton<Body>with noth_fallbackon the hot path. There is no further dispatch overhead to remove; the remaining cost is real work (register-file accesses, ALU bodies, SoftFloat). Closing the 55→65 gap is IR/block-level territory — Phase 11 (IR + block chaining), not more threading. - Host variance is high. This i5-12450HX laptop swings ±10 MIPS
run-to-run on
cpubound-mix(thermal boost, hybrid cores). The best warm runs reach 61 MIPS; a quiet server-class host with a fixedperformancegovernor would likely clear 65, but acceptance should not depend on cherry-picking the host.
Correctness¶
ctest is 554/554 in both dispatch modes at 1530f99:
- LINCE_DISPATCH=switch: 554/554 (24 min).
- LINCE_DISPATCH=threaded: 554/554 (30 min).
This includes the full RTEMS sptest/smptest/fptest integration suites and
test_dispatch_equivalence (which compiles and exercises both dispatch
paths regardless of the build's mode).
Recommendation¶
Phase 10.2 is structurally complete and correct: threaded dispatch is
the default-capable mode, all hot kinds are threaded, the dispatcher is
de-duplicated against step(), and it delivers a solid 1.5× cpubound
win over the switch interpreter. The two numeric targets are missed for
structural reasons (FP is compute-bound; dispatch overhead is exhausted),
not for lack of remaining threading work. The residual ~55→65 MIPS push
belongs to Phase 11 (IR + block chaining), where block-level
translation can eliminate the per-instruction skeleton overhead the
threaded chain still pays.