Performance — `lince-bench` and the Phase 9 baseline¶

lince-bench is the reproducible measurement tool for Lince. It runs a fixed list of RTEMS workloads on a fresh Emulator (PacingMode::Turbo) and reports sustained MIPS, % real-time, per-slice host jitter, host CPU usage, and peak RSS. Every Phase from 10 onward must publish bench numbers before merge so the project never optimises blindly — this is the cross-cutting rule from plans/post-mvp-1to1-roadmap.md.

This page is the Phase 9 interpreter baseline

The numbers below were captured before binary translation existed and measure the naive interpreter (~15 MIPS). They remain useful as the historical entry point for the phase-exit gate, but they are not today's performance: with translation = true (the default since the JIT landed) cpubound-mix reaches ~2000 MIPS single-core. For the current numbers and the tier/region breakdown, see IR and LLVM JIT. By default lince-bench now measures the translation path; pass --no-translate to measure the Switch interpreter instead.

Running it¶

cmake --build build --target lince_bench
./build/bench/lince-bench --list                # show default workloads
./build/bench/lince-bench --all --runs 5 \
    --json results.json                         # JSON + stdout table
./build/bench/lince-bench --workload fptest01   # one workload only

--runs N repeats every workload N times and returns the median (by MIPS). Use --runs 5 or more for a stable baseline on noisy hosts (a single short workload sees ~30 % drift run-to-run; a 5-repeat median brings that into single digits).

1:1 real-time status (GR740 — the long-horizon goal)¶

The Phase-9+ goal is sustained 1:1 wall-clock real-time emulation of the GR740 (quad-core LEON4FT @ 250 MHz). 1:1 = simulated seconds per wall-clock second ≥ 1, measured honestly — which depends on the multi-core time model (ADR-005, plans/adr-005-multicore-time-model.md).

Measure multi-core 1:1 by %realtime only under TimeAdvance::Concurrent (the default). The legacy Sum fold advances the N-core clock N× too fast, inflating %realtime ~N× for multi-core (a 4-core run that did ¼ of a chip-second of work reported "4× realtime"). Concurrent (max-of-deltas) advances one shared timeline like the SIS oracle, so %realtime = sim_time / wall is the true ratio. Aggregate host-MIPS (insns / wall) is correct under both folds and is the unambiguous cross-check.

Measured (2026-06-05, `governor=performance`, cpubound, `Concurrent` default)¶

Config	`%realtime`	host-MIPS	vs 1:1
GR740 MultiThread (4 cores → 4 host threads)	1.82×	~1840	past 1:1 ✅
GR740 SingleThread (4 cores → 1 host thread)	0.41×	~410	~2.4× short — structural (one host thread cannot drive a ~1000-MIPS quad)

On compute-bound load the GR740 reaches 1:1 in MultiThread — the only path that can, since MT scales ~2.8–4× over the cooperative round-robin. SingleThread is structurally short and is reserved for the SMP2-compatible cooperative model. The large gap the old inflated metric implied never existed — it was an artefact of the summed time model, now fixed. GR712RC (2 cores @ 80 MHz, ~160 MIPS demand) clears 1:1 even in SingleThread (~2× headroom).

Workloads¶

Name	Image	SoC	Stop	Required?
`boot-rtems-hello-world`	`tests/guest-programs/rtems/hello-world/hello-world.elf`	GR712RC	first UART byte	yes (committed)
`fptest01`	`tests/guest-programs/rtems/fptest01/fptest01.elf`	GR712RC	`*** END OF TEST` marker	yes (committed)
`sp24`	`tests/guest-programs/rtems/sptests/bin/sp24.exe`	GR712RC	RTEMS marker	no — built locally
`smp01-gr740-n4`	`tests/guest-programs/rtems/smptests-gr740-n4/bin/smp01.exe`	GR740	RTEMS marker	no — built locally

Optional workloads are gracefully skipped when their image is missing. Build them via tests/guest-programs/rtems/build_sptests.sh and build_smptests.sh. The two committed workloads are intentionally short so the bench can run end-to-end on a fresh checkout without host-side toolchains.

JSON schema¶

{
  "lince_bench_version": "<lince version string>",
  "host": {
    "hardware_concurrency": 12,
    "cmdline": "<full argv>"
  },
  "results": [
    {
      "name": "<workload name>",
      "outcome": "REACHED|TIMEOUT|ERROR_MODE|SETUP_FAILED|SKIPPED",
      "note": "<free-form reason for non-REACHED outcomes>",
      "sim_time_ns": 0,
      "host_time_ns": 0,            // total wall: setup + run + teardown
      "run_host_time_ns": 0,        // sum of `run_for` slice durations
      "instructions_executed": 0,
      "mips": 0.0,                  // = insns / run_host_time_ns
      "realtime_ratio": 0.0,        // = sim_time / run_host_time_ns
      "host_cpu_pct": 0.0,          // (utime+stime) / host_time, via getrusage
      "peak_rss_bytes": 0,          // getrusage ru_maxrss (normalised)
      "slice_jitter_ns": {
        "samples": 0,               // one per `run_for` call
        "p50": 0, "p95": 0, "p99": 0, "p999": 0, "p9999": 0
      }
    }
  ]
}

mips and realtime_ratio deliberately exclude Emulator setup (Emulator::create, load_elf, initialize) so that the metric reflects hot-loop performance — what later phases will actually move. host_time_ns includes setup so an external observer can still tell how long the bench process ran on the host.

slice_jitter_ns measures per-run_for host wall duration. It is not inter-IRQ latency — that is targeted for Phase 9.3 observability. Slice jitter is what controls 1:1 pacing precision in Phase 14.

Phase 9 baseline (host: Fedora Linux on x86-64 laptop, 12 hyperthreads)¶

Captured 2026-05-15 on main at commit fb2bd58. Single-thread naive interpreter, no decode cache, no JIT. --runs 5.

Workload	MIPS	% real-time	host CPU	RSS MiB	p50 slice (µs)	p99 slice (µs)
`boot-rtems-hello-world`	~15	~60 %	~99 %	~84	~1600	~1740
`fptest01`	~15	~60 %	~99 %	~84	~1610	~1725

Observed run-to-run drift: ~7–9 % on MIPS across 5 invocations of --all --runs 5. This is wider than the post-mvp-1to1-roadmap acceptance target of < 5 % and reflects measurement on a busy laptop (other workloads competing for the CPU). On a dedicated CI runner the drift is expected to drop into the 1–3 % range. Tighten this number once Phase 9.4 lands the CI bench runner; do not chase < 5 % on developer machines.

How to interpret the baseline¶

~15 MIPS, not the ~50 MIPS that CLAUDE.md describes as the performance target. The CLAUDE.md figure is the aspirational "real LEON3 speed at 50 MHz" goal. Today the interpreter is ~3× slower than the simulated CPU on this host. Phase 10 (decode cache + threaded code) targets a 3–5× lift, after which we expect to clear the 50 MIPS line. Hot-paths analysis (Phase 9.4) will quantify where the gap lives before any optimisation work starts.
~60 % real-time is higher than MIPS / sim_MHz = 15/50 = 30 % because the workloads spend non-trivial sim-time in CPU-idle states ("idle time skipping" in CLAUDE.md's Timing Model section). The emulator advances sim-time without executing instructions when all cores are halted waiting for IRQs.
Slice p99 ≈ p50: jitter is very flat on these short workloads (no allocator activity after warmup, no LLVM compile pauses yet). This will get more interesting in Phase 12.

Reproducing the baseline¶

git clean -fdx build/
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build --target lince_bench
./build/bench/lince-bench --all --runs 5 --json baseline.json

Compare against the table above. Differences > 20 % on MIPS warrant investigation; differences < 20 % are likely measurement noise on shared host hardware.

Phase exit gate¶

plans/post-mvp-1to1-roadmap.md requires every phase from 10 onward to publish bench output at phase entry and phase exit, and to demonstrate testsuite pass rate ≥ entry rate. Phase 9 itself landed lince-bench; the baseline above is the entry point for that gate.

Hot-path analysis¶

Phase 9.4 produced hot-paths.md, the per-workload profile-derived view of where the interpreter spends host CPU. The Phase 10 attack-plan section (§ 7) of that document supersedes the informal "Phase 10 (decode cache + threaded code) targets a 3–5× lift" hand-wave above: the data shows bus dispatch is the larger target than decode, and the recommended ordering is bus fetch fast-path → PC-indexed decode cache → FP-class gate removal.

Phase 13/14 — MultiThread and the 1:1 GR740 status¶

Phase 13 added ExecutionMode::MultiThread (thread-per-core, each simulated core on its own host thread); Phase 14 measures it against the 1:1 GR740 goal. New lince-bench knobs: --mt, --cores N, --quantum-batch N, --max-sim-ms N (apply to --image).

Host caveat. Captured on a powersave-governor host — a floor. The tier-1 acceptance host (ADR-003: ≥4 physical cores, AVX2, performance governor) is faster. Treat the absolute MIPS as conservative.

Reproduce¶

# Single-core realistic (Dhrystone), % real-time vs 250 MHz:
lince-bench --image dhrystone.elf --soc gr712rc --clock-mhz 250 --runs 3
# Four-core sustained compute, SingleThread vs MultiThread + batching:
lince-bench --image cpubound60-gr740.elf --soc gr740 --clock-mhz 250 \
            --max-sim-ms 600 --runs 3                            # ST
lince-bench --image cpubound60-gr740.elf --soc gr740 --clock-mhz 250 \
            --max-sim-ms 600 --runs 3 --mt --quantum-batch 32    # MT

Measured (powersave floor, 250 MHz target clock)¶

Workload	Mode	Aggregate MIPS	realtime_ratio
Dhrystone (1 core, realistic)	JIT	~219	~1.75 (cpi 2)
cpubound60 (4 cores, integer)	ST round-robin	~304	~1.22
cpubound60 (4 cores)	MT, batch 1	~538	~2.15
cpubound60 (4 cores)	MT, batch 32	~666	~2.67

realtime_ratio > 1.0 ⇒ the emulator sustains faster than real time on that workload (1:1 is then achievable by pacing down).

The 1:1 status and the CPI caveat¶

The honest, cpi-independent number is the raw host throughput: ~219 MIPS per core on realistic (Dhrystone) code — roughly 4× the pre-Phase-12 figure. Whether that is 1:1 depends on the real GR740's average cycles-per-instruction, which the sim-clock models via EmulatorConfig::cpi (longer term, the P10 bucket-resync model):

Assumed real GR740 cpi	insns/s per core for 1:1	per-core ratio @ 219 MIPS
1.0 (optimistic, no stalls)	250 M	0.87× (just short)
1.5	167 M	1.31×
2.0 (realistic SPARC, cache+FP stalls)	125 M	1.75×

So the per-core JIT is at or near 1:1 on realistic code — not the ~16× short older numbers implied. The dominant remaining variable is CPI-model fidelity (is the sim-clock honest?), orthogonal to JIT codegen speed.

Phase-13/14 findings¶

MT scales sub-linearly: MT(N=4)/ST ≈ 2.1× on cpubound. Losses are barrier-per-quantum overhead (mitigated by --quantum-batch: +24% at batch 32, P14-3), shared-RAM bandwidth, and warmup.
Tier-2 under MT (P14-2) recovered +32% over baseline-only MT — per-core O2 matters more than raw parallelism.
Batch granularity is a trade-off: large batches help CPU-bound throughput but coarsen inter-processor IRQ delivery, so the default stays 1 (determinism-exact). Tune per workload.

Toward the formal P14 acceptance (not yet done)¶

Re-measure on the tier-1 host (performance governor) for the true ceiling.
Broaden the workload set (FP-, IRQ/IO-, syscall-heavy) — acceptance is "1:1 on ≥⅔ bench workloads".
CPI-model honesty (P10) — pin the real-GR740 cpi so the ratio is defensible.
P99.9 jitter bounding (LLVM module GC, mmap, page faults) + pacing precision.

Phase 14 per-core optimization wins (2026-06) — current state¶

Landed on main 2026-06-05. Three bit-exact per-core wins, found by profiling and each confirmed by a controlled same-host back-to-back A/B with governor=performance. Together they roughly double uniprocessor Dhrystone throughput. The full ctest suite stays green (709 cases pass, 0 fail, across uniprocessor + SMP N=2/N=4, Switch + JIT + MultiThread). These numbers supersede the ~219 MIPS Phase-13/14 figures above as the current single-core state; the older numbers remain as the pre-win baseline.

Headline: with all four wins below on main (lever-A dispatch anchor landed 2026-06-05 alongside the three new wins), Dhrystone N=1 measures ~295 MIPS (governor=performance) — roughly 2× the pre-Phase-12 per-core baseline.

#	Win	Where	Dhrystone N=1 delta	Other effects
0	Dispatch-cache anchor (lever A)	`IrBlock::jit_entry`; `TieredJit::*_anchored`; `Emulator::run_ir_quantum`	+19.5% (238 → 285 MIPS)	skips the per-dispatch `unordered_map` hash
1	IR `BlockCache` 1024 → 8192 slots	`src/ir/include/lince/ir/block_cache.hpp`	+39% (153 → 212 MIPS)	p99 slice jitter 28.7 ms → 8.9 ms; cpubound flat
2	Event-bounded single-core quantum	`Emulator::run_core_quantum` (`src/runtime/src/emulator.cpp`)	+40% (212 → 298 MIPS)	sub-quantum-precise event firing (also more accurate)
3	Interrupt-poll fast-path (`raw_pending()`)	`iinterrupt_controller.hpp`; IrqMP/IrqAMP; `Emulator::sample_interrupts`	+2.7% (301.9 → 310.2 MIPS, controlled A/B)	boot/call-heavy benefit similarly; cpubound flat

Dispatch-cache anchor (lever A). For an already-compiled (pc, mode) the TieredJit CacheEntry node address is stable, so the dispatcher stashes it in IrBlock::jit_entry on first compile and reuses it (*_anchored overloads) to resolve the best fn and count executions without the per-dispatch unordered_map hash that was the #1 hotspot on call-heavy code. Bit-exact by construction (the anchor is read only after a validated find).
BlockCache 1024 → 8192 slots. The cache is direct-mapped, indexed by (pc >> 2) & (Size - 1). At 1024 the index covered only a 4 KiB PC window, so on a call-heavy guest the hot code and its libc aliased and evicted each other — on Dhrystone, libc strcmp/strcpy hash to indices inside the Proc_/Func_ block span, forcing constant re-translation (a steady-state profile put translate_block at ~9–16% of wall time on a loop that should translate each block once). 8192 (a 32 KiB window) removes the aliasing. Bit-exact — a cache only ever produces fewer evictions, never different results. 16384/32768 measured no further gain, so 8192 is the size/locality sweet spot.
Event-bounded single-core quantum. A true single core (cores_.size() == 1) has no round-robin interleaving to preserve, so the scheduling quantum (default 1000) only throttles throughput by paying per-round overhead (sync_global_up_counter, the interrupt scan, the tail-step, run_*_quantum setup) every quantum. It now runs a large burst (cap 1 << 16 = 65536) but never past the next scheduled event, so timer/peripheral IRQs still fire at their exact simulated time — in fact more precisely than before, where events were rounded up to the round boundary (up to a full quantum late). Gated strictly to cores == 1, so multi-core round-robin interleaving and its sim-time accounting are byte-unchanged (SMP suites unaffected). Bit-exact between the JIT and the Switch oracle (both honour the same bound); N=1 RTEMS sptests were identical in a controlled same-host back-to-back run.
Interrupt-poll fast-path via IInterruptController::raw_pending(). The dispatcher polls the interrupt controller at every block boundary (poll_self_interrupt → sample_interrupts → pending_mask), but with a 100 Hz tick the controller is empty ~99.99% of the time; each poll paid a virtual call + a 5-load per-CPU scan + the EIRQ-redirect branch only to compute 0 (a profile put pending_mask at ~5.8% of Dhrystone wall time). raw_pending() is a maintained single-word superset (pending_ | ifr0_ | OR over iforce_[cpu]) refreshed at the 7 raw-source mutation sites in both controllers; it is one load, and a 0 result proves pending_mask(cpu) == 0 for every cpu, so sample_interrupts early-outs before the full scan. Provably bit-exact — the early-out fires only when the full scan would have returned 0 anyway. The reader early-out is gated to SingleThread (the maintained word can lag a concurrent assert under MultiThread, so MT keeps the full pending_mask — no MT behavior change).

Rejected levers (measured net-negative — do not re-attempt)¶

Each was caught by a controlled same-host A/B and kept default-OFF (or not merged):

Cross-region JIT block linking ("lever B") — a trampoline running an already-compiled successor region in native code instead of returning to the dispatcher. Measured ~9% slower on Dhrystone: the dispatch residual was already lean, and the trampoline's per-hop cross-library self-IPI poll + recursion + per-exit fan-out cost more than it saved. Kept default-OFF on a feature branch, not merged.
MT host-thread affinity pinning — net-negative on a shared host (~6% lower MIPS and 4–6× worse P99.9 jitter): the per-round barrier makes every core wait for the slowest, and a pinned thread on a transiently contended host core can't migrate away. Only useful with host core isolation (isolcpus); kept opt-in, default OFF.
Bigger JIT region fusion (jit_max_region_blocks 8 → 16/32/64) — −8% on Dhrystone: larger fused regions overshoot the quantum and tail-step through the slow Switch path. 8 stays the sweet spot.

Method note¶

All wins were found by profiling/measurement, and every dead-end was caught by a controlled same-host back-to-back A/B. The committed result CSVs (tests/results/*.csv) are thermal-stale from other hosts and produced phantom regressions multiple times — perf comparisons must be same-host back-to-back with governor=performance.

Performance — lince-bench and the Phase 9 baseline¶