Performance — lince-bench and the Phase 9 baseline¶
lince-bench is the reproducible measurement tool for Lince. It runs a
fixed list of RTEMS workloads on a fresh Emulator (PacingMode::Turbo)
and reports sustained MIPS, % real-time, per-slice host jitter, host CPU
usage, and peak RSS. Every Phase from 10 onward must publish bench
numbers before merge so the project never optimises blindly — this is
the cross-cutting rule from plans/post-mvp-1to1-roadmap.md.
This page is the Phase 9 interpreter baseline
The numbers below were captured before binary translation existed and
measure the naive interpreter (~15 MIPS). They remain useful as the
historical entry point for the phase-exit gate, but they are not
today's performance: with translation = true (the default since the
JIT landed) cpubound-mix reaches ~2000 MIPS single-core. For the
current numbers and the tier/region breakdown, see
IR and LLVM JIT. By default
lince-bench now measures the translation path; pass --no-translate
to measure the Switch interpreter instead.
Running it¶
cmake --build build --target lince_bench
./build/bench/lince-bench --list # show default workloads
./build/bench/lince-bench --all --runs 5 \
--json results.json # JSON + stdout table
./build/bench/lince-bench --workload fptest01 # one workload only
--runs N repeats every workload N times and returns the median (by
MIPS). Use --runs 5 or more for a stable baseline on noisy hosts (a
single short workload sees ~30 % drift run-to-run; a 5-repeat median
brings that into single digits).
1:1 real-time status (GR740 — the long-horizon goal)¶
The Phase-9+ goal is sustained 1:1 wall-clock real-time emulation of the
GR740 (quad-core LEON4FT @ 250 MHz). 1:1 = simulated seconds per wall-clock
second ≥ 1, measured honestly — which depends on the multi-core time model
(ADR-005, plans/adr-005-multicore-time-model.md).
Measure multi-core 1:1 by %realtime only under TimeAdvance::Concurrent
(the default). The legacy Sum fold advances the N-core clock N× too fast,
inflating %realtime ~N× for multi-core (a 4-core run that did ¼ of a
chip-second of work reported "4× realtime"). Concurrent (max-of-deltas)
advances one shared timeline like the SIS oracle, so %realtime = sim_time / wall
is the true ratio. Aggregate host-MIPS (insns / wall) is correct under both
folds and is the unambiguous cross-check.
Measured (2026-06-05, governor=performance, cpubound, Concurrent default)¶
| Config | %realtime |
host-MIPS | vs 1:1 |
|---|---|---|---|
| GR740 MultiThread (4 cores → 4 host threads) | 1.82× | ~1840 | past 1:1 ✅ |
| GR740 SingleThread (4 cores → 1 host thread) | 0.41× | ~410 | ~2.4× short — structural (one host thread cannot drive a ~1000-MIPS quad) |
On compute-bound load the GR740 reaches 1:1 in MultiThread — the only path that can, since MT scales ~2.8–4× over the cooperative round-robin. SingleThread is structurally short and is reserved for the SMP2-compatible cooperative model. The large gap the old inflated metric implied never existed — it was an artefact of the summed time model, now fixed. GR712RC (2 cores @ 80 MHz, ~160 MIPS demand) clears 1:1 even in SingleThread (~2× headroom).
Workloads¶
| Name | Image | SoC | Stop | Required? |
|---|---|---|---|---|
boot-rtems-hello-world |
tests/guest-programs/rtems/hello-world/hello-world.elf |
GR712RC | first UART byte | yes (committed) |
fptest01 |
tests/guest-programs/rtems/fptest01/fptest01.elf |
GR712RC | *** END OF TEST marker |
yes (committed) |
sp24 |
tests/guest-programs/rtems/sptests/bin/sp24.exe |
GR712RC | RTEMS marker | no — built locally |
smp01-gr740-n4 |
tests/guest-programs/rtems/smptests-gr740-n4/bin/smp01.exe |
GR740 | RTEMS marker | no — built locally |
Optional workloads are gracefully skipped when their image is missing.
Build them via tests/guest-programs/rtems/build_sptests.sh and
build_smptests.sh. The two committed workloads are intentionally
short so the bench can run end-to-end on a fresh checkout without
host-side toolchains.
JSON schema¶
{
"lince_bench_version": "<lince version string>",
"host": {
"hardware_concurrency": 12,
"cmdline": "<full argv>"
},
"results": [
{
"name": "<workload name>",
"outcome": "REACHED|TIMEOUT|ERROR_MODE|SETUP_FAILED|SKIPPED",
"note": "<free-form reason for non-REACHED outcomes>",
"sim_time_ns": 0,
"host_time_ns": 0, // total wall: setup + run + teardown
"run_host_time_ns": 0, // sum of `run_for` slice durations
"instructions_executed": 0,
"mips": 0.0, // = insns / run_host_time_ns
"realtime_ratio": 0.0, // = sim_time / run_host_time_ns
"host_cpu_pct": 0.0, // (utime+stime) / host_time, via getrusage
"peak_rss_bytes": 0, // getrusage ru_maxrss (normalised)
"slice_jitter_ns": {
"samples": 0, // one per `run_for` call
"p50": 0, "p95": 0, "p99": 0, "p999": 0, "p9999": 0
}
}
]
}
mips and realtime_ratio deliberately exclude Emulator setup
(Emulator::create, load_elf, initialize) so that the metric
reflects hot-loop performance — what later phases will actually move.
host_time_ns includes setup so an external observer can still tell
how long the bench process ran on the host.
slice_jitter_ns measures per-run_for host wall duration. It is
not inter-IRQ latency — that is targeted for Phase 9.3
observability. Slice jitter is what controls 1:1 pacing precision in
Phase 14.
Phase 9 baseline (host: Fedora Linux on x86-64 laptop, 12 hyperthreads)¶
Captured 2026-05-15 on main at commit fb2bd58. Single-thread
naive interpreter, no decode cache, no JIT. --runs 5.
| Workload | MIPS | % real-time | host CPU | RSS MiB | p50 slice (µs) | p99 slice (µs) |
|---|---|---|---|---|---|---|
boot-rtems-hello-world |
~15 | ~60 % | ~99 % | ~84 | ~1600 | ~1740 |
fptest01 |
~15 | ~60 % | ~99 % | ~84 | ~1610 | ~1725 |
Observed run-to-run drift: ~7–9 % on MIPS across 5 invocations of
--all --runs 5. This is wider than the post-mvp-1to1-roadmap
acceptance target of < 5 % and reflects measurement on a busy
laptop (other workloads competing for the CPU). On a dedicated CI
runner the drift is expected to drop into the 1–3 % range. Tighten
this number once Phase 9.4 lands the CI bench runner; do not chase
< 5 % on developer machines.
How to interpret the baseline¶
- ~15 MIPS, not the ~50 MIPS that
CLAUDE.mddescribes as the performance target. The CLAUDE.md figure is the aspirational "real LEON3 speed at 50 MHz" goal. Today the interpreter is ~3× slower than the simulated CPU on this host. Phase 10 (decode cache + threaded code) targets a 3–5× lift, after which we expect to clear the 50 MIPS line. Hot-paths analysis (Phase 9.4) will quantify where the gap lives before any optimisation work starts. - ~60 % real-time is higher than
MIPS / sim_MHz = 15/50 = 30 %because the workloads spend non-trivial sim-time in CPU-idle states ("idle time skipping" inCLAUDE.md's Timing Model section). The emulator advances sim-time without executing instructions when all cores are halted waiting for IRQs. - Slice p99 ≈ p50: jitter is very flat on these short workloads (no allocator activity after warmup, no LLVM compile pauses yet). This will get more interesting in Phase 12.
Reproducing the baseline¶
git clean -fdx build/
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
cmake --build build --target lince_bench
./build/bench/lince-bench --all --runs 5 --json baseline.json
Compare against the table above. Differences > 20 % on MIPS warrant investigation; differences < 20 % are likely measurement noise on shared host hardware.
Phase exit gate¶
plans/post-mvp-1to1-roadmap.md requires every phase from 10
onward to publish bench output at phase entry and phase exit, and
to demonstrate testsuite pass rate ≥ entry rate. Phase 9 itself
landed lince-bench; the baseline above is the entry point for
that gate.
Hot-path analysis¶
Phase 9.4 produced hot-paths.md, the per-workload
profile-derived view of where the interpreter spends host CPU. The
Phase 10 attack-plan section (§ 7) of that document supersedes the
informal "Phase 10 (decode cache + threaded code) targets a 3–5×
lift" hand-wave above: the data shows bus dispatch is the larger
target than decode, and the recommended ordering is bus fetch
fast-path → PC-indexed decode cache → FP-class gate removal.
Phase 13/14 — MultiThread and the 1:1 GR740 status¶
Phase 13 added ExecutionMode::MultiThread (thread-per-core, each simulated
core on its own host thread); Phase 14 measures it against the 1:1 GR740
goal. New lince-bench knobs: --mt, --cores N, --quantum-batch N,
--max-sim-ms N (apply to --image).
Host caveat. Captured on a
powersave-governor host — a floor. The tier-1 acceptance host (ADR-003: ≥4 physical cores, AVX2,performancegovernor) is faster. Treat the absolute MIPS as conservative.
Reproduce¶
# Single-core realistic (Dhrystone), % real-time vs 250 MHz:
lince-bench --image dhrystone.elf --soc gr712rc --clock-mhz 250 --runs 3
# Four-core sustained compute, SingleThread vs MultiThread + batching:
lince-bench --image cpubound60-gr740.elf --soc gr740 --clock-mhz 250 \
--max-sim-ms 600 --runs 3 # ST
lince-bench --image cpubound60-gr740.elf --soc gr740 --clock-mhz 250 \
--max-sim-ms 600 --runs 3 --mt --quantum-batch 32 # MT
Measured (powersave floor, 250 MHz target clock)¶
| Workload | Mode | Aggregate MIPS | realtime_ratio |
|---|---|---|---|
| Dhrystone (1 core, realistic) | JIT | ~219 | ~1.75 (cpi 2) |
| cpubound60 (4 cores, integer) | ST round-robin | ~304 | ~1.22 |
| cpubound60 (4 cores) | MT, batch 1 | ~538 | ~2.15 |
| cpubound60 (4 cores) | MT, batch 32 | ~666 | ~2.67 |
realtime_ratio > 1.0 ⇒ the emulator sustains faster than real time on that
workload (1:1 is then achievable by pacing down).
The 1:1 status and the CPI caveat¶
The honest, cpi-independent number is the raw host throughput: ~219 MIPS
per core on realistic (Dhrystone) code — roughly 4× the pre-Phase-12
figure. Whether that is 1:1 depends on the real GR740's average
cycles-per-instruction, which the sim-clock models via EmulatorConfig::cpi
(longer term, the P10 bucket-resync model):
| Assumed real GR740 cpi | insns/s per core for 1:1 | per-core ratio @ 219 MIPS |
|---|---|---|
| 1.0 (optimistic, no stalls) | 250 M | 0.87× (just short) |
| 1.5 | 167 M | 1.31× |
| 2.0 (realistic SPARC, cache+FP stalls) | 125 M | 1.75× |
So the per-core JIT is at or near 1:1 on realistic code — not the ~16× short older numbers implied. The dominant remaining variable is CPI-model fidelity (is the sim-clock honest?), orthogonal to JIT codegen speed.
Phase-13/14 findings¶
- MT scales sub-linearly: MT(N=4)/ST ≈ 2.1× on cpubound. Losses are
barrier-per-quantum overhead (mitigated by
--quantum-batch: +24% at batch 32, P14-3), shared-RAM bandwidth, and warmup. - Tier-2 under MT (P14-2) recovered +32% over baseline-only MT — per-core O2 matters more than raw parallelism.
- Batch granularity is a trade-off: large batches help CPU-bound throughput but coarsen inter-processor IRQ delivery, so the default stays 1 (determinism-exact). Tune per workload.
Toward the formal P14 acceptance (not yet done)¶
- Re-measure on the tier-1 host (
performancegovernor) for the true ceiling. - Broaden the workload set (FP-, IRQ/IO-, syscall-heavy) — acceptance is "1:1 on ≥⅔ bench workloads".
- CPI-model honesty (P10) — pin the real-GR740 cpi so the ratio is defensible.
- P99.9 jitter bounding (LLVM module GC, mmap, page faults) + pacing precision.
Phase 14 per-core optimization wins (2026-06) — current state¶
Landed on main 2026-06-05. Three bit-exact per-core wins, found by
profiling and each confirmed by a controlled same-host back-to-back A/B with
governor=performance. Together they roughly double uniprocessor Dhrystone
throughput. The full ctest suite stays green (709 cases pass, 0 fail, across
uniprocessor + SMP N=2/N=4, Switch + JIT + MultiThread). These numbers
supersede the ~219 MIPS Phase-13/14 figures above as the current
single-core state; the older numbers remain as the pre-win baseline.
Headline: with all four wins below on
main(lever-A dispatch anchor landed 2026-06-05 alongside the three new wins), Dhrystone N=1 measures ~295 MIPS (governor=performance) — roughly 2× the pre-Phase-12 per-core baseline.
| # | Win | Where | Dhrystone N=1 delta | Other effects |
|---|---|---|---|---|
| 0 | Dispatch-cache anchor (lever A) | IrBlock::jit_entry; TieredJit::*_anchored; Emulator::run_ir_quantum |
+19.5% (238 → 285 MIPS) | skips the per-dispatch unordered_map hash |
| 1 | IR BlockCache 1024 → 8192 slots |
src/ir/include/lince/ir/block_cache.hpp |
+39% (153 → 212 MIPS) | p99 slice jitter 28.7 ms → 8.9 ms; cpubound flat |
| 2 | Event-bounded single-core quantum | Emulator::run_core_quantum (src/runtime/src/emulator.cpp) |
+40% (212 → 298 MIPS) | sub-quantum-precise event firing (also more accurate) |
| 3 | Interrupt-poll fast-path (raw_pending()) |
iinterrupt_controller.hpp; IrqMP/IrqAMP; Emulator::sample_interrupts |
+2.7% (301.9 → 310.2 MIPS, controlled A/B) | boot/call-heavy benefit similarly; cpubound flat |
- Dispatch-cache anchor (lever A). For an already-compiled
(pc, mode)theTieredJitCacheEntrynode address is stable, so the dispatcher stashes it inIrBlock::jit_entryon first compile and reuses it (*_anchoredoverloads) to resolve the best fn and count executions without the per-dispatchunordered_maphash that was the #1 hotspot on call-heavy code. Bit-exact by construction (the anchor is read only after a validatedfind). BlockCache1024 → 8192 slots. The cache is direct-mapped, indexed by(pc >> 2) & (Size - 1). At 1024 the index covered only a 4 KiB PC window, so on a call-heavy guest the hot code and its libc aliased and evicted each other — on Dhrystone, libcstrcmp/strcpyhash to indices inside theProc_/Func_block span, forcing constant re-translation (a steady-state profile puttranslate_blockat ~9–16% of wall time on a loop that should translate each block once). 8192 (a 32 KiB window) removes the aliasing. Bit-exact — a cache only ever produces fewer evictions, never different results. 16384/32768 measured no further gain, so 8192 is the size/locality sweet spot.- Event-bounded single-core quantum. A true single core
(
cores_.size() == 1) has no round-robin interleaving to preserve, so the scheduling quantum (default 1000) only throttles throughput by paying per-round overhead (sync_global_up_counter, the interrupt scan, the tail-step,run_*_quantumsetup) every quantum. It now runs a large burst (cap1 << 16= 65536) but never past the next scheduled event, so timer/peripheral IRQs still fire at their exact simulated time — in fact more precisely than before, where events were rounded up to the round boundary (up to a full quantum late). Gated strictly tocores == 1, so multi-core round-robin interleaving and its sim-time accounting are byte-unchanged (SMP suites unaffected). Bit-exact between the JIT and the Switch oracle (both honour the same bound); N=1 RTEMS sptests were identical in a controlled same-host back-to-back run. - Interrupt-poll fast-path via
IInterruptController::raw_pending(). The dispatcher polls the interrupt controller at every block boundary (poll_self_interrupt→sample_interrupts→pending_mask), but with a 100 Hz tick the controller is empty ~99.99% of the time; each poll paid a virtual call + a 5-load per-CPU scan + the EIRQ-redirect branch only to compute 0 (a profile putpending_maskat ~5.8% of Dhrystone wall time).raw_pending()is a maintained single-word superset (pending_ | ifr0_ | OR over iforce_[cpu]) refreshed at the 7 raw-source mutation sites in both controllers; it is one load, and a 0 result provespending_mask(cpu) == 0for every cpu, sosample_interruptsearly-outs before the full scan. Provably bit-exact — the early-out fires only when the full scan would have returned 0 anyway. The reader early-out is gated toSingleThread(the maintained word can lag a concurrent assert under MultiThread, so MT keeps the fullpending_mask— no MT behavior change).
Rejected levers (measured net-negative — do not re-attempt)¶
Each was caught by a controlled same-host A/B and kept default-OFF (or not merged):
- Cross-region JIT block linking ("lever B") — a trampoline running an already-compiled successor region in native code instead of returning to the dispatcher. Measured ~9% slower on Dhrystone: the dispatch residual was already lean, and the trampoline's per-hop cross-library self-IPI poll + recursion + per-exit fan-out cost more than it saved. Kept default-OFF on a feature branch, not merged.
- MT host-thread affinity pinning — net-negative on a shared host
(~6% lower MIPS and 4–6× worse P99.9 jitter): the per-round barrier makes
every core wait for the slowest, and a pinned thread on a transiently
contended host core can't migrate away. Only useful with host core isolation
(
isolcpus); kept opt-in, default OFF. - Bigger JIT region fusion (
jit_max_region_blocks8 → 16/32/64) — −8% on Dhrystone: larger fused regions overshoot the quantum and tail-step through the slow Switch path. 8 stays the sweet spot.
Method note¶
All wins were found by profiling/measurement, and every dead-end was caught by a
controlled same-host back-to-back A/B. The committed result CSVs
(tests/results/*.csv) are thermal-stale from other hosts and produced phantom
regressions multiple times — perf comparisons must be same-host
back-to-back with governor=performance.