Hot paths — Phase 9.4 baseline¶
Captured 2026-05-17 on commit d95a5ba (Tanda 4 measurement bundle
landed). This document is the entry point for the Phase 10 attack
plan; every number cites a specific artefact under
bench/profiles/phase9-baseline/.
1. Method and host details¶
| Aspect | Value |
|---|---|
| Host | Fedora Linux 43 on 12th Gen Intel Core i5-12450HX (Alder Lake, hybrid) |
| Cores | 4 P-cores (HT, cpu_core PMU, CPUs 0–7) + 4 E-cores (cpu_atom PMU, CPUs 8–11) |
| Pinning | All lince-bench runs taskset-pinned to cpu0 (one P-core thread) |
| Compiler | GCC 15.2.1, -O2 -g -fno-omit-frame-pointer -gdwarf-4 |
| Build | cmake --preset bench-profile, plus a sibling build-histogram with -DLINCE_OPCODE_HISTOGRAM=ON |
perf |
7.0.8 (Fedora kernel-tools), 999 Hz, --call-graph=dwarf |
| FlameGraph | Brendan Gregg v1.0 (commit a8d807a) |
| Runs per workload | 3 × perf record + 1 × opcode histogram + 1 × bench JSON (5 internal runs) |
Total Lost Samples: 0 on every perf-report.txt — perf never had
to drop samples due to buffer pressure. Pinning was non-negotiable:
on hybrid CPUs perf script silently picks one PMU and drops
samples that landed on the other type. Without pinning, the first
round of measurements produced flamegraphs where two functions each
read 50 % (impossible) — see bench/profiles/phase9-baseline/README.md.
2. Host-function hot paths¶
2.1 cpubound-mix (steady-state interpreter)¶
130 M instructions, ~8 s wall, 16.7 MIPS, ~8 000 samples. This is
the workload Phase 10 should optimise against. Source: cpubound-mix/run-1.perf-flat.txt.
| Rank | % CPU | Function | Role |
|---|---|---|---|
| 1 | 26.63 % | bus::SystemBus::read_physical_u32 |
Per-fetch dispatch: bounds-checks RAM regions, memcpys 4 bytes, calls decode_be. |
| 2 | 21.67 % | core::step |
Main interpreter loop: PC update, fetch, dispatch, trap handling. |
| 3 | 8.51 % | bus::Ram::read |
std::memcpy from the RAM byte buffer. |
| 4 | 7.42 % | core::decode |
SPARC V8 instruction decoder. |
| 5 | 6.66 % | bus::SystemBus::read_physical (byte path) |
The variadic-length variant called by read_physical_u32. |
| 6 | 4.54 % | runtime::Emulator::run_until_unpaced |
Round-robin scheduler / quantum loop. |
| 7 | 4.50 % | __memmove_avx_unaligned_erms (libc) |
The AVX memcpy invoked from Ram::read (4-byte copies coalesced). |
| 8 | 4.49 % | core::detail::exec_alu |
ALU handler dispatch (one giant switch). |
| 9 | 2.01 % | core::is_fp_class |
Pre-execute check for FP encoding (returns false ~always here). |
| 10 | 1.60 % | CpuState::icc |
Condition-code accessor (called from branch handlers). |
| 11 | 1.53 % | core::execute |
Dispatch by InsnKind. |
| 12 | 1.45 % | CpuState::commit_psr_pipeline |
Per-step PSR write delay accounting. |
| 13 | 1.28 % | CpuState::read_r |
Register-window read. |
| 14 | 0.98 % | runtime::CpuBusBridge::read_u32 |
Translates VirtAddr → PhysAddr for the core. |
| 15 | 0.97 % | memcpy@plt (libc) |
PLT trampoline to memcpy. |
The bus path (rows 1, 3, 5, 7, 15) sums to 47.27 % of host CPU, versus 7.42 % for decode and 4.49 % for ALU execution. Reading guest memory dominates the interpreter, not decoding what it found.
2.2 boot-rtems-hello-world¶
~250 ms wall × 5 internal runs ≈ 1.25 s per perf session. 143 k
instructions of guest code per single run; ~1 200 perf samples per
session. Source: boot-rtems-hello-world/run-1.perf-flat.txt.
| Rank | % CPU | Function | Role |
|---|---|---|---|
| 1 | 70.58 % | __memset_avx2_unaligned_erms (libc) |
One-time zero-init of the 16 MiB Ram std::vector<byte>. |
| 2 | 7.35 % | core::step |
|
| 3 | 6.49 % | bus::SystemBus::read_physical_u32 |
|
| 4 | 3.88 % | bus::Ram::read |
|
| 5 | 2.15 % | runtime::Emulator::run_until_unpaced |
|
| 6 | 1.66 % | CpuState::read_r |
|
| 7 | 1.44 % | bus::SystemBus::read_physical |
|
| 8 | 1.29 % | core::decode |
|
| 9 | 1.24 % | kernel 0xffffffffa7401968 |
Unresolved (perf_event_paranoid=2). |
| 10 | 0.79 % | __memmove_avx_unaligned_erms (libc) |
Reading: 70 % of boot is the one-time RAM zero-init. The
remaining 30 % is structurally the same hot path as cpubound-mix
(bus dispatch + step + decode), just compressed into 30 % of host
CPU. Boot is not the workload for Phase 10 analysis — it's the
workload for boot-path optimisation (if we ever care to amortise
the memset). Phase 10 should ignore row 1.
2.3 fptest01¶
~6 s wall, 148 k instructions of guest code per single run. The
PROM segment of the GR712RC recipe forces a second std::vector<byte>
zero-init (~256 KiB PROM region) on every run, so memset is still
dominant. Source: fptest01/run-1.perf-flat.txt.
| Rank | % CPU | Function | Role |
|---|---|---|---|
| 1 | 67.61 % | __memset_avx2 |
RAM + PROM zero-init. |
| 2 | 9.99 % | bus::SystemBus::read_physical_u32 |
|
| 3 | 7.15 % | core::step |
|
| 4 | 2.99 % | core::decode |
|
| 5 | 2.54 % | kernel 0xffffffffa7401968 |
|
| 6 | 2.29 % | core::execute |
|
| 7 | 2.08 % | core::is_fp_class |
Notable: not statistically larger than cpubound-mix (2.01 %). |
| 8 | 1.92 % | bus::SystemBus::read_physical |
|
| 9 | 1.16 % | runtime::Emulator::run_until |
|
| 10 | 0.88 % | libstdc++ filesystem deleter | ELF-loader cost. |
Reading: this is structurally boot-rtems-hello-world plus a
few thousand FP instructions on top. The boot dominates so heavily
that the "FP-heavy" character is invisible in the per-host-function
view. The opcode histogram (§ 3) tells the same story.
3. SPARC opcode histogram¶
Captured by the LINCE_OPCODE_HISTOGRAM=ON build (build-histogram/),
which adds a std::array<uint64_t, 32> per core indexed by
static_cast<uint8_t>(InsnKind) and emits CSV on Emulator
destruction. Sources: <workload>/opcode-histogram.csv.
3.1 RTEMS workloads — boot + fptest01¶
The two RTEMS workloads' distributions match to within 0.1 % on
every kind, because fptest01 is an RTEMS boot followed by a
tiny FP test (143 k vs 148 k guest instructions — 3 % difference,
inside the boot prefix).
| Kind | % (boot) | % (fptest01) | Cum (boot) |
|---|---|---|---|
| AluReg | 37.31 | 37.45 | 37.31 |
| Branch | 11.91 | 11.98 | 49.23 |
| SetHi | 11.44 | 11.37 | 60.67 |
| Store | 9.54 | 9.57 | 70.21 |
| Load | 9.48 | 9.45 | 79.69 |
| Jmpl | 5.56 | 5.52 | 85.25 |
| Call | 3.06 | 3.03 | 88.31 |
| Save | 2.21 | 2.22 | 90.52 |
| Restore | 2.20 | 2.19 | 92.72 |
| Shift | 1.50 | 1.49 | 94.21 |
| ReadSpecial | 1.47 | 1.46 | 95.69 |
| WriteSpecial | 1.35 | 1.34 | 97.04 |
| Rett | 1.30 | 1.29 | 98.33 |
| Ticc | 1.19 | 1.17 | 99.52 |
| tail (LoadAlt, StoreAlt, Mul, Div, FpLoad, FpStore, …) | 0.48 | 0.61 | 100.00 |
Top 5 cover ~80 %. Top 10 cover ~94 %. The long tail (FpOp1, FpOp2, Casa, Stbar, Flush, …) is < 0.5 % combined.
3.2 cpubound-mix¶
Designed mix; histogram matches the design to 0.001 %.
| Kind | Count | % | Cum |
|---|---|---|---|
| AluReg | 70 000 007 | 53.74 | 53.74 |
| Shift | 20 000 000 | 15.36 | 69.10 |
| SetHi | 10 124 997 | 7.77 | 76.87 |
| Branch | 10 124 993 | 7.77 | 84.65 |
| Store | 10 000 003 | 7.68 | 92.32 |
| Load | 10 000 000 | 7.68 | 100.00 |
The SetHi slot (7.77 %) is the nop that fills SPARC branch
delay slots — nop is canonically sethi 0, %g0. Bear this in
mind when reading the RTEMS histograms: a non-trivial fraction of
their SetHi count is also nop (RTEMS code is full of branch
delay slots), not actual immediate-load instructions.
3.3 Implication for Phase 10¶
Specialised handlers / decode-cache entries for AluReg, Branch,
SetHi, Store, Load cover 80 % of RTEMS executions and 100 % of
cpubound-mix. Below that, generic dispatch is fine — LoadAlt
(0.16 %), Casa (≈ 0 %), Stbar, Flush are not worth
specialising in the first JIT pass.
4. Hot guest PC ranges¶
Not collected in this baseline. Per-guest-PC sampling requires either:
- An
LINCE_PC_HISTOGRAMinstrumentation flag analogous to the opcode histogram (sized differently — RAM is 16 MiB so a flat array is 32 MiB of counters), or - A sample-on-step hook driven by
IEmulatorObserver.
Neither exists today. Phase 10.1 (decode cache) is the natural place to add this: the decode-cache map is already indexed by PC, so a side-effect of populating it is knowing which PCs are hot. Gap deliberately left open — picking the PC-cache index structure (flat array, hash map, sparse tree) is a Phase 10.1 design decision that benefits from observing the actual access pattern through that structure.
5. Peripheral / IRQ activity¶
Not collected in this baseline. bench.json currently reports
host-level metrics (MIPS, jitter, RSS) but not guest-side counters
(IrqMP deliveries, GPTimer ticks, APBUart bytes).
What we can infer:
cpubound-mix: zero peripheral activity by design (no IRQs raised, two writes total at the very end). The hot-path table contains no peripheral-related host functions, consistent with this.boot-rtems-hello-worldandfptest01: peripheral functions (APBUart::mmio_write,IrqMP::*,GPTimer::*) do not appear in the top-20 host functions of any run. RTEMS boot does write to the UART and program GPTimer, but those writes are < 0.5 % of host CPU each.
Gap to fill in Tanda 5+ extension: extend lince-bench to
read per-peripheral counters at end of run and include them in
bench.json. Useful for Tanda 4 of any future profiling run — not
critical for Phase 10 since the data we have already shows
peripherals are off the hot path.
6. Cross-workload comparison¶
Universal hot paths — appear in the top-10 of all three workloads:
| Function | boot | fptest01 | cpubound-mix |
|---|---|---|---|
bus::SystemBus::read_physical_u32 |
6.49 % | 9.99 % | 26.63 % |
core::step |
7.35 % | 7.15 % | 21.67 % |
core::decode |
1.29 % | 2.99 % | 7.42 % |
bus::Ram::read |
3.88 % | (top 15) | 8.51 % |
bus::SystemBus::read_physical (byte path) |
1.44 % | 1.92 % | 6.66 % |
Observation: the same five Lince functions dominate every workload, just at different absolute fractions. Boot and fptest01 look diluted because the libc memset eats 70 % of their CPU; once that is amortised they'd land at proportions close to cpubound-mix. This means a Phase 10 win on these five functions improves every workload — there is no boot-specific optimisation hiding here.
Workload-specific hot paths (top 10 of one workload only):
cpubound-mixonly:core::detail::exec_alu(4.49 %),CpuState::commit_psr_pipeline(1.45 %),CpuBusBridge::read_u32(0.98 %). All structural — visible in cpubound because boot noise is gone, not because they're absent elsewhere.fptest01only:core::execute(2.29 %) explicitly in top 10. Promoted by the FP-handler tail relative to boot.
There are no surprise workload-specific functions — the profile is internally consistent across all three.
7. Attack plan for Phase 10¶
Three optimisations are now justified by data, in priority order.
7.1 P1 — Collapse the bus / RAM fetch fast path¶
Evidence: §2.1 rows 1, 3, 5, 7 sum to 42 % of host CPU on steady-state code. The current code path for a guest 32-bit fetch is:
core::step
└─ bus.read_u32(VirtAddr) CpuBusBridge::read_u32
└─ MMU stub: pass-through
└─ SystemBus::read_physical_u32(PhysAddr)
├─ std::array<std::byte,4> buf{}
├─ SystemBus::read_physical(PhysAddr, span<byte>)
│ ├─ find_ram_region: linear scan over ram_regions_
│ │ ├─ AddressRange::contains
│ │ └─ (1 ram region in GR712RC, but a function call
│ │ and a vector access every fetch)
│ └─ Ram::read(offset, span<byte>) → std::memcpy → AVX
└─ decode_be(buf) → manual byte swap
For RAM-resident fetches (≈ 100 % of fetches in tight loops),
this is at minimum 3 function calls, one memcpy of 4 bytes, and
one byteswap — to load a single word from a std::vector<byte>.
Proposed change: add SystemBus::read_u32_be_fast(PhysAddr) or
specialised Ram::read_u32_be(offset) that:
- Skips the byte-buffer round trip.
- Returns directly from the typed underlying storage (one
memcpyinto auint32_tregister + one__builtin_bswap32, both inlined). - Optional: a per-
SystemBus1-entry cache of the most recently hitRamRegion*(sequential PCs hit the same region for thousands of fetches in a row). Skipfind_ram_regionon a hit.
Expected ROI: collapse the 42 % bus path to 10–15 %. Frees ~30 % of host CPU → ~1.4× MIPS gain. Estimate: 16.7 → 23–26 MIPS.
Effort: small — one or two functions, a unit test that confirms byte order, no public-API change. Days, not weeks.
Why this before the decode cache (10.1 in the original roadmap): decode is 7 % of host CPU; optimising it to zero buys at most 1.07×. Bus dispatch is 6× larger and the fix is mechanically simpler than a decode cache. Reorder Phase 10.1 to be "bus fetch fast path" and the decode-cache work moves to 10.2 or merges into a single "fetch+decode cache" entry that does both.
7.2 P2 — PC-indexed decode cache (the original Phase 10.1)¶
Evidence: §2.1 rows 4, 8, 11 (decode + exec_alu + execute) sum
to 13 % of host CPU. A decode cache that memoises
PhysAddr → DecodedInsn + handler_ptr would cut most of this on
loop bodies (cpubound-mix's inner loop is 13 instructions visited
10 M times — same instruction stream parsed 10 M times today).
Proposed change: PC-indexed map PhysAddr → CacheEntry { DecodedInsn,
fn_ptr }. Cache invalidation on writes to executable RAM (rare in
RTEMS; reset entirely on Emulator::reset). Compatible with the
threaded-code transition in Phase 10.2 — the fn_ptr field is the
hook for the threaded handler.
Expected ROI: collapse decode (~7 %) plus the dispatch overhead
in step itself (part of the 21.67 %) → estimate another 1.25×
MIPS gain on top of P1. Combined with P1: 16.7 → 28–32 MIPS.
Effort: medium — the cache lives in lince_core, needs a
test suite (write-invalidates, address-range invalidation, fresh
Emulator state, etc.). Weeks.
Dependency: best done after P1, because the decode cache benefits compound on top of a fast bus path. Doing them in the other order means P1's measurement after P2 is harder to attribute.
7.3 P3 — Lift the FP-class check out of the per-step hot path¶
Evidence: §2.1 row 9 — core::is_fp_class at 2.01 %.
Currently in step.cpp:
if (is_fp_class(raw_insn) && !state.ef()) {
status = ExecStatus::FpDisabled;
} else {
const DecodedInsn insn = decode(raw_insn);
...
}
is_fp_class runs on every fetched word. With FP enabled (the
default for RTEMS), the test always returns false-or-the-and-short-
circuits, and the result is discarded. On cpubound-mix the result
is always false. 2 % gone.
Proposed change: fold the FP-disabled check into decode():
when state.ef() is false, decode() returns a sentinel kind
(InsnKind::FpDisabled) for FP opcodes; otherwise it returns the
normal kind. step then checks the kind, not the raw word.
Expected ROI: 2 % win, additive. Trivial in isolation, fits naturally into the decode-cache (P2) work — the cache key already includes the relevant PSR.EF state, so this falls out for free.
Effort: small. Half a day standalone, or zero if folded into P2.
7.4 P4 — boot-rtems-hello-world is a memset benchmark (defer)¶
Evidence: §2.2 row 1 — __memset_avx2 at 70.58 % of CPU.
The whole 16 MiB Ram block is zeroed via std::vector<byte>(size)
every Emulator::create. AVX2 memset is already maximally fast;
the only way to shrink this number is to not allocate-and-zero
16 MiB per emulator.
Proposed change (only if a future use case needs it): keep RAM pages lazily-zeroed (track a dirty set, only zero pages on first read) or skip zeroing entirely if the guest will overwrite (the ELF loader knows what regions it'll fill).
Expected ROI: dramatic on the boot number (70 % → < 1 %), but irrelevant for any long-running guest because the cost is amortised. Defer indefinitely — Phase 10's "1:1 GR740" goal is about sustained throughput, not cold-start latency. Document so the finding isn't lost.
7.5 P5 — Histogram tail truncation (informational, not optimisation)¶
Evidence: §3.3 — LoadAlt, StoreAlt, Mul, Div, FpOp1,
FpOp2, Ldstub, Swap, Casa, Stbar, Flush, FpLoad,
FpStore, FpBranch, FpUnknown, Unimp, Unknown, MulScc
collectively account for < 1 % of executions on every workload.
Implication for Phase 10.2 (threaded code / JIT): prioritise fast paths for the top 5 kinds (AluReg, Branch, SetHi, Store, Load); generic dispatch handles the rest with negligible aggregate cost. The cost of not specialising the long tail is ~1 % of total runtime.
8. Reproduction recipe¶
# One-time setup
bench/profiles/setup.sh # clones FlameGraph to ~/.cache
sudo dnf install perf perl-open # Fedora; equivalent on other distros
# Build the two binaries
cmake --preset bench-profile
cmake --build --preset bench-profile
cmake -S . -B build-histogram -G Ninja \
-DCMAKE_BUILD_TYPE=RelWithDebInfo \
-DLINCE_OPCODE_HISTOGRAM=ON
cmake --build build-histogram --target lince_bench sparc_test_binaries
# Run the sweep (~3 min total)
bench/profiles/measure.sh
# Outputs land under bench/profiles/phase9-baseline/<workload>/
To re-measure on a different host, the only knob you typically override is the CPU pin:
LINCE_BENCH_PIN_CPU=4 bench/profiles/measure.sh # pin to cpu4
LINCE_BENCH_PIN_CPU= bench/profiles/measure.sh # disable pinning
On non-hybrid Intel / AMD hosts pinning is a harmless reproducibility
helper. On Apple Silicon / ARM hosts the script still runs but the
flamegraph fidelity depends on the kernel's perf equivalent (Apple:
none committed; Linux ARM64: perf works the same way).
Document status: Phase 9.4 Tanda 5 deliverable.
Cited artefacts: bench/profiles/phase9-baseline/ — 54 files,
672 KiB, all auditable text + SVG. The *.perf.data files (the
binary stream perf wrote) are not committed; regenerable from
bench/profiles/measure.sh.
Next: Phase 10 design PR. The PR description should open with
"This implements §7.1 (P1) of docs/development/hot-paths.md",
specify the test plan (bench delta on cpubound-mix and boot,
ctest pass rate unchanged), and quote the projected ROI from
that section.