Hot paths — Phase 9.4 baseline¶

Captured 2026-05-17 on commit d95a5ba (Tanda 4 measurement bundle landed). This document is the entry point for the Phase 10 attack plan; every number cites a specific artefact under bench/profiles/phase9-baseline/.

1. Method and host details¶

Aspect	Value
Host	Fedora Linux 43 on 12^th Gen Intel Core i5-12450HX (Alder Lake, hybrid)
Cores	4 P-cores (HT, cpu_core PMU, CPUs 0–7) + 4 E-cores (cpu_atom PMU, CPUs 8–11)
Pinning	All `lince-bench` runs taskset-pinned to cpu0 (one P-core thread)
Compiler	GCC 15.2.1, `-O2 -g -fno-omit-frame-pointer -gdwarf-4`
Build	`cmake --preset bench-profile`, plus a sibling `build-histogram` with `-DLINCE_OPCODE_HISTOGRAM=ON`
`perf`	7.0.8 (Fedora kernel-tools), 999 Hz, `--call-graph=dwarf`
FlameGraph	Brendan Gregg `v1.0` (commit `a8d807a`)
Runs per workload	3 × `perf record` + 1 × opcode histogram + 1 × bench JSON (5 internal runs)

Total Lost Samples: 0 on every perf-report.txt — perf never had to drop samples due to buffer pressure. Pinning was non-negotiable: on hybrid CPUs perf script silently picks one PMU and drops samples that landed on the other type. Without pinning, the first round of measurements produced flamegraphs where two functions each read 50 % (impossible) — see bench/profiles/phase9-baseline/README.md.

2. Host-function hot paths¶

2.1 `cpubound-mix` (steady-state interpreter)¶

130 M instructions, ~8 s wall, 16.7 MIPS, ~8 000 samples. This is the workload Phase 10 should optimise against. Source: cpubound-mix/run-1.perf-flat.txt.

Rank	% CPU	Function	Role
1	26.63 %	`bus::SystemBus::read_physical_u32`	Per-fetch dispatch: bounds-checks RAM regions, memcpys 4 bytes, calls `decode_be`.
2	21.67 %	`core::step`	Main interpreter loop: PC update, fetch, dispatch, trap handling.
3	8.51 %	`bus::Ram::read`	`std::memcpy` from the RAM byte buffer.
4	7.42 %	`core::decode`	SPARC V8 instruction decoder.
5	6.66 %	`bus::SystemBus::read_physical` (byte path)	The variadic-length variant called by `read_physical_u32`.
6	4.54 %	`runtime::Emulator::run_until_unpaced`	Round-robin scheduler / quantum loop.
7	4.50 %	`__memmove_avx_unaligned_erms` (libc)	The AVX memcpy invoked from `Ram::read` (4-byte copies coalesced).
8	4.49 %	`core::detail::exec_alu`	ALU handler dispatch (one giant switch).
9	2.01 %	`core::is_fp_class`	Pre-execute check for FP encoding (returns false ~always here).
10	1.60 %	`CpuState::icc`	Condition-code accessor (called from branch handlers).
11	1.53 %	`core::execute`	Dispatch by `InsnKind`.
12	1.45 %	`CpuState::commit_psr_pipeline`	Per-step PSR write delay accounting.
13	1.28 %	`CpuState::read_r`	Register-window read.
14	0.98 %	`runtime::CpuBusBridge::read_u32`	Translates `VirtAddr` → `PhysAddr` for the core.
15	0.97 %	`memcpy@plt` (libc)	PLT trampoline to memcpy.

The bus path (rows 1, 3, 5, 7, 15) sums to 47.27 % of host CPU, versus 7.42 % for decode and 4.49 % for ALU execution. Reading guest memory dominates the interpreter, not decoding what it found.

2.2 `boot-rtems-hello-world`¶

~250 ms wall × 5 internal runs ≈ 1.25 s per perf session. 143 k instructions of guest code per single run; ~1 200 perf samples per session. Source: boot-rtems-hello-world/run-1.perf-flat.txt.

Rank	% CPU	Function	Role
1	70.58 %	`__memset_avx2_unaligned_erms` (libc)	One-time zero-init of the 16 MiB `Ram` `std::vector<byte>`.
2	7.35 %	`core::step`
3	6.49 %	`bus::SystemBus::read_physical_u32`
4	3.88 %	`bus::Ram::read`
5	2.15 %	`runtime::Emulator::run_until_unpaced`
6	1.66 %	`CpuState::read_r`
7	1.44 %	`bus::SystemBus::read_physical`
8	1.29 %	`core::decode`
9	1.24 %	kernel `0xffffffffa7401968`	Unresolved (perf_event_paranoid=2).
10	0.79 %	`__memmove_avx_unaligned_erms` (libc)

Reading: 70 % of boot is the one-time RAM zero-init. The remaining 30 % is structurally the same hot path as cpubound-mix (bus dispatch + step + decode), just compressed into 30 % of host CPU. Boot is not the workload for Phase 10 analysis — it's the workload for boot-path optimisation (if we ever care to amortise the memset). Phase 10 should ignore row 1.

2.3 `fptest01`¶

~6 s wall, 148 k instructions of guest code per single run. The PROM segment of the GR712RC recipe forces a second std::vector<byte> zero-init (~256 KiB PROM region) on every run, so memset is still dominant. Source: fptest01/run-1.perf-flat.txt.

Rank	% CPU	Function	Role
1	67.61 %	`__memset_avx2`	RAM + PROM zero-init.
2	9.99 %	`bus::SystemBus::read_physical_u32`
3	7.15 %	`core::step`
4	2.99 %	`core::decode`
5	2.54 %	kernel `0xffffffffa7401968`
6	2.29 %	`core::execute`
7	2.08 %	`core::is_fp_class`	Notable: not statistically larger than `cpubound-mix` (2.01 %).
8	1.92 %	`bus::SystemBus::read_physical`
9	1.16 %	`runtime::Emulator::run_until`
10	0.88 %	libstdc++ filesystem deleter	ELF-loader cost.

Reading: this is structurally boot-rtems-hello-world plus a few thousand FP instructions on top. The boot dominates so heavily that the "FP-heavy" character is invisible in the per-host-function view. The opcode histogram (§ 3) tells the same story.

3. SPARC opcode histogram¶

Captured by the LINCE_OPCODE_HISTOGRAM=ON build (build-histogram/), which adds a std::array<uint64_t, 32> per core indexed by static_cast<uint8_t>(InsnKind) and emits CSV on Emulator destruction. Sources: <workload>/opcode-histogram.csv.

3.1 RTEMS workloads — boot + fptest01¶

The two RTEMS workloads' distributions match to within 0.1 % on every kind, because fptest01 is an RTEMS boot followed by a tiny FP test (143 k vs 148 k guest instructions — 3 % difference, inside the boot prefix).

Kind	% (boot)	% (fptest01)	Cum (boot)
AluReg	37.31	37.45	37.31
Branch	11.91	11.98	49.23
SetHi	11.44	11.37	60.67
Store	9.54	9.57	70.21
Load	9.48	9.45	79.69
Jmpl	5.56	5.52	85.25
Call	3.06	3.03	88.31
Save	2.21	2.22	90.52
Restore	2.20	2.19	92.72
Shift	1.50	1.49	94.21
ReadSpecial	1.47	1.46	95.69
WriteSpecial	1.35	1.34	97.04
Rett	1.30	1.29	98.33
Ticc	1.19	1.17	99.52
tail (LoadAlt, StoreAlt, Mul, Div, FpLoad, FpStore, …)	0.48	0.61	100.00

Top 5 cover ~80 %. Top 10 cover ~94 %. The long tail (FpOp1, FpOp2, Casa, Stbar, Flush, …) is < 0.5 % combined.

3.2 `cpubound-mix`¶

Designed mix; histogram matches the design to 0.001 %.

Kind	Count	%	Cum
AluReg	70 000 007	53.74	53.74
Shift	20 000 000	15.36	69.10
SetHi	10 124 997	7.77	76.87
Branch	10 124 993	7.77	84.65
Store	10 000 003	7.68	92.32
Load	10 000 000	7.68	100.00

The SetHi slot (7.77 %) is the nop that fills SPARC branch delay slots — nop is canonically sethi 0, %g0. Bear this in mind when reading the RTEMS histograms: a non-trivial fraction of their SetHi count is also nop (RTEMS code is full of branch delay slots), not actual immediate-load instructions.

3.3 Implication for Phase 10¶

Specialised handlers / decode-cache entries for AluReg, Branch, SetHi, Store, Load cover 80 % of RTEMS executions and 100 % of cpubound-mix. Below that, generic dispatch is fine — LoadAlt (0.16 %), Casa (≈ 0 %), Stbar, Flush are not worth specialising in the first JIT pass.

4. Hot guest PC ranges¶

Not collected in this baseline. Per-guest-PC sampling requires either:

An LINCE_PC_HISTOGRAM instrumentation flag analogous to the opcode histogram (sized differently — RAM is 16 MiB so a flat array is 32 MiB of counters), or
A sample-on-step hook driven by IEmulatorObserver.

Neither exists today. Phase 10.1 (decode cache) is the natural place to add this: the decode-cache map is already indexed by PC, so a side-effect of populating it is knowing which PCs are hot. Gap deliberately left open — picking the PC-cache index structure (flat array, hash map, sparse tree) is a Phase 10.1 design decision that benefits from observing the actual access pattern through that structure.

5. Peripheral / IRQ activity¶

Not collected in this baseline. bench.json currently reports host-level metrics (MIPS, jitter, RSS) but not guest-side counters (IrqMP deliveries, GPTimer ticks, APBUart bytes).

What we can infer:

cpubound-mix: zero peripheral activity by design (no IRQs raised, two writes total at the very end). The hot-path table contains no peripheral-related host functions, consistent with this.
boot-rtems-hello-world and fptest01: peripheral functions (APBUart::mmio_write, IrqMP::*, GPTimer::*) do not appear in the top-20 host functions of any run. RTEMS boot does write to the UART and program GPTimer, but those writes are < 0.5 % of host CPU each.

Gap to fill in Tanda 5+ extension: extend lince-bench to read per-peripheral counters at end of run and include them in bench.json. Useful for Tanda 4 of any future profiling run — not critical for Phase 10 since the data we have already shows peripherals are off the hot path.

6. Cross-workload comparison¶

Universal hot paths — appear in the top-10 of all three workloads:

Function	boot	fptest01	cpubound-mix
`bus::SystemBus::read_physical_u32`	6.49 %	9.99 %	26.63 %
`core::step`	7.35 %	7.15 %	21.67 %
`core::decode`	1.29 %	2.99 %	7.42 %
`bus::Ram::read`	3.88 %	(top 15)	8.51 %
`bus::SystemBus::read_physical` (byte path)	1.44 %	1.92 %	6.66 %

Observation: the same five Lince functions dominate every workload, just at different absolute fractions. Boot and fptest01 look diluted because the libc memset eats 70 % of their CPU; once that is amortised they'd land at proportions close to cpubound-mix. This means a Phase 10 win on these five functions improves every workload — there is no boot-specific optimisation hiding here.

Workload-specific hot paths (top 10 of one workload only):

cpubound-mix only: core::detail::exec_alu (4.49 %), CpuState::commit_psr_pipeline (1.45 %), CpuBusBridge::read_u32 (0.98 %). All structural — visible in cpubound because boot noise is gone, not because they're absent elsewhere.
fptest01 only: core::execute (2.29 %) explicitly in top 10. Promoted by the FP-handler tail relative to boot.

There are no surprise workload-specific functions — the profile is internally consistent across all three.

7. Attack plan for Phase 10¶

Three optimisations are now justified by data, in priority order.

7.1 P1 — Collapse the bus / RAM fetch fast path¶

Evidence: §2.1 rows 1, 3, 5, 7 sum to 42 % of host CPU on steady-state code. The current code path for a guest 32-bit fetch is:

core::step
  └─ bus.read_u32(VirtAddr)            CpuBusBridge::read_u32
       └─ MMU stub: pass-through
       └─ SystemBus::read_physical_u32(PhysAddr)
            ├─ std::array<std::byte,4> buf{}
            ├─ SystemBus::read_physical(PhysAddr, span<byte>)
            │    ├─ find_ram_region: linear scan over ram_regions_
            │    │    ├─ AddressRange::contains
            │    │    └─ (1 ram region in GR712RC, but a function call
            │    │       and a vector access every fetch)
            │    └─ Ram::read(offset, span<byte>) → std::memcpy → AVX
            └─ decode_be(buf) → manual byte swap

For RAM-resident fetches (≈ 100 % of fetches in tight loops), this is at minimum 3 function calls, one memcpy of 4 bytes, and one byteswap — to load a single word from a std::vector<byte>.

Proposed change: add SystemBus::read_u32_be_fast(PhysAddr) or specialised Ram::read_u32_be(offset) that:

Skips the byte-buffer round trip.
Returns directly from the typed underlying storage (one memcpy into a uint32_t register + one __builtin_bswap32, both inlined).
Optional: a per-SystemBus 1-entry cache of the most recently hit RamRegion* (sequential PCs hit the same region for thousands of fetches in a row). Skip find_ram_region on a hit.

Expected ROI: collapse the 42 % bus path to 10–15 %. Frees ~30 % of host CPU → ~1.4× MIPS gain. Estimate: 16.7 → 23–26 MIPS.

Effort: small — one or two functions, a unit test that confirms byte order, no public-API change. Days, not weeks.

Why this before the decode cache (10.1 in the original roadmap): decode is 7 % of host CPU; optimising it to zero buys at most 1.07×. Bus dispatch is 6× larger and the fix is mechanically simpler than a decode cache. Reorder Phase 10.1 to be "bus fetch fast path" and the decode-cache work moves to 10.2 or merges into a single "fetch+decode cache" entry that does both.

7.2 P2 — PC-indexed decode cache (the original Phase 10.1)¶

Evidence: §2.1 rows 4, 8, 11 (decode + exec_alu + execute) sum to 13 % of host CPU. A decode cache that memoises PhysAddr → DecodedInsn + handler_ptr would cut most of this on loop bodies (cpubound-mix's inner loop is 13 instructions visited 10 M times — same instruction stream parsed 10 M times today).

Proposed change: PC-indexed map PhysAddr → CacheEntry { DecodedInsn, fn_ptr }. Cache invalidation on writes to executable RAM (rare in RTEMS; reset entirely on Emulator::reset). Compatible with the threaded-code transition in Phase 10.2 — the fn_ptr field is the hook for the threaded handler.

Expected ROI: collapse decode (~7 %) plus the dispatch overhead in step itself (part of the 21.67 %) → estimate another 1.25× MIPS gain on top of P1. Combined with P1: 16.7 → 28–32 MIPS.

Effort: medium — the cache lives in lince_core, needs a test suite (write-invalidates, address-range invalidation, fresh Emulator state, etc.). Weeks.

Dependency: best done after P1, because the decode cache benefits compound on top of a fast bus path. Doing them in the other order means P1's measurement after P2 is harder to attribute.

7.3 P3 — Lift the FP-class check out of the per-step hot path¶

Evidence: §2.1 row 9 — core::is_fp_class at 2.01 %.

Currently in step.cpp:

if (is_fp_class(raw_insn) && !state.ef()) {
    status = ExecStatus::FpDisabled;
} else {
    const DecodedInsn insn = decode(raw_insn);
    ...
}

is_fp_class runs on every fetched word. With FP enabled (the default for RTEMS), the test always returns false-or-the-and-short- circuits, and the result is discarded. On cpubound-mix the result is always false. 2 % gone.

Proposed change: fold the FP-disabled check into decode(): when state.ef() is false, decode() returns a sentinel kind (InsnKind::FpDisabled) for FP opcodes; otherwise it returns the normal kind. step then checks the kind, not the raw word.

Expected ROI: 2 % win, additive. Trivial in isolation, fits naturally into the decode-cache (P2) work — the cache key already includes the relevant PSR.EF state, so this falls out for free.

Effort: small. Half a day standalone, or zero if folded into P2.

7.4 P4 — `boot-rtems-hello-world` is a `memset` benchmark (defer)¶

Evidence: §2.2 row 1 — __memset_avx2 at 70.58 % of CPU. The whole 16 MiB Ram block is zeroed via std::vector<byte>(size) every Emulator::create. AVX2 memset is already maximally fast; the only way to shrink this number is to not allocate-and-zero 16 MiB per emulator.

Proposed change (only if a future use case needs it): keep RAM pages lazily-zeroed (track a dirty set, only zero pages on first read) or skip zeroing entirely if the guest will overwrite (the ELF loader knows what regions it'll fill).

Expected ROI: dramatic on the boot number (70 % → < 1 %), but irrelevant for any long-running guest because the cost is amortised. Defer indefinitely — Phase 10's "1:1 GR740" goal is about sustained throughput, not cold-start latency. Document so the finding isn't lost.

7.5 P5 — Histogram tail truncation (informational, not optimisation)¶

Evidence: §3.3 — LoadAlt, StoreAlt, Mul, Div, FpOp1, FpOp2, Ldstub, Swap, Casa, Stbar, Flush, FpLoad, FpStore, FpBranch, FpUnknown, Unimp, Unknown, MulScc collectively account for < 1 % of executions on every workload.

Implication for Phase 10.2 (threaded code / JIT): prioritise fast paths for the top 5 kinds (AluReg, Branch, SetHi, Store, Load); generic dispatch handles the rest with negligible aggregate cost. The cost of not specialising the long tail is ~1 % of total runtime.

8. Reproduction recipe¶

# One-time setup
bench/profiles/setup.sh                 # clones FlameGraph to ~/.cache
sudo dnf install perf perl-open         # Fedora; equivalent on other distros

# Build the two binaries
cmake --preset bench-profile
cmake --build --preset bench-profile

cmake -S . -B build-histogram -G Ninja \
    -DCMAKE_BUILD_TYPE=RelWithDebInfo \
    -DLINCE_OPCODE_HISTOGRAM=ON
cmake --build build-histogram --target lince_bench sparc_test_binaries

# Run the sweep (~3 min total)
bench/profiles/measure.sh

# Outputs land under bench/profiles/phase9-baseline/<workload>/

To re-measure on a different host, the only knob you typically override is the CPU pin:

LINCE_BENCH_PIN_CPU=4 bench/profiles/measure.sh   # pin to cpu4
LINCE_BENCH_PIN_CPU=   bench/profiles/measure.sh   # disable pinning

On non-hybrid Intel / AMD hosts pinning is a harmless reproducibility helper. On Apple Silicon / ARM hosts the script still runs but the flamegraph fidelity depends on the kernel's perf equivalent (Apple: none committed; Linux ARM64: perf works the same way).

Document status: Phase 9.4 Tanda 5 deliverable.

Cited artefacts: bench/profiles/phase9-baseline/ — 54 files, 672 KiB, all auditable text + SVG. The *.perf.data files (the binary stream perf wrote) are not committed; regenerable from bench/profiles/measure.sh.

Next: Phase 10 design PR. The PR description should open with "This implements §7.1 (P1) of docs/development/hot-paths.md", specify the test plan (bench delta on cpubound-mix and boot, ctest pass rate unchanged), and quote the projected ROI from that section.