Arch-neutral IR and the LLVM JIT¶
Merged to main — LLVM is mandatory
The IR engine and the LLVM JIT are part of every build. LLVM (≥ 18) is a
mandatory dependency — there is no LINCE_ENABLE_JIT option and no
LLVM-free configuration. The execution method is chosen at runtime by
EmulatorConfig::translation (bool, default true), not at build time,
so the public API and the SMP2-facing contract are unchanged: the
Switch interpreter is always one config field away.
This page is the reference for Lince's second execution path: an
architecture-neutral basic-block IR and an LLVM JIT that lowers it to
native code. It documents every load-bearing type, the SPARC → IR → LLVM →
native pipeline, the tiered-JIT promotion flow, and the corner cases (precise
traps, delay slots, self-modifying code, GDB, multi-thread). For the
single-instruction core::step cycle that remains the reference oracle and the
fallback, see Execution model. To add a new guest ISA,
see Adding a frontend.
Why an IR¶
Two goals converge on the same seam:
- Multi-architecture. A new guest ISA (e.g. ARM) should be a new frontend that decodes guest bytes into the IR — not a new core. The IR carries no guest-ISA knowledge, so the IR interpreter and the JIT backend are shared across architectures. SPARC register windows and ARM banking never reach the IR; they are the frontend's choice of byte offsets into an opaque guest-state blob.
- Performance. The IR is the unit the JIT compiles. Polymorphism sits at the
coarse boundary (
translate_block/take_exception/mode_ctx_of— once per block or quantum), never per instruction, so it adds no per-instruction virtual dispatch on the hot path.
The design is frozen in plans/phase11-arch-neutral-ir.md; the JIT roadmap and
progress are in plans/post-mvp-1to1-roadmap.md (ADR-002 = tiered JIT,
ADR-003 = x86-64 host).
Design decisions¶
Six choices make the IR architecture-neutral. Each is a numbered entry in
decisions.md with its rejected alternative; the table is the
index.
| # | Decision | Consequence |
|---|---|---|
| 49 | Guest state is an opaque byte blob; ops touch it only via LdState/StState at (offset, size) |
register names, windows, banking are a frontend offset choice — invisible to the IR |
| 50 | The block-cache key is (PhysAddr, ModeCtx); mode-changing instructions are block terminators |
within a block the mode is constant → every mode-dependent offset resolves at translate time, no runtime indexing |
| 51 | Endianness is an attribute of LdGuest/StGuest, not the bus |
one IR serves big-endian SPARC and little-endian ARM; the swap is in the op/lowered code |
| 52 | No flags register; condition codes are explicit guest-state writes | SPARC icc and ARM CPSR differ; lazy-flag evaluation stays a per-frontend optimisation |
| 53 | Atomics (CASA/LDSTUB/SWAP, later LDREX/STREX) are block boundaries; ordering is TSO | atomicity is preserved even when the JIT adds mid-region exits; TSO is free on the x86-64 host (ADR-003) |
| — | Control flow exits a block with a reason (ExitKind); the arch delivers it (take_exception) |
the shared loop dispatches on the reason; trap-vector/priority stays in arch code |
Execution methods¶
The execution method is the runtime field EmulatorConfig::translation
(bool, default true, src/runtime/include/lince/runtime/emulator_config.hpp:179),
not a build flag — both paths are compiled into every build and selected per
Emulator. There is no exposed dispatch enum and no LINCE_ENABLE_JIT.
translation |
What runs | Notes |
|---|---|---|
false |
naive core::step switch |
the reference path and correctness oracle |
true (default) |
arch-neutral IR, JIT-compiled, with the IR interpreter as fallback | native blocks/regions; any block the JIT cannot lower runs interpreted |
When translation is true the tiered JIT compiles each hot region and the IR
interpreter executes everything else, so the architectural result is identical
either way — the JIT is a speed layer over the interpreter, never a behaviour
change. Emulator::run_ir_quantum (src/runtime/src/emulator.cpp:950) drives
this path block-at-a-time.
GDB and observers
A per-instruction observer (instruction trace) forces the Switch path,
because block-at-a-time execution has no per-instruction hook — guarded by
config_.translation && !observer_ at emulator.cpp:622. A GDB stub
does not force it: run_ir_quantum is breakpoint-aware. With a stub
attached it runs native until a block boundary, calls
gdb_stub_->should_break at each entry, and single-steps (via
fallback_step → core::step) through any block that holds an interior
breakpoint so it stops on the exact PC (block_has_breakpoint,
emulator.cpp:986). While a stub is attached build_jit_region stops fusing
blocks (emulator.cpp:1201), keeping that interior check exact.
Emulator::single_step (emulator.cpp:542) is always core::step —
per-instruction and method-independent.
The IR data model¶
lince_ir (src/ir/) defines the whole IR and depends only on lince_interfaces
(ICpuBus, types.hpp). It has no dependency on lince_core: it operates on
the opaque GuestState blob, never on core::CpuState.
IrInst — one operation¶
IrInst (src/ir/include/lince/ir/ir.hpp:79) is a fixed-size POD: an Op, a
byte size (½/4/8), a MemEndian, up to four block-local Temp operands
(dst, a, b, c), a 64-bit imm, and the originating guest pc. The pc
is stamped on the trapping ops (LdGuest/StGuest/TrapIf) so a mid-block
fault reports the exact trap PC; it slots into existing padding, so the struct
does not grow.
The Op enum (ir.hpp:38) is the full op set as it stands today:
| Family | Ops | Notes |
|---|---|---|
| Values | Const |
imm → dst |
| Register file | LdState, StState |
guest_state[off .. off+size], host order |
| Guest memory | LdGuest, StGuest |
{size, endian}-tagged bus access |
| Integer ALU | Add Sub And Or Xor Shl Shr Sar Mul |
32-bit; shift count masked to 5 bits |
| Multiply-high | UMulHi, SMulHi |
high 32 bits of the 64-bit (un)signed product (SPARC writes %Y); Mul is the low 32 |
| Divide | UDiv, SDiv |
32/32; ÷0 → 0 (frontend raises the trap separately) |
| 64-bit divide | UDiv64, SDiv64 |
ternary SPARC UDIV/SDIV: a=%Y(high), b=dividend(low), c=divisor; quotient of (a<<32\|b)/c, saturated to U/INT32 MAX/MIN |
| Compare | CmpEq, CmpLtU, CmpLtS |
dst = 0/1 |
| Select | Select |
dst = a ? b : c |
| Trap | TrapIf |
if a != 0: abort the block with exception imm at pc |
Temp (ir.hpp:21) is a block-local SSA-free value: every value-producing op
writes a fresh Temp and they all die at the block boundary. Cross-instruction
state flows through the GuestState blob, never through temps.
IrExit — the structured terminator¶
Control flow is not an op — a block ends with one IrExit
(ir.hpp:94). Conditional branches need both targets plus a condition temp,
which is why the terminator is a struct rather than an op:
ExitKind |
Fields used | Meaning |
|---|---|---|
FallThrough / StaticBranch |
static_target (+ is_call) |
continue at a fixed PC |
CondBranch |
cond, static_target, fallthrough_target |
cond ? target : fallthrough |
IndirectBranch |
dyn_target (Temp) |
computed target (JMPL / return) |
Exception |
exit_code |
deliver an architectural trap |
PowerDown |
— | core halted pending interrupt |
is_call (ir.hpp:104) marks a StaticBranch that is a CALL: the callee runs
at static_target and the return point is the sequential next PC. It lets the
region builder pull in the return block so the callee's return can chain back
(see region chaining).
IrBlock — a translated basic block¶
IrBlock (ir.hpp:117) is a class, not a bare struct: it carries the data and
the builder helpers the frontend composes. Frontends never hand-fill an
IrInst; they call the emit helpers, which allocate temps and append ops:
| Helper | Emits | Notes |
|---|---|---|
emit_const(imm) → Temp |
Const |
block-local value |
emit_ld_state(off, size) → Temp |
LdState |
read guest reg (host order) |
emit_st_state(off, size, src) |
StState |
write guest reg |
emit_ld_guest(addr, size, endian) → Temp |
LdGuest |
guest memory load |
emit_st_guest(addr, val, size, endian) |
StGuest |
guest memory store |
emit_binary(Op, a, b) → Temp |
ALU/Cmp | the two-operand families above |
emit_select(cond, t, f) → Temp |
Select |
cond ? t : f |
emit_ternary(Op, a, b, c) → Temp |
UDiv64/SDiv64 |
three-operand divide |
emit_trap_if(cond, code) |
TrapIf |
mid-block conditional trap at cur_pc |
set_cur_pc(pc) |
— | stamp the PC onto subsequent trapping ops |
The block's identity and bookkeeping fields:
entry_pc+mode_ctx— the block-cache key.insn_count— guest instructions covered; the run loop bills sim-time from it.mode_change_kind(ModeChangeKind,ir.hpp:112) — how the block leaves its mode context:None(the common case),StaticDelta(a translate-time constant CWP shift — SPARC SAVE/RESTORE), orDynamic(RETT/WRPSR land on an unknown CWP/PSR). A region compiler may chain across aStaticDeltablock but never across aDynamicone.exit_mode_ctx— the mode context the static successor runs under; meaningful only whenmode_change_kind == StaticDelta.exec_count(ir.hpp:141) — the interpret-first warmup counter (dispatcher runtime state, not part of the translation); see interpret-first tiering.delay_trap_pc/delay_trap_npc/delay_trap_dynamic(ir.hpp:156) — the delay-slot trap fixup; see delay-slot traps.
GuestState — the opaque blob¶
GuestState (src/ir/include/lince/ir/guest_state.hpp:20) is an opaque
byte-addressed std::vector<std::byte>. The frontend picks offsets; the IR only
does load(offset, size) / store(offset, size, value) in host order.
load/store are inlined (a literal size folds to a single aligned access) —
they are the hot path for every register/PC access, profiled at ~28% of real-code
runtime when out-of-line.
State unification — CpuState is a GuestState blob¶
There is no CpuState↔GuestState sync. core::CpuState embeds an
ir::GuestState int_state_{layout::StateSize} and exposes it directly via
CpuState::guest_state() (src/core/include/lince/core/cpu_state.hpp:302). The
SPARC integer-state layout (core::layout, cpu_state.hpp:40) is the single
canonical representation: the SPARC frontend emits LdState/StState against
exactly these offsets, and core::step reads/writes the same bytes, so the
reference interpreter, the IR interpreter, and the JIT agree register-for-register
with zero copy.
core::layout (the SPARC integer blob, cpu_state.hpp)
[0] 8 globals %g0..%g7 (GlobalsBase)
[32] NumWindows*16 windowed slots (8 outs + 8 locals/window;
ins of window w alias outs of (w+1) mod NumWindows) (WindowedBase)
[544] Y, PSR, WIM, TBR, PC, nPC, ASR17 (SpecialBase)
StateSize = 572
What is not in the blob: the FP register file, the cache-control registers,
and the step-loop micro-state (annul_next, psr_write_pending, error mode,
power-down). The IR never touches those; they stay in CpuState and only
core::step ever sets them, which is what keeps the blob clean across IR blocks.
Why SPARC keeps CpuState
SPARC predates the IR. Its reference interpreter, GDB stub, and
per-instruction observer all use core::CpuState, so unification made the
blob a member of CpuState rather than a separate scratch buffer. A
brand-new arch with no legacy core would be GuestState-native and skip
CpuState entirely — see Adding a frontend Step 5.
ModeCtx — the mode key¶
ModeCtx (ir.hpp:28) is the small arch value that, with the entry PC, keys a
block. Mode-changing instructions are block terminators (decision 50), so within
a block it is constant. For SPARC it is the CWP bits of PSR
(SparcArchitecture::mode_ctx_of, sparc_arch.cpp:43 — PSR & CwpMask); S/PS/EF
join the key when the privileged/FP translation paths land.
Guest memory — the one place endianness lives¶
src/ir/include/lince/ir/guest_memory.hpp centralises bus access + byte order:
bswap32, swap_for, bus_load, bus_store, guest_load, guest_store. The
bus is big-endian today; a little-endian access is its byte-reverse over the
access width (swap_for). Both the IR interpreter (interpreter.cpp:36,
:48) and the JIT's extern "C" helpers (ir_jit.cpp:50, :62) call these,
so the two strategies cannot diverge on memory semantics.
BlockCache — direct-mapped, (PhysAddr, ModeCtx)-keyed¶
BlockCache (src/ir/include/lince/ir/block_cache.hpp) is an 8192-slot
direct-mapped cache: index_of(pc) = (pc >> 2) & (Size - 1), each slot carrying
valid, pc_tag, mode, and the IrBlock. find(pc, mode) matches the tag
and the mode; insert evicts the occupant; invalidate(pc) drops the slot
whose tag matches (the self-modifying-code hook). The shape mirrors the Phase
10.1 decode cache.
Why 8192 (2026-06)
The slot count was raised from 1024 → 8192. At 1024 the index covered only a
4 KiB PC window, so on a call-heavy guest the hot code and its libc aliased
and evicted each other (on Dhrystone, libc strcmp/strcpy hashed into the
Proc_/Func_ block span and forced constant re-translation —
translate_block ate ~9–16% of wall time). 8192 (a 32 KiB window) removes
the aliasing (Dhrystone +39%, p99 slice jitter 28.7 ms → 8.9 ms); larger
sizes measured no further gain. A cache only ever changes eviction
frequency, never results, so this is bit-exact. See
Performance.
The pipeline: SPARC → IR → LLVM → native¶
flowchart LR
subgraph Frontend["lince_arch_sparc (per-ISA)"]
A[guest bytes] -->|core::decode| B[DecodedInsn]
B -->|translate_block| C[IrBlock]
end
subgraph IR["lince_ir (arch-neutral)"]
C -->|BlockCache.find/insert| D[(block cache)]
end
subgraph Backend["lince_jit (LLVM)"]
D -->|build_jit_region| E[region: IrBlock+]
E -->|lower_block / IRBuilder| F[LLVM IR module]
F -->|verifyModule + addIRModule| G[ORCv2 LLJIT]
G -->|lookup materialises| H[native BlockExecFn]
end
D -->|JIT can't lower| I[IrInterpreter]
H --> J[GuestState updated in place]
I --> J
The frontend is the only ISA-aware stage. Everything from IrBlock rightward is
arch-neutral and shared.
The IR run loop¶
Emulator::run_ir_quantum (emulator.cpp:950) drives one core for one quantum,
identical for the IR-interpreted and JIT-compiled cases:
flowchart TD
Start([enter quantum]) --> Halt{error / powered-down?}
Halt -->|yes| Done([return ran])
Halt -->|no| IPI[poll_self_interrupt]
IPI --> GDB{GDB break at pc?}
GDB -->|yes| Done
GDB -->|no| Clean{clean boundary?}
Clean -->|"no (delay slot /<br/>pending PSR / annul)"| FB[fallback_step: core::step] --> Loop
Clean -->|yes| Cache["BlockCache.find(pc, mode)"]
Cache -->|miss| TX[translate_block + insert] --> Untr
Cache -->|hit| Untr{insn_count == 0?}
Untr -->|yes| FB
Untr -->|no| Bud{next block<br/>crosses quantum?}
Bud -->|"yes (ran>0)"| Tail[tail-step remainder via core::step] --> Done
Bud -->|no| BP{interior breakpoint?}
BP -->|yes| FB
BP -->|no| Warm{exec_count <<br/>baseline_threshold?}
Warm -->|yes| Interp[IrInterpreter.run] --> Apply
Warm -->|no| Compile["tiered_jit.get_or_compile"]
Compile -->|nullptr| Interp
Compile -->|fn| Native["fn(gs, bus, &res, budget)"] --> Apply
Apply[apply exit: advance PC/nPC<br/>or take_exception] --> Loop{ran < quantum?}
Loop -->|yes| Halt
Loop -->|no| Done
Key points, in order:
- The blob is canonical for the whole quantum.
gs = state.guest_state()(emulator.cpp:961). The IR/JIT andcore::stepall operate on these same bytes, so there is no entry/exit sync. The exit comment (emulator.cpp:1185) is explicit: every IR/JIT update is already reflected inCpuState. - Self-IPI poll.
poll_self_interrupttakes a self-directed IPI at the block boundary (hardware-latency self-interrupt) before fetching the next block (emulator.cpp:1007). Because this runs every block boundary but the controller is empty ~99.99% of the time at a 100 Hz tick,Emulator::sample_interruptsfirst checks the controller'sIInterruptController::raw_pending()— a maintained single-word superset of the pending sources. A 0 result provespending_mask(cpu) == 0for every cpu, so the poll early-outs before the full per-CPU scan (provably bit-exact; gated toSingleThread). See Performance. - Clean-boundary gate (
emulator.cpp:1025). A block may be translated only at a clean instruction boundary:!annul_next() && !psr_write_pending() && npc == pc + 4. A delay slot, an annulled slot, or SPARC's delayed-WRPSR window isfallback_step-ed until clean. That micro-state lives inCpuState, not the blob, and only the fallback ever sets it. - Block cache.
find(pc, mode); on a miss,translate_block+insert(emulator.cpp:1038). Translation is a pure function of(pc, mode, guest code)— it never reads runtime register state — so a cached block is valid for any later execution and is shared across cores (in SingleThread).insn_count == 0means "untranslatable op here" → fallback.PhysAddr{pc}is sound only while VA==PA (identity-mapped MMU), which holds until SRMMU lands. - Quantum-exact yield (
emulator.cpp:1067). If the next block would cross the quantum (andran != 0), the remainingquantum - raninstructions run one-by-one throughcore::step, landing on the exact same boundary the switch path would. This matters under SMP: block-level overshoot drifts the round-robin interleaving point and once livelocked a lock-free migration handshake (smpschedaffinity04). A block larger than the whole quantum still runs whole whenran == 0, guaranteeing forward progress. - Execute. Native (JIT) when warm and lowerable, else the IR interpreter.
Both leave
gsupdated in place. The interpreter yields anir::BlockExit; the JIT writes ajit::BlockResult, which the run loop normalises into the sameBlockExitshape (emulator.cpp:1122). - Apply the outcome (
emulator.cpp:1141). Amemory_faultmaps toDataAccessException; anExceptioncarries the frontend'sexit_code. The saved PC is the faulting instruction's; nPC is normallyPC+4but a trap in a control-transfer delay slot saves the branch's resolved target (the delay-trap fixup). Delivery is viaarchitecture.take_exceptionwhenPSR.ET, else the core enters error mode (SPARC V8 §7.3). A normal exit advances PC/nPC to the continuation. Sim-time is billed from the instructions actually executed (jit_insnsfor the JIT — which may span many self-loop iterations — elseblock->insn_count).
What the JIT/IR can't translate → fallback¶
Anything the SPARC frontend cannot yet emit ends the block (insn_count == 0 or
a bail_at fall-through) and the run loop core::steps that PC. The frontend
currently bails on: FP ops, atomics (CASA/LDSTUB/SWAP), alternate-space access,
RETT/Ticc, the cc-setting multiply-step (MULScc) and SpecialReg reads,
LDD/STD (64-bit guest memory in lower_block, ir_jit.cpp:319), and a
delay slot that is itself a CTI or otherwise non-predicable. Annulled conditional
branches (Bicc,a) are lowered when their delay slot is predicable — rd-only,
cc/%Y-writing, a single-word load, or UDIV/SDIV — running the slot on the
taken path and squashing it when not taken; a store, LDD, or control-transfer
delay slot still bails (sparc_frontend.cpp:671).
Delay-slot traps¶
SPARC's branch + delay slot is translated as one block (the branch is the
terminator, the slot is the trailing edge). If the delay slot is a trap-capable
op (Load/Store/Div) a fault there must save the branch's resolved nPC, not
trap_pc + 4 — the straight-line block model would otherwise mis-save it. The
block records this in delay_trap_pc / delay_trap_npc / delay_trap_dynamic:
- Static (
delay_trap_dynamic == false— CALL,BA): the saved nPC is the constantdelay_trap_npc(the call/branch target). - Dynamic (
true— JMPL, a true conditionalBicc): the runtime nPC is unknown at translate time, so the frontend stores the resolved nPC into the GuestStatenPCslot before the delay slot, and the run loop keeps it on a fault (emulator.cpp:1156) rather than overwriting.
A delay-trap block must stay the dispatch entry — build_jit_region refuses to
bury one mid-region (emulator.cpp:1240) so the run loop can read its
delay_trap_* off the entry to fix the saved nPC.
The JIT¶
lince_jit (src/jit/) is the isolated module that owns LLVM. It depends only
on lince_ir (and LLVM); lince_runtime links it PUBLIC unconditionally. It
uses LLVM ORCv2 (LLJIT).
How ORCv2 turns a region into native code¶
ORCv2 is LLVM's on-request compilation API; LLJIT is its turnkey wrapper.
Lince uses it as a black box that takes an LLVM IR module and returns the address
of a compiled function. The lifecycle per region, in IrJit::compile_region
(ir_jit.cpp:666):
- Build an LLVM IR module. A fresh
llvm::LLVMContext+llvm::Modulehold oneexecute_region_Nfunction (Nfrom a per-IrJitcounter,ir_jit.cpp:680) whose signature is theBlockExecFnABI below. One LLVMBasicBlockis created per region member, keyed by(entry_pc, mode_ctx); the function entry block holds the sharedallocas and branches to member 0. - Lower.
lower_block(ir_jit.cpp:130) walks each member'sIrInsts andIrExitand emits LLVM IR with anIRBuilder— the only place that knows both the Lince IR and LLVM. It returnsfalsefor an op/exit it cannot lower, andcompile_regionreturnsErrorCode::JitError. - Verify.
llvm::verifyModulerejects malformed IR — a lowering bug returnsJitErrorinstead of producing wrong code (ir_jit.cpp:740). - Add to the JIT.
LLJIT::addIRModule(ThreadSafeModule)hands the module to ORCv2. Nothing is compiled yet; ORCv2 is lazy. - Look up the symbol.
LLJIT::lookup("execute_region_N")materialises the symbol: the optional O2 IR transform (Optimised tier only) runs, thenIRCompileLayerlowers LLVM IR → machine code via the host backend (instruction selection, register allocation, scheduling), then the object is linked into executable memory.lookupreturns the address, cast toBlockExecFnviasym->toPtr<BlockExecFn>()(ir_jit.cpp:754).
The address stays valid for the JIT's lifetime; the TieredJit caches it per
(pc, mode) so steps 1–5 happen once per region, never per execution.
Block ABI¶
// src/jit/include/lince/jit/ir_jit.hpp
enum class BlockStatus : std::uint32_t { Normal = 0, Exception = 1, MemoryFault = 2 };
struct BlockResult {
std::uint32_t next_pc; // continuation (Normal)
std::uint32_t exit_code; // architectural trap tt (Exception/TrapIf)
std::uint32_t trap_pc; // faulting PC (Exception / MemoryFault)
std::uint32_t status; // a BlockStatus value
std::uint32_t insns; // guest instructions ACTUALLY executed
};
using BlockExecFn = void (*)(void* guest_state, void* bus,
BlockResult* out, std::uint32_t budget);
The native function operates in place on the GuestState blob
(gs.bytes().data()) and the ICpuBus, writing its outcome to *out. The field
byte offsets are asserted against the struct (ir_jit.cpp:102) because the
lowering stores them by offset. insns is the count the call billed — with
self-loop/region chaining one call may iterate many times, and on a mid-block
trap it includes the partial final iteration, so the dispatcher bills sim-time
from insns, not the static insn_count. budget is the instruction budget for
chained back-edges (see region chaining).
BlockExit vs BlockResult
The IR interpreter returns ir::BlockExit (interpreter.hpp:30); the JIT
fills jit::BlockResult (ir_jit.hpp:50). They are distinct structs with the
same information — run_ir_quantum normalises the JIT's into a BlockExit so
the apply-outcome code is shared (emulator.cpp:1122).
Lowering details¶
lower_block (ir_jit.cpp:130) emits one block's ops and exit at the builder's
current insert point:
- State.
LdState/StState→ byte GEPs + align-1 load/store on the blob, host order (matching the interpreter exactly). The size selects ani8/i16/i32access with zext/trunc as needed. - The register file is memory, not LLVM registers. Loads/stores hit the blob
pointer. The O2 pipeline's
mem2reg/SROA passes promote hot guest registers to SSA values (and back) within a region — that is where most of the optimised tier's speed comes from. The baseline tier skips those passes, so its code keeps every guest register in memory. - ALU / Cmp / Select. Direct LLVM instructions; shift counts masked to 5
bits;
UMulHi/SMulHivia i64 ext + mul + shift; division guarded (÷0 → 0,INT_MIN/-1saturated, never a trapping LLVMsdiv/udiv) to match the interpreter (ir_jit.cpp:240). - Guest memory. See inline RAM below; the slow path
calls the
extern "C"helpers (lince_jit_load/lince_jit_store, resolved by ORCv2 absolute symbols atIrJit::create,ir_jit.cpp:644, to functions that funnel throughir::guest_load/guest_store). 64-bit (LDD/STD) sizes bail (ir_jit.cpp:319). - Faults and traps. A bus error (
fault_slotset by the helper) or aTrapIfaborts the block mid-stream via an earlyretthat recordstrap_pc/status/insns(ir_jit.cpp:341,:437). The precise-trap "prefix committed, suffix skipped" guarantee falls out of in-order emission + early return, identical to the interpreter.insnson the fault path isacc_load() + partial(pc)— completed iterations plus the prefix of the current one. - Exits. A static/taken target that is in the region chains directly to its
member block (budget-guarded); else it returns to the dispatcher.
Exception/PowerDownterminators are not lowered —lower_blockreturnsfalseand the caller falls back.
Inline RAM access¶
Guest loads/stores were originally always an out-of-line helper call. With a RAM
window the JIT inlines big-endian RAM access (sizes ½/4) as native host
access (ir_jit.cpp:372):
offset = addr - window.guest_base
if (offset <=u window.size - access_size) // whole access in-window
native load/store at host_ptr + offset // llvm.bswap for 2/4 (LE host)
else
call the bus helper // MMIO / out-of-window / straddling
- The bounds check guards the whole access (
offset <=u size - access_size, a compile-time constant), so a straddling access falls to the slow path where the bus latches it as aBusError, exactly as the interpreter does. The frontend's alignmentTrapIfalready runs before the access, so aligned accesses never straddle — but the check is correct without that guarantee. - The host pointer comes from
SystemBus::ram_view_at(config_.ram_base)(emulator.cpp:471) and is baked as ani64constant; it is stable because RAM is mapped once atinitialize()and never moved. The window is passed at JIT construction (IrJit::create(std::optional<RamWindow>)); with no window (unit tests with a synthetic bus) or for little-endian accesses, the helper path is used unchanged (ir_jit.cpp:364). - The inline path byte-reverses explicitly with
llvm.bswap, so it assumes a little-endian host — astatic_assertenforces this (ir_jit.cpp:36; ADR-003 fixes the JIT host to x86-64). This is independent of the guest endianness.
Self-loop chaining¶
A block whose static/taken exit target is its own entry_pc (a tight
backward branch, or a ba/spin park loop) is lowered as a native loop instead of
returning to the dispatcher each iteration — removing the per-iteration indirect
call + two hash lookups + BlockResult marshalling.
- The function entry holds the loop-invariant
allocas (thefault_slotflag and the pre-zeroed instruction accumulatoracc_slot); the self-edge is a budget-guardedbrback to the member body. The body always runs at least once (forward progress). - Loop-carried state is the blob (memory) and the accumulator; block-local temps are recomputed each iteration, so LLVM's mem2reg/LICM clean it up with no hand-written PHIs beyond the inline-RAM load merge.
BlockResult::insnsreports the real count across iterations; a mid-loop fault reports completed iterations + the partial final one, so sim-time stays exact.- Self-loops are mode-safe by construction: a block can only target its own entry, so it re-runs under the mode it was translated for.
The self-loop is the degenerate single-member case of a region.
Region chaining¶
A region fuses an entry block with its same-mode successors into one native
function. Emulator::build_jit_region (emulator.cpp:1190) discovers the region
by BFS from the entry:
- It follows
StaticBranch/CondBranchedges (and theStaticDeltafallthrough), decoding each successor viafrontend.translate_blockwithout touchingir_cache_(so the caller'sblockpointer into the cache stays valid,emulator.cpp:1232). - For a CALL (
is_call) it also pulls in the return block (the sequential PC after the call+delay slot) so the callee'sIndirectBranchreturn can chain back into native code (emulator.cpp:1274). - It stops at
jit_max_region_blocks(default 8), at an untranslatable target (insn_count == 0), at a delay-trap block (must stay the entry), at aDynamicmode change (unknown post-change CWP), or when a GDB stub is attached (no fusing,emulator.cpp:1201).
IrJit::compile_region (ir_jit.cpp:666) lowers the members into one function.
resolve(target_pc, mode) maps an exit target to a member's LLVM block:
- An in-region static/taken/fall-through target becomes a direct
br— budget-guarded (take_static,ir_jit.cpp:500): chain only if the whole target fits the remaining budget (acc_now + member_insns <=u budget), the same don't-start-an-overshooting-block rule the dispatcher applies. The weakeracc_now < budgetwould overshoot and reintroduce the SMP interleaving drift. - An
IndirectBranch(JMPL/return) target is a runtime value, so it gets an inline cache (ir_jit.cpp:553): compare it against each in-region member at the successor mode; on a hit (with budget) chain to that body, else fall through to a dispatcher return. This is what lets CALL/return pairs run wholly in native code. - An out-of-region or no-hit target returns to the dispatcher.
mode_change_kind is the safety flag: the SPARC frontend sets StaticDelta on
SAVE/RESTORE (their CWP rewrite makes the successor a different mode_ctx). The
region builder chains across a StaticDelta block under exit_mode_ctx but never
across a Dynamic block, and resolve keys members by (pc, mode) — so a member
is only ever re-entered under the mode it was translated for. If a member turns
out to be unlowerable, compile_with_fallback (tiered_jit.cpp:37) retries with
the entry block alone, so a region never does worse than single-block.
Tiered compilation (ADR-002)¶
flowchart TD
Miss["run_ir_quantum: cache miss / cold block"] --> Warm{"exec_count <<br/>jit_baseline_threshold (32)?"}
Warm -->|yes| Interp["IrInterpreter.run<br/>(++exec_count)"]
Warm -->|no| GOC["TieredJit.get_or_compile"]
GOC -->|"first request"| Base["compile_with_fallback<br/>(Baseline O0, caller thread)"]
Base -->|nullptr| InterpFb["interpreter fallback<br/>(verdict cached)"]
Base -->|fn| Run["run native (Baseline)"]
GOC -->|"opt published"| RunO["run native (Optimised)"]
Run --> Note["note_execution"]
Note --> Cross{"exec_count == 100?"}
Cross -->|yes| Enq["enqueue on background thread"]
Enq --> BG["worker: compile_with_fallback<br/>(Optimised O2)"]
BG --> Pub["optimised.store(fn, release)"]
Pub -.->|"next get_or_compile"| RunO
IrJit takes an OptLevel (ir_jit.hpp:81). Baseline builds its LLJIT
with CodeGenOptLevel::None (fast instruction selection, no IR passes) — cheap to
produce. Optimised uses CodeGenOptLevel::Aggressive plus the full LLVM O2 IR
pipeline (optimize_module_o2, ir_jit.cpp:76), installed as an
IRTransformLayer transform that runs during materialisation
(ir_jit.cpp:631).
TieredJit (src/jit/src/tiered_jit.cpp) drives two IrJits and owns the
(pc, mode) cache (CacheEntry, tiered_jit.cpp:65):
- On
get_or_compileit compiles the Baseline immediately on the calling thread, so a warm block runs at once (no compile stall on the hot path). note_executioncounts runs; crossingjit_promotion_threshold(default 100) enqueues the block on a single background thread (worker_loop,tiered_jit.cpp:111) that compiles the Optimised tier and publishes the function pointer atomically (optimised.store(fn, std::memory_order_release)). The dispatcher prefers the optimised pointer once present (get_or_compile, acquire-load,tiered_jit.cpp:179).- Both tiers lower the identical IR, so they are semantically interchangeable.
An unlowerable block caches a
nullptrverdict (baseline_tried,tiered_jit.cpp:183) and is never re-attempted or promoted. - A compile budget (
MaxPendingCompiles = 64,tiered_jit.cpp:30) bounds the Optimised queue: a burst of promotions drops the oldest request (its block stays on Baseline) rather than growing without bound. - The
CacheEntrykeeps the JIT's own copy of the region's IR (entry.region), because the dispatcher's block cache is direct-mapped and evicts — the background thread must not alias it.std::unordered_mapnode-pointer stability is what lets the worker hold a rawCacheEntry*while the caller keeps inserting (tiered_jit.cpp:62).
Interpret-first tiering¶
Before any compilation, the dispatcher runs a cold block on the IR interpreter
until it proves hot. run_ir_quantum checks block->exec_count <
config_.jit_baseline_threshold (default 32, emulator.cpp:1104): below the
threshold it runs the interpreter and ++exec_count; at the threshold it calls
get_or_compile. Cold / run-few blocks (the bulk of boot and varied control
flow) never pay LLVM compile latency they would not amortise. The IR interpreter
is the lockstep oracle, so warming on it changes no semantics — only which
validated executor runs the block. jit_baseline_threshold == 0 restores the
legacy compile-on-first-sight behaviour.
Runtime knobs (EmulatorConfig, not build flags): jit_baseline_threshold,
jit_promotion_threshold, jit_background_opt (false → Baseline-only, no
background thread), jit_max_region_blocks (region size cap).
Self-modifying code and the code flush¶
A cached block is invalidated when the guest changes the code under it:
- FLUSH. SPARC
FLUSHis untranslatable, so it always retires viacore::step. The fallback checksstate.consume_code_flush()and callsrequest_code_flush(emulator.cpp:973), which clears everyir_cache_, everytiered_jit_cache, and every per-core decode cache (flush_code_caches,emulator.cpp:912). Under MultiThread the flush is deferred to the serial round boundary (code_flush_pending_latch,emulator.cpp:927) where all workers are parked. load_elf/reset/write_physical. These also clear the IR + JIT caches (emulator.cpp:525,:1632) since they may overwrite cached code.BlockCache::invalidate(pc)is the finer hook for a single written word; it drops the matching slot.
TieredJit::clear drains the background worker first (so no in-flight compile
touches a cleared entry), then clears the cache map; the LLJIT-resident code is
left orphaned in place — reloads are rare, matching the IR caches' policy
(tiered_jit.cpp:232).
Multi-thread: per-core caches¶
Under ExecutionMode::MultiThread (Phase 13) each simulated core runs on its own
host thread, so the per-core IR caches, interpreters, and JITs must not be shared
(the BlockCache slots, IrInterpreter scratch, and TieredJit cache map are
not thread-safe). initialize() allocates n_caches = MultiThread ? num_cores :
1 of each (emulator.cpp:454), and ir_cache_for / tiered_jit_for /
ir_interp_for (emulator.cpp:938) index by core_idx under MT, by 0
otherwise. Each per-core TieredJit keeps its own background O2 thread
(Phase 14 P14-2) — the optimiser is bursty (compile each hot block once at
warmup, then idle), so N threads are a transient startup spike, not sustained
load, recovering the per-core O2 throughput baseline-only MT had sacrificed.
Validation strategy¶
Correctness is established by lockstep, layered so each increment lands on a proven base:
- IR interpreter vs reference — the IR diff harness
(
tests/integration/test_ir_diff_lockstep.cpp+tests/support/ir_diff_harness.hpp) runs twoEmulators (IR-driven vscore::step) andmemcmps the fullGuestStateblob after every block across a real RTEMS boot (≈140 k instructions, byte-identical). - JIT vs interpreter — block-level lockstep (
tests/unit/test_ir_jit.cpp): a block run through the JIT and throughIrInterpreterleaves identical blob + guest memory. - JIT vs reference — run-loop lockstep
(
tests/integration/test_jit_run_lockstep.cpp, mirrored by the in-tree oracle harness atemulator.cpp:1900+): the JIT engine vscore::stepsingle-stepping, byte-identical across RTEMS boot with self-loop chaining and inline RAM active.
On top of that: the full sptest suite under each mode (CSV diff for PASS→FAIL
regressions), SMP smptests under Switch/IR/JIT (N=2 and N=4), and
ASan/UBSan/LSan on the isolated lince_jit_tests exe.
TSan
libtsan is absent on the development host, so the background-thread JIT and
per-core MT scaffolding are validated under ASan+UBSan+LSan only.
Performance¶
Single-core, cpubound-mix, x86-64 host (median of 5):
| Mode | MIPS | vs Switch |
|---|---|---|
Switch (interpreter, reference) |
104 | 1.0× |
IR interpreted (translation on, JIT fallback) |
49 | 0.47× |
| translation — 12.3 lowering | 347 | 3.3× |
| translation — + 12.4a self-loop chaining | 880 | 8.5× |
| translation — + 12.5 inline RAM | 1962 | 18.9× |
| translation — + tiered + 12.4b region (closeout) | ~2070 | ~20× |
The IR-interpreted mode is slower than the Switch interpreter by design (translate-then-interpret indirection with no native payoff); the IR earns its keep only once JIT-compiled — which is why it is only ever the JIT's fallback, never a mode a user selects.
Steady-state cpubound-mix is dominated by a hot loop the self-loop path already
captured, so the tier and region chaining leave it ~flat. Their win is on the
cold path — where the old single-tier paid full O2 codegen for every block,
most of which run only a handful of times:
| Cold-path workload | single-tier O2 | tiered (+ interpret-first + region) |
|---|---|---|
| RTEMS hello boot (wall) | 4.5 s | 1.0 s |
| sptest suite (190 ELFs, translation, wall) | ~15 min | 6.3 min |
(The removed Threaded prototype peaked at ~143 MIPS / 1.4× over Switch before
the IR JIT superseded it — see Decision 59.)
Build and test¶
# Standard build — LLVM (>= 18) is mandatory; the JIT is in every build.
cmake -S . -B build -G Ninja
cmake --build build -j
# JIT unit tests, sanitiser-clean (links only lince_jit, not lince_core,
# whose musttail chain is incompatible with ASan)
./build/tests/lince_jit_tests
# JIT lockstep + RTEMS boots, excluding the slow sptest suite
./build/tests/lince_tests "[jit]~[sptests]"
# Compare the Switch interpreter against the default translation path
./build/bench/lince-bench --workload cpubound-mix --runs 5 # translation (default)
./build/bench/lince-bench --workload cpubound-mix --no-translate --runs 5 # Switch interpreter
The full sptest suite under the translation path is compile-bound (cold LLVM
codegen per region) and takes minutes — run it without a timeout wrapper.
Module boundaries¶
interfaces ← ir ← arch_sparc ──┐
↑ ├→ runtime ← app
jit (LLVM) ─────────┘ (links jit PUBLIC, always)
core, bus, peripherals ─────────┘
lince_ir depends only on lince_interfaces (not lince_core) — it works on
the opaque GuestState. lince_jit depends only on lince_ir + LLVM, kept
isolated so the JIT tests can build under sanitisers without lince_core, whose
musttail chain is incompatible with ASan. The SPARC frontend
(lince_arch_sparc) bridges core and IR: it reuses core::decode and the
core::layout offsets, but emits arch-neutral IR.
Remaining work¶
Phase 12 is structurally complete (tiered JIT + region chaining + interpret-first landed). What is left is optimisation and the multi-arch payoff, not core functionality:
- Perf tuning + a long soak — promotion-threshold sweep, hot-helper inlining, a sustained-divergence stress run. Gated on a concrete 1:1 target rather than the testsuite.
- FP / atomic / alternate-space lowering — these still bail to
core::step; lowering them would shrink the fallback surface (correctness is already preserved by the fallback). - An ARM frontend — the payoff that confirms the arch-neutral seam. See Adding a frontend for the procedure.
See also¶
- Execution model — the
core::stepreference cycle and the decode cache the IR path falls back to. - Adding a frontend — the
IArchitecture/IArchFrontendseam and the step-by-step to add a new ISA. decisions.md— decisions 49–57 (IR neutrality, runtime dispatch) and 59 (threaded-code removal) with rejected alternatives.