Skip to content

Arch-neutral IR and the LLVM JIT

Merged to main — LLVM is mandatory

The IR engine and the LLVM JIT are part of every build. LLVM (≥ 18) is a mandatory dependency — there is no LINCE_ENABLE_JIT option and no LLVM-free configuration. The execution method is chosen at runtime by EmulatorConfig::translation (bool, default true), not at build time, so the public API and the SMP2-facing contract are unchanged: the Switch interpreter is always one config field away.

This page is the reference for Lince's second execution path: an architecture-neutral basic-block IR and an LLVM JIT that lowers it to native code. It documents every load-bearing type, the SPARC → IR → LLVM → native pipeline, the tiered-JIT promotion flow, and the corner cases (precise traps, delay slots, self-modifying code, GDB, multi-thread). For the single-instruction core::step cycle that remains the reference oracle and the fallback, see Execution model. To add a new guest ISA, see Adding a frontend.

Why an IR

Two goals converge on the same seam:

  • Multi-architecture. A new guest ISA (e.g. ARM) should be a new frontend that decodes guest bytes into the IR — not a new core. The IR carries no guest-ISA knowledge, so the IR interpreter and the JIT backend are shared across architectures. SPARC register windows and ARM banking never reach the IR; they are the frontend's choice of byte offsets into an opaque guest-state blob.
  • Performance. The IR is the unit the JIT compiles. Polymorphism sits at the coarse boundary (translate_block / take_exception / mode_ctx_of — once per block or quantum), never per instruction, so it adds no per-instruction virtual dispatch on the hot path.

The design is frozen in plans/phase11-arch-neutral-ir.md; the JIT roadmap and progress are in plans/post-mvp-1to1-roadmap.md (ADR-002 = tiered JIT, ADR-003 = x86-64 host).

Design decisions

Six choices make the IR architecture-neutral. Each is a numbered entry in decisions.md with its rejected alternative; the table is the index.

# Decision Consequence
49 Guest state is an opaque byte blob; ops touch it only via LdState/StState at (offset, size) register names, windows, banking are a frontend offset choice — invisible to the IR
50 The block-cache key is (PhysAddr, ModeCtx); mode-changing instructions are block terminators within a block the mode is constant → every mode-dependent offset resolves at translate time, no runtime indexing
51 Endianness is an attribute of LdGuest/StGuest, not the bus one IR serves big-endian SPARC and little-endian ARM; the swap is in the op/lowered code
52 No flags register; condition codes are explicit guest-state writes SPARC icc and ARM CPSR differ; lazy-flag evaluation stays a per-frontend optimisation
53 Atomics (CASA/LDSTUB/SWAP, later LDREX/STREX) are block boundaries; ordering is TSO atomicity is preserved even when the JIT adds mid-region exits; TSO is free on the x86-64 host (ADR-003)
Control flow exits a block with a reason (ExitKind); the arch delivers it (take_exception) the shared loop dispatches on the reason; trap-vector/priority stays in arch code

Execution methods

The execution method is the runtime field EmulatorConfig::translation (bool, default true, src/runtime/include/lince/runtime/emulator_config.hpp:179), not a build flag — both paths are compiled into every build and selected per Emulator. There is no exposed dispatch enum and no LINCE_ENABLE_JIT.

translation What runs Notes
false naive core::step switch the reference path and correctness oracle
true (default) arch-neutral IR, JIT-compiled, with the IR interpreter as fallback native blocks/regions; any block the JIT cannot lower runs interpreted

When translation is true the tiered JIT compiles each hot region and the IR interpreter executes everything else, so the architectural result is identical either way — the JIT is a speed layer over the interpreter, never a behaviour change. Emulator::run_ir_quantum (src/runtime/src/emulator.cpp:950) drives this path block-at-a-time.

GDB and observers

A per-instruction observer (instruction trace) forces the Switch path, because block-at-a-time execution has no per-instruction hook — guarded by config_.translation && !observer_ at emulator.cpp:622. A GDB stub does not force it: run_ir_quantum is breakpoint-aware. With a stub attached it runs native until a block boundary, calls gdb_stub_->should_break at each entry, and single-steps (via fallback_stepcore::step) through any block that holds an interior breakpoint so it stops on the exact PC (block_has_breakpoint, emulator.cpp:986). While a stub is attached build_jit_region stops fusing blocks (emulator.cpp:1201), keeping that interior check exact. Emulator::single_step (emulator.cpp:542) is always core::step — per-instruction and method-independent.

The IR data model

lince_ir (src/ir/) defines the whole IR and depends only on lince_interfaces (ICpuBus, types.hpp). It has no dependency on lince_core: it operates on the opaque GuestState blob, never on core::CpuState.

IrInst — one operation

IrInst (src/ir/include/lince/ir/ir.hpp:79) is a fixed-size POD: an Op, a byte size (½/4/8), a MemEndian, up to four block-local Temp operands (dst, a, b, c), a 64-bit imm, and the originating guest pc. The pc is stamped on the trapping ops (LdGuest/StGuest/TrapIf) so a mid-block fault reports the exact trap PC; it slots into existing padding, so the struct does not grow.

The Op enum (ir.hpp:38) is the full op set as it stands today:

Family Ops Notes
Values Const imm → dst
Register file LdState, StState guest_state[off .. off+size], host order
Guest memory LdGuest, StGuest {size, endian}-tagged bus access
Integer ALU Add Sub And Or Xor Shl Shr Sar Mul 32-bit; shift count masked to 5 bits
Multiply-high UMulHi, SMulHi high 32 bits of the 64-bit (un)signed product (SPARC writes %Y); Mul is the low 32
Divide UDiv, SDiv 32/32; ÷0 → 0 (frontend raises the trap separately)
64-bit divide UDiv64, SDiv64 ternary SPARC UDIV/SDIV: a=%Y(high), b=dividend(low), c=divisor; quotient of (a<<32\|b)/c, saturated to U/INT32 MAX/MIN
Compare CmpEq, CmpLtU, CmpLtS dst = 0/1
Select Select dst = a ? b : c
Trap TrapIf if a != 0: abort the block with exception imm at pc

Temp (ir.hpp:21) is a block-local SSA-free value: every value-producing op writes a fresh Temp and they all die at the block boundary. Cross-instruction state flows through the GuestState blob, never through temps.

IrExit — the structured terminator

Control flow is not an op — a block ends with one IrExit (ir.hpp:94). Conditional branches need both targets plus a condition temp, which is why the terminator is a struct rather than an op:

ExitKind Fields used Meaning
FallThrough / StaticBranch static_target (+ is_call) continue at a fixed PC
CondBranch cond, static_target, fallthrough_target cond ? target : fallthrough
IndirectBranch dyn_target (Temp) computed target (JMPL / return)
Exception exit_code deliver an architectural trap
PowerDown core halted pending interrupt

is_call (ir.hpp:104) marks a StaticBranch that is a CALL: the callee runs at static_target and the return point is the sequential next PC. It lets the region builder pull in the return block so the callee's return can chain back (see region chaining).

IrBlock — a translated basic block

IrBlock (ir.hpp:117) is a class, not a bare struct: it carries the data and the builder helpers the frontend composes. Frontends never hand-fill an IrInst; they call the emit helpers, which allocate temps and append ops:

Helper Emits Notes
emit_const(imm) → Temp Const block-local value
emit_ld_state(off, size) → Temp LdState read guest reg (host order)
emit_st_state(off, size, src) StState write guest reg
emit_ld_guest(addr, size, endian) → Temp LdGuest guest memory load
emit_st_guest(addr, val, size, endian) StGuest guest memory store
emit_binary(Op, a, b) → Temp ALU/Cmp the two-operand families above
emit_select(cond, t, f) → Temp Select cond ? t : f
emit_ternary(Op, a, b, c) → Temp UDiv64/SDiv64 three-operand divide
emit_trap_if(cond, code) TrapIf mid-block conditional trap at cur_pc
set_cur_pc(pc) stamp the PC onto subsequent trapping ops

The block's identity and bookkeeping fields:

  • entry_pc + mode_ctx — the block-cache key.
  • insn_count — guest instructions covered; the run loop bills sim-time from it.
  • mode_change_kind (ModeChangeKind, ir.hpp:112) — how the block leaves its mode context: None (the common case), StaticDelta (a translate-time constant CWP shift — SPARC SAVE/RESTORE), or Dynamic (RETT/WRPSR land on an unknown CWP/PSR). A region compiler may chain across a StaticDelta block but never across a Dynamic one.
  • exit_mode_ctx — the mode context the static successor runs under; meaningful only when mode_change_kind == StaticDelta.
  • exec_count (ir.hpp:141) — the interpret-first warmup counter (dispatcher runtime state, not part of the translation); see interpret-first tiering.
  • delay_trap_pc / delay_trap_npc / delay_trap_dynamic (ir.hpp:156) — the delay-slot trap fixup; see delay-slot traps.

GuestState — the opaque blob

GuestState (src/ir/include/lince/ir/guest_state.hpp:20) is an opaque byte-addressed std::vector<std::byte>. The frontend picks offsets; the IR only does load(offset, size) / store(offset, size, value) in host order. load/store are inlined (a literal size folds to a single aligned access) — they are the hot path for every register/PC access, profiled at ~28% of real-code runtime when out-of-line.

State unification — CpuState is a GuestState blob

There is no CpuStateGuestState sync. core::CpuState embeds an ir::GuestState int_state_{layout::StateSize} and exposes it directly via CpuState::guest_state() (src/core/include/lince/core/cpu_state.hpp:302). The SPARC integer-state layout (core::layout, cpu_state.hpp:40) is the single canonical representation: the SPARC frontend emits LdState/StState against exactly these offsets, and core::step reads/writes the same bytes, so the reference interpreter, the IR interpreter, and the JIT agree register-for-register with zero copy.

core::layout (the SPARC integer blob, cpu_state.hpp)
  [0]    8 globals %g0..%g7                       (GlobalsBase)
  [32]   NumWindows*16 windowed slots (8 outs + 8 locals/window;
         ins of window w alias outs of (w+1) mod NumWindows)   (WindowedBase)
  [544]  Y, PSR, WIM, TBR, PC, nPC, ASR17                       (SpecialBase)
  StateSize = 572

What is not in the blob: the FP register file, the cache-control registers, and the step-loop micro-state (annul_next, psr_write_pending, error mode, power-down). The IR never touches those; they stay in CpuState and only core::step ever sets them, which is what keeps the blob clean across IR blocks.

Why SPARC keeps CpuState

SPARC predates the IR. Its reference interpreter, GDB stub, and per-instruction observer all use core::CpuState, so unification made the blob a member of CpuState rather than a separate scratch buffer. A brand-new arch with no legacy core would be GuestState-native and skip CpuState entirely — see Adding a frontend Step 5.

ModeCtx — the mode key

ModeCtx (ir.hpp:28) is the small arch value that, with the entry PC, keys a block. Mode-changing instructions are block terminators (decision 50), so within a block it is constant. For SPARC it is the CWP bits of PSR (SparcArchitecture::mode_ctx_of, sparc_arch.cpp:43PSR & CwpMask); S/PS/EF join the key when the privileged/FP translation paths land.

Guest memory — the one place endianness lives

src/ir/include/lince/ir/guest_memory.hpp centralises bus access + byte order: bswap32, swap_for, bus_load, bus_store, guest_load, guest_store. The bus is big-endian today; a little-endian access is its byte-reverse over the access width (swap_for). Both the IR interpreter (interpreter.cpp:36, :48) and the JIT's extern "C" helpers (ir_jit.cpp:50, :62) call these, so the two strategies cannot diverge on memory semantics.

BlockCache — direct-mapped, (PhysAddr, ModeCtx)-keyed

BlockCache (src/ir/include/lince/ir/block_cache.hpp) is an 8192-slot direct-mapped cache: index_of(pc) = (pc >> 2) & (Size - 1), each slot carrying valid, pc_tag, mode, and the IrBlock. find(pc, mode) matches the tag and the mode; insert evicts the occupant; invalidate(pc) drops the slot whose tag matches (the self-modifying-code hook). The shape mirrors the Phase 10.1 decode cache.

Why 8192 (2026-06)

The slot count was raised from 1024 → 8192. At 1024 the index covered only a 4 KiB PC window, so on a call-heavy guest the hot code and its libc aliased and evicted each other (on Dhrystone, libc strcmp/strcpy hashed into the Proc_/Func_ block span and forced constant re-translation — translate_block ate ~9–16% of wall time). 8192 (a 32 KiB window) removes the aliasing (Dhrystone +39%, p99 slice jitter 28.7 ms → 8.9 ms); larger sizes measured no further gain. A cache only ever changes eviction frequency, never results, so this is bit-exact. See Performance.

The pipeline: SPARC → IR → LLVM → native

flowchart LR
    subgraph Frontend["lince_arch_sparc (per-ISA)"]
        A[guest bytes] -->|core::decode| B[DecodedInsn]
        B -->|translate_block| C[IrBlock]
    end
    subgraph IR["lince_ir (arch-neutral)"]
        C -->|BlockCache.find/insert| D[(block cache)]
    end
    subgraph Backend["lince_jit (LLVM)"]
        D -->|build_jit_region| E[region: IrBlock+]
        E -->|lower_block / IRBuilder| F[LLVM IR module]
        F -->|verifyModule + addIRModule| G[ORCv2 LLJIT]
        G -->|lookup materialises| H[native BlockExecFn]
    end
    D -->|JIT can't lower| I[IrInterpreter]
    H --> J[GuestState updated in place]
    I --> J

The frontend is the only ISA-aware stage. Everything from IrBlock rightward is arch-neutral and shared.

The IR run loop

Emulator::run_ir_quantum (emulator.cpp:950) drives one core for one quantum, identical for the IR-interpreted and JIT-compiled cases:

flowchart TD
    Start([enter quantum]) --> Halt{error / powered-down?}
    Halt -->|yes| Done([return ran])
    Halt -->|no| IPI[poll_self_interrupt]
    IPI --> GDB{GDB break at pc?}
    GDB -->|yes| Done
    GDB -->|no| Clean{clean boundary?}
    Clean -->|"no (delay slot /<br/>pending PSR / annul)"| FB[fallback_step: core::step] --> Loop
    Clean -->|yes| Cache["BlockCache.find(pc, mode)"]
    Cache -->|miss| TX[translate_block + insert] --> Untr
    Cache -->|hit| Untr{insn_count == 0?}
    Untr -->|yes| FB
    Untr -->|no| Bud{next block<br/>crosses quantum?}
    Bud -->|"yes (ran>0)"| Tail[tail-step remainder via core::step] --> Done
    Bud -->|no| BP{interior breakpoint?}
    BP -->|yes| FB
    BP -->|no| Warm{exec_count <<br/>baseline_threshold?}
    Warm -->|yes| Interp[IrInterpreter.run] --> Apply
    Warm -->|no| Compile["tiered_jit.get_or_compile"]
    Compile -->|nullptr| Interp
    Compile -->|fn| Native["fn(gs, bus, &res, budget)"] --> Apply
    Apply[apply exit: advance PC/nPC<br/>or take_exception] --> Loop{ran < quantum?}
    Loop -->|yes| Halt
    Loop -->|no| Done

Key points, in order:

  1. The blob is canonical for the whole quantum. gs = state.guest_state() (emulator.cpp:961). The IR/JIT and core::step all operate on these same bytes, so there is no entry/exit sync. The exit comment (emulator.cpp:1185) is explicit: every IR/JIT update is already reflected in CpuState.
  2. Self-IPI poll. poll_self_interrupt takes a self-directed IPI at the block boundary (hardware-latency self-interrupt) before fetching the next block (emulator.cpp:1007). Because this runs every block boundary but the controller is empty ~99.99% of the time at a 100 Hz tick, Emulator::sample_interrupts first checks the controller's IInterruptController::raw_pending() — a maintained single-word superset of the pending sources. A 0 result proves pending_mask(cpu) == 0 for every cpu, so the poll early-outs before the full per-CPU scan (provably bit-exact; gated to SingleThread). See Performance.
  3. Clean-boundary gate (emulator.cpp:1025). A block may be translated only at a clean instruction boundary: !annul_next() && !psr_write_pending() && npc == pc + 4. A delay slot, an annulled slot, or SPARC's delayed-WRPSR window is fallback_step-ed until clean. That micro-state lives in CpuState, not the blob, and only the fallback ever sets it.
  4. Block cache. find(pc, mode); on a miss, translate_block + insert (emulator.cpp:1038). Translation is a pure function of (pc, mode, guest code) — it never reads runtime register state — so a cached block is valid for any later execution and is shared across cores (in SingleThread). insn_count == 0 means "untranslatable op here" → fallback. PhysAddr{pc} is sound only while VA==PA (identity-mapped MMU), which holds until SRMMU lands.
  5. Quantum-exact yield (emulator.cpp:1067). If the next block would cross the quantum (and ran != 0), the remaining quantum - ran instructions run one-by-one through core::step, landing on the exact same boundary the switch path would. This matters under SMP: block-level overshoot drifts the round-robin interleaving point and once livelocked a lock-free migration handshake (smpschedaffinity04). A block larger than the whole quantum still runs whole when ran == 0, guaranteeing forward progress.
  6. Execute. Native (JIT) when warm and lowerable, else the IR interpreter. Both leave gs updated in place. The interpreter yields an ir::BlockExit; the JIT writes a jit::BlockResult, which the run loop normalises into the same BlockExit shape (emulator.cpp:1122).
  7. Apply the outcome (emulator.cpp:1141). A memory_fault maps to DataAccessException; an Exception carries the frontend's exit_code. The saved PC is the faulting instruction's; nPC is normally PC+4 but a trap in a control-transfer delay slot saves the branch's resolved target (the delay-trap fixup). Delivery is via architecture.take_exception when PSR.ET, else the core enters error mode (SPARC V8 §7.3). A normal exit advances PC/nPC to the continuation. Sim-time is billed from the instructions actually executed (jit_insns for the JIT — which may span many self-loop iterations — else block->insn_count).

What the JIT/IR can't translate → fallback

Anything the SPARC frontend cannot yet emit ends the block (insn_count == 0 or a bail_at fall-through) and the run loop core::steps that PC. The frontend currently bails on: FP ops, atomics (CASA/LDSTUB/SWAP), alternate-space access, RETT/Ticc, the cc-setting multiply-step (MULScc) and SpecialReg reads, LDD/STD (64-bit guest memory in lower_block, ir_jit.cpp:319), and a delay slot that is itself a CTI or otherwise non-predicable. Annulled conditional branches (Bicc,a) are lowered when their delay slot is predicable — rd-only, cc/%Y-writing, a single-word load, or UDIV/SDIV — running the slot on the taken path and squashing it when not taken; a store, LDD, or control-transfer delay slot still bails (sparc_frontend.cpp:671).

Delay-slot traps

SPARC's branch + delay slot is translated as one block (the branch is the terminator, the slot is the trailing edge). If the delay slot is a trap-capable op (Load/Store/Div) a fault there must save the branch's resolved nPC, not trap_pc + 4 — the straight-line block model would otherwise mis-save it. The block records this in delay_trap_pc / delay_trap_npc / delay_trap_dynamic:

  • Static (delay_trap_dynamic == false — CALL, BA): the saved nPC is the constant delay_trap_npc (the call/branch target).
  • Dynamic (true — JMPL, a true conditional Bicc): the runtime nPC is unknown at translate time, so the frontend stores the resolved nPC into the GuestState nPC slot before the delay slot, and the run loop keeps it on a fault (emulator.cpp:1156) rather than overwriting.

A delay-trap block must stay the dispatch entry — build_jit_region refuses to bury one mid-region (emulator.cpp:1240) so the run loop can read its delay_trap_* off the entry to fix the saved nPC.

The JIT

lince_jit (src/jit/) is the isolated module that owns LLVM. It depends only on lince_ir (and LLVM); lince_runtime links it PUBLIC unconditionally. It uses LLVM ORCv2 (LLJIT).

How ORCv2 turns a region into native code

ORCv2 is LLVM's on-request compilation API; LLJIT is its turnkey wrapper. Lince uses it as a black box that takes an LLVM IR module and returns the address of a compiled function. The lifecycle per region, in IrJit::compile_region (ir_jit.cpp:666):

  1. Build an LLVM IR module. A fresh llvm::LLVMContext + llvm::Module hold one execute_region_N function (N from a per-IrJit counter, ir_jit.cpp:680) whose signature is the BlockExecFn ABI below. One LLVM BasicBlock is created per region member, keyed by (entry_pc, mode_ctx); the function entry block holds the shared allocas and branches to member 0.
  2. Lower. lower_block (ir_jit.cpp:130) walks each member's IrInsts and IrExit and emits LLVM IR with an IRBuilder — the only place that knows both the Lince IR and LLVM. It returns false for an op/exit it cannot lower, and compile_region returns ErrorCode::JitError.
  3. Verify. llvm::verifyModule rejects malformed IR — a lowering bug returns JitError instead of producing wrong code (ir_jit.cpp:740).
  4. Add to the JIT. LLJIT::addIRModule(ThreadSafeModule) hands the module to ORCv2. Nothing is compiled yet; ORCv2 is lazy.
  5. Look up the symbol. LLJIT::lookup("execute_region_N") materialises the symbol: the optional O2 IR transform (Optimised tier only) runs, then IRCompileLayer lowers LLVM IR → machine code via the host backend (instruction selection, register allocation, scheduling), then the object is linked into executable memory. lookup returns the address, cast to BlockExecFn via sym->toPtr<BlockExecFn>() (ir_jit.cpp:754).

The address stays valid for the JIT's lifetime; the TieredJit caches it per (pc, mode) so steps 1–5 happen once per region, never per execution.

Block ABI

// src/jit/include/lince/jit/ir_jit.hpp
enum class BlockStatus : std::uint32_t { Normal = 0, Exception = 1, MemoryFault = 2 };

struct BlockResult {
    std::uint32_t next_pc;     // continuation (Normal)
    std::uint32_t exit_code;   // architectural trap tt (Exception/TrapIf)
    std::uint32_t trap_pc;     // faulting PC (Exception / MemoryFault)
    std::uint32_t status;      // a BlockStatus value
    std::uint32_t insns;       // guest instructions ACTUALLY executed
};
using BlockExecFn = void (*)(void* guest_state, void* bus,
                             BlockResult* out, std::uint32_t budget);

The native function operates in place on the GuestState blob (gs.bytes().data()) and the ICpuBus, writing its outcome to *out. The field byte offsets are asserted against the struct (ir_jit.cpp:102) because the lowering stores them by offset. insns is the count the call billed — with self-loop/region chaining one call may iterate many times, and on a mid-block trap it includes the partial final iteration, so the dispatcher bills sim-time from insns, not the static insn_count. budget is the instruction budget for chained back-edges (see region chaining).

BlockExit vs BlockResult

The IR interpreter returns ir::BlockExit (interpreter.hpp:30); the JIT fills jit::BlockResult (ir_jit.hpp:50). They are distinct structs with the same information — run_ir_quantum normalises the JIT's into a BlockExit so the apply-outcome code is shared (emulator.cpp:1122).

Lowering details

lower_block (ir_jit.cpp:130) emits one block's ops and exit at the builder's current insert point:

  • State. LdState/StState → byte GEPs + align-1 load/store on the blob, host order (matching the interpreter exactly). The size selects an i8/i16/i32 access with zext/trunc as needed.
  • The register file is memory, not LLVM registers. Loads/stores hit the blob pointer. The O2 pipeline's mem2reg/SROA passes promote hot guest registers to SSA values (and back) within a region — that is where most of the optimised tier's speed comes from. The baseline tier skips those passes, so its code keeps every guest register in memory.
  • ALU / Cmp / Select. Direct LLVM instructions; shift counts masked to 5 bits; UMulHi/SMulHi via i64 ext + mul + shift; division guarded (÷0 → 0, INT_MIN/-1 saturated, never a trapping LLVM sdiv/udiv) to match the interpreter (ir_jit.cpp:240).
  • Guest memory. See inline RAM below; the slow path calls the extern "C" helpers (lince_jit_load/lince_jit_store, resolved by ORCv2 absolute symbols at IrJit::create, ir_jit.cpp:644, to functions that funnel through ir::guest_load/guest_store). 64-bit (LDD/STD) sizes bail (ir_jit.cpp:319).
  • Faults and traps. A bus error (fault_slot set by the helper) or a TrapIf aborts the block mid-stream via an early ret that records trap_pc/status/insns (ir_jit.cpp:341, :437). The precise-trap "prefix committed, suffix skipped" guarantee falls out of in-order emission + early return, identical to the interpreter. insns on the fault path is acc_load() + partial(pc) — completed iterations plus the prefix of the current one.
  • Exits. A static/taken target that is in the region chains directly to its member block (budget-guarded); else it returns to the dispatcher. Exception/PowerDown terminators are not lowered — lower_block returns false and the caller falls back.

Inline RAM access

Guest loads/stores were originally always an out-of-line helper call. With a RAM window the JIT inlines big-endian RAM access (sizes ½/4) as native host access (ir_jit.cpp:372):

offset = addr - window.guest_base
if (offset <=u window.size - access_size)   // whole access in-window
    native load/store at host_ptr + offset  // llvm.bswap for 2/4 (LE host)
else
    call the bus helper                      // MMIO / out-of-window / straddling
  • The bounds check guards the whole access (offset <=u size - access_size, a compile-time constant), so a straddling access falls to the slow path where the bus latches it as a BusError, exactly as the interpreter does. The frontend's alignment TrapIf already runs before the access, so aligned accesses never straddle — but the check is correct without that guarantee.
  • The host pointer comes from SystemBus::ram_view_at(config_.ram_base) (emulator.cpp:471) and is baked as an i64 constant; it is stable because RAM is mapped once at initialize() and never moved. The window is passed at JIT construction (IrJit::create(std::optional<RamWindow>)); with no window (unit tests with a synthetic bus) or for little-endian accesses, the helper path is used unchanged (ir_jit.cpp:364).
  • The inline path byte-reverses explicitly with llvm.bswap, so it assumes a little-endian host — a static_assert enforces this (ir_jit.cpp:36; ADR-003 fixes the JIT host to x86-64). This is independent of the guest endianness.

Self-loop chaining

A block whose static/taken exit target is its own entry_pc (a tight backward branch, or a ba/spin park loop) is lowered as a native loop instead of returning to the dispatcher each iteration — removing the per-iteration indirect call + two hash lookups + BlockResult marshalling.

  • The function entry holds the loop-invariant allocas (the fault_slot flag and the pre-zeroed instruction accumulator acc_slot); the self-edge is a budget-guarded br back to the member body. The body always runs at least once (forward progress).
  • Loop-carried state is the blob (memory) and the accumulator; block-local temps are recomputed each iteration, so LLVM's mem2reg/LICM clean it up with no hand-written PHIs beyond the inline-RAM load merge.
  • BlockResult::insns reports the real count across iterations; a mid-loop fault reports completed iterations + the partial final one, so sim-time stays exact.
  • Self-loops are mode-safe by construction: a block can only target its own entry, so it re-runs under the mode it was translated for.

The self-loop is the degenerate single-member case of a region.

Region chaining

A region fuses an entry block with its same-mode successors into one native function. Emulator::build_jit_region (emulator.cpp:1190) discovers the region by BFS from the entry:

  • It follows StaticBranch/CondBranch edges (and the StaticDelta fallthrough), decoding each successor via frontend.translate_block without touching ir_cache_ (so the caller's block pointer into the cache stays valid, emulator.cpp:1232).
  • For a CALL (is_call) it also pulls in the return block (the sequential PC after the call+delay slot) so the callee's IndirectBranch return can chain back into native code (emulator.cpp:1274).
  • It stops at jit_max_region_blocks (default 8), at an untranslatable target (insn_count == 0), at a delay-trap block (must stay the entry), at a Dynamic mode change (unknown post-change CWP), or when a GDB stub is attached (no fusing, emulator.cpp:1201).

IrJit::compile_region (ir_jit.cpp:666) lowers the members into one function. resolve(target_pc, mode) maps an exit target to a member's LLVM block:

  • An in-region static/taken/fall-through target becomes a direct brbudget-guarded (take_static, ir_jit.cpp:500): chain only if the whole target fits the remaining budget (acc_now + member_insns <=u budget), the same don't-start-an-overshooting-block rule the dispatcher applies. The weaker acc_now < budget would overshoot and reintroduce the SMP interleaving drift.
  • An IndirectBranch (JMPL/return) target is a runtime value, so it gets an inline cache (ir_jit.cpp:553): compare it against each in-region member at the successor mode; on a hit (with budget) chain to that body, else fall through to a dispatcher return. This is what lets CALL/return pairs run wholly in native code.
  • An out-of-region or no-hit target returns to the dispatcher.

mode_change_kind is the safety flag: the SPARC frontend sets StaticDelta on SAVE/RESTORE (their CWP rewrite makes the successor a different mode_ctx). The region builder chains across a StaticDelta block under exit_mode_ctx but never across a Dynamic block, and resolve keys members by (pc, mode) — so a member is only ever re-entered under the mode it was translated for. If a member turns out to be unlowerable, compile_with_fallback (tiered_jit.cpp:37) retries with the entry block alone, so a region never does worse than single-block.

Tiered compilation (ADR-002)

flowchart TD
    Miss["run_ir_quantum: cache miss / cold block"] --> Warm{"exec_count <<br/>jit_baseline_threshold (32)?"}
    Warm -->|yes| Interp["IrInterpreter.run<br/>(++exec_count)"]
    Warm -->|no| GOC["TieredJit.get_or_compile"]
    GOC -->|"first request"| Base["compile_with_fallback<br/>(Baseline O0, caller thread)"]
    Base -->|nullptr| InterpFb["interpreter fallback<br/>(verdict cached)"]
    Base -->|fn| Run["run native (Baseline)"]
    GOC -->|"opt published"| RunO["run native (Optimised)"]
    Run --> Note["note_execution"]
    Note --> Cross{"exec_count == 100?"}
    Cross -->|yes| Enq["enqueue on background thread"]
    Enq --> BG["worker: compile_with_fallback<br/>(Optimised O2)"]
    BG --> Pub["optimised.store(fn, release)"]
    Pub -.->|"next get_or_compile"| RunO

IrJit takes an OptLevel (ir_jit.hpp:81). Baseline builds its LLJIT with CodeGenOptLevel::None (fast instruction selection, no IR passes) — cheap to produce. Optimised uses CodeGenOptLevel::Aggressive plus the full LLVM O2 IR pipeline (optimize_module_o2, ir_jit.cpp:76), installed as an IRTransformLayer transform that runs during materialisation (ir_jit.cpp:631).

TieredJit (src/jit/src/tiered_jit.cpp) drives two IrJits and owns the (pc, mode) cache (CacheEntry, tiered_jit.cpp:65):

  • On get_or_compile it compiles the Baseline immediately on the calling thread, so a warm block runs at once (no compile stall on the hot path).
  • note_execution counts runs; crossing jit_promotion_threshold (default 100) enqueues the block on a single background thread (worker_loop, tiered_jit.cpp:111) that compiles the Optimised tier and publishes the function pointer atomically (optimised.store(fn, std::memory_order_release)). The dispatcher prefers the optimised pointer once present (get_or_compile, acquire-load, tiered_jit.cpp:179).
  • Both tiers lower the identical IR, so they are semantically interchangeable. An unlowerable block caches a nullptr verdict (baseline_tried, tiered_jit.cpp:183) and is never re-attempted or promoted.
  • A compile budget (MaxPendingCompiles = 64, tiered_jit.cpp:30) bounds the Optimised queue: a burst of promotions drops the oldest request (its block stays on Baseline) rather than growing without bound.
  • The CacheEntry keeps the JIT's own copy of the region's IR (entry.region), because the dispatcher's block cache is direct-mapped and evicts — the background thread must not alias it. std::unordered_map node-pointer stability is what lets the worker hold a raw CacheEntry* while the caller keeps inserting (tiered_jit.cpp:62).

Interpret-first tiering

Before any compilation, the dispatcher runs a cold block on the IR interpreter until it proves hot. run_ir_quantum checks block->exec_count < config_.jit_baseline_threshold (default 32, emulator.cpp:1104): below the threshold it runs the interpreter and ++exec_count; at the threshold it calls get_or_compile. Cold / run-few blocks (the bulk of boot and varied control flow) never pay LLVM compile latency they would not amortise. The IR interpreter is the lockstep oracle, so warming on it changes no semantics — only which validated executor runs the block. jit_baseline_threshold == 0 restores the legacy compile-on-first-sight behaviour.

Runtime knobs (EmulatorConfig, not build flags): jit_baseline_threshold, jit_promotion_threshold, jit_background_opt (false → Baseline-only, no background thread), jit_max_region_blocks (region size cap).

Self-modifying code and the code flush

A cached block is invalidated when the guest changes the code under it:

  • FLUSH. SPARC FLUSH is untranslatable, so it always retires via core::step. The fallback checks state.consume_code_flush() and calls request_code_flush (emulator.cpp:973), which clears every ir_cache_, every tiered_jit_ cache, and every per-core decode cache (flush_code_caches, emulator.cpp:912). Under MultiThread the flush is deferred to the serial round boundary (code_flush_pending_ latch, emulator.cpp:927) where all workers are parked.
  • load_elf / reset / write_physical. These also clear the IR + JIT caches (emulator.cpp:525, :1632) since they may overwrite cached code.
  • BlockCache::invalidate(pc) is the finer hook for a single written word; it drops the matching slot.

TieredJit::clear drains the background worker first (so no in-flight compile touches a cleared entry), then clears the cache map; the LLJIT-resident code is left orphaned in place — reloads are rare, matching the IR caches' policy (tiered_jit.cpp:232).

Multi-thread: per-core caches

Under ExecutionMode::MultiThread (Phase 13) each simulated core runs on its own host thread, so the per-core IR caches, interpreters, and JITs must not be shared (the BlockCache slots, IrInterpreter scratch, and TieredJit cache map are not thread-safe). initialize() allocates n_caches = MultiThread ? num_cores : 1 of each (emulator.cpp:454), and ir_cache_for / tiered_jit_for / ir_interp_for (emulator.cpp:938) index by core_idx under MT, by 0 otherwise. Each per-core TieredJit keeps its own background O2 thread (Phase 14 P14-2) — the optimiser is bursty (compile each hot block once at warmup, then idle), so N threads are a transient startup spike, not sustained load, recovering the per-core O2 throughput baseline-only MT had sacrificed.

Validation strategy

Correctness is established by lockstep, layered so each increment lands on a proven base:

  1. IR interpreter vs reference — the IR diff harness (tests/integration/test_ir_diff_lockstep.cpp + tests/support/ir_diff_harness.hpp) runs two Emulators (IR-driven vs core::step) and memcmps the full GuestState blob after every block across a real RTEMS boot (≈140 k instructions, byte-identical).
  2. JIT vs interpreter — block-level lockstep (tests/unit/test_ir_jit.cpp): a block run through the JIT and through IrInterpreter leaves identical blob + guest memory.
  3. JIT vs reference — run-loop lockstep (tests/integration/test_jit_run_lockstep.cpp, mirrored by the in-tree oracle harness at emulator.cpp:1900+): the JIT engine vs core::step single-stepping, byte-identical across RTEMS boot with self-loop chaining and inline RAM active.

On top of that: the full sptest suite under each mode (CSV diff for PASS→FAIL regressions), SMP smptests under Switch/IR/JIT (N=2 and N=4), and ASan/UBSan/LSan on the isolated lince_jit_tests exe.

TSan

libtsan is absent on the development host, so the background-thread JIT and per-core MT scaffolding are validated under ASan+UBSan+LSan only.

Performance

Single-core, cpubound-mix, x86-64 host (median of 5):

Mode MIPS vs Switch
Switch (interpreter, reference) 104 1.0×
IR interpreted (translation on, JIT fallback) 49 0.47×
translation — 12.3 lowering 347 3.3×
translation — + 12.4a self-loop chaining 880 8.5×
translation — + 12.5 inline RAM 1962 18.9×
translation — + tiered + 12.4b region (closeout) ~2070 ~20×

The IR-interpreted mode is slower than the Switch interpreter by design (translate-then-interpret indirection with no native payoff); the IR earns its keep only once JIT-compiled — which is why it is only ever the JIT's fallback, never a mode a user selects.

Steady-state cpubound-mix is dominated by a hot loop the self-loop path already captured, so the tier and region chaining leave it ~flat. Their win is on the cold path — where the old single-tier paid full O2 codegen for every block, most of which run only a handful of times:

Cold-path workload single-tier O2 tiered (+ interpret-first + region)
RTEMS hello boot (wall) 4.5 s 1.0 s
sptest suite (190 ELFs, translation, wall) ~15 min 6.3 min

(The removed Threaded prototype peaked at ~143 MIPS / 1.4× over Switch before the IR JIT superseded it — see Decision 59.)

Build and test

# Standard build — LLVM (>= 18) is mandatory; the JIT is in every build.
cmake -S . -B build -G Ninja
cmake --build build -j

# JIT unit tests, sanitiser-clean (links only lince_jit, not lince_core,
# whose musttail chain is incompatible with ASan)
./build/tests/lince_jit_tests

# JIT lockstep + RTEMS boots, excluding the slow sptest suite
./build/tests/lince_tests "[jit]~[sptests]"

# Compare the Switch interpreter against the default translation path
./build/bench/lince-bench --workload cpubound-mix --runs 5                # translation (default)
./build/bench/lince-bench --workload cpubound-mix --no-translate --runs 5 # Switch interpreter

The full sptest suite under the translation path is compile-bound (cold LLVM codegen per region) and takes minutes — run it without a timeout wrapper.

Module boundaries

interfaces ← ir ← arch_sparc ──┐
            ↑                   ├→ runtime ← app
           jit (LLVM) ─────────┘   (links jit PUBLIC, always)
core, bus, peripherals ─────────┘

lince_ir depends only on lince_interfaces (not lince_core) — it works on the opaque GuestState. lince_jit depends only on lince_ir + LLVM, kept isolated so the JIT tests can build under sanitisers without lince_core, whose musttail chain is incompatible with ASan. The SPARC frontend (lince_arch_sparc) bridges core and IR: it reuses core::decode and the core::layout offsets, but emits arch-neutral IR.

Remaining work

Phase 12 is structurally complete (tiered JIT + region chaining + interpret-first landed). What is left is optimisation and the multi-arch payoff, not core functionality:

  • Perf tuning + a long soak — promotion-threshold sweep, hot-helper inlining, a sustained-divergence stress run. Gated on a concrete 1:1 target rather than the testsuite.
  • FP / atomic / alternate-space lowering — these still bail to core::step; lowering them would shrink the fallback surface (correctness is already preserved by the fallback).
  • An ARM frontend — the payoff that confirms the arch-neutral seam. See Adding a frontend for the procedure.

See also

  • Execution model — the core::step reference cycle and the decode cache the IR path falls back to.
  • Adding a frontend — the IArchitecture / IArchFrontend seam and the step-by-step to add a new ISA.
  • decisions.md — decisions 49–57 (IR neutrality, runtime dispatch) and 59 (threaded-code removal) with rejected alternatives.