Binary translation — a primer¶

How TERO executes SPARC machine code on an x86-64 host, written for an engineer who has not worked with emulators or JIT compilers before. Every concept is introduced before it is used, and every claim names the type, file, or numbered decision that backs it. The reference-grade companion pages are IR and LLVM JIT (the full data model and backend internals) and Adding a frontend (the contributor procedure for a new guest ISA).

Vocabulary used throughout — the terminology table at the bottom collects all of it:

Guest: the emulated machine and its code. TERO's guest is a SPARC V8 system-on-chip (GR712RC or GR740) running RTEMS.
Host: the machine the emulator runs on — x86-64 Linux (ADR-003).
Functional emulation: reproducing the architectural results of every instruction (registers, memory, traps), not the silicon's internal pipeline timing. TERO's "1:1 real-time" goal means simulated seconds per wall-clock second, never cycle accuracy (ADR-004).

1. The problem¶

A SPARC program is a sequence of 32-bit big-endian instructions for a CPU the host does not have. Something on the host must read each instruction and produce the architectural effect the SPARC manual specifies. The engineering question is how much work to invest per instruction executed versus per instruction translated, because real workloads execute the same small set of instructions billions of times.

The strategies below form a ladder. TERO implements rungs 1 and 4, and keeps rung 1 permanently as its correctness reference; rungs 2 and 3 exist inside rung 4 as its building blocks.

Rung 1 — the switch interpreter¶

Fetch one instruction, decode its bit fields, execute it through a switch over the instruction kind, repeat. This is core::step (src/core/src/step.cpp) over CpuState, with a per-core direct-mapped decode cache so the bit-field extraction is not repeated for a PC already seen. Cost model: every dynamic instruction pays decode-cache lookup + one indirect dispatch + handler body. Measured: ~104 MIPS single-core on the cpubound-mix benchmark (performance table).

The interpreter's value is not speed. Each handler is a direct, readable transcription of the SPARC V8 manual, which makes it the correctness oracle: the implementation every faster path is compared against, bit for bit. It is frozen by ADR-006 — no new features, never removed (Decision 80).

Rung 2 — basic blocks¶

Programs do not branch every instruction. A basic block is a run of instructions with one entry point and one terminator (a branch, call, trap, or other control transfer). Decoding a whole block once and caching the result amortises the per-instruction decode cost over every future execution of that block. The block becomes the unit of translation, caching, and execution everywhere past this point.

Rung 3 — an intermediate representation¶

Caching decoded SPARC instructions still ties every consumer (an interpreter, a compiler, a tracer) to SPARC. Translating the block instead into a small arch-neutral instruction set — an intermediate representation, IR — decouples them: the IR knows registers only as byte offsets into an opaque state blob and memory only as explicitly-sized, explicitly-endian accesses, so nothing downstream of the translator contains SPARC knowledge. TERO's IR is ir::IrBlock (src/ir/include/tero/ir/ir.hpp); §3 below walks through it. A block of IR can be interpreted (ir::IrInterpreter — execute the ops one by one) or compiled (§5).

Rung 4 — just-in-time compilation¶

Translate the IR block once more, into host-native x86-64 code, and jump into it. The translation cost is paid once per block; every subsequent execution runs at native speed. This is binary translation proper (equivalently: a JIT — just-in-time — compiler whose input language is SPARC machine code). TERO's backend lowers IR through LLVM and reaches ~2070 MIPS on the same benchmark — ~20× the interpreter.

The catch is latency: compiling costs milliseconds, and most blocks in a real workload (boot code, initialisation paths) execute a handful of times — compiling them costs more than it saves. The fix is tiering (§5): run a block on the IR interpreter until it proves hot, then compile cheaply, then recompile with full optimisation in the background.

2. The pipeline¶

flowchart LR
    subgraph guest["Guest memory"]
        G["SPARC machine code"]
    end
    subgraph front["tero_arch_sparc"]
        FE["SparcFrontend::translate_block"]
    end
    subgraph ir["tero_ir"]
        B["ir::IrBlock<br/>(TERO IR)"]
        BC["ir::BlockCache<br/>(pc, mode) → block"]
        INT["ir::IrInterpreter"]
    end
    subgraph jit["tero_jit"]
        TJ["jit::TieredJit"]
        LL["LLVM IR → ORCv2 LLJIT"]
        N["native x86-64"]
    end
    G -->|decode once per block| FE --> B --> BC
    BC -->|cold / fallback| INT
    BC -->|hot| TJ --> LL --> N
    INT -->|writes| GS["ir::GuestState blob"]
    N -->|writes| GS

One translation, two executors. The frontend (src/arch/sparc/src/sparc_frontend.cpp) decodes guest bytes into an IrBlock exactly once per (PC, mode); the cached block is then run either by the IR interpreter or by JIT-compiled native code. Both executors mutate the same ir::GuestState byte blob and call the same ir::guest_load/guest_store helpers for guest memory (src/ir/include/tero/ir/guest_memory.hpp), so they cannot drift apart on state layout or endianness.

Module	Owns	Does not own
`tero_arch_sparc`	SPARC decode → IR; state layout offsets; exception entry (`IArchitecture`)	execution
`tero_ir`	the IR data model; `GuestState`; `BlockCache`; the IR interpreter	any guest-ISA knowledge
`tero_jit`	IR → LLVM IR lowering; tiered compilation; native dispatch	the IR's semantics (it implements them op-for-op)
`tero_runtime`	the run loop that drives all of the above per quantum	—

The dependency rule (see Layers): tero_ir does not depend on tero_core — the IR works on the opaque blob. A new guest ISA is a new frontend module and nothing else; interpreter, cache, JIT, and runtime are shared (Adding a frontend).

Method selection is a runtime config field, not a build flag (EmulatorConfig::translation, default true): false runs the switch interpreter; true runs the JIT with the IR interpreter as its fallback. Both paths are always compiled in.

3. TERO IR in ten minutes¶

The IR is deliberately small: ~30 operation kinds, no types beyond "32-bit value", no SSA, no control flow inside a block. Everything a guest instruction does decomposes into:

Temps (ir::Temp) — block-local virtual values. Each value-producing op writes a fresh temp; temps die at the block boundary. Cross-instruction state never flows through temps — it flows through the blob.
State ops — LdState/StState read/write the guest register file at (byte offset, size) in the GuestState blob. Register names do not exist in the IR: the SPARC frontend turns %g1 or a windowed %o3 into a byte offset at translate time (sparc_layout.hpp).
Memory ops — LdGuest/StGuest carry an explicit size and endianness (MemEndian::Big for SPARC) and report faults; this is the only place endianness exists in the whole execution stack.
Compute ops — Add, Sub, logic, shifts, multiplies, divides, compares, Select. Plain 32-bit operations on temps.
TrapIf — conditional mid-block exception (alignment, window overflow, divide-by-zero), carrying the exact guest PC.
A structured exit (ir::IrExit) — control flow is not an op; a block ends with one terminator record:

`ExitKind`	Meaning
`FallThrough` / `StaticBranch`	continue at a fixed PC (`is_call` marks a CALL)
`CondBranch`	`cond ? static_target : fallthrough_target`
`IndirectBranch`	continue at a runtime-computed address
`Exception`	deliver an architectural trap
`PowerDown`	core halts pending an interrupt

Worked example — two SPARC instructions and the IR the frontend emits (builder calls, simplified):

guest:                              IR (conceptual):
  add  %g1, %g2, %g3                  t0 = ld_state(off(%g1), 4)
                                      t1 = ld_state(off(%g2), 4)
                                      t2 = add(t0, t1)
                                      st_state(off(%g3), 4, t2)
  ld   [%g3], %g4                     t3 = ld_guest(t2, 4, Big)   ; may fault
                                      st_state(off(%g4), 4, t3)
  (block terminator: FallThrough → next PC)

Every op is stamped with the guest PC and the 0-based instruction index it came from (IrInst::pc / insn_index). That metadata is what makes a mid-block trap report the exact faulting PC, lets the JIT bill retired instructions, and drives the per-instruction step hook (§6).

Two pieces of block metadata complete the picture:

ModeCtx — a small arch-defined value that, with the entry PC, keys the block cache. SPARC packs the register-window pointer and three PSR bits into it, so mode-dependent state offsets resolve at translate time and a mode change simply ends the block. Same PC + different window = different cached block.
Delay-slot metadata (delay_trap_*, no_stop_tail) — SPARC branches execute one more instruction after the branch (the delay slot). The frontend encodes the fix-ups this needs on the block so the arch-neutral engine can stay ignorant of the concept.

Full field-level reference: IR data model.

The guest state blob¶

ir::GuestState is a byte array — nothing more. The architecture declares its size (IArchitecture::state_size(): 572 bytes for SPARC, 64 for the toy test ISA) and owns the layout. Since state unification, SPARC's CpuState integer state is this blob: the switch interpreter, the IR interpreter, and JIT-compiled native code all read and write the same bytes, so handing a core from one executor to another requires no synchronisation step at all (State unification).

4. Execution: the dispatch loop¶

ExecutionEngine::run_ir_quantum (src/runtime/src/engine_translate.cpp) drives one core for one quantum (default 1000 instructions — the round-robin slice that bounds cross-core drift in SMP). Per iteration, for the current PC:

Cache lookup — BlockCache::find(pc, mode); on a miss, call the frontend and insert. The cache is direct-mapped, 8192 slots (details).
Tier check — run the block on the IR interpreter until it has executed jit_baseline_threshold (32) times; then compile at the cheap tier; after jit_promotion_threshold (100) more executions, a background thread recompiles at full optimisation (§5).
Execute — native code or interpreter; both leave the blob updated and return the same BlockExit shape (next PC, or a fault with its exact PC).
Exit handling — normal exits advance the PC through IArchitecture::set_pc; faults go through raise_block_exception, which performs the architecture's trap entry.

SPARC-specific windows the IR cannot model (an in-flight delay slot, an annulled-slot micro-state, an untranslatable instruction — the FPU among them) fall back to the switch oracle for exactly those instructions, then the IR path resumes. A clean-slate ISA without these features never takes the fallback; TERO's toy test frontend (tests/integration/test_toy_frontend.cpp) runs end-to-end with no oracle at all.

Timing is instruction-counted: each retired instruction advances simulated time by ns_per_insn; the JIT changes only how fast wall-clock the same simulated timeline is produced. See Execution model for the quantum, pacing, and idle-skip machinery.

5. LLVM as the backend¶

LLVM appears in exactly one place: inside tero_jit, turning IR blocks into host machine code at runtime. It is not the source language (guest SPARC is), not the IR of the project (TERO IR is), and not on the reference path (the IR interpreter executes without it). What tero_jit uses, concretely:

Lowering (src/jit/src/ir_jit.cpp): one LLVM function per region (one or more IR blocks, §5.1), one LLVM basic block per member. Each TERO IR op maps to a handful of LLVM instructions — LdState/StState become pointer arithmetic + loads/stores on the blob pointer, LdGuest/StGuest become calls to two extern "C" helpers (jit_guest_load/jit_guest_store) that funnel through the same guest_load/guest_store the interpreter uses. RAM accesses take an inlined fast path that skips the bus call entirely (Inline RAM).
The emitted function ABI (jit::BlockExecFn): void(void* guest_state, void* bus, BlockResult* out, uint32_t budget) — native code receives the blob pointer and an instruction budget, and reports how it exited and how many guest instructions retired.
ORCv2 / LLJIT (LLVM ≥ 18, ADR-003): the on-request compilation API. The module is added, the function symbol is looked up, and the returned pointer is the executable code. No assembler or linker step exists in the project; LLVM owns code emission end to end.
Two optimisation levels (ADR-002): the Baseline tier compiles at CodeGenOptLevel::None — fast translation, adequate code. The Optimised tier runs the full O2 pass pipeline at CodeGenOptLevel::Aggressive on one background thread, and the finished pointer is swapped in with a release-store; the dispatcher acquire-loads it and simply starts calling the better code. Cold-path effect: an RTEMS boot that took 4.5 s under a single mandatory-O2 design takes 1.0 s tiered.

5.1 Regions and self-loops¶

Compiling single blocks leaves money on the table: a hot loop exits to the dispatcher every iteration. Two measures keep execution inside native code:

Self-loop chaining: a block whose branch targets its own entry compiles into a native loop, bounded by the instruction budget.
Region chaining: the compiler fuses up to jit_max_region_blocks (8) successor blocks into one function, following static branch targets (including across SPARC SAVE/RESTORE window shifts, whose mode delta is known at translate time).

Both carry the budget check in the generated code, so a region yields at exactly the quantum boundary — the same boundary the switch interpreter would stop at, which keeps SMP round-robin interleaving deterministic across execution methods.

5.2 What the backend refuses to do¶

No breakpoints in native code: with a GDB stub attached, region fusion is disabled and any block containing a breakpoint runs interpretively — compiled code cannot stop mid-block, and TERO never patches guest memory (GDB under translation, Decision 58).
No semantic shortcuts: the lowering implements the IR op-for-op. Anything the lowering cannot express returns nullptr and the interpreter runs that block — a fallback, never an error.

6. Why a project-owned IR instead of emitting LLVM IR directly¶

The most common question about this design. The hot path gives no reason: once a block is compiled, the native code is identical whether the frontend emitted TERO IR (lowered to LLVM IR at compile time) or LLVM IR directly — the extra hop costs microseconds, once per block. The reasons all live off the hot path. Two of them are decisive on their own; the table after them collects the rest.

Total semantics: LLVM IR has undefined behaviour by design¶

LLVM IR is built to compile languages that have undefined behaviour — poison, undef, and partially-defined operations are what license its optimiser. Example: shl %x, 33 on an i32 is poison in LLVM IR, while SPARC SLL with count 33 has an exact architectural result (count taken mod 32, SPARC V8 §B.12). An emulator needs total semantics: every bit pattern the guest can produce must have a defined result.

With TERO IR, totalisation happens once, in one place: the IR defines Shl as a << (b & 31) (interpreter.cpp), and the single lowering emits the mask (ir_jit.cpp). With frontends emitting LLVM IR directly, every author of every frontend must know and avoid LLVM's poison/undef rules on every instruction — and the failure mode is not an error but a silent miscompile under O2, the most expensive bug class in a translator. TERO IR is, in effect, the layer that removes undefined behaviour from LLVM IR before any frontend can touch it.

The cold path has no executor¶

LLVM IR cannot be interpreted in production (lli is a process-level tool, not a per-block engine embeddable against a GuestState). Without an interpreter, every block pays compilation — including the majority that execute a handful of times. The cost is measured in this project: compile-everything versus interpret-first tiering is 4.5 s → 1.0 s on the RTEMS hello boot and ~15 min → 6.3 min on the 190-ELF sptest suite (performance table). Cold-path compile latency is also exactly the jitter the 1:1 real-time target (bounded P99.9) cannot absorb.

The rest of the ledger¶

Requirement	TERO IR provides	Direct LLVM IR would not
Run cold blocks without compiling	`IrInterpreter` executes IR at zero compile latency; most blocks in a boot run a handful of times and are never worth compiling	LLVM IR has no embeddable per-block interpreter; every block would pay compile latency — the jitter ADR-002 exists to kill
Reference-path duties (trace, GDB single-step, lockstep state compare)	every op carries `pc`/`insn_index`; the step hook stops the interpreter at exact instruction boundaries (Decision 79)	per-instruction guest metadata has no first-class home in LLVM IR; no pass is obligated to preserve it
Internal cross-check	the same IR runs on two independent executors (interpreter vs JIT); a lowering bug shows up as state divergence in the lockstep tests	one executor — a lowering bug is invisible until an external oracle catches it
Dispatch economics	`IrBlock` anchors its JIT cache entry (`jit_entry`), making hot dispatch a pointer deref instead of a hash lookup (+19.5% measured)	the dispatch table would key on raw PCs with nothing to hang the anchor on
Arch-neutral seam	a new ISA implements `translate_block` → IR and nothing else	each frontend would target LLVM's full surface, and every IR-level tool (cache, tracer, comparator) would need to understand it

Secondary, but real: LLVM IR is not stable across major versions (the typed-pointer → opaque-pointer migration is the canonical example). Today an LLVM upgrade touches one file (src/jit/src/ir_jit.cpp); with frontends written in LLVM IR it would touch every frontend. And an IrBlock is plain data — the 8192-slot cache stores it by value and the background O2 tier copies regions with a vector copy, where an LLVM module is a heavyweight object with a Context and ThreadSafeModule discipline.

Two measured results reinforce the conclusion that steady-state performance lives in the backend and dispatch, not in the intermediate format: block linking at the dispatch layer (lever B) measured −9% and was rejected; removing the byte-swap in lowered code had a ceiling of +2–3% because LLVM already folds the swap into MOVBE. Both experiments are recorded in plans/ and the performance log.

Convergent industry design¶

The thin-own-IR-in-front, heavy-backend-behind shape is not particular to TERO. QEMU translates every guest ISA to TCG ops — its own neutral IR in exactly this role — and interprets or compiles that; the HQEMU research line, which does use LLVM as a backend, goes guest → TCG → LLVM IR rather than emitting LLVM IR from frontends; V8 runs JavaScript through its own bytecode before TurboFan. Production translators converge on a narrow, totally-defined, metadata-carrying contract at the frontend boundary, and treat the optimising compiler as a backend behind it.

The trade-off accepted in exchange: the project owns an IR definition and an interpreter for it (~30 op kinds, one source file each), and frontends must be written against it. ADR-006 extends the same logic to the future frontend generator: EmuGen emits TERO IR, never LLVM IR — generating LLVM IR would orphan the interpreter, the block cache, the tiering, and the GDB integration in one stroke.

7. How we know it is correct¶

Translation bugs are silent — wrong code runs happily and corrupts state long before anything crashes. TERO's defence is redundancy at every level, all of it exercised in CI:

Layer	Mechanism	Where
Semantic redundancy	Switch oracle vs JIT, full RTEMS suites, bit-exact final state	`tests/integration/test_rtems_sptests.cpp` + suite CSVs
Lowering cross-check	IR interpreter vs JIT over real boots	`test_jit_run_lockstep.cpp`
Block-level lockstep	reference core vs IR engine, full blob `memcmp` per block	`tests/support/ir_diff_harness.hpp`
Instruction-level lockstep	blob compare at every interior instruction boundary (E0)	`run_ir_diff(..., per_insn = true)`, `test_ir_reference_path.cpp`
External oracle	SIS (Gaisler's simulator) lockstep trace compare	`scripts/lockstep_compare.py`

The division of labour: the switch interpreter is hand-checked against the SPARC V8 manual; everything faster is machine-checked against the switch interpreter; the whole stack is spot-checked against an independent implementation (SIS). A new guest architecture keeps the middle layers and replaces the outer ones per ADR-006 — see EmuGen and the multi-arch plan.

Terminology¶

Term	Meaning in this project
Guest / host	the emulated SPARC machine / the x86-64 Linux machine running TERO
Basic block	straight-line instruction run, one entry, ends at a control transfer; the unit of translation and caching
Frontend	per-ISA decoder that turns guest bytes into TERO IR (`IArchFrontend::translate_block`)
IR	intermediate representation — TERO's own arch-neutral instruction set (`ir::IrBlock`), not LLVM IR
Lowering	translating one representation into a lower-level one (TERO IR → LLVM IR → x86-64)
Backend	the consumer that turns IR into executable behaviour; here the LLVM-based `tero_jit`
JIT	just-in-time compiler — compiles at runtime, only what executes
Tier	a compilation level; TERO has interpret → Baseline (O0) → Optimised (O2, background)
Quantum	per-core instruction budget per scheduling round (default 1000); bounds SMP drift
Oracle	a trusted reference implementation used for differential testing; the frozen switch interpreter for SPARC, SIS externally
Lockstep	running two implementations input-by-input and comparing state at each step
Delay slot	SPARC executes the instruction after a branch before the branch takes effect; encoded as block metadata, invisible to the IR engine
Blob	`ir::GuestState` — the architecture's register file as an opaque byte array
Mode context	arch-defined bits (`ModeCtx`) that key cached blocks alongside the PC

Pointers¶

IR and LLVM JIT — full reference: data model, lowering, tiers, performance history.
Adding a frontend — write a new guest ISA.
EmuGen — the planned frontend generator (design, gated).
Execution model — quantum, pacing, idle skip, reference-path duties.
Decisions — numbered judgment calls (49–59, 67–68, 79, 80 cover this stack).