Binary translation — a primer¶
How TERO executes SPARC machine code on an x86-64 host, written for an engineer who has not worked with emulators or JIT compilers before. Every concept is introduced before it is used, and every claim names the type, file, or numbered decision that backs it. The reference-grade companion pages are IR and LLVM JIT (the full data model and backend internals) and Adding a frontend (the contributor procedure for a new guest ISA).
Vocabulary used throughout — the terminology table at the bottom collects all of it:
- Guest: the emulated machine and its code. TERO's guest is a SPARC V8 system-on-chip (GR712RC or GR740) running RTEMS.
- Host: the machine the emulator runs on — x86-64 Linux (ADR-003).
- Functional emulation: reproducing the architectural results of every instruction (registers, memory, traps), not the silicon's internal pipeline timing. TERO's "1:1 real-time" goal means simulated seconds per wall-clock second, never cycle accuracy (ADR-004).
1. The problem¶
A SPARC program is a sequence of 32-bit big-endian instructions for a CPU the host does not have. Something on the host must read each instruction and produce the architectural effect the SPARC manual specifies. The engineering question is how much work to invest per instruction executed versus per instruction translated, because real workloads execute the same small set of instructions billions of times.
The strategies below form a ladder. TERO implements rungs 1 and 4, and keeps rung 1 permanently as its correctness reference; rungs 2 and 3 exist inside rung 4 as its building blocks.
Rung 1 — the switch interpreter¶
Fetch one instruction, decode its bit fields, execute it through a
switch over the instruction kind, repeat. This is
core::step (src/core/src/step.cpp) over CpuState, with a per-core
direct-mapped decode cache so the bit-field extraction is not repeated for
a PC already seen. Cost model: every dynamic instruction pays decode-cache
lookup + one indirect dispatch + handler body. Measured: ~104 MIPS
single-core on the cpubound-mix benchmark (performance
table).
The interpreter's value is not speed. Each handler is a direct, readable transcription of the SPARC V8 manual, which makes it the correctness oracle: the implementation every faster path is compared against, bit for bit. It is frozen by ADR-006 — no new features, never removed (Decision 80).
Rung 2 — basic blocks¶
Programs do not branch every instruction. A basic block is a run of instructions with one entry point and one terminator (a branch, call, trap, or other control transfer). Decoding a whole block once and caching the result amortises the per-instruction decode cost over every future execution of that block. The block becomes the unit of translation, caching, and execution everywhere past this point.
Rung 3 — an intermediate representation¶
Caching decoded SPARC instructions still ties every consumer (an
interpreter, a compiler, a tracer) to SPARC. Translating the block instead
into a small arch-neutral instruction set — an intermediate
representation, IR — decouples them: the IR knows registers only as byte
offsets into an opaque state blob and memory only as explicitly-sized,
explicitly-endian accesses, so nothing downstream of the translator
contains SPARC knowledge. TERO's IR is ir::IrBlock
(src/ir/include/tero/ir/ir.hpp); §3 below walks through it. A block of
IR can be interpreted (ir::IrInterpreter — execute the ops one by one)
or compiled (§5).
Rung 4 — just-in-time compilation¶
Translate the IR block once more, into host-native x86-64 code, and jump into it. The translation cost is paid once per block; every subsequent execution runs at native speed. This is binary translation proper (equivalently: a JIT — just-in-time — compiler whose input language is SPARC machine code). TERO's backend lowers IR through LLVM and reaches ~2070 MIPS on the same benchmark — ~20× the interpreter.
The catch is latency: compiling costs milliseconds, and most blocks in a real workload (boot code, initialisation paths) execute a handful of times — compiling them costs more than it saves. The fix is tiering (§5): run a block on the IR interpreter until it proves hot, then compile cheaply, then recompile with full optimisation in the background.
2. The pipeline¶
flowchart LR
subgraph guest["Guest memory"]
G["SPARC machine code"]
end
subgraph front["tero_arch_sparc"]
FE["SparcFrontend::translate_block"]
end
subgraph ir["tero_ir"]
B["ir::IrBlock<br/>(TERO IR)"]
BC["ir::BlockCache<br/>(pc, mode) → block"]
INT["ir::IrInterpreter"]
end
subgraph jit["tero_jit"]
TJ["jit::TieredJit"]
LL["LLVM IR → ORCv2 LLJIT"]
N["native x86-64"]
end
G -->|decode once per block| FE --> B --> BC
BC -->|cold / fallback| INT
BC -->|hot| TJ --> LL --> N
INT -->|writes| GS["ir::GuestState blob"]
N -->|writes| GS
One translation, two executors. The frontend
(src/arch/sparc/src/sparc_frontend.cpp) decodes guest bytes into an
IrBlock exactly once per (PC, mode); the cached block is then run
either by the IR interpreter or by JIT-compiled native code. Both
executors mutate the same ir::GuestState byte blob and call the same
ir::guest_load/guest_store helpers for guest memory
(src/ir/include/tero/ir/guest_memory.hpp), so they cannot drift apart on
state layout or endianness.
| Module | Owns | Does not own |
|---|---|---|
tero_arch_sparc |
SPARC decode → IR; state layout offsets; exception entry (IArchitecture) |
execution |
tero_ir |
the IR data model; GuestState; BlockCache; the IR interpreter |
any guest-ISA knowledge |
tero_jit |
IR → LLVM IR lowering; tiered compilation; native dispatch | the IR's semantics (it implements them op-for-op) |
tero_runtime |
the run loop that drives all of the above per quantum | — |
The dependency rule (see Layers): tero_ir does not depend
on tero_core — the IR works on the opaque blob. A new guest ISA is a new
frontend module and nothing else; interpreter, cache, JIT, and runtime are
shared (Adding a frontend).
Method selection is a runtime config field, not a build flag
(EmulatorConfig::translation, default true): false runs the switch
interpreter; true runs the JIT with the IR interpreter as its fallback.
Both paths are always compiled in.
3. TERO IR in ten minutes¶
The IR is deliberately small: ~30 operation kinds, no types beyond "32-bit value", no SSA, no control flow inside a block. Everything a guest instruction does decomposes into:
- Temps (
ir::Temp) — block-local virtual values. Each value-producing op writes a fresh temp; temps die at the block boundary. Cross-instruction state never flows through temps — it flows through the blob. - State ops —
LdState/StStateread/write the guest register file at(byte offset, size)in theGuestStateblob. Register names do not exist in the IR: the SPARC frontend turns%g1or a windowed%o3into a byte offset at translate time (sparc_layout.hpp). - Memory ops —
LdGuest/StGuestcarry an explicit size and endianness (MemEndian::Bigfor SPARC) and report faults; this is the only place endianness exists in the whole execution stack. - Compute ops —
Add,Sub, logic, shifts, multiplies, divides, compares,Select. Plain 32-bit operations on temps. TrapIf— conditional mid-block exception (alignment, window overflow, divide-by-zero), carrying the exact guest PC.- A structured exit (
ir::IrExit) — control flow is not an op; a block ends with one terminator record:
ExitKind |
Meaning |
|---|---|
FallThrough / StaticBranch |
continue at a fixed PC (is_call marks a CALL) |
CondBranch |
cond ? static_target : fallthrough_target |
IndirectBranch |
continue at a runtime-computed address |
Exception |
deliver an architectural trap |
PowerDown |
core halts pending an interrupt |
Worked example — two SPARC instructions and the IR the frontend emits (builder calls, simplified):
guest: IR (conceptual):
add %g1, %g2, %g3 t0 = ld_state(off(%g1), 4)
t1 = ld_state(off(%g2), 4)
t2 = add(t0, t1)
st_state(off(%g3), 4, t2)
ld [%g3], %g4 t3 = ld_guest(t2, 4, Big) ; may fault
st_state(off(%g4), 4, t3)
(block terminator: FallThrough → next PC)
Every op is stamped with the guest PC and the 0-based instruction index it
came from (IrInst::pc / insn_index). That metadata is what makes a
mid-block trap report the exact faulting PC, lets the JIT bill retired
instructions, and drives the per-instruction step hook (§6).
Two pieces of block metadata complete the picture:
ModeCtx— a small arch-defined value that, with the entry PC, keys the block cache. SPARC packs the register-window pointer and three PSR bits into it, so mode-dependent state offsets resolve at translate time and a mode change simply ends the block. Same PC + different window = different cached block.- Delay-slot metadata (
delay_trap_*,no_stop_tail) — SPARC branches execute one more instruction after the branch (the delay slot). The frontend encodes the fix-ups this needs on the block so the arch-neutral engine can stay ignorant of the concept.
Full field-level reference: IR data model.
The guest state blob¶
ir::GuestState is a byte array — nothing more. The architecture declares
its size (IArchitecture::state_size(): 572 bytes for SPARC, 64 for the
toy test ISA) and owns the layout. Since state unification, SPARC's
CpuState integer state is this blob: the switch interpreter, the IR
interpreter, and JIT-compiled native code all read and write the same
bytes, so handing a core from one executor to another requires no
synchronisation step at all (State
unification).
4. Execution: the dispatch loop¶
ExecutionEngine::run_ir_quantum
(src/runtime/src/engine_translate.cpp) drives one core for one
quantum (default 1000 instructions — the round-robin slice that bounds
cross-core drift in SMP). Per iteration, for the current PC:
- Cache lookup —
BlockCache::find(pc, mode); on a miss, call the frontend and insert. The cache is direct-mapped, 8192 slots (details). - Tier check — run the block on the IR interpreter until it has
executed
jit_baseline_threshold(32) times; then compile at the cheap tier; afterjit_promotion_threshold(100) more executions, a background thread recompiles at full optimisation (§5). - Execute — native code or interpreter; both leave the blob updated
and return the same
BlockExitshape (next PC, or a fault with its exact PC). - Exit handling — normal exits advance the PC through
IArchitecture::set_pc; faults go throughraise_block_exception, which performs the architecture's trap entry.
SPARC-specific windows the IR cannot model (an in-flight delay slot, an
annulled-slot micro-state, an untranslatable instruction — the FPU among
them) fall back to the switch oracle for exactly those instructions, then
the IR path resumes. A clean-slate ISA without these features never takes
the fallback; TERO's toy test frontend
(tests/integration/test_toy_frontend.cpp) runs end-to-end with no
oracle at all.
Timing is instruction-counted: each retired instruction advances simulated
time by ns_per_insn; the JIT changes only how fast wall-clock the same
simulated timeline is produced. See Execution
model for the quantum, pacing, and idle-skip
machinery.
5. LLVM as the backend¶
LLVM appears in exactly one place: inside tero_jit, turning IR blocks
into host machine code at runtime. It is not the source language
(guest SPARC is), not the IR of the project (TERO IR is), and not
on the reference path (the IR interpreter executes without it). What
tero_jit uses, concretely:
- Lowering (
src/jit/src/ir_jit.cpp): one LLVM function per region (one or more IR blocks, §5.1), one LLVM basic block per member. Each TERO IR op maps to a handful of LLVM instructions —LdState/StStatebecome pointer arithmetic + loads/stores on the blob pointer,LdGuest/StGuestbecome calls to twoextern "C"helpers (jit_guest_load/jit_guest_store) that funnel through the sameguest_load/guest_storethe interpreter uses. RAM accesses take an inlined fast path that skips the bus call entirely (Inline RAM). - The emitted function ABI (
jit::BlockExecFn):void(void* guest_state, void* bus, BlockResult* out, uint32_t budget)— native code receives the blob pointer and an instruction budget, and reports how it exited and how many guest instructions retired. - ORCv2 / LLJIT (LLVM ≥ 18, ADR-003): the on-request compilation API. The module is added, the function symbol is looked up, and the returned pointer is the executable code. No assembler or linker step exists in the project; LLVM owns code emission end to end.
- Two optimisation levels (ADR-002): the Baseline tier compiles at
CodeGenOptLevel::None— fast translation, adequate code. The Optimised tier runs the full O2 pass pipeline atCodeGenOptLevel::Aggressiveon one background thread, and the finished pointer is swapped in with a release-store; the dispatcher acquire-loads it and simply starts calling the better code. Cold-path effect: an RTEMS boot that took 4.5 s under a single mandatory-O2 design takes 1.0 s tiered.
5.1 Regions and self-loops¶
Compiling single blocks leaves money on the table: a hot loop exits to the dispatcher every iteration. Two measures keep execution inside native code:
- Self-loop chaining: a block whose branch targets its own entry compiles into a native loop, bounded by the instruction budget.
- Region chaining: the compiler fuses up to
jit_max_region_blocks(8) successor blocks into one function, following static branch targets (including across SPARC SAVE/RESTORE window shifts, whose mode delta is known at translate time).
Both carry the budget check in the generated code, so a region yields at exactly the quantum boundary — the same boundary the switch interpreter would stop at, which keeps SMP round-robin interleaving deterministic across execution methods.
5.2 What the backend refuses to do¶
- No breakpoints in native code: with a GDB stub attached, region fusion is disabled and any block containing a breakpoint runs interpretively — compiled code cannot stop mid-block, and TERO never patches guest memory (GDB under translation, Decision 58).
- No semantic shortcuts: the lowering implements the IR op-for-op.
Anything the lowering cannot express returns
nullptrand the interpreter runs that block — a fallback, never an error.
6. Why a project-owned IR instead of emitting LLVM IR directly¶
The most common question about this design. The hot path gives no reason: once a block is compiled, the native code is identical whether the frontend emitted TERO IR (lowered to LLVM IR at compile time) or LLVM IR directly — the extra hop costs microseconds, once per block. The reasons all live off the hot path. Two of them are decisive on their own; the table after them collects the rest.
Total semantics: LLVM IR has undefined behaviour by design¶
LLVM IR is built to compile languages that have undefined behaviour —
poison, undef, and partially-defined operations are what license its
optimiser. Example: shl %x, 33 on an i32 is poison in LLVM IR, while
SPARC SLL with count 33 has an exact architectural result (count taken
mod 32, SPARC V8 §B.12). An emulator needs total semantics: every bit
pattern the guest can produce must have a defined result.
With TERO IR, totalisation happens once, in one place: the IR defines
Shl as a << (b & 31) (interpreter.cpp), and the single lowering
emits the mask (ir_jit.cpp). With frontends emitting LLVM IR directly,
every author of every frontend must know and avoid LLVM's poison/undef
rules on every instruction — and the failure mode is not an error but a
silent miscompile under O2, the most expensive bug class in a translator.
TERO IR is, in effect, the layer that removes undefined behaviour from
LLVM IR before any frontend can touch it.
The cold path has no executor¶
LLVM IR cannot be interpreted in production (lli is a process-level
tool, not a per-block engine embeddable against a GuestState). Without
an interpreter, every block pays compilation — including the majority
that execute a handful of times. The cost is measured in this project:
compile-everything versus interpret-first tiering is 4.5 s → 1.0 s on
the RTEMS hello boot and ~15 min → 6.3 min on the 190-ELF sptest
suite (performance table). Cold-path compile
latency is also exactly the jitter the 1:1 real-time target (bounded
P99.9) cannot absorb.
The rest of the ledger¶
| Requirement | TERO IR provides | Direct LLVM IR would not |
|---|---|---|
| Run cold blocks without compiling | IrInterpreter executes IR at zero compile latency; most blocks in a boot run a handful of times and are never worth compiling |
LLVM IR has no embeddable per-block interpreter; every block would pay compile latency — the jitter ADR-002 exists to kill |
| Reference-path duties (trace, GDB single-step, lockstep state compare) | every op carries pc/insn_index; the step hook stops the interpreter at exact instruction boundaries (Decision 79) |
per-instruction guest metadata has no first-class home in LLVM IR; no pass is obligated to preserve it |
| Internal cross-check | the same IR runs on two independent executors (interpreter vs JIT); a lowering bug shows up as state divergence in the lockstep tests | one executor — a lowering bug is invisible until an external oracle catches it |
| Dispatch economics | IrBlock anchors its JIT cache entry (jit_entry), making hot dispatch a pointer deref instead of a hash lookup (+19.5% measured) |
the dispatch table would key on raw PCs with nothing to hang the anchor on |
| Arch-neutral seam | a new ISA implements translate_block → IR and nothing else |
each frontend would target LLVM's full surface, and every IR-level tool (cache, tracer, comparator) would need to understand it |
Secondary, but real: LLVM IR is not stable across major versions (the
typed-pointer → opaque-pointer migration is the canonical example).
Today an LLVM upgrade touches one file (src/jit/src/ir_jit.cpp); with
frontends written in LLVM IR it would touch every frontend. And an
IrBlock is plain data — the 8192-slot cache stores it by value and the
background O2 tier copies regions with a vector copy, where an LLVM
module is a heavyweight object with a Context and ThreadSafeModule
discipline.
Two measured results reinforce the conclusion that steady-state
performance lives in the backend and dispatch, not in the intermediate
format: block linking at the dispatch layer (lever B) measured −9%
and was rejected; removing the byte-swap in lowered code had a ceiling of
+2–3% because LLVM already folds the swap into MOVBE. Both experiments
are recorded in plans/ and the performance log.
Convergent industry design¶
The thin-own-IR-in-front, heavy-backend-behind shape is not particular to TERO. QEMU translates every guest ISA to TCG ops — its own neutral IR in exactly this role — and interprets or compiles that; the HQEMU research line, which does use LLVM as a backend, goes guest → TCG → LLVM IR rather than emitting LLVM IR from frontends; V8 runs JavaScript through its own bytecode before TurboFan. Production translators converge on a narrow, totally-defined, metadata-carrying contract at the frontend boundary, and treat the optimising compiler as a backend behind it.
The trade-off accepted in exchange: the project owns an IR definition and an interpreter for it (~30 op kinds, one source file each), and frontends must be written against it. ADR-006 extends the same logic to the future frontend generator: EmuGen emits TERO IR, never LLVM IR — generating LLVM IR would orphan the interpreter, the block cache, the tiering, and the GDB integration in one stroke.
7. How we know it is correct¶
Translation bugs are silent — wrong code runs happily and corrupts state long before anything crashes. TERO's defence is redundancy at every level, all of it exercised in CI:
| Layer | Mechanism | Where |
|---|---|---|
| Semantic redundancy | Switch oracle vs JIT, full RTEMS suites, bit-exact final state | tests/integration/test_rtems_sptests.cpp + suite CSVs |
| Lowering cross-check | IR interpreter vs JIT over real boots | test_jit_run_lockstep.cpp |
| Block-level lockstep | reference core vs IR engine, full blob memcmp per block |
tests/support/ir_diff_harness.hpp |
| Instruction-level lockstep | blob compare at every interior instruction boundary (E0) | run_ir_diff(..., per_insn = true), test_ir_reference_path.cpp |
| External oracle | SIS (Gaisler's simulator) lockstep trace compare | scripts/lockstep_compare.py |
The division of labour: the switch interpreter is hand-checked against the SPARC V8 manual; everything faster is machine-checked against the switch interpreter; the whole stack is spot-checked against an independent implementation (SIS). A new guest architecture keeps the middle layers and replaces the outer ones per ADR-006 — see EmuGen and the multi-arch plan.
Terminology¶
| Term | Meaning in this project |
|---|---|
| Guest / host | the emulated SPARC machine / the x86-64 Linux machine running TERO |
| Basic block | straight-line instruction run, one entry, ends at a control transfer; the unit of translation and caching |
| Frontend | per-ISA decoder that turns guest bytes into TERO IR (IArchFrontend::translate_block) |
| IR | intermediate representation — TERO's own arch-neutral instruction set (ir::IrBlock), not LLVM IR |
| Lowering | translating one representation into a lower-level one (TERO IR → LLVM IR → x86-64) |
| Backend | the consumer that turns IR into executable behaviour; here the LLVM-based tero_jit |
| JIT | just-in-time compiler — compiles at runtime, only what executes |
| Tier | a compilation level; TERO has interpret → Baseline (O0) → Optimised (O2, background) |
| Quantum | per-core instruction budget per scheduling round (default 1000); bounds SMP drift |
| Oracle | a trusted reference implementation used for differential testing; the frozen switch interpreter for SPARC, SIS externally |
| Lockstep | running two implementations input-by-input and comparing state at each step |
| Delay slot | SPARC executes the instruction after a branch before the branch takes effect; encoded as block metadata, invisible to the IR engine |
| Blob | ir::GuestState — the architecture's register file as an opaque byte array |
| Mode context | arch-defined bits (ModeCtx) that key cached blocks alongside the PC |
Pointers¶
- IR and LLVM JIT — full reference: data model, lowering, tiers, performance history.
- Adding a frontend — write a new guest ISA.
- EmuGen — the planned frontend generator (design, gated).
- Execution model — quantum, pacing, idle skip, reference-path duties.
- Decisions — numbered judgment calls (49–59, 67–68, 79, 80 cover this stack).