Skip to content

EmuGen — declarative frontend generation

Status: designed and approved, not implemented

EmuGen is the build-time generator planned by ADR-006 (plans/multiarch-emugen-frontends.md, approved 2026-06-10). Of its staged plan, only E0 is implemented (the IR interpreter as a full reference path — Decision 79). Stages E1–E3 are gated on the project committing to two new guest architectures. Nothing on this page beyond §"E0" describes committed code.

The problem it solves

After the arch-neutral IR (primer §3), the only per-ISA artifact in the execution stack is the frontend: a decoder plus a translate_block that emits TERO IR, plus a state-layout header. Writing one by hand has two cost centres:

  1. Opcode tables and decoders — mechanical, voluminous, and transcription-error-prone. A bit-field typo produces a wrong instruction that may execute plausibly for millions of cycles.
  2. Instruction semantics — intellectually irreducible (someone must state what ADD does), but currently expressed as C++ builder calls, which are harder to audit against the ISA manual's pseudocode than a declarative form would be.

EmuGen targets both: a single declarative description per ISA, from which the frontend artifacts are generated at build time. Prior art exists in table-driven emulator generators (TableGen-style pipelines used by industrial full-system simulators); ADR-006 adopts the generation idea while rejecting the part that conflicts with TERO's validation architecture (§"The hard rule" below).

What it will consume and emit

Input — one ISA description with three sections:

Section Content Generated output
State registers, widths, layout the GuestState offset header (the per-arch *_layout.hpp)
Encodings opcode tables, bit fields the decoder
Semantics per-instruction compositions of TERO IR ops the IArchFrontend::translate_block bodies

Output is C++ source, generated as a CMake build step, reproducible and reviewable — to every downstream consumer (block cache, IR interpreter, tiered JIT, GDB integration) a generated frontend is indistinguishable from a hand-written one.

What it will NOT generate

  • Interpreters or JITs. TERO has one IR interpreter and one LLVM backend, already written and validated; the generator only fabricates their per-ISA feeder. (This is the rejected half of the prior art, which generates whole interpreter cores per ISA.)
  • The switch interpreter. Frozen as the SPARC oracle (Decision 80); hand-written permanently.
  • Peripherals or SoC composition. Owned by the entity model and the .tero machine files (Peripheral system).

The hard rule: the generator emits TERO IR, never LLVM IR

Generating LLVM IR directly would bypass the project's own IR and orphan everything keyed on it: the IR interpreter (the reference path every IR-only architecture depends on for trace, single-step, and lockstep — Decision 79), the (PC, mode) block cache, the tiered compilation pipeline, and the per-op guest metadata GDB consumes. The full argument is the primer's §6; ADR-006 freezes the conclusion: LLVM remains exactly what it is today — the runtime backend, fed by TERO IR.

ADR-006 in brief

Recorded as Decision 80; full text in plans/multiarch-emugen-frontends.md.

  1. The switch interpreter is frozen — SPARC-only, no new features, kept indefinitely as oracle and trace path.
  2. A new guest architecture is IR-only — it implements IArchFrontend + IArchitecture and nothing else; its reference path is the IR interpreter, its fast path the tiered JIT.
  3. EmuGen generates frontends, nothing deeper — and emits TERO IR.
  4. Each new architecture gets an external oracle — lockstep against an independent emulator plays the role SIS plays for SPARC.
  5. Candidates (revisited when the gate opens): RISC-V RV32 (NOEL-V class) as architecture #2, ARM AArch32 as #3.

Staged plan and status

Stage Content Status
E0 IR interpreter as full reference path: per-instruction step hook, exact single-step, interior breakpoints, instruction-level lockstep Done 2026-06-10 (Decision 79, plans/e0-ir-reference-path.md)
E1 Architecture #2 with a hand-written frontend + the decoder-table generator (the mechanical 80% of the win, at 20% of the risk) Deferred — gated
E2 Full EmuGen: the semantics DSL, designed only once two hand-written frontends (SPARC + arch #2) exist as its corpus Deferred — gated
E3 SPARC retrofit: describe SPARC V8 in the DSL, compare generated IR block-by-block against the hand-written frontend over the RTEMS corpus; once validated, the generated frontend becomes production (amended 2026-06-10 from optional to mandatory) Deferred — gated

The gate ("rule of three"): the DSL is not designed until two real frontends exist to inform it — a generator designed from one example (SPARC) would encode abstractions no second architecture has verified. E0 ran early because it is the capability everything else assumes, and it was provable on SPARC against the existing oracle.

Validation per architecture

Layer SPARC New ISAs (E1+)
Semantic redundancy Switch oracle vs JIT, bit-exact — (single semantic source; accepted by ADR-006)
Lowering cross-check IR interpreter vs JIT IR interpreter vs JIT
Instruction-level lockstep run_ir_diff(per_insn) vs oracle (E0) same harness, vs the external oracle
External oracle SIS per-ISA independent emulator

Cross-cutting exit criterion for every stage: SPARC RTEMS pass rate ≥ the rate at stage entry, and no new mandatory runtime dependencies (the generator is a build-time tool; external oracles are test-time tools).

Pointers

  • Binary translation — a primer — the stack this generator feeds.
  • Adding a frontend — the manual procedure EmuGen automates; its contract (builder calls, set_cur_pc stamping, no_stop_tail) is exactly what generated code must satisfy.
  • plans/multiarch-emugen-frontends.md — ADR-006 full text and stage details.