Skip to content

Execution model

This page is the canonical reference for how Tero advances time and runs guest code: the two execution methods, the run_for/run_until loop, the per-core quantum, round-robin scheduling, idle-time skipping, pacing modes, and the single-instruction step() cycle that the translation path falls back to. All citations are path:line into the current source.

The run loop lives in the ExecutionEngine (Runtime decomposition): scheduling rounds in src/runtime/src/engine_run_loop.cpp, the translation quanta in src/runtime/src/engine_translate.cpp, interrupt sampling and the time base in src/runtime/src/engine_irq_time.cpp, the MultiThread workers in src/runtime/src/engine_mt.cpp. Emulator::run_for/run_until are one-line delegations to the engine (emulator.cpp:307-313).

Two execution methods

Tero has two execution methods, both compiled into every build and selected at runtime by EmulatorConfig::translation (bool, default trueemulator_config.hpp:327):

translation Method Driver Role
false Switch interpreter core::step per instruction (run_core_quantum's per-step loop, engine_run_loop.cpp:190) The reference path and correctness oracle
true (default) Binary translation ExecutionEngine::run_ir_quantum — tiered LLVM JIT block-at-a-time, IR interpreter fallback (engine_translate.cpp:92) The fast path, validated bit-identical against the oracle

A third quantum path exists alongside these: the universal IR-interpret path (run_ir_interpret_quantum, engine_translate.cpp:379 — entity-model S9). It drives the arch-neutral IR interpreter block by block with no JIT. It is the non-JIT path for any non-SPARC frontend (which has no core::step oracle), and SPARC opts into it via EmulatorConfig::force_ir_interpret (emulator_config.hpp:357) for differential validation against the switch oracle.

Both methods share the same CpuState and the same trap / PSR / error-mode semantics. The IR/JIT path is not a separate emulator: it operates on the same GuestState byte blob that core::step uses (state unification — see Layers), and it falls back to the reference step for delay slots, annulled slots, and any instruction the frontend cannot translate (run_ir_quantum's fallback_step, which calls IArchitecture::reference_stepengine_translate.cpp:119-123).

This page documents the run-loop scheduling and the step() cycle. The translation engine itself — IR ops, the block cache, the tiered JIT — is documented in IR and LLVM JIT.

Why keep a slow interpreter at all?

The interpreter trades throughput for determinism, debuggability, and a small surface for hardware-modelling bugs. That is exactly why it stays the oracle the JIT is checked against. See Design principles §7.

The public run API

RunResult run_for(SimTimeNs duration);      // advance by duration
RunResult run_until(SimTimeNs deadline);    // advance until sim_time == deadline
RunResult single_step(CoreId core);         // exactly one instruction, no IRQ/events

run_for is a thin wrapper over run_until (engine_run_loop.cpp:22):

RunResult ExecutionEngine::run_for(SimTimeNs duration) {
    return run_until(SimTimeNs{to_underlying(sim_time_) + to_underlying(duration)});
}

Every call returns a RunResult (run_result.hpp):

Field Meaning
instructions_executed total instructions retired across all cores this call
time_elapsed the sim_time_ at return
reason a HaltReason (see below)

HaltReason distinguishes the four ways a run stops:

  • DeadlineReached / DurationExpired — normal completion.
  • HaltedMode — a guest core took a trap with ET=0 (SPARC stops the processor). This is the guest's own doing (RTEMS _CPU_Fatal_halt / _exit issue ta 0 with ET=0), not an emulator failure (Decision 56).
  • ErrorMode — reserved for an internal emulator error, distinct from a guest HaltedMode.
  • Breakpoint — a GDB breakpoint, a late-binding GDB attach, or a Ctrl-C async interrupt.

run_until → pacing → run_until_unpaced

run_until slices the requested span into pacing_slice_ns chunks (default 10 ms simulated, emulator_config.hpp:304) under both pacing modes (engine_run_loop.cpp:56). The slice bounds the per-segment idle-skip: an unbounded run_until_unpaced over a long all-idle stretch could leap simulated time past a clock-dependent guest's periodic GPTimer ticks. Realtime additionally sleeps between chunks; Turbo runs them back-to-back.

flowchart TD
    A["run_until(deadline)"] --> D["anchor wall & sim time"]
    D --> E["slice loop:<br/>run_until_unpaced(min(deadline, sim+slice))"]
    E --> B{pacing?}
    B -- Realtime --> F["sleep_until(wall_anchor + sim_done_ns)"]
    B -- Turbo --> G
    F --> G{sim &lt; deadline?}
    G -- yes --> E
    G -- no --> H[return]

PacingMode::Turbo (tests, batch, SMP2 wrapper)

Free-running. The slice loop runs the chunks back-to-back and never sleeps. This is the mode for CI, the RTEMS testsuite, the bench tools, and the future SMP2 wrapper (where the external scheduler dictates cadence).

PacingMode::Realtime (default for interactive use)

The slice loop sleep_untils on std::chrono::steady_clock between chunks so 1 s simulated ≈ 1 s real (engine_run_loop.cpp:89-95). The wall-clock and sim-time anchors are taken once per call and are local — persisting them across calls would let an idle host program drift the simulation arbitrarily ahead of wall-clock. Halving ns_per_insn doubles simulated MHz, so the host must execute twice the instructions per real second to stay on schedule.

Pacing never changes results

Pacing affects only when the host sleeps, never sim_time_ or any guest-visible value. A run is bit-identical under Turbo and Realtime.

The core loop: run_until_unpaced

run_until_unpaced (engine_run_loop.cpp:226) is the heart of the emulator. Each iteration is one scheduling round; the loop runs until sim_time_ >= deadline. A single round, in order:

sequenceDiagram
    participant L as run_until_unpaced
    participant G as GdbStub
    participant IC as IrqController
    participant C as Cores (round-robin)
    participant S as EventScheduler
    participant P as Peripherals

    loop while sim_time < deadline
        L->>G: poll_accept / stop_pending (late-binding GDB)
        L->>L: sync_global_up_counter()  %% %asr22:%asr23
        L->>IC: consume_core_release_wake() → wake parked cores
        L->>IC: sample_interrupts(ci) for every core
        L->>L: error_mode check (→ GDB SIGSEGV or HaltedMode)
        L->>L: all_idle?
        alt all cores powered down
            L->>S: next_event_time()
            L->>L: sim_time = jump toward next event / deadline
        else
            L->>C: run quantum per core (ST inline / MT barrier)
            L->>L: sim_time += fold of per-core delta_ns
        end
        L->>S: fire_pending(sim_time)
        L->>P: tick(sim_time) for every peripheral
    end

The exact per-round sequence (citations into engine_run_loop.cpp):

  1. Late-binding GDB (:246, :257): pick up a client that connected after run_until started, or a stop primed at early-attach. Either returns HaltReason::Breakpoint.
  2. sync_global_up_counter() (:269; impl engine_irq_time.cpp:79): re-base every core's %asr22:%asr23 up-counter to global simulated time for this round. This is the SoC-wide system timecounter RTEMS reads for TOD/uptime, so it must advance with sim_time_ including the idle-skip jump that runs no core. It routes through IArchitecture::set_time_base, so a frontend without such a counter no-ops.
  3. Core-release wake (:277): drain IInterruptController::consume_core_release_wake(ci) before sampling IRQs, so a secondary CPU is running before its per-CPU IRQs are evaluated. The GRLIB controllers drive this from RTEMS SMP boot's MPSTAT[i]=1 write.
  4. sample_interrupts(ci) for every core (:287; impl engine_irq_time.cpp:12): deliver any IRQ that arrived from the previous round's peripheral ticks — before deciding whether to skip time, so a freshly-woken core is not skipped over.
  5. Error-mode check (:298): if any core tripped error mode, halt immediately. With a GDB client attached and not yet notified, redirect through the stub (SIGSEGV at the offending PC); otherwise return HaltedMode.
  6. all_idle test (:316): are all cores powered down? (A GDB Ctrl-C still breaks in while every core is dormant, :327.)
  7. Either idle-skip OR run a round of quanta (see below).
  8. scheduler_.fire_pending(sim_time_) (:401): process timed events whose deadline has passed.
  9. soc_->tick_peripherals(sim_time_) (:403): advance GPTIMER, drain UART, etc. with the round's final sim_time_.

The quantum and round-robin scheduling

The quantum is EmulatorConfig::quantum (default 1000 instructions, emulator_config.hpp:277) — the number of instructions a core executes before the scheduler moves on. One round runs one quantum per active core.

run_core_quantum(core_idx) (engine_run_loop.cpp:101) runs a single core's quantum and returns a CoreQuantumResult { instructions, delta_ns, should_return, stop }:

  • If the core is powered down, it executes nothing but still bills quantum * ns_per_insn so it stays time-synchronised with peripherals (:106).
  • With a single core (core_blobs_.size() == 1), the quantum is enlarged to an event-bounded burst (up to 65536 instructions, never past the next scheduled event) — there is no cross-core interleaving to preserve, so the per-round overhead is amortised while timer interrupts still fire at their exact simulated time (:113-138).
  • If translation && !observer_, it calls run_ir_quantum — the JIT/IR fast path (:147).
  • Else, if use_ir_interpret_ (a non-SPARC frontend, or SPARC under force_ir_interpret), it calls run_ir_interpret_quantum — the universal IR-interpret path (:166).
  • Otherwise (SPARC with observer/trace or translation = false — note MultiThread does NOT force this; it takes the JIT path above) it runs the per-core Switch path: a for loop driving IArchitecture::reference_step (which is core::step on the SPARC lens) up to quantum times (:190), checking should_break, firing the observer, polling self-IPIs, and breaking early on error/power-down.

Single-thread round-robin

In ExecutionMode::SingleThread (the default), the loop runs each core's quantum inline, in core order, and reduces the per-core simulated-time deltas into sim_time_ once, at the round boundary (engine_run_loop.cpp:378-400):

std::uint64_t round_delta = 0;
for (std::size_t ci = 0; ci < core_blobs_.size(); ++ci) {
    const auto cr = run_core_quantum(ci);
    result.instructions_executed += cr.instructions;
    round_delta = fold_delta(round_delta, cr.delta_ns);
    if (cr.should_return) { /* GDB stop: commit and return */ }
}
sim_time_ = SimTimeNs{to_underlying(sim_time_) + round_delta};

fold_delta (engine_run_loop.cpp:236) applies EmulatorConfig::time_advance (ADR-005, Multicore and timing): Concurrent (the default) takes the max of the per-core deltas — the cores run concurrently in one wall-window — while the legacy Sum accumulates them. For N=1 the two are identical.

Nothing observes sim_time_ mid-round (peripherals tick after the loop with the final value), so reducing once at the boundary is byte-identical to advancing inline — and it is what makes the multi-thread path below produce the same sim_time_.

Multi-thread (ADR-001)

In ExecutionMode::MultiThread with num_cores > 1, cores 1..N−1 run on their own host threads. start_workers spawns the workers parked on a std::barrier; each round the main thread releases them at the start_barrier_, runs core 0 itself, then rejoins at the done_barrier_ (engine_run_loop.cpp:354-377; worker_loop at engine_mt.cpp:35, start_workers at engine_mt.cpp:46). Per-core deltas are reduced into sim_time_ exactly as in single-thread. A quantum_batch > 1 lets each worker run several quanta back-to-back before the barrier to amortise its cost (Phase 14, emulator_config.hpp:286; run_core_batch, engine_mt.cpp:22). MultiThread runs the per-core JIT path with per-core IR caches and TieredJits (Tier-2 background O2 included); a GDB stub or observer forces SingleThread.

Self-IPI latency

All quantum paths call poll_self_interrupt(core_idx) at each instruction/block boundary (engine_run_loop.cpp:197, engine_translate.cpp:155) so a core that raises an interrupt to itself takes it at hardware (instruction-boundary) latency rather than waiting for the next round — fixing smpmulticast01 (poll_self_interrupt, engine_irq_time.cpp:99).

Idle-time skipping

When all cores are powered down, simulating their quanta is wasted latency. The loop instead jumps sim_time_ straight forward (engine_run_loop.cpp:333-353):

auto next_event = to_underlying(scheduler_.next_event_time());
auto jump_target = std::min(to_underlying(deadline), next_event);
if (scheduler_.empty()) {
    constexpr std::uint64_t MaxIdleNs = 1'000'000ULL;   // 1 ms
    jump_target = std::min(jump_target, to_underlying(sim_time_) + MaxIdleNs);
}
if (jump_target > to_underlying(sim_time_)) {
    sim_time_ = SimTimeNs{jump_target};
}

The jump is bounded by the deadline and the next scheduled event. Timed peripherals schedule their underflows through the EventScheduler (event-driven GPTimer), so the event bounds the jump to the exact wake instant — one jump per timer period instead of many small steps. The 1 ms MaxIdleNs fallback applies only when nothing is scheduled, so non-event-driven peripherals (e.g. APBUART RX polling) are still ticked during a long idle with no armed timer, and time never leaps straight to a far deadline.

Cores enter the idle state by writing %asr19 (power-down); a pending interrupt wakes them in sample_interrupts (engine_irq_time.cpp:48-54).

How cores get woken: sample_interrupts

sample_interrupts(core_idx) (engine_irq_time.cpp:12) is the single place a hardware interrupt becomes a trap. The engine is arch-neutral here: the architecture makes the decision and performs the delivery (Decision 68).

  1. Fast path: in SingleThread, interrupt_controller_->raw_pending() == 0 proves nothing is pending anywhere and skips the per-CPU scan (:23).
  2. Query interrupt_controller_->pending_mask(core_idx); return if zero.
  3. Ask the architecture: ir_arch_->evaluate_interrupt(state, pending) returns an InterruptDecision { level, take, trap_type, ack_mask } (architecture.hpp:123-141) — the highest pending priority and whether the architecture's enable/priority gates (SPARC: PSR.ET, PSR.PIL) permit delivery.
  4. A pending interrupt wakes a powered-down core regardless of the enable gate (:48-54) — RTEMS SMP boot relies on this: a secondary CPU parks with ET=0 and is woken by an IPI.
  5. If decision.take: acknowledge the controller with the opaque decision.ack_mask the architecture formed (auto-clear the pending/force bit, Decision 39, :64), notify any observer, and deliver the trap through ir_arch_->deliver_interrupt(core_idx, decision) (:76) — for SPARC this enters the trap on the lens, clearing the step micro-state (annul, pending software trap) the blob does not carry.

The step() cycle (the reference, and the fallback)

A single iteration of tero::core::step(CpuState&, ICpuBus&) (src/core/src/step.cpp:11) does the following:

flowchart TD
    A0{error_mode?} -- yes --> A1[return ErrorMode]
    A0 -- no --> A2[clear_branch_request]
    A2 --> B{PC aligned?}
    B -- no --> B1[status = AlignError]
    B -- yes --> C[fetch via decode cache]
    C --> D{annul_next?}
    D -- yes --> D1[drop slot, clear annul]
    D -- no --> E[execute → ExecStatus]
    B1 --> F{tt?}
    D1 --> F
    E --> F
    F -- has tt &amp; ET=0 --> X[set error_mode, halt]
    F -- has tt &amp; ET=1 --> H[enter_trap]
    F -- no tt --> G[advance PC/nPC via delay-slot rule]

Key points (all in step.cpp):

  • Error mode short-circuits without fetching, so the caller sees a stable status (:15).
  • Branch-request reset (:21): the previous cycle's CTI request is cleared up front; handlers re-set it if this instruction is a CTI.
  • Decode cache: a per-PC direct-mapped slot (CpuState::decode_cache_slot) skips both the bus fetch and the decoder on a hit (:43-57). On miss it fetches via ICpuBus::read_u32 and fills the slot. This is the Switch path's only optimisation.
  • Annul: if the previous CTI annulled this delay slot, the instruction is dropped (side effects skipped) but PC/nPC still advance (:60-65).
  • Trap derivation (:83-100): a software trap (Ticc) sets pending_tt directly; otherwise the tt is derived from the ExecStatus via status_to_tt. A trap with ET=0 sets error_mode_ and returns; otherwise enter_trap(pc, npc, tt) fires.
  • Normal advance (:104-109): new_pc = npc, new_npc = branch_taken ? branch_target : npc + 4.

Branch delay slots and annul

SPARC V8 has architectural delay slots: the instruction immediately following a control-transfer instruction (CTI) is always fetched and optionally executed before the branch takes effect. Tero models this without a pipeline:

  1. CTIs (JMPL, CALL, taken Bicc, RETT, …) do not mutate PC/nPC. They set branch_taken_ and compute branch_target_ on CpuState.
  2. The next step() fetches the instruction at the current nPC (the delay slot) and executes it.
  3. After the delay slot, the loop adopts branch_target_ as nPC.

For annulled branches (Bicc,a): if the branch is taken, the delay slot executes normally; if not taken, annul_next_ is set and the next step() skips execution but advances PC/nPC. A trap clears annul_next_ on entry (SPARC V8 §5.1.2.2; CpuState::enter_trap, Decision 37) — without this, the first instruction of an ISR could be silently dropped (the root cause of sp11's ErrorMode crash).

PSR writes are immediate

SPARC V8 §5.1.2.3 permits WRPSR's effect on the S, ET, PS, and CWP fields to be deferred up to three instructions (ICC and PIL are always immediate). That deferral is implementation latitude: real software pads WRPSR with three NOPs, so the observable result is identical whether the write lands now or three instructions later.

Tero applies every writable PSR field immediately, matching the reference oracle (Gaisler SIS). write_psr_writable (cpu_state.cpp:69) masks the read-only fields and writes the rest straight to psr_ in one shot — there is no pending-write buffer, and trap entry/exit set the PSR the same direct way. Modelling the delay diverged from SIS whenever a trap fired inside the three-instruction window (trap entry dropped the still-pending CWP change), desyncing the register windows on trap-dense SMP paths — the root cause of smpschededf03.

How the translation path reuses step()

With translation = true, run_ir_quantum (engine_translate.cpp:92) runs whole blocks but defers to the reference step (IArchitecture::reference_step, which is core::step for SPARC) whenever the IR cannot safely take over:

  • a delay slot (npc != pc + 4), an annulled slot, or an in-flight PSR writefallback_step() (:176);
  • an untranslatable instruction (the frontend returns a 0-insn block) → fallback_step() (:195);
  • the quantum-EXACT yield invariant: a block that would cross the quantum boundary is not run as a block — the remaining quantum - ran instructions are stepped one at a time so the core lands on exactly the same boundary the switch path would (:234-242). Under SMP round-robin, a core drifting past its quantum changes the cross-core interleaving and breaks determinism (smpschedaffinity04). A block larger than a whole quantum still runs as a block when ran == 0, so the core always makes progress.

Block exits funnel through the architecture (the S10 fault tail, Decision 67):

  • Exceptional exits — a memory fault or ExitKind::Exception — go through architecture.raise_block_exception(gs, exit, *block) (engine_translate.cpp:358; contract architecture.hpp:115). The architecture sets up its trap PC/nPC (honouring the block's delay-slot metadata) and enters the trap; false means it cannot take the trap (SPARC ET=0) and the engine halts the core into error mode.
  • Normal exits advance the PC through architecture.set_pc(gs, exit.next_pc) (engine_translate.cpp:368; contract architecture.hpp:103) — SPARC sets PC ← next_pc and nPC ← next_pc + 4.

The universal IR-interpret path (run_ir_interpret_quantum, engine_translate.cpp:379) mirrors the interpreter arm of run_ir_quantum with no JIT, and is reachable for any frontend.

The step hook: reference-path duties without an oracle (E0)

Everything above that steps — the quantum tail, GDB interior breakpoints, single-step — leans on reference_step, which only SPARC implements. An IR-only architecture (every post-SPARC frontend, ADR-006) gets the same duties from the interpreter's per-instruction seam (Decision 79): ir::IStepHook fires before the first op of each guest instruction (every builder op carries insn_index/pc; op-less instructions fire via the gap-filling walk), and may stop the block at that boundary. The committed prefix reuses the precise-trap "prefix committed" guarantee; the engine resumes with set_pc(entry_pc + stride · n) and re-translates from the boundary.

A stop is honoured only at a straight-line boundary — IrBlock::no_stop_tail (set by the frontend on every delay-slot-bearing SPARC terminator) excludes the CTI→delay-slot shadow, where nPC ≠ PC+4 and an annulled slot must not be re-entered as a fresh block. Through this seam an IR-only frontend yields at the exact quantum boundary (the SMP determinism guard, with no oracle), single_step is a true one-instruction step, GDB stops before an interior breakpoint instead of running the block past it, and the per-instruction observer fires interleaved with execution — each callback sees the committed state of every instruction before it. SPARC defaults are unchanged: the oracle remains its trace and stepping path; the hook activates for SPARC only under force_ir_interpret and in the per-instruction lockstep harness (run_ir_diff(..., per_insn = true)).

Maintenance contract: the two loops mirror each other

run_ir_quantum and run_ir_interpret_quantum duplicate their shared invariants on purpose (S9 kept the JIT arm frozen): block-cache lookup, untranslatable-block handling, the quantum-exact yield, the GDB interior-breakpoint single-step, and the raise_block_exception fault tail. A fix to any of those belongs in BOTH loops or they drift apart under SMP — see the MIRROR WARNING comments at engine_translate.cpp:94-100 and :386-388.

Because the IR and the interpreter operate on the same GuestState bytes, no synchronisation is needed across the hand-off. The trap, PSR, and error-mode handling above are therefore identical on both paths. See IR and LLVM JIT for the engine itself.

ErrorMode and post-mortem

SPARC V8 §7.1: a trap that fires while PSR.ET == 0 halts the processor and signals the outside world. Tero sets error_mode_ = true on the offending core (step.cpp:95) and the run loop returns HaltReason::HaltedMode (guest halt) — or, with GDB attached, redirects through the stub as SIGSEGV (engine_run_loop.cpp:298-314, Decision 44). The CLI reacts by dumping a post-mortem of every register on core 0; library users can call emu->core(idx) and inspect pc(), psr(), tbr(), wim(), the globals, and the active window.

See also