Execution model¶
This page is the canonical reference for how Tero advances time and
runs guest code: the two execution methods, the run_for/run_until
loop, the per-core quantum, round-robin scheduling, idle-time skipping,
pacing modes, and the single-instruction step() cycle that the
translation path falls back to. All citations are path:line into the
current source.
The run loop lives in the ExecutionEngine
(Runtime decomposition): scheduling rounds
in src/runtime/src/engine_run_loop.cpp, the translation quanta in
src/runtime/src/engine_translate.cpp, interrupt sampling and the time
base in src/runtime/src/engine_irq_time.cpp, the MultiThread workers
in src/runtime/src/engine_mt.cpp. Emulator::run_for/run_until are
one-line delegations to the engine (emulator.cpp:307-313).
Two execution methods¶
Tero has two execution methods, both compiled into every build and
selected at runtime by EmulatorConfig::translation
(bool, default true — emulator_config.hpp:327):
translation |
Method | Driver | Role |
|---|---|---|---|
false |
Switch interpreter | core::step per instruction (run_core_quantum's per-step loop, engine_run_loop.cpp:190) |
The reference path and correctness oracle |
true (default) |
Binary translation | ExecutionEngine::run_ir_quantum — tiered LLVM JIT block-at-a-time, IR interpreter fallback (engine_translate.cpp:92) |
The fast path, validated bit-identical against the oracle |
A third quantum path exists alongside these: the universal IR-interpret
path (run_ir_interpret_quantum, engine_translate.cpp:379 —
entity-model S9). It drives the arch-neutral IR interpreter block by
block with no JIT. It is the non-JIT path for any non-SPARC frontend
(which has no core::step oracle), and SPARC opts into it via
EmulatorConfig::force_ir_interpret (emulator_config.hpp:357) for
differential validation against the switch oracle.
Both methods share the same CpuState and the same trap / PSR /
error-mode semantics. The IR/JIT path is not a separate emulator: it
operates on the same GuestState byte blob that core::step uses
(state unification — see Layers), and it falls back to
the reference step for delay slots, annulled slots, and any
instruction the frontend cannot translate
(run_ir_quantum's fallback_step, which calls
IArchitecture::reference_step — engine_translate.cpp:119-123).
This page documents the run-loop scheduling and the step() cycle. The
translation engine itself — IR ops, the block cache, the tiered JIT — is
documented in IR and LLVM JIT.
Why keep a slow interpreter at all?
The interpreter trades throughput for determinism, debuggability, and a small surface for hardware-modelling bugs. That is exactly why it stays the oracle the JIT is checked against. See Design principles §7.
The public run API¶
RunResult run_for(SimTimeNs duration); // advance by duration
RunResult run_until(SimTimeNs deadline); // advance until sim_time == deadline
RunResult single_step(CoreId core); // exactly one instruction, no IRQ/events
run_for is a thin wrapper over run_until
(engine_run_loop.cpp:22):
RunResult ExecutionEngine::run_for(SimTimeNs duration) {
return run_until(SimTimeNs{to_underlying(sim_time_) + to_underlying(duration)});
}
Every call returns a RunResult (run_result.hpp):
| Field | Meaning |
|---|---|
instructions_executed |
total instructions retired across all cores this call |
time_elapsed |
the sim_time_ at return |
reason |
a HaltReason (see below) |
HaltReason distinguishes the four ways a run stops:
DeadlineReached/DurationExpired— normal completion.HaltedMode— a guest core took a trap withET=0(SPARC stops the processor). This is the guest's own doing (RTEMS_CPU_Fatal_halt/_exitissueta 0withET=0), not an emulator failure (Decision 56).ErrorMode— reserved for an internal emulator error, distinct from a guestHaltedMode.Breakpoint— a GDB breakpoint, a late-binding GDB attach, or a Ctrl-C async interrupt.
run_until → pacing → run_until_unpaced¶
run_until slices the requested span into pacing_slice_ns chunks
(default 10 ms simulated, emulator_config.hpp:304) under both
pacing modes (engine_run_loop.cpp:56). The slice bounds the
per-segment idle-skip: an unbounded run_until_unpaced over a long
all-idle stretch could leap simulated time past a clock-dependent
guest's periodic GPTimer ticks. Realtime additionally sleeps between
chunks; Turbo runs them back-to-back.
flowchart TD
A["run_until(deadline)"] --> D["anchor wall & sim time"]
D --> E["slice loop:<br/>run_until_unpaced(min(deadline, sim+slice))"]
E --> B{pacing?}
B -- Realtime --> F["sleep_until(wall_anchor + sim_done_ns)"]
B -- Turbo --> G
F --> G{sim < deadline?}
G -- yes --> E
G -- no --> H[return]
PacingMode::Turbo (tests, batch, SMP2 wrapper)¶
Free-running. The slice loop runs the chunks back-to-back and never sleeps. This is the mode for CI, the RTEMS testsuite, the bench tools, and the future SMP2 wrapper (where the external scheduler dictates cadence).
PacingMode::Realtime (default for interactive use)¶
The slice loop sleep_untils on std::chrono::steady_clock between
chunks so 1 s simulated ≈ 1 s real (engine_run_loop.cpp:89-95). The
wall-clock and sim-time anchors are taken once per call and are
local — persisting them across calls would let an idle host program
drift the simulation arbitrarily ahead of wall-clock. Halving
ns_per_insn doubles simulated MHz, so the host must execute twice the
instructions per real second to stay on schedule.
Pacing never changes results
Pacing affects only when the host sleeps, never sim_time_ or any
guest-visible value. A run is bit-identical under Turbo and
Realtime.
The core loop: run_until_unpaced¶
run_until_unpaced (engine_run_loop.cpp:226) is the heart of the
emulator. Each iteration is one scheduling round; the loop runs
until sim_time_ >= deadline. A single round, in order:
sequenceDiagram
participant L as run_until_unpaced
participant G as GdbStub
participant IC as IrqController
participant C as Cores (round-robin)
participant S as EventScheduler
participant P as Peripherals
loop while sim_time < deadline
L->>G: poll_accept / stop_pending (late-binding GDB)
L->>L: sync_global_up_counter() %% %asr22:%asr23
L->>IC: consume_core_release_wake() → wake parked cores
L->>IC: sample_interrupts(ci) for every core
L->>L: error_mode check (→ GDB SIGSEGV or HaltedMode)
L->>L: all_idle?
alt all cores powered down
L->>S: next_event_time()
L->>L: sim_time = jump toward next event / deadline
else
L->>C: run quantum per core (ST inline / MT barrier)
L->>L: sim_time += fold of per-core delta_ns
end
L->>S: fire_pending(sim_time)
L->>P: tick(sim_time) for every peripheral
end
The exact per-round sequence (citations into engine_run_loop.cpp):
- Late-binding GDB (
:246,:257): pick up a client that connected afterrun_untilstarted, or a stop primed at early-attach. Either returnsHaltReason::Breakpoint. sync_global_up_counter()(:269; implengine_irq_time.cpp:79): re-base every core's%asr22:%asr23up-counter to global simulated time for this round. This is the SoC-wide system timecounter RTEMS reads for TOD/uptime, so it must advance withsim_time_including the idle-skip jump that runs no core. It routes throughIArchitecture::set_time_base, so a frontend without such a counter no-ops.- Core-release wake (
:277): drainIInterruptController::consume_core_release_wake(ci)before sampling IRQs, so a secondary CPU is running before its per-CPU IRQs are evaluated. The GRLIB controllers drive this from RTEMS SMP boot'sMPSTAT[i]=1write. sample_interrupts(ci)for every core (:287; implengine_irq_time.cpp:12): deliver any IRQ that arrived from the previous round's peripheral ticks — before deciding whether to skip time, so a freshly-woken core is not skipped over.- Error-mode check (
:298): if any core tripped error mode, halt immediately. With a GDB client attached and not yet notified, redirect through the stub (SIGSEGV at the offending PC); otherwise returnHaltedMode. all_idletest (:316): are all cores powered down? (A GDB Ctrl-C still breaks in while every core is dormant,:327.)- Either idle-skip OR run a round of quanta (see below).
scheduler_.fire_pending(sim_time_)(:401): process timed events whose deadline has passed.soc_->tick_peripherals(sim_time_)(:403): advance GPTIMER, drain UART, etc. with the round's finalsim_time_.
The quantum and round-robin scheduling¶
The quantum is EmulatorConfig::quantum (default 1000
instructions, emulator_config.hpp:277) — the number of instructions a
core executes before the scheduler moves on. One round runs one quantum
per active core.
run_core_quantum(core_idx) (engine_run_loop.cpp:101) runs a single
core's quantum and returns a CoreQuantumResult { instructions,
delta_ns, should_return, stop }:
- If the core is powered down, it executes nothing but still bills
quantum * ns_per_insnso it stays time-synchronised with peripherals (:106). - With a single core (
core_blobs_.size() == 1), the quantum is enlarged to an event-bounded burst (up to 65536 instructions, never past the next scheduled event) — there is no cross-core interleaving to preserve, so the per-round overhead is amortised while timer interrupts still fire at their exact simulated time (:113-138). - If
translation && !observer_, it callsrun_ir_quantum— the JIT/IR fast path (:147). - Else, if
use_ir_interpret_(a non-SPARC frontend, or SPARC underforce_ir_interpret), it callsrun_ir_interpret_quantum— the universal IR-interpret path (:166). - Otherwise (SPARC with observer/trace or
translation = false— note MultiThread does NOT force this; it takes the JIT path above) it runs the per-core Switch path: aforloop drivingIArchitecture::reference_step(which iscore::stepon the SPARC lens) up toquantumtimes (:190), checkingshould_break, firing the observer, polling self-IPIs, and breaking early on error/power-down.
Single-thread round-robin¶
In ExecutionMode::SingleThread (the default), the loop runs each core's
quantum inline, in core order, and reduces the per-core simulated-time
deltas into sim_time_ once, at the round boundary
(engine_run_loop.cpp:378-400):
std::uint64_t round_delta = 0;
for (std::size_t ci = 0; ci < core_blobs_.size(); ++ci) {
const auto cr = run_core_quantum(ci);
result.instructions_executed += cr.instructions;
round_delta = fold_delta(round_delta, cr.delta_ns);
if (cr.should_return) { /* GDB stop: commit and return */ }
}
sim_time_ = SimTimeNs{to_underlying(sim_time_) + round_delta};
fold_delta (engine_run_loop.cpp:236) applies
EmulatorConfig::time_advance (ADR-005,
Multicore and timing): Concurrent (the default)
takes the max of the per-core deltas — the cores run concurrently in
one wall-window — while the legacy Sum accumulates them. For N=1 the
two are identical.
Nothing observes sim_time_ mid-round (peripherals tick after the loop
with the final value), so reducing once at the boundary is byte-identical
to advancing inline — and it is what makes the multi-thread path below
produce the same sim_time_.
Multi-thread (ADR-001)¶
In ExecutionMode::MultiThread with num_cores > 1, cores 1..N−1 run on
their own host threads. start_workers spawns the workers parked on a
std::barrier; each round the main thread releases them at the
start_barrier_, runs core 0 itself, then rejoins at the done_barrier_
(engine_run_loop.cpp:354-377; worker_loop at engine_mt.cpp:35,
start_workers at engine_mt.cpp:46). Per-core deltas are reduced into
sim_time_ exactly as in single-thread. A quantum_batch > 1 lets each
worker run several quanta back-to-back before the barrier to amortise
its cost (Phase 14, emulator_config.hpp:286; run_core_batch,
engine_mt.cpp:22). MultiThread runs the per-core JIT path with
per-core IR caches and TieredJits (Tier-2 background O2 included); a GDB
stub or observer forces SingleThread.
Self-IPI latency
All quantum paths call poll_self_interrupt(core_idx) at each
instruction/block boundary (engine_run_loop.cpp:197,
engine_translate.cpp:155) so a core that raises an interrupt to
itself takes it at hardware (instruction-boundary) latency rather
than waiting for the next round — fixing smpmulticast01
(poll_self_interrupt, engine_irq_time.cpp:99).
Idle-time skipping¶
When all cores are powered down, simulating their quanta is wasted
latency. The loop instead jumps sim_time_ straight forward
(engine_run_loop.cpp:333-353):
auto next_event = to_underlying(scheduler_.next_event_time());
auto jump_target = std::min(to_underlying(deadline), next_event);
if (scheduler_.empty()) {
constexpr std::uint64_t MaxIdleNs = 1'000'000ULL; // 1 ms
jump_target = std::min(jump_target, to_underlying(sim_time_) + MaxIdleNs);
}
if (jump_target > to_underlying(sim_time_)) {
sim_time_ = SimTimeNs{jump_target};
}
The jump is bounded by the deadline and the next scheduled event. Timed
peripherals schedule their underflows through the EventScheduler
(event-driven GPTimer), so the event bounds the jump to the exact wake
instant — one jump per timer period instead of many small steps. The
1 ms MaxIdleNs fallback applies only when nothing is scheduled, so
non-event-driven peripherals (e.g. APBUART RX polling) are still ticked
during a long idle with no armed timer, and time never leaps straight to
a far deadline.
Cores enter the idle state by writing %asr19 (power-down); a pending
interrupt wakes them in sample_interrupts
(engine_irq_time.cpp:48-54).
How cores get woken: sample_interrupts¶
sample_interrupts(core_idx) (engine_irq_time.cpp:12) is the single
place a hardware interrupt becomes a trap. The engine is arch-neutral
here: the architecture makes the decision and performs the delivery
(Decision 68).
- Fast path: in SingleThread,
interrupt_controller_->raw_pending() == 0proves nothing is pending anywhere and skips the per-CPU scan (:23). - Query
interrupt_controller_->pending_mask(core_idx); return if zero. - Ask the architecture:
ir_arch_->evaluate_interrupt(state, pending)returns anInterruptDecision { level, take, trap_type, ack_mask }(architecture.hpp:123-141) — the highest pending priority and whether the architecture's enable/priority gates (SPARC:PSR.ET,PSR.PIL) permit delivery. - A pending interrupt wakes a powered-down core regardless of the
enable gate (
:48-54) — RTEMS SMP boot relies on this: a secondary CPU parks withET=0and is woken by an IPI. - If
decision.take: acknowledge the controller with the opaquedecision.ack_maskthe architecture formed (auto-clear the pending/force bit, Decision 39,:64), notify any observer, and deliver the trap throughir_arch_->deliver_interrupt(core_idx, decision)(:76) — for SPARC this enters the trap on the lens, clearing the step micro-state (annul, pending software trap) the blob does not carry.
The step() cycle (the reference, and the fallback)¶
A single iteration of tero::core::step(CpuState&, ICpuBus&)
(src/core/src/step.cpp:11) does the following:
flowchart TD
A0{error_mode?} -- yes --> A1[return ErrorMode]
A0 -- no --> A2[clear_branch_request]
A2 --> B{PC aligned?}
B -- no --> B1[status = AlignError]
B -- yes --> C[fetch via decode cache]
C --> D{annul_next?}
D -- yes --> D1[drop slot, clear annul]
D -- no --> E[execute → ExecStatus]
B1 --> F{tt?}
D1 --> F
E --> F
F -- has tt & ET=0 --> X[set error_mode, halt]
F -- has tt & ET=1 --> H[enter_trap]
F -- no tt --> G[advance PC/nPC via delay-slot rule]
Key points (all in step.cpp):
- Error mode short-circuits without fetching, so the caller sees a
stable status (
:15). - Branch-request reset (
:21): the previous cycle's CTI request is cleared up front; handlers re-set it if this instruction is a CTI. - Decode cache: a per-PC direct-mapped slot
(
CpuState::decode_cache_slot) skips both the bus fetch and the decoder on a hit (:43-57). On miss it fetches viaICpuBus::read_u32and fills the slot. This is the Switch path's only optimisation. - Annul: if the previous CTI annulled this delay slot, the
instruction is dropped (side effects skipped) but PC/nPC still advance
(
:60-65). - Trap derivation (
:83-100): a software trap (Ticc) setspending_ttdirectly; otherwise thettis derived from theExecStatusviastatus_to_tt. A trap withET=0setserror_mode_and returns; otherwiseenter_trap(pc, npc, tt)fires. - Normal advance (
:104-109):new_pc = npc,new_npc = branch_taken ? branch_target : npc + 4.
Branch delay slots and annul¶
SPARC V8 has architectural delay slots: the instruction immediately following a control-transfer instruction (CTI) is always fetched and optionally executed before the branch takes effect. Tero models this without a pipeline:
- CTIs (
JMPL,CALL, takenBicc,RETT, …) do not mutatePC/nPC. They setbranch_taken_and computebranch_target_onCpuState. - The next
step()fetches the instruction at the currentnPC(the delay slot) and executes it. - After the delay slot, the loop adopts
branch_target_asnPC.
For annulled branches (Bicc,a): if the branch is taken, the delay
slot executes normally; if not taken, annul_next_ is set and the
next step() skips execution but advances PC/nPC. A trap clears
annul_next_ on entry (SPARC V8 §5.1.2.2; CpuState::enter_trap,
Decision 37) — without this, the first instruction of an ISR could be
silently dropped (the root cause of sp11's ErrorMode crash).
PSR writes are immediate¶
SPARC V8 §5.1.2.3 permits WRPSR's effect on the S, ET, PS, and
CWP fields to be deferred up to three instructions (ICC and PIL are
always immediate). That deferral is implementation latitude: real
software pads WRPSR with three NOPs, so the observable result is
identical whether the write lands now or three instructions later.
Tero applies every writable PSR field immediately, matching the
reference oracle (Gaisler SIS). write_psr_writable (cpu_state.cpp:69)
masks the read-only fields and writes the rest straight to psr_ in one
shot — there is no pending-write buffer, and trap entry/exit set the PSR
the same direct way. Modelling the delay diverged from SIS whenever a trap
fired inside the three-instruction window (trap entry dropped the
still-pending CWP change), desyncing the register windows on trap-dense
SMP paths — the root cause of smpschededf03.
How the translation path reuses step()¶
With translation = true, run_ir_quantum (engine_translate.cpp:92)
runs whole blocks but defers to the reference step
(IArchitecture::reference_step, which is core::step for SPARC)
whenever the IR cannot safely take over:
- a delay slot (
npc != pc + 4), an annulled slot, or an in-flight PSR write →fallback_step()(:176); - an untranslatable instruction (the frontend returns a 0-insn block)
→
fallback_step()(:195); - the quantum-EXACT yield invariant: a block that would cross the
quantum boundary is not run as a block — the remaining
quantum - raninstructions are stepped one at a time so the core lands on exactly the same boundary the switch path would (:234-242). Under SMP round-robin, a core drifting past its quantum changes the cross-core interleaving and breaks determinism (smpschedaffinity04). A block larger than a whole quantum still runs as a block whenran == 0, so the core always makes progress.
Block exits funnel through the architecture (the S10 fault tail, Decision 67):
- Exceptional exits — a memory fault or
ExitKind::Exception— go througharchitecture.raise_block_exception(gs, exit, *block)(engine_translate.cpp:358; contractarchitecture.hpp:115). The architecture sets up its trap PC/nPC (honouring the block's delay-slot metadata) and enters the trap;falsemeans it cannot take the trap (SPARCET=0) and the engine halts the core into error mode. - Normal exits advance the PC through
architecture.set_pc(gs, exit.next_pc)(engine_translate.cpp:368; contractarchitecture.hpp:103) — SPARC sets PC ← next_pc and nPC ← next_pc + 4.
The universal IR-interpret path (run_ir_interpret_quantum,
engine_translate.cpp:379) mirrors the interpreter arm of
run_ir_quantum with no JIT, and is reachable for any frontend.
The step hook: reference-path duties without an oracle (E0)¶
Everything above that steps — the quantum tail, GDB interior
breakpoints, single-step — leans on reference_step, which only SPARC
implements. An IR-only architecture (every post-SPARC frontend,
ADR-006) gets the same duties from the interpreter's per-instruction seam
(Decision 79): ir::IStepHook fires before the first op
of each guest instruction (every builder op carries insn_index/pc;
op-less instructions fire via the gap-filling walk), and may stop the
block at that boundary. The committed prefix reuses the precise-trap
"prefix committed" guarantee; the engine resumes with
set_pc(entry_pc + stride · n) and re-translates from the boundary.
A stop is honoured only at a straight-line boundary —
IrBlock::no_stop_tail (set by the frontend on every delay-slot-bearing
SPARC terminator) excludes the CTI→delay-slot shadow, where nPC ≠ PC+4
and an annulled slot must not be re-entered as a fresh block. Through
this seam an IR-only frontend yields at the exact quantum boundary
(the SMP determinism guard, with no oracle), single_step is a true
one-instruction step, GDB stops before an interior breakpoint instead
of running the block past it, and the per-instruction observer fires
interleaved with execution — each callback sees the committed state of
every instruction before it. SPARC defaults are unchanged: the oracle
remains its trace and stepping path; the hook activates for SPARC only
under force_ir_interpret and in the per-instruction lockstep harness
(run_ir_diff(..., per_insn = true)).
Maintenance contract: the two loops mirror each other
run_ir_quantum and run_ir_interpret_quantum duplicate their
shared invariants on purpose (S9 kept the JIT arm frozen): block-cache
lookup, untranslatable-block handling, the quantum-exact yield, the
GDB interior-breakpoint single-step, and the raise_block_exception
fault tail. A fix to any of those belongs in BOTH loops or they
drift apart under SMP — see the MIRROR WARNING comments at
engine_translate.cpp:94-100 and :386-388.
Because the IR and the interpreter operate on the same GuestState
bytes, no synchronisation is needed across the hand-off. The trap, PSR,
and error-mode handling above are therefore identical on both paths. See
IR and LLVM JIT for the engine itself.
ErrorMode and post-mortem¶
SPARC V8 §7.1: a trap that fires while PSR.ET == 0 halts the processor
and signals the outside world. Tero sets error_mode_ = true on the
offending core (step.cpp:95) and the run loop returns
HaltReason::HaltedMode (guest halt) — or, with GDB attached, redirects
through the stub as SIGSEGV (engine_run_loop.cpp:298-314, Decision 44). The CLI
reacts by dumping a post-mortem of every register on core 0; library
users can call emu->core(idx) and inspect pc(), psr(), tbr(),
wim(), the globals, and the active window.
See also¶
- IR and LLVM JIT — the binary-translation engine
- Multicore and timing — deeper on quantum tuning, CPI
- Traps and interrupts — full trap reference
- Design decisions / ADRs — the rationale behind every choice here