Multicore and timing¶
This page is the reference for how Tero schedules multiple cores, advances
simulated time, and dispatches timed events. It covers the cooperative
round-robin run loop, the instruction quantum, SimTimeNs and the
cpu_clock_hz/cpi derivation of ns_per_insn, idle-time skipping, the
PacingMode choice, the EventScheduler, and the ADR-001 MultiThread
thread-per-core mode (with its std::barrier and GatedMutex machinery).
It belongs to the Developer Manual. See Traps and interrupts for how IRQs are delivered between quanta, Memory and bus for the per-core bus bridge, and Design principles for the frozen decisions this page implements.
Two execution modes, one default¶
EmulatorConfig::execution_mode (emulator_config.hpp:170) selects how
simulated cores map onto host threads (ADR-001):
| Mode | Host threads | Determinism | Availability |
|---|---|---|---|
SingleThread (default) |
1 (cooperative round-robin) | Bit-exact, host-load-independent | Always — the SMP2-compatible path |
MultiThread |
thread-per-core + barrier | Convergent, not bit-exact | Standalone only (Phase 13) |
SingleThread is the default and the only mode the external SMP2 wrapper
uses: that wrapper expects the model to advance under its scheduler, on one
thread. MultiThread is a throughput addition for the 1:1 GR740 goal. The two
modes are runtime-selected — both are compiled into one binary; there is
no build flag (per the config-by-struct rule). execution_mode is orthogonal
to translation (the JIT vs Switch choice): any execution method runs under
either mode.
SingleThread: cooperative round-robin¶
Why single-threaded round-robin?¶
The decision flows from the design principles:
- TSO is satisfied by construction. Only one core executes at any host instant, so the SPARC Total Store Order memory model holds with no fences or atomics.
- Atomics (CASA, LDSTUB, SWAP) are correct by construction. They run to completion within a quantum; no interleaving is possible.
- Deterministic. Two runs of the same image produce the same trace, regardless of host load. Essential for regression testing and lockstep comparison against SIS.
- Debuggable without host-level races. No host-level data race can corrupt guest state.
The cost is throughput: with N cores, one host thread delivers at best 1/N
of a parallel implementation. This is acceptable for the RTEMS testsuite goal,
and with translation = true (the tiered LLVM JIT, the default) single-core
throughput is high enough that the host is rarely the bottleneck (see
IR and LLVM JIT). True host-parallel execution is the
standalone-only MultiThread mode below.
The run loop¶
The core loop is Emulator::run_until_unpaced
(src/runtime/src/emulator.cpp:744). Each iteration is one scheduling
round:
sequenceDiagram
participant E as run_until_unpaced
participant IC as IRQ(A)MP
participant C as Cores
participant S as EventScheduler
participant P as Peripherals
loop while sim_time < deadline
E->>E: sync_global_up_counter() (re-base %asr23)
E->>IC: consume_mpstat_wake() per core (release parked CPUs)
E->>IC: sample_interrupts() per core (deliver pending IRQs)
E->>C: scan error_mode (halt if any tripped)
alt all cores powered down
E->>S: next_event_time()
E->>E: sim_time = min(next_event, deadline, now + 1 ms)
else at least one active
loop for each core (inline, round-robin)
E->>C: run_core_quantum(core)
end
E->>E: sim_time += fold of per-core deltas (max | sum)
end
E->>S: fire_pending(sim_time)
E->>P: tick(sim_time) per peripheral
end
Key ordering invariants, each commented in the source:
- IRQ sampling happens before the idle check (
emulator.cpp:793) so a powered-down core that just received an IPI wakes before the loop decides to skip time. sim_time_is advanced once per round, by folding the per-coredelta_nsviaEmulatorConfig::time_advance(emulator.cpp, thefold_deltalambda). The defaultConcurrenttakes the max — the round's chip-time is the longest core's quantum, since the cores execute concurrently in one wall-window, so each core runs at its full rated clock and%realtimeis honest for multi-core. This matches the SIS oracle'srun_sim_mp(all CPUs advanced to one shared time) and restores theasr23 == simtimeinvariantsync_global_up_counterintends. The legacySumoption accumulates instead (the N-core clock then runs N× fast, each core atclock/N) — bit-exact with the pre-ADR-005 behaviour, and identical toConcurrentfor N=1. See ADR-005 (plans/adr-005-multicore-time-model.md). Nothing observessim_time_mid-round — peripherals tick at the round boundary with the final value.- Peripherals tick last (
emulator.cpp:902), after the scheduler fires, with the round's finalsim_time_.
A single core's quantum¶
Emulator::run_core_quantum(core_idx) (emulator.cpp:603) runs one core for
up to quantum instructions and returns a CoreQuantumResult
(instructions, delta_ns, optional GDB stop). It picks a path:
- Powered down → no instructions, but
delta_ns = quantum * ns_per_insnso the core stays time-aligned with peripherals (:609). - Translation on, no observer →
run_ir_quantum(the tiered JIT / IR interpreter,:622). GDB-aware; arms a stop on a breakpoint. - Switch path (observer/trace or translation off — not
MultiThread, which takes path 2) → the referencecore::stepinterpreter loop (:640). This polls the self-IPI and the GDB breakpoint set, fires the observer, and steps one instruction at a time.
Both paths call poll_self_interrupt per instruction so self-directed IPIs
arrive at instruction-boundary latency (see
Traps and interrupts).
The quantum¶
EmulatorConfig::quantum (default 1000, emulator_config.hpp:130) is the
number of instructions each core executes before yielding to the next core.
| Quantum size | Pros | Cons |
|---|---|---|
| Small (≤ 100) | Finer cross-core interrupt latency | More scheduler overhead |
| Default (1000) | Balanced for RTEMS | — |
| Large (≥ 10000) | Lower scheduler overhead | Worse cross-core latency |
You almost never need to touch this; the default keeps RTEMS sptests within a
few-percent variance of reference baselines. The N=2 SMP recipe re-tunes the
quantum to 200 to balance IPI-heavy workloads (see MEMORY.md self-IPI note).
Idle-time skipping¶
The most performance-sensitive part of the loop is what happens when every
core is idle. Running quantum no-op instructions per idle core would waste
host cycles, so Tero jumps simulated time forward instead
(emulator.cpp:838):
// src/runtime/src/emulator.cpp:838 — all cores powered down
constexpr std::uint64_t MaxIdleNs = 1'000'000ULL; // 1 ms cap
auto next_event = to_underlying(scheduler_.next_event_time());
auto jump_target = std::min(to_underlying(deadline), next_event);
auto max_wake = to_underlying(sim_time_) + MaxIdleNs;
jump_target = std::min(jump_target, max_wake);
if (jump_target > to_underlying(sim_time_)) {
sim_time_ = SimTimeNs{jump_target};
}
The jump is bounded by three things:
- The run deadline — never overshoot the requested budget.
scheduler_.next_event_time()— the next scheduledIEvent.MaxIdleNs(1 ms) — this cap exists because the GPTIMER raises its periodic interrupt viaIInterruptSource::raise()(a peripheraltick), not through theEventScheduler, so its next tick is invisible tonext_event_time(). Without the cap, time would jump straight to the deadline and the GPTIMER would only fire once, freezing the RTEMS clock. The 1 ms bound guarantees timer interrupts arrive at roughly their expected rate even while the core sleeps.
This is Decision 25 (see Design decisions); it is what lets
RTEMS sp04 (an explicit idle-loop test) finish in finite real time.
GDB can break into an idle core
The idle-skip path executes no instructions, so the per-instruction
should_break check never runs. The loop adds an explicit
check_async_interrupt() poll for the all-idle case
(emulator.cpp:832) so a Ctrl-C from GDB is not swallowed indefinitely.
Powering cores up and down¶
A LEON3/LEON4 core powers down by writing a non-zero value to %asr19
(wr %g0, %asr19). The handler (exec_write_special,
src/core/src/handlers_special.cpp:106) calls set_power_down(true). The core
stays parked until any of:
- An IRQ is asserted on its input —
sample_interruptswakes it regardless ofET(emulator.cpp:1320). - RTEMS SMP boot writes
MPSTAT[i]=1—consume_mpstat_wake(emulator.cpp:783) releases the secondary CPU at the top of the round, before IRQ sampling. - A GDB debug command forces a wake.
On reset, only CPU 0 runs. CPUs 1..N start parked; ElfLoader sets
is_powered_down = true for every core except core 0, mirroring real silicon
where the boot CPU wakes the others once the OS is ready (Decision 28).
Simulated time¶
SimTimeNs and the timing knobs¶
Simulated time is an opaque enum class SimTimeNs : std::uint64_t (nanoseconds
simulated). It is advanced by instruction count, never by the host clock for
the core loop. Two config knobs drive it:
cpu_clock_hz(emulator_config.hpp:109) — the system clock. The primary frequency knob; it clocks the cores, the on-chip buses, and the peripherals (the GR712RC/GR740 manuals describe a single clock domain). Bare default 50 MHz; the recipes set 80 MHz (GR712RC) / 250 MHz (GR740).cpi(:119) — global cycles-per-instruction (TEMU-style, default1.0). The throughput knob. There is no per-opcode CPI table — Tero models one global CPI.
ns_per_insn is derived, not set by hand:
// src/runtime/include/tero/runtime/emulator_config.hpp:55,65
constexpr std::uint64_t ns_per_cycle(std::uint64_t clock_hz) noexcept {
return (clock_hz == 0) ? 0 : (1'000'000'000ULL + clock_hz / 2) / clock_hz;
}
inline std::uint64_t ns_per_insn_for(std::uint64_t clock_hz, double cpi) noexcept {
const double t = static_cast<double>(ns_per_cycle(clock_hz)) * cpi;
if (!(t >= 1.0)) return 1U; // clamp NaN / non-positive cpi to 1 ns
return static_cast<std::uint64_t>(std::lround(t));
}
Emulator::create recomputes cfg.ns_per_insn = ns_per_insn_for(cpu_clock_hz,
cpi) (emulator.cpp:227), so setting cpu_clock_hz and cpi is sufficient.
At 50 MHz, cpi = 1.0 → 20 ns/insn.
cpi scales the CPU only — not the peripheral clock
cpi enters the timing model in exactly one place: ns_per_insn_for.
The GPTIMER prescaler and every peripheral stay on
ns_per_cycle(cpu_clock_hz) (the bus clock), unaffected by cpi.
Raising cpi makes the CPU proportionally slower in simulated time
without retiming the timers (emulator_config.hpp:60-62).
run_for(duration) runs until current_sim_time() + duration
(emulator.cpp:538); run_until(deadline) runs until exactly deadline;
current_sim_time() snapshots the clock.
The global up-counter (%asr22:%asr23)¶
%asr22:%asr23 is the LEON3/LEON4 system timecounter — one SoC-wide
free-running counter at the processor clock (GR740-UM §6.10.4), not a
per-core tally. Emulator::sync_global_up_counter (emulator.cpp:1351)
re-bases every core's counter from global sim_time_ once per round, before
any worker thread is released (single-writer):
// src/runtime/src/emulator.cpp:1361 — floor(ns * hz / 1e9) without 128-bit math
const std::uint64_t cycles =
((ns / NsPerSec) * hz) + (((ns % NsPerSec) * hz) / NsPerSec);
for (auto& core : cores_) core.set_up_counter_base(cycles);
Because it is re-based per round (not per instruction), it is round-granular —
which is intentional: a finer counter broke spcpucounter01, and RTEMS falls
back to the GPTIMER timecounter consistently with the round-granular tick. The
counter advances even across the idle-skip jump (it never runs a core), so
TOD/uptime keep moving while the CPU sleeps. See the %asr23 note in
MEMORY.md.
Pacing modes¶
Wall-clock pacing is opt-out, not opt-in. EmulatorConfig::pacing
(emulator_config.hpp:163) selects:
PacingMode::Realtime(default) —run_until(emulator.cpp:561) slices the budget intopacing_slice_nschunks (default 10 ms simulated), callsrun_until_unpacedfor each, thenstd::this_thread::sleep_untils onstd::chrono::steady_clockso simulated time tracks wall-clock time. The wall-clock anchor and sim-time anchor are taken once perrun_untilcall and dropped at the end (emulator.cpp:571-572) — they never persist across calls, so a host stall between tworun_forinvocations does not produce a long catch-up sleep.PacingMode::Turbo— free-running.run_untilshort-circuits straight torun_until_unpaced(emulator.cpp:562); the emulator never reads the host clock. Mandatory for tests, batch tools, and the SMP2 wrapper (where an external scheduler dictates time).
Reconciling the 'no wall-clock in the core' rule
The original frozen rule said the core must never read the host clock.
After human review (2026-04-28) wall-clock pacing was moved into the core,
gated by PacingMode. The intent of the rule — the SMP2 wrapper controls
time externally — is preserved by Turbo, which the wrapper hard-sets.
Halving ns_per_insn doubles simulated MHz. In Realtime mode that means the
host must execute twice the instructions per real second to keep up; if it
cannot, the simulation silently falls behind because sleep_until cannot
accelerate the simulator.
The EventScheduler¶
tero::runtime::EventScheduler (event_scheduler.hpp) is a min-heap of
(deadline, IEvent*) nodes implementing the IScheduler interface
(src/interfaces/include/tero/ischeduler.hpp). Peripherals must not spin on
the CPU clock; they schedule future work and are called back via
IEvent::execute() (src/interfaces/include/tero/ievent.hpp).
| Method | Purpose |
|---|---|
schedule_event(when, ev) |
Enqueue ev to fire at simulated time when. |
fire_pending(now) |
Fire every event with deadline ≤ now, in chronological order. Events may re-schedule themselves from inside execute(). |
next_event_time() |
Earliest pending deadline, or UINT64_MAX if empty. Drives idle-skip. |
empty() |
True when no events are pending. |
// src/runtime/include/tero/runtime/event_scheduler.hpp:32
void fire_pending(SimTimeNs now) {
while (!heap_.empty() && heap_.top().deadline <= now) {
auto node = heap_.top();
heap_.pop();
node.event->execute(); // may re-schedule itself
}
}
The run loop calls fire_pending(sim_time_) once per round
(emulator.cpp:900), after cores have run and before peripherals tick.
The scheduler is not thread-safe — and that is by design
EventScheduler is driven only from the main thread (the round boundary),
even under MultiThread: workers run cores, but fire_pending /
schedule_event happen at the serial barrier point. The class header
states this explicitly (event_scheduler.hpp:21). Note that the GPTIMER's
periodic IRQ does not flow through the scheduler — it is raised from
the peripheral tick, which is why idle-skip needs the MaxIdleNs cap.
MultiThread mode (ADR-001)¶
MultiThread runs each simulated core on its own host thread for aggregate
throughput beyond a single host thread, while preserving the cross-core
ordering points the round-robin loop relies on. It is standalone-only and
inert (falls back to single-thread dispatch) whenever a GDB stub or observer
is attached.
Thread-per-core + std::barrier¶
start_workers (emulator.cpp:701) spawns N-1 worker threads (cores
1..N-1); the main thread runs core 0. Two std::barriers with N
participants bracket each round (emulator.cpp:711-716): a start_barrier_
and a done_barrier_. The MultiThread dispatch in the run loop
(emulator.cpp:855):
// src/runtime/src/emulator.cpp:861 — MultiThread round
start_barrier_->arrive_and_wait(); // release workers
core_results_[0] = run_core_batch(0); // main thread runs core 0
done_barrier_->arrive_and_wait(); // rejoin
// reduce per-core deltas; drain deferred FLUSH; advance sim_time_ once
Each worker's worker_loop (emulator.cpp:690) parks at start_barrier_,
runs run_core_batch(core_idx), then parks at done_barrier_. The
mt_dispatch_now() gate (emulator.hpp:347) is
mt_dispatch_active_ && gdb_stub_ == nullptr && observer_ == nullptr — a GDB
stub or observer forces the SingleThread inline branch.
run_core_batch (emulator.cpp:677) runs quantum_batch quanta back-to-back
before the barrier (Phase 14). Larger quantum_batch amortises the barrier
(higher CPU-bound throughput) at the cost of coarser cross-core IRQ/event
delivery; default 1 is the finest grain.
Per-core state under MultiThread¶
Several pieces that are shared in SingleThread become per-core under MultiThread because their internals are not thread-safe (Hazard G):
- Bus bridge — each core owns a
CpuBusBridgewith its own RAM fast-path cache (bus_bridge_, one per core,emulator.cpp:288). See Memory and bus. - IR block cache / IR interpreter / tiered JIT — per-core under
MultiThread (
ir_cache_for/tiered_jit_for,emulator.hpp:455), each per-core JIT keeping its own Tier-2 background O2 thread (P14-2). SingleThread shares one of each. - MMIO peripheral locks and the IRQ-controller lock — engaged via
set_thread_safe(true)(emulator.cpp:373-378); see GatedMutex.
A guest FLUSH issued by a worker is deferred and drained at the serial
barrier point (code_flush_pending_, emulator.cpp:872) so cache mutation
never races a translation.
GatedMutex (runtime-gated locking)¶
tero::GatedMutex (src/interfaces/include/tero/gated_mutex.hpp) is a
std::mutex whose locking is gated by a runtime flag. In SingleThread the
gate is inactive and lock()/unlock()/try_lock() are no-ops (one
predictable branch, effectively zero sync cost); in MultiThread it behaves
like a real mutex. This is the mechanism ADR-001 prescribes:
Why a runtime gate, not a compile-time NullMutex
ADR-001 originally described the no-op-in-single-thread mechanism as a
compile-time template parameter. That conflicts with the
config-by-struct / no-compile-flags-for-behaviour rule:
execution_mode is a runtime field and both modes must live in one
binary. The runtime gate reconciles them — zero overhead single-thread,
real exclusion multi-thread, one binary (gated_mutex.hpp:14-22).
Threading contract: set_active() is called exactly once during
initialize(), before any worker starts (emulator.cpp:373). The flag is
read-only for the rest of the run, so it never races and lock()/unlock()
observe a stable value. Flipping it while threads run is a usage error.
TSO and atomics under MultiThread¶
ADR-003 makes x86-64 the tier-1 host precisely because its native TSO maps
SPARC TSO with zero fences. Atomics (CASA, LDSTUB, SWAP) are real atomics
on the RAM backing store under MultiThread (SystemBus::atomic_*, see
Memory and bus); the per-region lock serialises MMIO. On a
weak-memory host (ARM64, tier-2) explicit fences would be required — which is
why ARM64 is best-effort, not a 1:1 target.
Determinism gate¶
Byte-identical SingleThread == MultiThread traces are impossible for
lock-contended workloads: the round-robin path has no real contention, so the
two cannot interleave identically. The achievable gate is shared-state
convergence — both modes reach the same architectural end state. RTEMS
smptests pass under MultiThread for the configs validated to date (see the
Phase 13 notes referenced from MEMORY.md).
How to tune and extend¶
- Cross-core IRQ latency too coarse? Lower
quantum(SingleThread) orquantum_batch(MultiThread). Both trade throughput for latency. - Need a timed peripheral callback? Implement
IEvent, callschedule_event(when, ev)fromattach()or a register write, and re-schedule fromexecute(). Do not read the host clock orsim_time_directly for periodic work — use the scheduler. - Need a periodic interrupt with idle-skip awareness? If you raise IRQs
from
tick(like the GPTIMER) rather than the scheduler, your tick is invisible tonext_event_time()— the 1 ms idle cap covers the common case, but a slower period than 1 ms will still fire each cap. Preferschedule_eventfor event-precise timing. - Standalone parallelism? Set
execution_mode = MultiThreadandpacing = Turbo. Confirm the workload is not GDB/observer-driven (those force SingleThread).
Measure SMP timing under control
SMP regression checks are timing-sensitive. Always do back-to-back
fix-vs-baseline runs on the same host — committed CSVs go thermally
stale and produce phantom regressions. Only PASS transitions matter;
TIMEOUT↔FAIL shuffles are noise (see MEMORY.md controlled-baseline note).