Skip to content

Multicore and timing

This page is the reference for how Tero schedules multiple cores, advances simulated time, and dispatches timed events. It covers the cooperative round-robin run loop, the instruction quantum, SimTimeNs and the cpu_clock_hz/cpi derivation of ns_per_insn, idle-time skipping, the PacingMode choice, the EventScheduler, and the ADR-001 MultiThread thread-per-core mode (with its std::barrier and GatedMutex machinery).

It belongs to the Developer Manual. See Traps and interrupts for how IRQs are delivered between quanta, Memory and bus for the per-core bus bridge, and Design principles for the frozen decisions this page implements.


Two execution modes, one default

EmulatorConfig::execution_mode (emulator_config.hpp:170) selects how simulated cores map onto host threads (ADR-001):

Mode Host threads Determinism Availability
SingleThread (default) 1 (cooperative round-robin) Bit-exact, host-load-independent Always — the SMP2-compatible path
MultiThread thread-per-core + barrier Convergent, not bit-exact Standalone only (Phase 13)

SingleThread is the default and the only mode the external SMP2 wrapper uses: that wrapper expects the model to advance under its scheduler, on one thread. MultiThread is a throughput addition for the 1:1 GR740 goal. The two modes are runtime-selected — both are compiled into one binary; there is no build flag (per the config-by-struct rule). execution_mode is orthogonal to translation (the JIT vs Switch choice): any execution method runs under either mode.


SingleThread: cooperative round-robin

Why single-threaded round-robin?

The decision flows from the design principles:

  • TSO is satisfied by construction. Only one core executes at any host instant, so the SPARC Total Store Order memory model holds with no fences or atomics.
  • Atomics (CASA, LDSTUB, SWAP) are correct by construction. They run to completion within a quantum; no interleaving is possible.
  • Deterministic. Two runs of the same image produce the same trace, regardless of host load. Essential for regression testing and lockstep comparison against SIS.
  • Debuggable without host-level races. No host-level data race can corrupt guest state.

The cost is throughput: with N cores, one host thread delivers at best 1/N of a parallel implementation. This is acceptable for the RTEMS testsuite goal, and with translation = true (the tiered LLVM JIT, the default) single-core throughput is high enough that the host is rarely the bottleneck (see IR and LLVM JIT). True host-parallel execution is the standalone-only MultiThread mode below.

The run loop

The core loop is Emulator::run_until_unpaced (src/runtime/src/emulator.cpp:744). Each iteration is one scheduling round:

sequenceDiagram
    participant E as run_until_unpaced
    participant IC as IRQ(A)MP
    participant C as Cores
    participant S as EventScheduler
    participant P as Peripherals

    loop while sim_time < deadline
        E->>E: sync_global_up_counter() (re-base %asr23)
        E->>IC: consume_mpstat_wake() per core (release parked CPUs)
        E->>IC: sample_interrupts() per core (deliver pending IRQs)
        E->>C: scan error_mode (halt if any tripped)
        alt all cores powered down
            E->>S: next_event_time()
            E->>E: sim_time = min(next_event, deadline, now + 1 ms)
        else at least one active
            loop for each core (inline, round-robin)
                E->>C: run_core_quantum(core)
            end
            E->>E: sim_time += fold of per-core deltas (max | sum)
        end
        E->>S: fire_pending(sim_time)
        E->>P: tick(sim_time) per peripheral
    end

Key ordering invariants, each commented in the source:

  • IRQ sampling happens before the idle check (emulator.cpp:793) so a powered-down core that just received an IPI wakes before the loop decides to skip time.
  • sim_time_ is advanced once per round, by folding the per-core delta_ns via EmulatorConfig::time_advance (emulator.cpp, the fold_delta lambda). The default Concurrent takes the max — the round's chip-time is the longest core's quantum, since the cores execute concurrently in one wall-window, so each core runs at its full rated clock and %realtime is honest for multi-core. This matches the SIS oracle's run_sim_mp (all CPUs advanced to one shared time) and restores the asr23 == simtime invariant sync_global_up_counter intends. The legacy Sum option accumulates instead (the N-core clock then runs N× fast, each core at clock/N) — bit-exact with the pre-ADR-005 behaviour, and identical to Concurrent for N=1. See ADR-005 (plans/adr-005-multicore-time-model.md). Nothing observes sim_time_ mid-round — peripherals tick at the round boundary with the final value.
  • Peripherals tick last (emulator.cpp:902), after the scheduler fires, with the round's final sim_time_.

A single core's quantum

Emulator::run_core_quantum(core_idx) (emulator.cpp:603) runs one core for up to quantum instructions and returns a CoreQuantumResult (instructions, delta_ns, optional GDB stop). It picks a path:

  1. Powered down → no instructions, but delta_ns = quantum * ns_per_insn so the core stays time-aligned with peripherals (:609).
  2. Translation on, no observerrun_ir_quantum (the tiered JIT / IR interpreter, :622). GDB-aware; arms a stop on a breakpoint.
  3. Switch path (observer/trace or translation off — not MultiThread, which takes path 2) → the reference core::step interpreter loop (:640). This polls the self-IPI and the GDB breakpoint set, fires the observer, and steps one instruction at a time.

Both paths call poll_self_interrupt per instruction so self-directed IPIs arrive at instruction-boundary latency (see Traps and interrupts).

The quantum

EmulatorConfig::quantum (default 1000, emulator_config.hpp:130) is the number of instructions each core executes before yielding to the next core.

Quantum size Pros Cons
Small (≤ 100) Finer cross-core interrupt latency More scheduler overhead
Default (1000) Balanced for RTEMS
Large (≥ 10000) Lower scheduler overhead Worse cross-core latency

You almost never need to touch this; the default keeps RTEMS sptests within a few-percent variance of reference baselines. The N=2 SMP recipe re-tunes the quantum to 200 to balance IPI-heavy workloads (see MEMORY.md self-IPI note).


Idle-time skipping

The most performance-sensitive part of the loop is what happens when every core is idle. Running quantum no-op instructions per idle core would waste host cycles, so Tero jumps simulated time forward instead (emulator.cpp:838):

// src/runtime/src/emulator.cpp:838 — all cores powered down
constexpr std::uint64_t MaxIdleNs = 1'000'000ULL;          // 1 ms cap
auto next_event = to_underlying(scheduler_.next_event_time());
auto jump_target = std::min(to_underlying(deadline), next_event);
auto max_wake = to_underlying(sim_time_) + MaxIdleNs;
jump_target = std::min(jump_target, max_wake);
if (jump_target > to_underlying(sim_time_)) {
    sim_time_ = SimTimeNs{jump_target};
}

The jump is bounded by three things:

  1. The run deadline — never overshoot the requested budget.
  2. scheduler_.next_event_time() — the next scheduled IEvent.
  3. MaxIdleNs (1 ms) — this cap exists because the GPTIMER raises its periodic interrupt via IInterruptSource::raise() (a peripheral tick), not through the EventScheduler, so its next tick is invisible to next_event_time(). Without the cap, time would jump straight to the deadline and the GPTIMER would only fire once, freezing the RTEMS clock. The 1 ms bound guarantees timer interrupts arrive at roughly their expected rate even while the core sleeps.

This is Decision 25 (see Design decisions); it is what lets RTEMS sp04 (an explicit idle-loop test) finish in finite real time.

GDB can break into an idle core

The idle-skip path executes no instructions, so the per-instruction should_break check never runs. The loop adds an explicit check_async_interrupt() poll for the all-idle case (emulator.cpp:832) so a Ctrl-C from GDB is not swallowed indefinitely.

Powering cores up and down

A LEON3/LEON4 core powers down by writing a non-zero value to %asr19 (wr %g0, %asr19). The handler (exec_write_special, src/core/src/handlers_special.cpp:106) calls set_power_down(true). The core stays parked until any of:

  • An IRQ is asserted on its input — sample_interrupts wakes it regardless of ET (emulator.cpp:1320).
  • RTEMS SMP boot writes MPSTAT[i]=1consume_mpstat_wake (emulator.cpp:783) releases the secondary CPU at the top of the round, before IRQ sampling.
  • A GDB debug command forces a wake.

On reset, only CPU 0 runs. CPUs 1..N start parked; ElfLoader sets is_powered_down = true for every core except core 0, mirroring real silicon where the boot CPU wakes the others once the OS is ready (Decision 28).


Simulated time

SimTimeNs and the timing knobs

Simulated time is an opaque enum class SimTimeNs : std::uint64_t (nanoseconds simulated). It is advanced by instruction count, never by the host clock for the core loop. Two config knobs drive it:

  • cpu_clock_hz (emulator_config.hpp:109) — the system clock. The primary frequency knob; it clocks the cores, the on-chip buses, and the peripherals (the GR712RC/GR740 manuals describe a single clock domain). Bare default 50 MHz; the recipes set 80 MHz (GR712RC) / 250 MHz (GR740).
  • cpi (:119) — global cycles-per-instruction (TEMU-style, default 1.0). The throughput knob. There is no per-opcode CPI table — Tero models one global CPI.

ns_per_insn is derived, not set by hand:

// src/runtime/include/tero/runtime/emulator_config.hpp:55,65
constexpr std::uint64_t ns_per_cycle(std::uint64_t clock_hz) noexcept {
    return (clock_hz == 0) ? 0 : (1'000'000'000ULL + clock_hz / 2) / clock_hz;
}
inline std::uint64_t ns_per_insn_for(std::uint64_t clock_hz, double cpi) noexcept {
    const double t = static_cast<double>(ns_per_cycle(clock_hz)) * cpi;
    if (!(t >= 1.0)) return 1U;            // clamp NaN / non-positive cpi to 1 ns
    return static_cast<std::uint64_t>(std::lround(t));
}

Emulator::create recomputes cfg.ns_per_insn = ns_per_insn_for(cpu_clock_hz, cpi) (emulator.cpp:227), so setting cpu_clock_hz and cpi is sufficient. At 50 MHz, cpi = 1.0 → 20 ns/insn.

cpi scales the CPU only — not the peripheral clock

cpi enters the timing model in exactly one place: ns_per_insn_for. The GPTIMER prescaler and every peripheral stay on ns_per_cycle(cpu_clock_hz) (the bus clock), unaffected by cpi. Raising cpi makes the CPU proportionally slower in simulated time without retiming the timers (emulator_config.hpp:60-62).

run_for(duration) runs until current_sim_time() + duration (emulator.cpp:538); run_until(deadline) runs until exactly deadline; current_sim_time() snapshots the clock.

The global up-counter (%asr22:%asr23)

%asr22:%asr23 is the LEON3/LEON4 system timecounter — one SoC-wide free-running counter at the processor clock (GR740-UM §6.10.4), not a per-core tally. Emulator::sync_global_up_counter (emulator.cpp:1351) re-bases every core's counter from global sim_time_ once per round, before any worker thread is released (single-writer):

// src/runtime/src/emulator.cpp:1361 — floor(ns * hz / 1e9) without 128-bit math
const std::uint64_t cycles =
    ((ns / NsPerSec) * hz) + (((ns % NsPerSec) * hz) / NsPerSec);
for (auto& core : cores_) core.set_up_counter_base(cycles);

Because it is re-based per round (not per instruction), it is round-granular — which is intentional: a finer counter broke spcpucounter01, and RTEMS falls back to the GPTIMER timecounter consistently with the round-granular tick. The counter advances even across the idle-skip jump (it never runs a core), so TOD/uptime keep moving while the CPU sleeps. See the %asr23 note in MEMORY.md.


Pacing modes

Wall-clock pacing is opt-out, not opt-in. EmulatorConfig::pacing (emulator_config.hpp:163) selects:

  • PacingMode::Realtime (default)run_until (emulator.cpp:561) slices the budget into pacing_slice_ns chunks (default 10 ms simulated), calls run_until_unpaced for each, then std::this_thread::sleep_untils on std::chrono::steady_clock so simulated time tracks wall-clock time. The wall-clock anchor and sim-time anchor are taken once per run_until call and dropped at the end (emulator.cpp:571-572) — they never persist across calls, so a host stall between two run_for invocations does not produce a long catch-up sleep.
  • PacingMode::Turbo — free-running. run_until short-circuits straight to run_until_unpaced (emulator.cpp:562); the emulator never reads the host clock. Mandatory for tests, batch tools, and the SMP2 wrapper (where an external scheduler dictates time).

Reconciling the 'no wall-clock in the core' rule

The original frozen rule said the core must never read the host clock. After human review (2026-04-28) wall-clock pacing was moved into the core, gated by PacingMode. The intent of the rule — the SMP2 wrapper controls time externally — is preserved by Turbo, which the wrapper hard-sets.

Halving ns_per_insn doubles simulated MHz. In Realtime mode that means the host must execute twice the instructions per real second to keep up; if it cannot, the simulation silently falls behind because sleep_until cannot accelerate the simulator.


The EventScheduler

tero::runtime::EventScheduler (event_scheduler.hpp) is a min-heap of (deadline, IEvent*) nodes implementing the IScheduler interface (src/interfaces/include/tero/ischeduler.hpp). Peripherals must not spin on the CPU clock; they schedule future work and are called back via IEvent::execute() (src/interfaces/include/tero/ievent.hpp).

Method Purpose
schedule_event(when, ev) Enqueue ev to fire at simulated time when.
fire_pending(now) Fire every event with deadline ≤ now, in chronological order. Events may re-schedule themselves from inside execute().
next_event_time() Earliest pending deadline, or UINT64_MAX if empty. Drives idle-skip.
empty() True when no events are pending.
// src/runtime/include/tero/runtime/event_scheduler.hpp:32
void fire_pending(SimTimeNs now) {
    while (!heap_.empty() && heap_.top().deadline <= now) {
        auto node = heap_.top();
        heap_.pop();
        node.event->execute();          // may re-schedule itself
    }
}

The run loop calls fire_pending(sim_time_) once per round (emulator.cpp:900), after cores have run and before peripherals tick.

The scheduler is not thread-safe — and that is by design

EventScheduler is driven only from the main thread (the round boundary), even under MultiThread: workers run cores, but fire_pending / schedule_event happen at the serial barrier point. The class header states this explicitly (event_scheduler.hpp:21). Note that the GPTIMER's periodic IRQ does not flow through the scheduler — it is raised from the peripheral tick, which is why idle-skip needs the MaxIdleNs cap.


MultiThread mode (ADR-001)

MultiThread runs each simulated core on its own host thread for aggregate throughput beyond a single host thread, while preserving the cross-core ordering points the round-robin loop relies on. It is standalone-only and inert (falls back to single-thread dispatch) whenever a GDB stub or observer is attached.

Thread-per-core + std::barrier

start_workers (emulator.cpp:701) spawns N-1 worker threads (cores 1..N-1); the main thread runs core 0. Two std::barriers with N participants bracket each round (emulator.cpp:711-716): a start_barrier_ and a done_barrier_. The MultiThread dispatch in the run loop (emulator.cpp:855):

// src/runtime/src/emulator.cpp:861 — MultiThread round
start_barrier_->arrive_and_wait();          // release workers
core_results_[0] = run_core_batch(0);        // main thread runs core 0
done_barrier_->arrive_and_wait();            // rejoin
// reduce per-core deltas; drain deferred FLUSH; advance sim_time_ once

Each worker's worker_loop (emulator.cpp:690) parks at start_barrier_, runs run_core_batch(core_idx), then parks at done_barrier_. The mt_dispatch_now() gate (emulator.hpp:347) is mt_dispatch_active_ && gdb_stub_ == nullptr && observer_ == nullptr — a GDB stub or observer forces the SingleThread inline branch.

run_core_batch (emulator.cpp:677) runs quantum_batch quanta back-to-back before the barrier (Phase 14). Larger quantum_batch amortises the barrier (higher CPU-bound throughput) at the cost of coarser cross-core IRQ/event delivery; default 1 is the finest grain.

Per-core state under MultiThread

Several pieces that are shared in SingleThread become per-core under MultiThread because their internals are not thread-safe (Hazard G):

  • Bus bridge — each core owns a CpuBusBridge with its own RAM fast-path cache (bus_bridge_, one per core, emulator.cpp:288). See Memory and bus.
  • IR block cache / IR interpreter / tiered JIT — per-core under MultiThread (ir_cache_for / tiered_jit_for, emulator.hpp:455), each per-core JIT keeping its own Tier-2 background O2 thread (P14-2). SingleThread shares one of each.
  • MMIO peripheral locks and the IRQ-controller lock — engaged via set_thread_safe(true) (emulator.cpp:373-378); see GatedMutex.

A guest FLUSH issued by a worker is deferred and drained at the serial barrier point (code_flush_pending_, emulator.cpp:872) so cache mutation never races a translation.

GatedMutex (runtime-gated locking)

tero::GatedMutex (src/interfaces/include/tero/gated_mutex.hpp) is a std::mutex whose locking is gated by a runtime flag. In SingleThread the gate is inactive and lock()/unlock()/try_lock() are no-ops (one predictable branch, effectively zero sync cost); in MultiThread it behaves like a real mutex. This is the mechanism ADR-001 prescribes:

Why a runtime gate, not a compile-time NullMutex

ADR-001 originally described the no-op-in-single-thread mechanism as a compile-time template parameter. That conflicts with the config-by-struct / no-compile-flags-for-behaviour rule: execution_mode is a runtime field and both modes must live in one binary. The runtime gate reconciles them — zero overhead single-thread, real exclusion multi-thread, one binary (gated_mutex.hpp:14-22).

Threading contract: set_active() is called exactly once during initialize(), before any worker starts (emulator.cpp:373). The flag is read-only for the rest of the run, so it never races and lock()/unlock() observe a stable value. Flipping it while threads run is a usage error.

TSO and atomics under MultiThread

ADR-003 makes x86-64 the tier-1 host precisely because its native TSO maps SPARC TSO with zero fences. Atomics (CASA, LDSTUB, SWAP) are real atomics on the RAM backing store under MultiThread (SystemBus::atomic_*, see Memory and bus); the per-region lock serialises MMIO. On a weak-memory host (ARM64, tier-2) explicit fences would be required — which is why ARM64 is best-effort, not a 1:1 target.

Determinism gate

Byte-identical SingleThread == MultiThread traces are impossible for lock-contended workloads: the round-robin path has no real contention, so the two cannot interleave identically. The achievable gate is shared-state convergence — both modes reach the same architectural end state. RTEMS smptests pass under MultiThread for the configs validated to date (see the Phase 13 notes referenced from MEMORY.md).


How to tune and extend

  • Cross-core IRQ latency too coarse? Lower quantum (SingleThread) or quantum_batch (MultiThread). Both trade throughput for latency.
  • Need a timed peripheral callback? Implement IEvent, call schedule_event(when, ev) from attach() or a register write, and re-schedule from execute(). Do not read the host clock or sim_time_ directly for periodic work — use the scheduler.
  • Need a periodic interrupt with idle-skip awareness? If you raise IRQs from tick (like the GPTIMER) rather than the scheduler, your tick is invisible to next_event_time() — the 1 ms idle cap covers the common case, but a slower period than 1 ms will still fire each cap. Prefer schedule_event for event-precise timing.
  • Standalone parallelism? Set execution_mode = MultiThread and pacing = Turbo. Confirm the workload is not GDB/observer-driven (those force SingleThread).

Measure SMP timing under control

SMP regression checks are timing-sensitive. Always do back-to-back fix-vs-baseline runs on the same host — committed CSVs go thermally stale and produce phantom regressions. Only PASS transitions matter; TIMEOUT↔FAIL shuffles are noise (see MEMORY.md controlled-baseline note).