Skip to content

tero_core

The SPARC V8 integer + floating-point unit. It owns the architectural state of one core (CpuState), the pure decoder, the per-category instruction handlers, the single-cycle step() driver, and the SoftFloat-backed FPU. It is the correctness oracle — the naive, verify-by-inspection reference path against which the IR/JIT is validated.

# src/core/CMakeLists.txt
target_link_libraries(tero_core
    PUBLIC  tero::interfaces tero::ir
    PRIVATE tero::warnings tero::softfloat3e)

Responsibility

Execute one SPARC V8 instruction correctly against CpuState, reaching the world only through the injected ICpuBus. No bus, no peripheral, no I/O — not even logging.

Why core PUBLIC-links tero_ir

tero_core depends on tero::ir — surprising, since the core is "below" the translation stack. The reason is state unification: CpuState's integer register file is an ir::GuestState byte blob (cpu_state.hpp:690). The Switch interpreter and the IR engine read and write the same bytes, so there is no per-fallback sync. The core knows the blob layout, not the IR ops or LLVM — it never links tero_arch_sparc or tero_jit.

Source layout

src/core/
├── include/tero/core/
│   ├── cpu_state.hpp          ← per-core architectural state (the blob)
│   ├── execution_flags.hpp    ← step-loop scratchpad (branch/annul/trap/PSR pipe)
│   ├── decoder.hpp            ← pure decode(uint32_t) → DecodedInsn
│   ├── decoded_insn.hpp       ← InsnKind + DecodedInsn POD + sub-op enums
│   ├── handlers.hpp           ← execute() dispatcher + ExecStatus
│   ├── step.hpp               ← step() driver + StepResult
│   ├── trap.hpp               ← TrapType (tt) constants + tt_of()
│   ├── fpu.hpp                ← FpOp + FP decode helpers
│   ├── fpu_handlers.hpp       ← FPop1/FPop2/FP load-store entry points
│   └── softfloat_context.hpp  ← Berkeley SoftFloat 3e wrapper
└── src/
    ├── cpu_state.cpp  decoder.cpp  decoder_fpu.cpp
    ├── handlers.cpp            ← execute() dispatch
    ├── handlers_alu.cpp  handlers_branch.cpp  handlers_loadstore.cpp
    ├── handlers_regwin.cpp  handlers_special.cpp
    ├── handlers_internal.hpp   ← shared bodies (exec_alu_op<Op>, status_to_tt, flags_*)
    ├── fpu_handlers.cpp  softfloat_context.cpp
    └── step.cpp

CpuState — the architectural state

CpuState (cpu_state.hpp:250) holds everything one SPARC V8 core remembers between instructions. Its integer state lives in a single host-order ir::GuestState blob laid out by namespace layout:

Region Offset Size Contents
Globals GlobalsBase = 0 8 × 4 B %g0..%g7 (%g0 hardwired zero)
Windowed WindowedBase = 32 NumWindows*16 slots = 128 × 4 B for window w: 8 outs (0..7) + 8 locals (8..15); ins alias the outs of (w+1) mod 8
Special SpecialBase = 544 7 × 4 B Y, PSR, WIM, TBR, PC, nPC, ASR17

StateSize = 572 bytes (cpu_state.hpp:55). The register-window overlap (ins ≡ next window's outs) is what gives SAVE/RESTORE their zero-copy parameter passing — window_slot() / reg_offset() (cpu_state.hpp:63,74) encode it. NumWindows = 8 (fixed for LEON3/LEON4; SPARC V8 allows 2..32).

State unification: one representation, no sync

guest_state() (cpu_state.hpp:302) returns the same blob the integer accessors read/write. The SPARC frontend emits IR LdState/StState against the same layout:: offsets, so the runtime hands this blob straight to the IR engine — no CpuStateGuestState copy. FP state is not in the blob: the IR never lowers FP, so f_[] and fsr_ stay canonical in CpuState (cpu_state.hpp:710-711).

Register and special-register accessors

  • Windowed file: read_r(rs) / write_r(rd, v) (window-aware; r0 reads zero, writes discarded). globals_view() / windowed_view() / *_storage() give raw spans for bulk copy.
  • PSR: psr()/set_psr(), plus typed views cwp, s, ps, et, ef, ec, pil, icc(). PSR bit layout is in namespace psr (cpu_state.hpp:92); WRPSR semantics go through write_psr_writable() which respects the read-only mask (impl/ver/reserved/EC) and the 3-instruction delay (see PSR pipeline below).
  • WIM (set_wim masks to the low NumWindows bits), TBR (write_tba preserves the tt field), Y, ASR17 (LEON3 config register — write_asr17 honours the 0x0FFFE000 writable mask; set_core_index writes the per-core index field; svt_enabled() reports single-vector trapping).
  • Cache control (ASI 0x2): cache_control_reg, icache_config, dcache_config — register state only, no real cache. Reset advertises CCR.DS=1 so RTEMS SMP boot's snoop check passes (cpu_state.hpp:162).
  • FPU: f(i)/set_f, fsr()/fsr_mut()/set_fsr, pack_fsr() / unpack_fsr() (cpu_state.hpp:517-540).

%asr22:%asr23 global up-counter

up_counter() / set_up_counter_base() (cpu_state.hpp:394-401) model the LEON3/LEON4 free-running cycle counter read via %asr22:%asr23. Per GR740-UM §6.10.4 it is a single SoC-wide counter — the run loop sets it from global simulated time once per scheduling round (Emulator::sync_global_up_counter), so all cores read one monotone value that advances through idle-skip and power-down. It is round-granular, so leon3_up_counter_is_available() reports false and RTEMS falls back to the (also round-granular) GPTIMER timecounter — keeping the counter and the periodic clock tick consistent.

Step-loop micro-state (ExecutionFlags)

exec_ (cpu_state.hpp:716, type in execution_flags.hpp) carries the transient state the fetch/execute loop needs but that is not architectural:

Field Meaning
branch_taken / branch_target CTI requested a branch; the loop applies it via the delay-slot rule
annul_next Bicc,a not-taken: skip the delay slot's side effects
pending_trap / pending_tt a handler synthesised a software trap (e.g. Ticc)
error_mode trap taken while ET=0 — SPARC halts (one-way latch)
is_powered_down %asr19 write (idle loop); cleared by IRQ/trap
code_flush_pending a FLUSH retired — the runtime drains it to clear shared caches
The transfer/trap helpers on CpuState operate on these:
request_branch, clear_branch_request, set_annul_next,
raise_trap(tt), enter_trap(saved_pc, saved_npc, tt) (the §7.3 entry
sequence), leave_trap(target) (the §B.26 RETT exit),
and request_code_flush / consume_code_flush.

WRPSR is immediate

SPARC V8 §5.1.2.3 permits WRPSR's effect on S/ET/PS/CWP to be deferred up to three instructions, but that is implementation latitude. Tero applies every writable field at once (write_psr_writable, cpu_state.cpp:69), matching the SIS oracle. The earlier pending_psr_ / commit_psr_pipeline() delay model was removed — it desynced the register windows when a trap fired inside the window (smpschededf03).

The per-core decode cache

CpuState carries a 1024-entry, direct-mapped decode cache (DecodeCacheEntry = {pc_tag, DecodedInsn}, cpu_state.hpp:746). On each step, step() indexes decode_cache_slot(pc) ((pc >> 2) & DecodeCacheMask) and reuses the cached DecodedInsn when pc_tag == pc, skipping both the bus fetch and the decoder. Hit rate is > 99 % on warm RTEMS workloads.

  • pc_tag = 0xFFFFFFFF marks an empty slot (can never collide with a real 4-byte-aligned PC).
  • Cleared in full by reset() and invalidate_decode_cache(). A FLUSH drops this core's cache immediately via request_code_flush().
  • It stores nothing architectural — copy_arch_state_from() (cpu_state.hpp:312) deliberately copies the blob, FP, cache-control, and step micro-state but not the decode cache (the oracle reuses one warm scratch core across millions of blocks).

Decoder — pure, stateless

decode(uint32_t word) → DecodedInsn (decoder.hpp:22) covers Format 1 (CALL), Format 2 (SETHI/Bicc/FBfcc), and the Format 3 opcode space. The decoder is pure: no state, no allocation, no side effects; it never touches CpuState. Unknown / reserved encodings decode to InsnKind::Unknown (the handler maps that to the illegal-instruction trap) — the decoder never throws.

DecodedInsn (decoded_insn.hpp:127) is a flat trivially-copyable POD: raw, kind (an InsnKind discriminator — not the SPARC encoding), register operands (rd/rs1/rs2), has_imm + simm13 + imm22, pre-shifted displacements (disp_call, disp_branch), cond + annul, and a sub-op union by kind (alu_op, shift_op, mul_op, div_op, mem_op, sreg/asr_index, fp_op), plus raw op/op2/op3/asi bits for diagnostics. The sub-op enums (AluOp, ShiftOp, MulOp, DivOp, MemOp, CondCode, SpecialReg) let handlers switch on one value without re-parsing op3.

is_fp_kind(InsnKind) (decoded_insn.hpp:64) and is_fp_class(raw) (fpu.hpp:81) gate the fp_disabled trap (PSR.EF == 0).

Handlers and execute()

execute(CpuState&, const DecodedInsn&, ICpuBus* = nullptr) (handlers.hpp:48) is the dispatcher. It mutates state and returns an ExecStatus; it does not advance PC/nPC (that is step()'s job). A null bus is legal for callers that never run memory instructions (most unit tests); a Load/Store with null bus returns BusError.

File Covers
handlers_alu.cpp ADD/SUB/AND/OR/XOR/ANDN/ORN/XNOR + cc/X variants, tagged-add/sub (TADDcc/TSUBcc(TV)), SLL/SRL/SRA, UMUL/SMUL, UDIV/SDIV, MULScc
handlers_branch.cpp Bicc/BA, CALL, JMPL, Tcc, SETHI
handlers_loadstore.cpp LD/LDUB/LDUH/LDSB/LDSH/LDD + ST/STB/STH/STD, LDA/STA (ASI; incl. cache-control ASI 0x2), LDSTUB, SWAP, CASA
handlers_regwin.cpp SAVE, RESTORE, RETT (WIM-checked per §B.26)
handlers_special.cpp RDY/WRY, RDPSR/WRPSR, RDWIM/WRWIM, RDTBR/WRTBR, RDASR/WRASR, FLUSH, STBAR
fpu_handlers.cpp FPop1/FPop2, FP loads/stores, FBfcc

Shared bodies — exec_alu_op<Op>, status_to_tt, the flags helpers — live in handlers_internal.hpp, included only by the handler .cpp files so they stay off the public surface.

ExecStatus (handlers.hpp:24) is the side-band result; status_to_tt maps it to a SPARC tt:

ExecStatus tt Meaning
Ok completed normally
IllegalInsn 0x02 unknown / UNIMP / reserved
PrivInsn 0x03 privileged op in user mode
FpDisabled 0x04 FP op with PSR.EF == 0
WinOverflow / WinUnderflow 0x05 / 0x06 SAVE / RESTORE window check
AlignError 0x07 unaligned load/store or PC
BusError 0x09 bus fault on data access
TagOverflow 0x0A tagged-arithmetic trap
DivZero 0x2A UDIV/SDIV by zero
TrapInsn 0x80 + n Tcc software trap
InsnFetchError 0x01 bus fault on instruction fetch
ErrorMode trap taken while ET=0 (bubbled up)

TrapType (trap.hpp:26) names the architectural tt bytes; tt_of() converts. Async interrupts use tt = 0x10 + irl (InterruptLevelBase), software traps 0x80 + imm7 (SoftwareTrapBase); MaxIrqLevel = 15.

step() — the single-cycle driver

step(CpuState&, ICpuBus&) → StepResult (step.hpp:41) advances exactly one instruction. Per call, in order (step.cpp):

  1. If error_mode(), short-circuit with ExecStatus::ErrorMode (SPARC stays in error mode until reset).
  2. clear_branch_request() — drop last cycle's CTI state.
  3. Check PC alignment (misaligned PC → AlignError against the fetch).
  4. Decode-cache lookup: on hit, reuse the cached DecodedInsn; on miss, bus.read_u32(VirtAddr{pc}) then decode() and fill the slot. A fetch bus error → InsnFetchError.
  5. If annul_next() is set, drop the instruction's side effects (still advance PC/nPC); else execute(state, insn, &bus) (the FP-disabled gate is charged inside execute, only for FP kinds).
  6. Resolve the trap: a handler-raised pending_trap wins, else status_to_tt(status). If a tt is pending and !et(), latch error_mode and return; otherwise enter_trap(pc, npc, tt).
  7. Non-trap path: apply the SPARC delay-slot update — new_pc = npc, new_npc = branch_taken ? branch_target : npc + 4.

The function is ~110 LOC, deliberately verifiable by inspection.

Delay-slot model

CTIs (CALL, taken Bicc, JMPL, RETT) only record a branch_target; they never touch PC/nPC. step() applies the two-PC update after execute() returns, so SPARC's branch-delay semantics fall out of one place. See execution model.

FPU — Berkeley SoftFloat 3e

fpu.hpp / fpu_handlers.hpp / softfloat_context.hpp wrap the vendored third-party/softfloat3e build (BSD-3-Clause). FpOp (fpu.hpp:25) names the concrete operation inside an FP InsnKind; decode_fpop1 / decode_fpop2 parse the 9-bit opf field. The implemented set covers FP loads/stores (LDF/LDDF/LDFSR, STF/STDF/STFSR), the trivial moves (FMOVs/FNEGs/FABSs), single- and double-precision add/sub, all four FCMP variants, and FBfcc; anything outside it decodes to FpOp::Unimplemented (→ fp_exception, with FTT = 3 — the unimplemented FPop code in the floating-point status register FSR). The FSR's TEM (trap-enable mask) selects which FP faults trap, and drives fp_exception (tt = 0x08); the FP-disabled trap path (PSR.EF == 0) is retained for guests that ship without FP enabled.

The wrapper is per-CpuState — each core has its own SoftFloat state so rounding mode / exception flags do not leak between cores. The specialization is 8086-SSE (upstream default; IEEE-754 default-NaN handling), not the original plan's 8086. RTEMS fptest01 and the FP-using sptests exercise the path end-to-end.

What is deliberately out of tero_core

  • No bus access — fetch, load, store all go through the injected ICpuBus. The core never names SystemBus.
  • No peripherals, no interrupt controller, no scheduler.
  • No I/O of any kind, not even logging — handlers return statuses, not messages.
  • No hardware cache model, no MMU (both deferred). The decode cache holds nothing architectural.
  • No binary translation. The arch-neutral IR and the LLVM JIT live in tero_ir / tero_arch_sparc / tero_jit; tero_core stays the single-instruction reference and oracle.

See also