SMP Support in FRISC OS

February 10, 2026 friscsmposdev

Bringing multi-core support to FRISC OS with Lateralus channels for inter-processor communication. The journey from "works on one core" to "scales to N cores."

◉ The SMP Challenge

Single-core OS development is hard. Multi-core is harder. Every assumption about "this variable won't change" becomes suspect when another core might be touching it simultaneously.

Key SMP challenges we solved:

Bringing up secondary cores (CPU bringup)
Per-core data structures (stacks, interrupt handlers)
Shared memory synchronization (spinlocks, atomics)
Load-balanced scheduling
Cache coherency awareness

◉ CPU Bring-Up

On RISC-V, secondary cores start in a parked state. OpenSBI provides the sbi_hart_start call to wake them:

fn boot_secondary_cores() {
    let hart_count = device_tree.get_cpu_count()
    let boot_hart = current_hart_id()

    println("Boot hart: " + str(boot_hart) + ", total harts: " + str(hart_count))

    for hart_id in 0..hart_count {
        if hart_id != boot_hart {
            // Each core needs its own stack
            let stack = alloc_pages(CORE_STACK_PAGES)
            let stack_top = stack + (CORE_STACK_PAGES * PAGE_SIZE)

            // Start the core at secondary_entry with its stack
            let result = sbi_hart_start(
                hart_id,
                secondary_entry as usize,
                stack_top
            )

            match result {
                Ok(_) => println("  Started hart " + str(hart_id)),
                Err(e) => println("  Failed to start hart " + str(hart_id) + ": " + e),
            }
        }
    }
}

Secondary cores wake up at secondary_entry, initialize their per-core state, then join the scheduler:

fn secondary_entry() {
    let hart_id = current_hart_id()

    // Initialize per-core state
    per_core[hart_id].init()

    // Set up timer interrupt
    timer_init_per_core()

    // Enable interrupts
    enable_interrupts()

    // Signal ready
    atomics.store(&cores_ready, cores_ready.load() + 1)

    // Enter scheduler (never returns)
    scheduler_loop()
}

◉ Per-Core Data Structures

Each core has private data that must not be shared:

struct PerCoreData {
    hart_id: u32,
    current_thread: *Thread,
    idle_thread: *Thread,
    run_queue: RunQueue,
    interrupt_stack: *u8,
    timer_ticks: u64,
}

We use the tp (thread pointer) register to access per-core data without locking:

fn get_per_core() -> &PerCoreData {
    // tp register holds pointer to current core's PerCoreData
    let ptr: *PerCoreData
    asm!("mv {}, tp", out(reg) ptr)
    return &*ptr
}

◉ Synchronization Primitives

RISC-V provides atomic operations via the A extension. We built standard primitives on top:

// Spinlock using atomic swap
struct Spinlock {
    locked: atomic[u32],
}

fn acquire(lock: &Spinlock) {
    while lock.locked.swap(1, Ordering::Acquire) == 1 {
        // Spin with hint to save power
        asm!("pause")  // Or wfi on some implementations
    }
}

fn release(lock: &Spinlock) {
    lock.locked.store(0, Ordering::Release)
}

For higher-level synchronization, Lateralus channels provide message-passing:

// Inter-core communication via channels
let (tx, rx) = channel[WorkItem]()

// Producer core
core[0].run(fn() {
    work_items
        |> each(fn(item) { tx.send(item) })
})

// Consumer cores
for i in 1..4 {
    core[i].run(fn() {
        loop {
            let item = rx.recv()
            process(item)
        }
    })
}

◉ SMP-Aware Scheduler

The scheduler maintains per-core run queues with work stealing:

fn scheduler_loop() {
    let my_queue = &get_per_core().run_queue

    loop {
        // Try to get work from my queue
        let thread = my_queue.pop()

        // If empty, try stealing from other cores
        if thread.is_none() {
            thread = steal_from_busiest_core()
        }

        // If still nothing, run idle thread
        let thread = thread.unwrap_or(get_per_core().idle_thread)

        // Run until preempted or yielded
        context_switch(thread)
    }
}

fn steal_from_busiest_core() -> Option[*Thread] {
    // Find core with most queued threads
    let victim = per_core_data
        |> enumerate()
        |> filter(fn((id, _)) { id != current_hart_id() })
        |> max_by(fn((_, data)) { data.run_queue.len() })

    match victim {
        Some((_, data)) if data.run_queue.len() > 1 => {
            // Steal half their queue
            data.run_queue.steal_half()
        },
        _ => None,
    }
}

◉ Load Balancing

Every 100ms, a load balancer runs to rebalance work across cores:

fn load_balance_tick() {
    let loads = per_core_data
        |> map(fn(data) { (data.hart_id, data.run_queue.len()) })
        |> collect()

    let avg_load = loads |> map(fn((_, l)) { l }) |> mean()

    // Move threads from overloaded to underloaded cores
    let overloaded = loads |> filter(fn((_, l)) { l > avg_load * 1.5 })
    let underloaded = loads |> filter(fn((_, l)) { l < avg_load * 0.5 })

    for (src, _) in overloaded {
        for (dst, _) in underloaded {
            if per_core[src].run_queue.len() > 2 {
                let thread = per_core[src].run_queue.pop_back()
                per_core[dst].run_queue.push(thread)
            }
        }
    }
}

◉ Cache Coherency

RISC-V doesn't guarantee cache coherency across cores for all implementations. We handle this with explicit fences:

// Before another core reads this data
fn publish_data(data: &Data) {
    asm!("fence w, w")  // Ensure all writes are visible
}

// Before reading data written by another core
fn acquire_data() {
    asm!("fence r, r")  // Ensure we see latest writes
}

The U74 cores on HiFive Unmatched are coherent, but we keep these fences for portability.

◉ Benchmarks

Parallel speedup on HiFive Unmatched (4 cores):

Workload	1 core	4 cores	Speedup
Mandelbrot render	4.2s	1.15s	3.65x
Parallel sort (1M items)	890ms	245ms	3.63x
Build kernel	12.4s	4.1s	3.02x
Process spawn stress	1.8s	0.52s	3.46x

We're seeing 3-3.65x speedup on 4 cores — not quite linear, but respectable for a young OS.

◉ Debugging SMP

SMP bugs are notoriously hard to reproduce. Our debugging toolkit:

Per-core serial output: Each core prefixes its output with hart ID
Deadlock detection: Timer watchdog detects stuck spinlocks
Core dump on panic: All cores halt and dump state on any panic
QEMU logging: -d int,cpu_reset to trace all interrupts

[hart 0] Boot complete
[hart 1] Secondary entry
[hart 2] Secondary entry
[hart 3] Secondary entry
[hart 0] All 4 cores online
[hart 2] Running task: compile foo.ltl
[hart 0] Running task: compile bar.ltl
...

◉ What's Next

SMP support is stable for compute workloads. Next up:

NUMA awareness (when hardware supports it)
Core parking for power savings
Real-time scheduling priority for multimedia

See the OS page for build instructions and SMP configuration.