Designing a Shared-Nothing Wasm Orchestrator

The MicroVM was the right answer to the wrong question. It asked “how do we get VM-grade isolation with container-grade startup?” and the answer was Firecracker, a stripped device model, and cold starts that bottomed out around 125ms. For workloads that live for hours, that’s free. For request-scoped edge compute, where a function lives 4ms and serves one HTTP request, 125ms of cold-path jitter isn’t a tax you amortize. It’s the budget.

WebAssembly flips the constants. A precompiled wasmtime module instantiates in tens of microseconds. Isolation comes from compile-time bounds checks on the linear memory and module-level type safety, not from a vmexit and a second-stage page-table walk. Density goes from hundreds of MicroVMs per host to tens of thousands of Wasm instances. The interesting platform work is now below the VM, and the orchestrator that lives there has to be designed against the silicon, not against an event loop.

Pin every thread, share nothing

A 64-core EPYC has eight CCDs, each with its own L3. Cross-CCD traffic is a coherence event in tens of nanoseconds. The default Tokio multi-thread runtime is a work-stealing scheduler: a task wakes on core 3, gets stolen onto core 47, and now its working set is cold in L1, cold in L2, on the wrong NUMA node, and the connection state it just touched is sitting on a different L3 entirely. Work stealing optimizes mean throughput at the cost of variance. For p99-sensitive workloads that’s the wrong trade.

The shared-nothing alternative is mechanical:

Let the kernel route the flow

How does a connection actually reach a core in this model? SO_REUSEPORT with a CBPF steering filter. Each core opens its own listening socket. The kernel hashes the 4-tuple and pins the flow to one socket for its lifetime. No accept-queue contention, no cross-core wakeups, no rebalancing. The connection state, the io_uring CQEs, and the wasmtime::Store all live on the same L1.

The syscall is the budget

epoll was designed when a syscall cost a few hundred cycles. With KPTI and retpolines, you’re closer to a thousand. At a million RPS per core, that’s the budget.

io_uring collapses the model. Two ring buffers, submission and completion, mapped into userspace. Set up IORING_SETUP_SQPOLL plus IORING_SETUP_IOPOLL and the kernel polls the SQ while the device polls completions. The data plane runs without a single enter syscall on the hot path.

The hard part is memory ownership. A recv SQE hands a buffer pointer to the kernel until the matching CQE arrives. Rust’s lifetimes don’t have a way to say “borrowed by the kernel for an unbounded duration.” Two patterns:

Back the registered pool with 2MB hugepages. The TLB footprint of a 1GB region drops from 256K entries to 512, which removes the dTLB miss as a tail-latency contributor.

Linear memory is the io_uring buffer

This is the seam between I/O and execution, and it’s the part most designs get wrong. The NIC has DMA’d a packet into a registered buffer. Wasmtime runs inside a separate linear-memory mmap. A naive design copies bytes from the io_uring buffer into linear memory at every dispatch. That copy dominates at small payload sizes and pollutes the L1 line that just arrived warm from the NIC.

The move is to register a fixed prefix of each instance’s linear memory as the io_uring buffer. The kernel writes directly into guest memory. The Wasm entry function sees its input at a known offset. Zero copies, one TLB-warm page, no host-to-guest pointer translation. The cost is coupling: buffer registration is now tied to instance lifetime, so reset has to re-register or reuse a slot.

AOT, static memory, epoch interruption

Why not JIT and let Cranelift run on a different core? There is no different core. Every core is pinned and serving requests. JIT means tenant code schedules compilation onto the same data-plane core that’s holding a flow’s state, and Cranelift isn’t constant-time. AOT pushes that cost to deploy. Done.

Throughput alone is the wrong benchmark

The number that matters is throughput vs p99.9 latency, not throughput alone. X axis: offered RPS per core. Y axis: p99.9. Three lines:

Native thread-per-core wins the absolute throughput number. Wasm pays a ~10-15% tax for the sandbox and gets it back in density and isolation. The default Tokio line diverges hard above 60% utilization because work stealing introduces tail-latency cliffs that pinned cores don’t have.

The last syscall is the network stack

The remaining hot-path syscall is the one userspace can’t remove from above: the kernel network stack itself. A packet enters the NIC, traverses the kernel’s IP and TCP layers, and lands in a socket buffer before io_uring sees it.

The next move is to push L4 termination into eBPF/XDP, do TLS in userspace with kernel TLS offload for the bulk path, and let the orchestrator see only dispatch-ready connections. At that point the kernel is a driver, not a participant.