Architecture
One reactor per thread, typically one per core. Each owns an io_uring, a SO_REUSEPORT listener, a connection table, buffer rings, and a connection pool - and is the sole writer of all of them. Nothing is shared between reactors. This page is the system view; The Reactor and The Connection go inside the classes.
The loop
A reactor's life is a single loop: drain the cross-thread queues, enter the kernel once
(io_uring_enter - submitting everything staged and waiting for at least one
completion), then dispatch the whole completion batch.
while (true)
{
// Work handed over by off-reactor handlers. Cheap when empty.
DrainReturnQ(); // buffer returns
DrainFlushQ(); // flushes
DrainRecycleQ(); // connection teardowns
DrainRemoteOps(); // client ops
// One syscall per batch: submit everything staged, wait for >= 1 CQE.
Ring.SubmitAndWait(1);
// Read the CQ tail once, dispatch the whole batch, publish the head once.
uint ready = Ring.CqReady();
for (uint i = 0; i < ready; i++)
Dispatch(in Ring.CqeAt(i));
Ring.CqAdvance(ready);
}
Dispatching a completion frequently runs handler code - that's the inline-resume model below - so by the time the loop re-enters the kernel, the responses those completions triggered are already staged in the submission queue. One syscall carries the whole request/response batch.
Ring setup
- SINGLE_ISSUER - only this thread submits, so the kernel skips SQ locking.
- DEFER_TASKRUN - completion work runs batched inside
enterinstead of interrupting the thread as task-work. - NO_SQARRAY (kernel 6.6+) - drops the SQ indirection array, one store fewer per SQE. Setup falls back automatically on older kernels (EINVAL probe).
- SQ-full handling - if the submission queue fills mid-batch, the reactor flushes it
with a no-wait
enterand continues; submission never blocks on completions.
Accept, recv, send
- Multishot accept - armed once on the listener; every new connection is just a CQE. Accepted sockets get TCP_NODELAY (it doesn't reliably inherit from the listener).
- Multishot recv + provided buffers - armed once per connection. The kernel picks a buffer from a pre-registered ring, fills it, and posts a CQE carrying the buffer id. Handlers consume slices zero-copy and return buffers when done.
- Send with MSG_WAITALL - the kernel retries short sends internally, so a flush is one SQE and one CQE. A genuinely partial send (error paths) is resubmitted from the offset.
Two buffer-ring modes
Shared (default): one pool per reactor; every connection draws from it. One recv
consumes one whole buffer regardless of size - elastic and simple, but small messages waste
space. Incremental (IOU_PBUF_RING_INC, kernel 6.12+): a small ring per
connection, and the kernel appends successive recvs into the same buffer until it
fills. Dense packing and per-connection isolation, paid for with refcounted recycling - a
buffer returns only when the handler has returned every slice and the kernel is done
appending (F_BUF_MORE cleared) - plus a ring registration per connection
(MaxConnections caps the buffer-group ids). The handler API is identical in both
modes; ReturnBuffer(s) routes the right return path.
Completion routing: tags and generations
Every SQE carries its routing in user_data:
[63:56] kind accept · recv · send · wake · client · cancel
[47:32] gen the connection's generation at submit time
[31:0] fd (or the client-op slot)
Dispatch is an array index (connections[fd] - fds are small dense integers, so
an array beats hashing) plus a generation check. The generation is what makes fd reuse safe:
when a connection dies, its fd number is immediately reusable, and a straggler CQE from the
old life would otherwise reach the new tenant. Stale generation → the CQE is dropped and its
buffer returned. The same guard rides the flush queue and incremental buffer returns.
Teardown also submits an ASYNC_CANCEL for the connection's multishot recv
(matched by exact user_data), so a dead connection can't keep consuming buffers or race the
fd's next tenant. If a connection's recv queue overflows - the handler isn't draining - the
reactor cancels and tears it down rather than leaving it zombied.
Client ops (kind = client) skip the connection table entirely: the low 32 bits
index a slot table holding the submission's completion object.
Inline resume
Every awaitable - ReadAsync, FlushAsync, every client op - is
backed by a reusable IValueTaskSource core with
RunContinuationsAsynchronously = false. When the reactor dispatches a CQE and
calls SetResult, the awaiting handler continues right there, on the
reactor thread, inside the dispatch loop. Zero allocation per await: connections are pooled
and their cores are reused, with the connection generation as the token so an awaiter from a
previous pool life resolves to a closed result instead of the new tenant's state.
Leaving the reactor (and coming back)
Handlers may wander - await Task.Delay, any BCL async - and resume on the
thread pool. Every reactor-touching operation checks the current thread: on the reactor it
takes the direct path (write the SQE, touch the buf_ring); off it, the operation is queued -
lock-free MPSC queues for buffer returns and flushes, a queue for client ops and recycles -
and the reactor is woken through an eventfd registered as a multishot poll. The
detour costs a queue hop and a syscall; the hop Playground mode runs every
request through it, end to end.
Connection lifetime
A connection has exactly two owners: the reactor (recv side) and the handler. The refcount
starts at 2 on accept; each owner releases once - the reactor on EOF/error, the handler via
conn.DecRef() on exit (exactly once, in a finally). Whoever reaches
zero hands the connection to the reactor for recycling: cancel the multishot recv, return
leftover buffers, close the fd, bump the generation (invalidating stale awaiters and queued
work), reset state, and push to the pool (capped by PoolMax; beyond it, native
memory is freed). Pooled connections keep their slab and buffer-ring allocations across
lives.
Wiring and services
Three seams connect an application: Handle (the per-connection loop),
OnStart (runs on the reactor thread before serving - open ring-native clients
here so they bind to this reactor's ring), and typed services
(AddService<T> / GetService<T>) so one reactor can carry
any number of clients. The engine never names a client type - see
Ring clients.
Configuration
| Option | Default | Meaning |
|---|---|---|
Port | 8080 | SO_REUSEPORT listener port (every reactor binds it) |
ExtraPorts | [] | additional listener ports; conn.ListenerPort says which one a connection used |
ReactorCount | 12 | reactors = threads; run one per core |
RingEntries | 8192 | io_uring SQ/CQ depth |
RecvBufferSize | 32 KB | shared mode: bytes per recv buffer |
BufferRingEntries | 4096 | shared mode: buffers per reactor (power of two) |
WriteSlabSize | 16 KB | per-connection write buffer |
PoolMax | 1024 | pooled connection objects per reactor (bounds native memory) |
RecvQueueEntries | 64 | per-connection slice queue depth; overflow closes the connection |
Incremental | false | per-connection buffer rings (kernel 6.12+) |
MaxConnections | 4096 | incremental: one buffer-group id per live connection |
ConnBufRingEntries | 16 | incremental: buffers per connection ring |
IncRecvBufferSize | 4 KB | incremental: bytes per buffer (kernel appends into it) |