Skip to content

io_uring

io_uring is a Linux kernel interface for asynchronous I/O. zerg uses it as its sole I/O mechanism – there are no epoll, kqueue, or libuv fallbacks.

The Ring Model

How io_uring Works

io_uring uses two lock-free ring buffers shared between userspace and the kernel. Your app writes SQEs (requests), the kernel writes CQEs (results). No syscall needed for submission.

USERSPACE (your C# app)KERNELSHARED MEMORY (mmap'd)Submission QueueSQESQESQECompletion QueueCQECQEwrite SQEskernel reads SQkernel writes CQread CQEsI/O Processingaccept / recv / sendYour Codeprep_recv, prep_send...Your Handlerprocess results, route...SQE Structure (Submission)opcodefdbuf/lenuser_dataflagsCQE Structure (Completion)user_dataresflags

SQE — Submission Queue Entry

opcode = what to do (recv, send, accept)
fd = which socket
user_data = your 64-bit tag (returned in CQE)
flags = BUFFER_SELECT, etc.

CQE — Completion Queue Entry

user_data = your original tag (identifies the op)
res = result (bytes transferred, new fd, or -errno)
flags = MORE, BUFFER (contains buffer_id)

Shared Memory

Both rings are mmap'd. The kernel and your app write to them directly. No copy, no syscall for enqueue. Only io_uring_enter() needed to wake the kernel.

Interactive

The I/O Lifecycle

Step through the exact sequence: your app queues an SQE, the kernel processes it, and you read the CQE result. All in shared memory.

USERSPACE━━ kernel boundary ━━KERNELget_sqe()grab empty SQE slotprep_recv(sqe, fd)fill opcode + fd + flagsset_data64(sqe, tag)attach your 64-bit tokensubmit()io_uring_enterKernel processes I/Orecv(fd) → data into bufferCQE written to CQuser_data + res + flagsApp reads CQEdispatch by user_datacq_advance(count)mark CQEs consumed
zerg — single syscall pattern
// 1. Queue work (no syscall)
io_uring_sqe* sqe = shim_get_sqe(ring);
shim_prep_recv_multishot_select(sqe, fd, bgid, 0);
shim_sqe_set_data64(sqe, PackUd(UdKind.Recv, fd));

// 2. Submit + wait in ONE syscall shim_submit_and_wait_timeout(ring, &cqes, 1, &ts);

// 3. Batch-read completions (no syscall) int got = shim_peek_batch_cqe(ring, cqes, batchSize);

// 4. Process results for (int i = 0; i < got; i++) { UdKind kind = UdKindOf(shim_cqe_get_data64(cqes[i])); int res = cqes[i]->res; // dispatch… } shim_cq_advance(ring, (uint)got); // 5. Mark consumed

Key Feature

Multishot Operations

Traditional I/O: 1 SQE → 1 CQE. Multishot: 1 SQE → many CQEs. The kernel keeps producing completions until an error or you cancel.

Traditional (one-shot)

Submit recv for each readSQE recvCQE dataSQE recvCQE dataSQE recvCQE dataSQE recvCQE data4 SQEs4 submissions4 completionsCost: re-arm after every readMore SQE slots consumedMore CPU cycles on submission

Multishot (zerg)

Submit once, get many completions1 SQErecv_multishotCQE + MORECQE + MORECQE + MORECQE finalF_MORE = 1F_MORE = 1F_MORE = 1F_MORE = 0Win: 1 submission, N completionsKernel sets IORING_CQE_F_MORE on each CQEWhen MORE=0 → multishot ended, re-armUsed for both accept and recv in zerg

user_data Packing

Each SQE carries a 64-bit token so the completion handler knows what operation completed and on which socket.

64-bit user_dataUdKind (bits 63-32)1=Accept 2=Recv 3=Send 4=CancelFile Descriptor (bits 31-0)socket fd cast to uintPackUd(kind, fd) = ((ulong)kind << 32) | (uint)fd
Zero Copy

Provided Buffer Ring

Instead of passing a buffer with each recv, you pre-register a pool. The kernel picks one, fills it, and tells you which ID it used. You return it when done.

USERSPACEKERNELBuffer Slab (NativeMemory)buf 032KBbuf 132KBbuf 232KBbuf 332KB...buf N32KBBuffer Ring (shared with kernel)id:0id:1id:2usedid:3id:4id:5...id:NRecv Flow1Kernel picks buf from ring2recv() fills it with data3CQE.flags contains buf idbid = flags >> 164CQE.res = bytes received5App returns buf via buf_ring_addKernel: recv(fd) → picks buf 2 → fills 1,420 bytesCQE { user_data: [Recv|fd], res: 1420, flags: (2 << 16) | F_BUFFER | F_MORE }Buffer Return (app → ring)connection.ReturnRing(bid)→ MPSC queue → reactor drains → shim_buf_ring_add(ring, addr, len, bid, mask, idx) → shim_buf_ring_advance(1)
End to End

Full zerg Flow

Walk through the complete lifecycle: client connects, data flows in, your app responds, buffers recycle. Step through each phase one at a time.

ACCEPTOR THREADClientTCP SYNconnectKernelmultishot acceptcompletes → CQEres = new client fdAccept CQEfd = 42F_MORE = 1 (stay armed)RoundRobinnext % NMultishot accept is armed once at startup. Each new connection produces a CQE automatically.The acceptor never re-submits. F_MORE flag tells us the kernel will keep producing CQEs.

Features Used by zerg

Multishot Accept

A single SQE arms the kernel to produce one CQE per accepted connection indefinitely. The acceptor thread never re-arms. Each CQE contains the new client fd in cqe->res and IORING_CQE_F_MORE to indicate more will follow.

Multishot Recv + Buffer Selection

A single SQE arms recv for a connection. Each time data arrives, the kernel picks a buffer from the buf_ring, fills it, and produces a CQE with the buffer ID in the flags. Eliminates per-recv buffer allocation.

Buffer Rings (Provided Buffers)

Pre-allocated buffer pool registered with the kernel via shim_setup_buf_ring(). Buffers are added with buf_ring_add() and recycled after use. See Buffer Rings for the full lifecycle.

SINGLE_ISSUER

Tells the kernel only one thread submits to this ring. Skips SQ locking for better throughput. Matches zerg's model where each reactor is the sole submitter to its ring.

DEFER_TASKRUN

Defers kernel task_work until the next ring entry. Reduces latency spikes from interrupt-context work and makes completions arrive at predictable points for better async/await integration.

SQPOLL (Optional)

Creates a kernel thread polling the SQ continuously, eliminating the io_uring_enter() syscall. Trades a dedicated CPU core for the lowest possible submission latency.

Submit-and-Wait

zerg's reactor uses shim_submit_and_wait_timeout() — a single syscall that submits all pending SQEs AND waits for at least one CQE. One syscall instead of two.

CQE Batching

Instead of one CQE at a time, the reactor peeks a batch with shim_peek_batch_cqe() and processes all before advancing the CQ head. Amortizes the head update across completions.