C# Networking Deep Dive With io_uring — Part 7

As per tradition, a new part that doesn't follow what the last part suggested would be the continuation topic (Kestrel integration benchmarks).

In this part 7 we are building a real world application for Minima, fractal - a high performance file serving server with an ambitious goal, to be the world's fastest at its use case. The source code for fractal can be found at github.com/dotnet-web-stack/fractal.

As always a kind reminder that all the code provided by this series is meant for research purposes only, this is not product or library meant for general usage. This work should be used to help you create or improve your own projects, not a direct replacement.

So, how do you serve a static file as fast as the machine allows?

Last week I've been researching on static file serving topic and the mechanisms used by the high performance edge servers. It's a vast and interesting topic. For example in C# we can use the RandomAccess API, a low level interface that uses the pread syscall to read from the kernel page cache. We can also use sendfile that writes from "disk" directly to socket bypassing user space.

Let's explore these in more detail..

pread

pread(fd, buf, count, offset) - positional read into your buffer.

Reads count bytes from offset, does not touch the file's current position, that means one open fd can be read concurrently by every thread/connection without locks.

Cost:

read file data - page cache -> our buffer (1 copy)
then to send we do another copy - our buffer -> socket (1 copy)

So total 2 syscalls and 2 copies.

Fully synchronous/blocking, however the cache is typically hot so it is very fast and cheap.

sendfile

sendfile(out_fd, in_fd, offset, count) - copy fd -> fd inside the kernel.

Bypasses user space entirely.

Cost:

page cache -> socket (1 copy) [would be 0 with DMA zero copy send]

So total 1 syscall.

The catch here is that we can't "touch" the bytes which complicates possible compression or encryption on the user space web framework side. Some edge servers do work around this issue by pre-compressing on disk and using kTLS (Kernel TLS) for HTTPS.

Just like pread it is synchronous/blocking.

FileStream

C#'s high level Stream

Its API supports sync and async calls. I didn't dig much deep on this one as its performance was as expected on a different ballpark thus not a direct comparison to what fractal aims to do. An important thing that jumped out is that while it provides an async API, on Linux it is not truly async - a blocking read scheduled on the thread pool (Windows can do real overlapped I/O, Linux has no async file I/O outside io_uring).

These three are what I found to be the state of the art for reading files from disk in C#.

The idea behind fractal

Let's recall the model from parts 1-6, Minima is one reactor per core, each owns its io_uring and connections. Handlers are async but every await is completed by the reactor inline (ManualResetValueTaskSourceCore with RunContinuationsAsynchronously = false). So unless the request handler awaits an external call that is scheduled to continue at the thread pool, there is no thread hopping, everything runs inline in the reactor's thread.

Looking at the options we have so far (pread and sendfile), these are hostile to the architecture by construction, the sync paths stall the reactor thread. We need true asynchronous file I/O.

We've been of course using io_uring for the networking I/O but we can use it too for file I/O, using the same io_uring! our Ring deals with file descriptors which don't differentiate between a file or a socket so our reactor can handle both CQEs from socket and file data. A file read is just another tagged SQE on the reactor's ring.

Extending Minima: a third completion source

The connection already speaks two awaitable "facets", recv (IValueTaskSource<RecvSnapshot>, the one we built in Part 2) and flush (IValueTaskSource). A file read is just a third one. Same ManualResetValueTaskSourceCore, same inline-resume trick, a new signal:

public sealed unsafe partial class Connection : IValueTaskSource<int>
{
    private ManualResetValueTaskSourceCore<int> _fileSignal = new()
    {
        RunContinuationsAsynchronously = false,   // resume on the reactor thread
    };
    private int _fileArmed;

    // Stashed by ReadFileAsync, read by the reactor when it builds the SQE. Both run on
    // the reactor thread (the handler is inline), so it's just a field handoff between two
    // points on the same call stack. No queue, no lock.
    internal int   FileReadFd;
    internal byte* FileReadDst;
    internal int   FileReadLen;
    internal long  FileReadOffset;
}

First, the new completion needs a tag. Every SQE we submit carries an opaque user_data we get back on the CQE. Minima packs a kind in the high 32 bits and the client fd in the low 32. We add one constant:

private const ulong KindAccept   = 1UL << 32;
private const ulong KindRecv     = 2UL << 32;
private const ulong KindSend     = 3UL << 32;
private const ulong KindFileRead = 4UL << 32;   // new: file read into the write slab

The awaitable. ReadFileAsync is what the handler calls. It mirrors ReadAsync/FlushAsync guard, arm, stash, submit, return a ValueTask bound to this:

public ValueTask<int> ReadFileAsync(int fileFd, int len, long fileOffset)
{
    if (Volatile.Read(ref _closed) == 1)
        return new ValueTask<int>(0);                 // connection already gone

    if (WriteTail + len > _writeSlabSize)
        throw new InvalidOperationException("Write slab too small for file read.");

    if (Interlocked.Exchange(ref _fileArmed, 1) == 1)
        throw new InvalidOperationException("ReadFileAsync already armed.");

    _fileSignal.Reset(); // Reset it so that it can be awaited again.

    FileReadFd     = fileFd;
    FileReadDst    = WriteBuffer + WriteTail;          // land bytes straight into the slab
    FileReadLen    = len;
    FileReadOffset = fileOffset;

    int gen = Volatile.Read(ref _generation);          // pooled-connection stale-token guard
    _reactor.SubmitFileRead(ClientFd);

    // Close-race recovery (same as FlushAsync): if a close slipped in after the guard,
    // self-complete so we don't hang on a read the reactor will never make.
    if (Volatile.Read(ref _closed) == 1 && Interlocked.Exchange(ref _fileArmed, 0) == 1)
        _fileSignal.SetResult(0);

    return new ValueTask<int>(this, (short)gen);
}

Two things to notice. FileReadDst = WriteBuffer + WriteTail points the read into the write slab, right after whatever the handler already wrote (our headers) so the body lands contiguous with the headers and goes out in one send later. And the ValueTask is bound to the connection itself with the generation as its token, so a stale await left over from a previous pooled life resolves to a no-op instead of leaking into the new tenant.

The submit. The reactor turns the stashed request into a single SQE. This is the whole "file I/O on the network ring" idea, in eight lines:

internal void SubmitFileRead(int fd)
{
    if (!Connections.TryGetValue(fd, out var conn)) return;

    IoUringSqe* sqe = GetSqeOrFlush();
    Unsafe.InitBlockUnaligned(sqe, 0, 64);
    sqe->opcode    = IORING_OP_READ;
    sqe->fd        = conn.FileReadFd;            // the *file* fd we read from
    sqe->addr      = (ulong)conn.FileReadDst;    // into the slab
    sqe->len       = (uint)conn.FileReadLen;
    sqe->off       = (ulong)conn.FileReadOffset; // positional, like pread
    sqe->user_data = KindFileRead | (uint)fd;    // tag + the *client* fd
}

The detail that makes it work: sqe->fd is the file fd (we're reading a file), but user_data carries the client fd. The read pulls from the file, the completion has to tell the reactor whose handler to wake. Two different descriptors, one SQE. And note GetSqeOrFlush only grabs a slot in the ring, the read isn't submitted to the kernel here. The handler is mid-await, still inside the reactor's dispatch of the recv that triggered it, the SQE goes out on the reactor's next io_uring_enter, alongside everything else it batched this turn.

The completion. Now the read is just another CQE the reactor reaps. One new branch in the dispatch loop, right next to recv and send:

else if (kind == KindFileRead)
{
    if (Connections.TryGetValue(fd, out var conn))
        conn.CompleteFileRead(cqe.res);   // cqe.res = bytes read
}

internal void CompleteFileRead(int res)
{
    Interlocked.Exchange(ref _fileArmed, 0);
    _fileSignal.SetResult(res);           // <- the inline resume happens here
}

That SetResult is the payoff. Because _fileSignal has RunContinuationsAsynchronously = false, calling it doesn't schedule the handler, it runs the handler's continuation inline, on the reactor thread, on the reactor's own call stack, right inside this CQE's dispatch. No thread-pool hop, no scheduler, exactly as in Part 2, only now the thing we waited for came off the disk instead of the wire.

The IValueTaskSource<int> plumbing is the same three methods the runtime needs that we wrote for recv, just backed by _fileSignal and gated by the generation token:

int IValueTaskSource<int>.GetResult(short token)
    => token == (short)Volatile.Read(ref _generation)
        ? _fileSignal.GetResult(_fileSignal.Version)
        : 0;                                    // stale awaiter from a recycled connection

ValueTaskSourceStatus IValueTaskSource<int>.GetStatus(short token)
    => token == (short)Volatile.Read(ref _generation)
        ? _fileSignal.GetStatus(_fileSignal.Version)
        : ValueTaskSourceStatus.Succeeded;

void IValueTaskSource<int>.OnCompleted(Action<object?> c, object? s, short token, ValueTaskSourceOnCompletedFlags f)
{
    if (token != (short)Volatile.Read(ref _generation)) { c(s); return; }
    _fileSignal.OnCompleted(c, s, _fileSignal.Version, f);
}

And that's the entire extension. The handler now reads a file the same way it reads a socket:

conn.Write(headers);                                  // response header -> slab
int n = await conn.ReadFileAsync(asset.Fd, budget, 0); // file -> slab, async, inline
conn.Advance(n);                                      // n = bytes the read returned
await conn.FlushAsync();                              // slab -> socket

No new threads, no new queue, no second ring. We added a new file tag, an SQE, a dispatch branch, and one more IVTS, and because the read is both submitted by and completed by the reactor, it can never trigger the deadlock we documented in Part 5. It physically can't leave the core. That's true async file I/O, in a model that already knew how to wait without blocking.

The asset cache and the page cache

fractal never reads a file from disk on the hot path, it reads from the kernel page cache, and it opens every file exactly once at startup. That is the job of the AssetCache. It walks the wwwroot, opens each file read only, and keeps the handle around for the life of the process.

SafeFileHandle handle = File.OpenHandle(path, FileMode.Open, FileAccess.Read, FileShare.Read);
handles.Add(handle);                            // keep it alive for the cache's lifetime

int fd = (int)handle.DangerousGetHandle();      // the raw fd io_uring needs, no length recorded

io_uring works with raw file descriptors so we need the int fd, not the managed handle. A SafeFileHandle owns that fd and would close it the moment the handle is collected, so we hold every handle in an array for the cache's lifetime and only then hand the raw fd from DangerousGetHandle to the reactor. As long as the handle is alive the fd is valid, and one open fd per file is the price (mind RLIMIT_NOFILE on a huge tree). The reads are positional so the same fd is read concurrently by every connection on every reactor, no locks and no seeking.

Now the page cache. It is not per fd, it is per inode. The kernel caches a file's pages once under its inode and every fd pointing at that inode shares them. The first read faults the pages in from disk, every read after that is served straight from memory, which is why a warm read completes almost instantly and the whole benchmark runs at memory speed. fractal keeps no copy of the file itself, the page cache is the copy.

Which raises the obvious question, what happens when a file changes. If something writes the file in place the writes land on the same pages under the same inode, so our next read sees the new bytes and the cache stays coherent for free. The catch is that almost nobody edits in place. Editors and deploy scripts write a new file and rename it over the old one, which is an atomic swap to a brand new inode. Our held fd still points at the old inode, and because we are holding it open that inode and its pages cannot be freed, so fractal keeps serving the old content until the process reopens the files. AssetCache is a startup snapshot of the tree, that is the tradeoff of pre-opening every fd, we never pay an open or a stat on the hot path and the cost is that the snapshot has to be told when to refresh.

So we give it a reload. fractal rebuilds the whole snapshot off the hot path, opens the new fds and re-bakes the header prefixes, then swaps it in with a single atomic reference. Because the fd and its header prefix live together in one immutable record, a reactor reading the snapshot sees either the whole old version or the whole new one, never a half and half. A request that already resolved keeps the snapshot it resolved against and finishes cleanly, the next request picks up the new one.

The old fds are closed a few seconds later rather than right away. io_uring takes a reference to the file when it submits a read, so a read already in flight is safe even if we close the fd underneath it, the grace just covers the tiny gap between a handler resolving the snapshot and submitting its read. The trigger is a signal, the same model nginx uses, you deploy your files and send SIGHUP and fractal reloads without missing a request. Watching the directory with inotify would make it automatic, same machinery underneath.

There is a second kind of staleness, the length. The easy way to send a Content-Length is to measure every file at startup and bake it into the header, but a file edited in place can change size, and through the page cache that edit is already live, so a length measured at startup can quietly stop matching the bytes we are about to send. fractal never measures up front. It bakes only the header prefix, everything except the length. On each request it lays down that prefix, reserves a fixed-width gap where the length line will go, and reads the file into the slab right after the gap. The length line itself is written only after the read returns, once we know what it should say.

conn.Write(prefix);                              // fills the slab, nothing is sent yet
WriteTail += ClLineWidth;                        // reserve a gap for the last header line, left blank

int budget = slab.Length - headerLen;
int n = await conn.ReadFileAsync(asset.Fd, budget, 0);   // read file bytes into the slab, still no send
if (n < budget)                                // the whole file fit, so now we know its length
{
    WriteContentLength(slab.Slice(lineAt, ClLineWidth), n);  // only now do we write the length line
    await conn.FlushAsync();                     // first and only send: prefix + length + body
    return;
}
await SendChunkedAsync(conn, asset, prefix);     // too big, gap never filled, stream chunked instead

The io_uring read result is the exact byte count we just pulled from the page cache, so that is the number we write into the gap, no extra fstat, the count falls out of the read we were already doing. The length is computed from the bytes we are about to ship, never from something we wrote down earlier, so the two cannot disagree.

That leaves one case, a file too big for the slab. We size the read at the slab minus the header, so a read that comes back full might have more behind it and we cannot know its total length. This is where reserving rather than writing pays off. Every write so far only filled the slab, the flush is the only thing that sends, and we have not flushed, so nothing is committed. The gap we left for the length line stays blank, we rewind the slab to the front, write a Transfer-Encoding: chunked header, and stream the file from there in slab sized frames, each one prefixed with its own size, until a short read marks the end. The client only ever sees the chunked header. Small files, almost everything a static server sees, stay on the single read single send path with a real Content-Length. Big files stream. Neither one trusts a stored size.

We now covered the base mechanism that makes fractal, this is still however just the bones of it, we still need logic to parse HTTP requests, deal with pre-compressed files and fragmented requests. All this logic already exists in the fractal repository but won't be covered in this Part 7 (or any future part), the HTTP parsing is achieved with a third party library Glyph11.

Let's benchmark this.

Benchmarks

These values are not "scientific", just a ballpark estimate.

All four servers ran on the same box, over loopback, with a warm page cache, one server at a time so they never fight for cores. Same wrk, same file, same everything.

Hardware

i9 14900k
64GB DDR5 6400MHz
Linux Kernel 6.17.0-22-generic

Tests are done through localhost loopback (no NIC influence)

Setup

wrk -c 512 -t18 -d5s http://localhost:<port>/10kb.html

10kb.html is 9170 bytes and wrk sends no Accept-Encoding, so everyone serves the same identity body with no compression. fractal runs 12 reactors, nginx 12 workers, ASP.NET with DOTNET_PROCESSOR_COUNT=12 (the nearest knob it has), so it's roughly a 12-core budget across the board. nginx is the tuned config from the repo's nginx-bench (sendfile, reuseport, open_file_cache, access_log off, keepalive cranked up), not stock, so it's a fair fight and not a strawman.

Server	Req/s	Transfer/s	Avg latency
fractal	2.34M	20.3GB/s	233us
fractal (PipeReader/PipeWriter)	2.05M	17.7GB/s	295us
nginx (tuned)	1.61M	14.1GB/s	309us
ASP.NET Core MapStaticAssets	508K	4.47GB/s	0.99ms

fractal lands first. It pushes about 1.45x the throughput of tuned nginx and around 4.6x ASP.NET's MapStaticAssets, and it does it at the lowest average latency of the four. Even the PipeReader/PipeWriter variant, which trades a little speed for the convenience of the standard pipe API, still sits about 27% above nginx. The raw handler beats the pipe one by roughly 14%, and that gap is the cost of the pipe wrappers, the held-buffer bookkeeping and sequence building and extra virtual calls that the raw path simply doesn't pay.

The big caveat is that this is loopback. 20GB/s is about 160 Gbps, which is the loopback memcpy path, not a wire. Real NICs top out far below that, so these numbers are an upper bound on what the server can push, not what you would deliver over a real network, where the NIC becomes the bottleneck long before the server does and everyone converges near line rate. The point isn't "fractal does 160 Gbps", it's that fractal gets out of its own way. It spends less CPU per request than the alternatives, so it has more headroom before anything else becomes the limit.

One more caveat, this time in nginx's favor. nginx serves with sendfile, and on a real NIC that can be a true zero copy send, the card pulls the file's pages straight out of the page cache by DMA with no CPU copy at all. Over loopback there is no NIC and no DMA to a card, so that path never really kicks in and the bytes get copied around in memory like everyone else's. So we are not seeing nginx at its absolute best here, on a real network with a capable NIC sendfile's zero copy could claw back some of this gap.

As always, measure your own workload before drawing conclusions about it.