C# Networking Deep Dive With io_uring — Part 5

Part 5 was going to be about integrating with Kestrel. I actually did it, the integration is done and tested, and it even came out ahead on some benchmarks, but overall the performance lands not far from Kestrel's stock Socket transport. So instead of a walkthrough, this turned into a rant about io_uring and the thread pool.

This story doesn't begin with io_uring. To be honest with you, I love epoll (plot twist :o), and the reason I've been experimenting with and researching io_uring for 7 months now is to understand whether it's truly a better alternative to epoll... for networking.

Now, don't get me wrong, io_uring is great. I love so many things about it. It was originally created for disk/file I/O and excels at it, but in my humble opinion it can be a mismatch for typical networking and back-end applications.

So at this point you're probably wondering what the hell is going on, and why I'm saying this when so many people treat io_uring as their coca-cola in the desert. For benchmark enthusiasts who want to push and squeeze numbers, io_uring is indeed fast. But that's exactly where it shines, micro-benchmarks.

io_uring is a perfect match for the reactor pattern we've been exploring in this series. It performs especially fast when the reactor is pinned to a thread for its entire lifetime, which, again, lines up perfectly with how IValueTaskSource works. I haven't shared any benchmarks with you yet, but let me tell you, Minima is fast. Frighteningly fast. I'll save the numbers for a future part.

This speed comes with a shackle, though.

io_uring's speed in this model (Minima's multi-reactor) is conditional. Two things have to hold at the same time, the reactor is the sole submitter of the ring (SINGLE_ISSUER with DEFER_TASKRUN), and the handler runs inline on the reactor thread, so the IValueTaskSource resumes without ever leaving it. Both hold only as long as the handler never leaves the reactor. But guess what? The entire .NET backend world is built on leaving it, the thread pool, async/await resuming off-thread, and of course Kestrel, whose whole model is "hand the connection to the pool." So the moment we await any real async work, the handler moves off the reactor thread, the response can no longer be submitted from that thread, and we're forced into a cross-thread handoff. And now that we're juggling multiple threads, every kind of race condition and deadlock becomes possible, and fixing them is not free.

The reactor deadlock

Putting an SQE in the ring and bumping the SQ tail does nothing by itself. The kernel only looks at the SQ when you call io_uring_enter (assuming no SQPOLL), which is an explicit syscall that, in Minima's model, only the reactor can make. The reactor's wake-up, however, is gated on completions (CQEs), the loop blocks in io_uring_enter(to_submit, min_complete=1, GETEVENTS), submits whatever is pending, and then sleeps in the kernel until at least one CQE is available.

Let's dissect,

All connections are idle (keep-alive, no in-flight requests). Every reactor is asleep inside io_uring_enter, waiting for any completion.
A handler finishes on a pool thread and needs to send a response on connection C (owned by reactor R). It produces a SEND SQE and writes it into R's ring, bumping the tail.
But it does not call io_uring_enter (single-issuer, so only R may submit). The SEND SQE now sits in the ring, unsubmitted.
R won't run again until a CQE wakes it, and the only CQE that would wake it is the completion of that SEND, which is never submitted, because R is the one that submits and R is asleep. If C was the only connection with work, no other completion is coming.

So to sum up, the pool thread is waiting for the reactor to submit its SQE, and the reactor is waiting for a completion that only that submission would produce. Each waits on the other. Deadlock.

There are ways to avoid this, though. One is to use a wait syscall that accepts a timeout, if no CQE arrives, it simply times out and the reactor loop comes back around. This is zerg's solution, and it has an obvious problem, we might hit that timeout too often. If the timeout is too large, latency takes a hit, if it's too small, we keep waking up for nothing when traffic is low, burning CPU.

Minima's solution is more elegant, an eventfd wake. The reactor keeps a multishot poll armed on an eventfd inside its own ring. After enqueuing the work, the pool thread does a small write() to that eventfd, the write makes it readable, the armed poll turns that into a CQE, and the reactor wakes. It comes with an ironic cost though, an extra syscall. The very thing we were trying to escape from epoll.

Can SQPOLL solve this problem?

On paper, it does, a kernel thread polls the submission queue, so the pool thread's SQE gets picked up and submitted without the reactor ever calling io_uring_enter. No sleeping reactor to wake. But the irony kicks in again. The poller itself can go to sleep, and then it needs a wake-up of its own, which is the same problem all over again. On top of that, SQPOLL and DEFER_TASKRUN are mutually exclusive, so we'd surrender the very completion-batching that makes the model fast in the first place. So SQPOLL doesn't remove the wake, it relocates it, burns a kernel thread per reactor, and makes us give up DEFER_TASKRUN on the way.

We could set SQPOLL so it never sleeps, or build a more complex mechanism to wake it automatically. I might explore that in a future part, but to be honest, io_uring with SQPOLL has been subpar in my own tests, so I'd rather just go with epoll.

Now, epoll sidesteps all of this because it never separates doing the I/O from submitting it. send and recv are plain syscalls that any thread calls directly. A much cleaner model, no hacks.

So how much faster than epoll is io_uring, really? On a micro-benchmark where the work isn't delegated to the thread pool, it can be a tad faster, 5-10% in my own benchmarks. But on a real workload, it's as fast at best, and given everything io_uring drags along (security concerns, restricted kernel access, recent-kernel requirements, and implementation complexity), as of today it just ain't worth it, in my opinion. I might change my mind with more research.

Regardless, io_uring has its place when the handler/endpoint logic is lightweight, such as websockets, or the kind of mostly-synchronous workloads that reactor/event-loop servers like nginx and Redis are built for (and it's no accident that those live on an event-loop model).

For the next parts I'll keep going with Minima and explore the reactor pattern and the send branch. Even though io_uring may not be a great fit for Kestrel, it can still shine in the right kind of application.