C# Networking Deep Dive With io_uring — Part 6

For part 6 let's do some benchmarks;

These values are not "scientific", just a ballpark estimate.

What is going to be benchmarked

io_uring read+write with IVTS reactor inline continuations (RunAsynchrounousContinuation = false)
io_uring read+write without IVTS reactor inline continuations (threadpool) (RunAsynchrounousContinuation = true)
io_uring read + libc send write without IVTS reactor inline continuations (threadpool) (RunAsynchrounousContinuation = true)
epoll read+write with IVTS reactor inline continuations
epoll read+write without IVTS reactor inline continuations
System.Net.Socket (Kestrel stock) - epoll threadpool

Tests

(No pipelining)

Synchronous lightweight plaintext "OK" response.
Asynchronous workload: _ = await Task.Run(static () => JsonSerializer.Serialize("Hello World!"));

The purpose of the async workload is to force the continuation onto the threadpool, not to model a heavy async workload.

Hardware

i9 14900k
64GB DDR5 6400MHz
Linux Kernel 6.17.0-22-generic

Tests are done through localhost loopback (no NIC influence)
MTU 1500

Load generators

Http/1.1 no TLS

wrk (epoll)
gcannon (io_uring)

io_uring configuration

All io_uring variants share the same ring setup:

SINGLE_ISSUER + DEFER_TASKRUN
Multishot accept and multishot recv (send is one-shot per response)
No zero copy receive, no zero copy send
Incremental buffer consumption disabled, similar performance for this specific benchmark

io_uring read+write with IVTS reactor inline continuations

This is the exact model explored throughout the series, expected to deliver high performance on synchronous test.

Reactor count: 12

Sync workload

Metric	wrk	gcannon
Latency Avg	121.45us	129us
Latency Stdev	178.81us	—
Latency Max	8.32ms	—
Latency p50	—	125us
Latency p90	—	185us
Latency p99	—	245us
Latency p99.9	—	317us
Req/Sec Avg	3.59M	3.95M
Requests Total	18,299,278	19,735,722
Duration	5.10s	5.00s
Transfer/Bandwidth	225.84MB/s	248.42MB/s

Async workload (very unstable)

Metric	wrk	gcannon
Latency Avg	435.74us	185us
Latency Stdev	795.84us	—
Latency Max	12.73ms	—
Latency p50	—	135us
Latency p90	—	229us
Latency p99	—	1.84ms
Latency p99.9	—	4.10ms
Req/Sec Avg	2.53M	2.76M
Requests Total	12,883,294	13,797,048
Duration	5.10s	5.00s
Transfer/Bandwidth	159.05MB/s	173.67MB/s

io_uring read+write without IVTS reactor inline

Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS, expected to deliver close results on both tests.

Reactor count: 12

Sync workload

Metric	wrk	gcannon
Latency Avg	515.72us	211us
Latency Stdev	821.99us	—
Latency Max	12.67ms	—
Latency p50	—	164us
Latency p90	—	273us
Latency p99	—	1.55ms
Latency p99.9	—	3.79ms
Req/Sec Avg	1.95M	2.41M
Requests Total	9,946,282	12,080,236
Duration	5.10s	5.00s
Transfer/Bandwidth	122.80MB/s	151.97MB/s

Async workload

Metric	wrk	gcannon
Latency Avg	530.17us	213us
Latency Stdev	842.05us	—
Latency Max	13.37ms	—
Latency p50	—	146us
Latency p90	—	265us
Latency p99	—	2.27ms
Latency p99.9	—	4.38ms
Req/Sec Avg	1.93M	2.39M
Requests Total	9,726,083	11,952,675
Duration	5.03s	5.00s
Transfer/Bandwidth	121.82MB/s	150.45MB/s

io_uring read + libc send write without IVTS reactor inline continuations

Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS and the write branch is not io_uring, instead we use the libc's send, expected to deliver close results on both tests. This is an hybrid approach and should be the middle ground between the first two models.

Reactor count: 12

Sync workload

Metric	wrk	gcannon
Latency Avg	410.23us	154us
Latency Stdev	782.03us	—
Latency Max	12.08ms	—
Latency p50	—	84us
Latency p90	—	176us
Latency p99	—	2.68ms
Latency p99.9	—	4.32ms
Req/Sec Avg	2.82M	3.31M
Requests Total	14,361,239	16,551,871
Duration	5.10s	5.00s
Transfer/Bandwidth	0.88GB read	208.27MB/s

Async workload

Metric	wrk	gcannon
Latency Avg	418.96us	159us
Latency Stdev	824.32us	—
Latency Max	17.51ms	—
Latency p50	—	85us
Latency p90	—	198us
Latency p99	—	1.99ms
Latency p99.9	—	4.41ms
Req/Sec Avg	2.74M	3.20M
Requests Total	13,955,371	15,997,491
Duration	5.09s	5.00s
Transfer/Bandwidth	172.59MB/s	201.18MB/s

epoll read+write with IVTS reactor inline continuations

Pure epoll approach with same reactor threading architecture. Inline handler continuation for both IVTS.

Reactor count: 12

Sync workload

Metric	wrk	gcannon
Latency Avg	284.42us	160us
Latency Stdev	610.90us	—
Latency Max	11.06ms	—
Latency p50	—	86us
Latency p90	—	194us
Latency p99	—	2.07ms
Latency p99.9	—	4.39ms
Req/Sec Avg	3.36M	3.17M
Requests Total	17,141,225	15,856,691
Duration	5.10s	5.00s
Transfer/Bandwidth	403.61MB/s	199.56MB/s

Async workload

Metric	wrk	gcannon
Latency Avg	458.63us	159us
Latency Stdev	0.90ms	—
Latency Max	15.96ms	—
Latency p50	—	74us
Latency p90	—	185us
Latency p99	—	2.68ms
Latency p99.9	—	5.32ms
Req/Sec Avg	2.68M	3.08M
Requests Total	13,670,697	15,386,279
Duration	5.10s	5.00s
Transfer/Bandwidth	322.12MB/s	369.72MB/s

epoll read+write without IVTS reactor inline continuations

Pure epoll approach with same reactor threading architecture. Threadpool handler continuation for both IVTS.

Reactor count: 6

Sync workload

Metric	wrk	gcannon
Latency Avg	391.31us	140us
Latency Stdev	764.42us	—
Latency Max	13.71ms	—
Latency p50	—	96us
Latency p90	—	150us
Latency p99	—	2.06ms
Latency p99.9	—	4.15ms
Req/Sec Avg	2.98M	3.60M
Requests Total	15,179,066	18,019,801
Duration	5.10s	5.00s
Transfer/Bandwidth	357.60MB/s	432.83MB/s

Async workload

Metric	wrk	gcannon
Latency Avg	464.15us	154us
Latency Stdev	838.78us	—
Latency Max	10.74ms	—
Latency p50	—	96us
Latency p90	—	154us
Latency p99	—	2.22ms
Latency p99.9	—	4.48ms
Req/Sec Avg	2.79M	3.27M
Requests Total	14,231,176	16,342,325
Duration	5.10s	5.00s
Transfer/Bandwidth	236.89MB/s	277.35MB/s

System.Net.Socket (Kestrel stock) - epoll threadpool

Kestrel's stock network I/O with some tunning

Sync workload

Metric	wrk	gcannon
Latency Avg	156.79us	141us
Latency Stdev	342.31us	—
Latency Max	6.98ms	—
Latency p50	—	129us
Latency p90	—	176us
Latency p99	—	305us
Latency p99.9	—	3.17ms
Req/Sec Avg	3.09M	3.60M
Requests Total	15,748,223	18,024,579
Duration	5.10s	5.00s
Transfer/Bandwidth	194.39MB/s	226.84MB/s

Async workload

Metric	wrk	gcannon
Latency Avg	255.07us	169us
Latency Stdev	507.29us	—
Latency Max	12.53ms	—
Latency p50	—	123us
Latency p90	—	237us
Latency p99	—	1.25ms
Latency p99.9	—	3.89ms
Req/Sec Avg	2.67M	3.01M
Requests Total	13,618,906	15,043,820
Duration	5.10s	5.00s
Transfer/Bandwidth	168.14MB/s	189.25MB/s

Comparison at a glance

wrk and gcannon req/s and avg latency for every model, side by side.

Implementation	Reactors	Sync				Async
Implementation	Reactors	wrk req/s	wrk avg	gcannon req/s	gcannon avg	wrk req/s	wrk avg	gcannon req/s	gcannon avg
io_uring r+w, IVTS inline	12	3.59M	121.45us	3.95M	129us	2.53M*	435.74us*	2.76M*	185us*
io_uring r+w, threadpool	12	1.95M	515.72us	2.41M	211us	1.93M	530.17us	2.39M	213us
io_uring recv + libc send	12	2.82M	410.23us	3.31M	154us	2.74M	418.96us	3.20M	159us
epoll r+w, IVTS inline	12	3.36M	284.42us	3.17M	160us	2.68M	458.63us	3.08M	159us
epoll r+w, threadpool	6	2.98M	391.31us	3.60M	140us	2.79M	464.15us	3.27M	154us
System.Net.Socket (Kestrel stock)	—	3.09M	156.79us	3.60M	141us	2.67M	255.07us	3.01M	169us

CPU usage: the inline-IVTS cases (io_uring r+w IVTS, epoll r+w IVTS) cap at around 1200% max, while every other model averages ~1600%.

* Async run flagged as very unstable in the original write-up.

Conclusion

The numbers are aligned with part 5's rant. On a fully synchronous benchmark, io_uring with the reactor inline continuation rides ahead, no cross thread hand offs.

Force the continuation on the threadpool (async workload) and that lead evaporates. The hybrid approach reclaims most of it and is a serious contender for further tests with Kestrel integration.

A little note on the load generators, quite interesting results, gcannon seems a lot more stable on latency values while wrk is all over the place.

Important to highlight that the reactor inline sync models consume in average 20% less CPU as they are bounded to 12 reactor CPU threads. On the other hand, solutions that allow threadpool continuation will use as much CPU is available. For example, epoll r+w IVTS inline can actually yield 3.9M rps if we increase the reactor count to 16, surpassing System.Net.Socket performance for same CPU usage.

Very surprising result on epoll r+w threadpool, was expecting the performance to be equal to System.Net.Socket, this will be quite interesting for part 7.

On part 7 some of these models will be integrated on Kestrel/ASP.NET for direct benchmark comparison.