← All posts
A dense, chaotic scattering of black numerals on a white background, overlapping at every angle.

C# Networking Deep Dive With io_uring — Part 6: Numbers

For part 6 let's do some benchmarks;

These values are not "scientific", just a ballpark estimate.

What is going to be benchmarked

  • io_uring read+write with IVTS reactor inline continuations (RunAsynchrounousContinuation = false)
  • io_uring read+write without IVTS reactor inline continuations (threadpool) (RunAsynchrounousContinuation = true)
  • io_uring read + libc send write without IVTS reactor inline continuations (threadpool) (RunAsynchrounousContinuation = true)
  • epoll read+write with IVTS reactor inline continuations
  • epoll read+write without IVTS reactor inline continuations
  • System.Net.Socket (Kestrel stock) - epoll threadpool

Tests

(No pipelining)

  • Synchronous lightweight plaintext "OK" response.
  • Asynchronous workload: _ = await Task.Run(static () => JsonSerializer.Serialize("Hello World!"));

The purpose of the async workload is to force the continuation onto the threadpool, not to model a heavy async workload.

Hardware

i9 14900k
64GB DDR5 6400MHz
Linux Kernel 6.17.0-22-generic

Tests are done through localhost loopback (no NIC influence)
MTU 1500

Load generators

Http/1.1 no TLS

wrk (epoll)
gcannon (io_uring)

io_uring configuration

All io_uring variants share the same ring setup:

  • SINGLE_ISSUER + DEFER_TASKRUN
  • Multishot accept and multishot recv (send is one-shot per response)
  • No zero copy receive, no zero copy send
  • Incremental buffer consumption disabled, similar performance for this specific benchmark

io_uring read+write with IVTS reactor inline continuations

This is the exact model explored throughout the series, expected to deliver high performance on synchronous test.

Reactor count: 12

Sync workload

Metricwrkgcannon
Latency Avg121.45us129us
Latency Stdev178.81us
Latency Max8.32ms
Latency p50125us
Latency p90185us
Latency p99245us
Latency p99.9317us
Req/Sec Avg3.59M3.95M
Requests Total18,299,27819,735,722
Duration5.10s5.00s
Transfer/Bandwidth225.84MB/s248.42MB/s

Async workload (very unstable)

Metricwrkgcannon
Latency Avg435.74us185us
Latency Stdev795.84us
Latency Max12.73ms
Latency p50135us
Latency p90229us
Latency p991.84ms
Latency p99.94.10ms
Req/Sec Avg2.53M2.76M
Requests Total12,883,29413,797,048
Duration5.10s5.00s
Transfer/Bandwidth159.05MB/s173.67MB/s

io_uring read+write without IVTS reactor inline

Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS, expected to deliver close results on both tests.

Reactor count: 12

Sync workload

Metricwrkgcannon
Latency Avg515.72us211us
Latency Stdev821.99us
Latency Max12.67ms
Latency p50164us
Latency p90273us
Latency p991.55ms
Latency p99.93.79ms
Req/Sec Avg1.95M2.41M
Requests Total9,946,28212,080,236
Duration5.10s5.00s
Transfer/Bandwidth122.80MB/s151.97MB/s

Async workload

Metricwrkgcannon
Latency Avg530.17us213us
Latency Stdev842.05us
Latency Max13.37ms
Latency p50146us
Latency p90265us
Latency p992.27ms
Latency p99.94.38ms
Req/Sec Avg1.93M2.39M
Requests Total9,726,08311,952,675
Duration5.03s5.00s
Transfer/Bandwidth121.82MB/s150.45MB/s

io_uring read + libc send write without IVTS reactor inline continuations

Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS and the write branch is not io_uring, instead we use the libc's send, expected to deliver close results on both tests. This is an hybrid approach and should be the middle ground between the first two models.

Reactor count: 12

Sync workload

Metricwrkgcannon
Latency Avg410.23us154us
Latency Stdev782.03us
Latency Max12.08ms
Latency p5084us
Latency p90176us
Latency p992.68ms
Latency p99.94.32ms
Req/Sec Avg2.82M3.31M
Requests Total14,361,23916,551,871
Duration5.10s5.00s
Transfer/Bandwidth0.88GB read208.27MB/s

Async workload

Metricwrkgcannon
Latency Avg418.96us159us
Latency Stdev824.32us
Latency Max17.51ms
Latency p5085us
Latency p90198us
Latency p991.99ms
Latency p99.94.41ms
Req/Sec Avg2.74M3.20M
Requests Total13,955,37115,997,491
Duration5.09s5.00s
Transfer/Bandwidth172.59MB/s201.18MB/s

epoll read+write with IVTS reactor inline continuations

Pure epoll approach with same reactor threading architecture. Inline handler continuation for both IVTS.

Reactor count: 12

Sync workload

Metricwrkgcannon
Latency Avg284.42us160us
Latency Stdev610.90us
Latency Max11.06ms
Latency p5086us
Latency p90194us
Latency p992.07ms
Latency p99.94.39ms
Req/Sec Avg3.36M3.17M
Requests Total17,141,22515,856,691
Duration5.10s5.00s
Transfer/Bandwidth403.61MB/s199.56MB/s

Async workload

Metricwrkgcannon
Latency Avg458.63us159us
Latency Stdev0.90ms
Latency Max15.96ms
Latency p5074us
Latency p90185us
Latency p992.68ms
Latency p99.95.32ms
Req/Sec Avg2.68M3.08M
Requests Total13,670,69715,386,279
Duration5.10s5.00s
Transfer/Bandwidth322.12MB/s369.72MB/s

epoll read+write without IVTS reactor inline continuations

Pure epoll approach with same reactor threading architecture. Threadpool handler continuation for both IVTS.

Reactor count: 6

Sync workload

Metricwrkgcannon
Latency Avg391.31us140us
Latency Stdev764.42us
Latency Max13.71ms
Latency p5096us
Latency p90150us
Latency p992.06ms
Latency p99.94.15ms
Req/Sec Avg2.98M3.60M
Requests Total15,179,06618,019,801
Duration5.10s5.00s
Transfer/Bandwidth357.60MB/s432.83MB/s

Async workload

Metricwrkgcannon
Latency Avg464.15us154us
Latency Stdev838.78us
Latency Max10.74ms
Latency p5096us
Latency p90154us
Latency p992.22ms
Latency p99.94.48ms
Req/Sec Avg2.79M3.27M
Requests Total14,231,17616,342,325
Duration5.10s5.00s
Transfer/Bandwidth236.89MB/s277.35MB/s

System.Net.Socket (Kestrel stock) - epoll threadpool

Kestrel's stock network I/O with some tunning

Sync workload

Metricwrkgcannon
Latency Avg156.79us141us
Latency Stdev342.31us
Latency Max6.98ms
Latency p50129us
Latency p90176us
Latency p99305us
Latency p99.93.17ms
Req/Sec Avg3.09M3.60M
Requests Total15,748,22318,024,579
Duration5.10s5.00s
Transfer/Bandwidth194.39MB/s226.84MB/s

Async workload

Metricwrkgcannon
Latency Avg255.07us169us
Latency Stdev507.29us
Latency Max12.53ms
Latency p50123us
Latency p90237us
Latency p991.25ms
Latency p99.93.89ms
Req/Sec Avg2.67M3.01M
Requests Total13,618,90615,043,820
Duration5.10s5.00s
Transfer/Bandwidth168.14MB/s189.25MB/s

Comparison at a glance

wrk and gcannon req/s and avg latency for every model, side by side.

Implementation Reactors Sync Async
wrk req/s wrk avg gcannon req/s gcannon avg wrk req/s wrk avg gcannon req/s gcannon avg
io_uring r+w, IVTS inline 12 3.59M 121.45us 3.95M 129us 2.53M* 435.74us* 2.76M* 185us*
io_uring r+w, threadpool 12 1.95M 515.72us 2.41M 211us 1.93M 530.17us 2.39M 213us
io_uring recv + libc send 12 2.82M 410.23us 3.31M 154us 2.74M 418.96us 3.20M 159us
epoll r+w, IVTS inline 12 3.36M 284.42us 3.17M 160us 2.68M 458.63us 3.08M 159us
epoll r+w, threadpool 6 2.98M 391.31us 3.60M 140us 2.79M 464.15us 3.27M 154us
System.Net.Socket (Kestrel stock) 3.09M 156.79us 3.60M 141us 2.67M 255.07us 3.01M 169us

CPU usage: the inline-IVTS cases (io_uring r+w IVTS, epoll r+w IVTS) cap at around 1200% max, while every other model averages ~1600%.

* Async run flagged as very unstable in the original write-up.

Conclusion

The numbers are aligned with part 5's rant. On a fully synchronous benchmark, io_uring with the reactor inline continuation rides ahead, no cross thread hand offs.

Force the continuation on the threadpool (async workload) and that lead evaporates. The hybrid approach reclaims most of it and is a serious contender for further tests with Kestrel integration.

A little note on the load generators, quite interesting results, gcannon seems a lot more stable on latency values while wrk is all over the place.

Important to highlight that the reactor inline sync models consume in average 20% less CPU as they are bounded to 12 reactor CPU threads. On the other hand, solutions that allow threadpool continuation will use as much CPU is available. For example, epoll r+w IVTS inline can actually yield 3.9M rps if we increase the reactor count to 16, surpassing System.Net.Socket performance for same CPU usage.

Very surprising result on epoll r+w threadpool, was expecting the performance to be equal to System.Net.Socket, this will be quite interesting for part 7.

On part 7 some of these models will be integrated on Kestrel/ASP.NET for direct benchmark comparison.