For part 6 let's do some benchmarks;
These values are not "scientific", just a ballpark estimate.
What is going to be benchmarked
- io_uring read+write with IVTS reactor inline continuations (RunAsynchrounousContinuation = false)
- io_uring read+write without IVTS reactor inline continuations (threadpool) (RunAsynchrounousContinuation = true)
- io_uring read + libc send write without IVTS reactor inline continuations (threadpool) (RunAsynchrounousContinuation = true)
- epoll read+write with IVTS reactor inline continuations
- epoll read+write without IVTS reactor inline continuations
- System.Net.Socket (Kestrel stock) - epoll threadpool
Tests
(No pipelining)
- Synchronous lightweight plaintext "OK" response.
- Asynchronous workload:
_ = await Task.Run(static () => JsonSerializer.Serialize("Hello World!"));
The purpose of the async workload is to force the continuation onto the threadpool, not to model a heavy async workload.
Hardware
i9 14900k
64GB DDR5 6400MHz
Linux Kernel 6.17.0-22-generic
Tests are done through localhost loopback (no NIC influence)
MTU 1500
Load generators
Http/1.1 no TLS
wrk (epoll)
gcannon (io_uring)
io_uring configuration
All io_uring variants share the same ring setup:
SINGLE_ISSUER+DEFER_TASKRUN- Multishot accept and multishot recv (send is one-shot per response)
- No zero copy receive, no zero copy send
- Incremental buffer consumption disabled, similar performance for this specific benchmark
io_uring read+write with IVTS reactor inline continuations
This is the exact model explored throughout the series, expected to deliver high performance on synchronous test.
Reactor count: 12
Sync workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 121.45us | 129us |
| Latency Stdev | 178.81us | — |
| Latency Max | 8.32ms | — |
| Latency p50 | — | 125us |
| Latency p90 | — | 185us |
| Latency p99 | — | 245us |
| Latency p99.9 | — | 317us |
| Req/Sec Avg | 3.59M | 3.95M |
| Requests Total | 18,299,278 | 19,735,722 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 225.84MB/s | 248.42MB/s |
Async workload (very unstable)
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 435.74us | 185us |
| Latency Stdev | 795.84us | — |
| Latency Max | 12.73ms | — |
| Latency p50 | — | 135us |
| Latency p90 | — | 229us |
| Latency p99 | — | 1.84ms |
| Latency p99.9 | — | 4.10ms |
| Req/Sec Avg | 2.53M | 2.76M |
| Requests Total | 12,883,294 | 13,797,048 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 159.05MB/s | 173.67MB/s |
io_uring read+write without IVTS reactor inline
Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS, expected to deliver close results on both tests.
Reactor count: 12
Sync workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 515.72us | 211us |
| Latency Stdev | 821.99us | — |
| Latency Max | 12.67ms | — |
| Latency p50 | — | 164us |
| Latency p90 | — | 273us |
| Latency p99 | — | 1.55ms |
| Latency p99.9 | — | 3.79ms |
| Req/Sec Avg | 1.95M | 2.41M |
| Requests Total | 9,946,282 | 12,080,236 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 122.80MB/s | 151.97MB/s |
Async workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 530.17us | 213us |
| Latency Stdev | 842.05us | — |
| Latency Max | 13.37ms | — |
| Latency p50 | — | 146us |
| Latency p90 | — | 265us |
| Latency p99 | — | 2.27ms |
| Latency p99.9 | — | 4.38ms |
| Req/Sec Avg | 1.93M | 2.39M |
| Requests Total | 9,726,083 | 11,952,675 |
| Duration | 5.03s | 5.00s |
| Transfer/Bandwidth | 121.82MB/s | 150.45MB/s |
io_uring read + libc send write without IVTS reactor inline continuations
Similar model explored throughout the series but with RunAsynchronousContinuation set to true on both IVTS and the write branch is not io_uring, instead we use the libc's send, expected to deliver close results on both tests. This is an hybrid approach and should be the middle ground between the first two models.
Reactor count: 12
Sync workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 410.23us | 154us |
| Latency Stdev | 782.03us | — |
| Latency Max | 12.08ms | — |
| Latency p50 | — | 84us |
| Latency p90 | — | 176us |
| Latency p99 | — | 2.68ms |
| Latency p99.9 | — | 4.32ms |
| Req/Sec Avg | 2.82M | 3.31M |
| Requests Total | 14,361,239 | 16,551,871 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 0.88GB read | 208.27MB/s |
Async workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 418.96us | 159us |
| Latency Stdev | 824.32us | — |
| Latency Max | 17.51ms | — |
| Latency p50 | — | 85us |
| Latency p90 | — | 198us |
| Latency p99 | — | 1.99ms |
| Latency p99.9 | — | 4.41ms |
| Req/Sec Avg | 2.74M | 3.20M |
| Requests Total | 13,955,371 | 15,997,491 |
| Duration | 5.09s | 5.00s |
| Transfer/Bandwidth | 172.59MB/s | 201.18MB/s |
epoll read+write with IVTS reactor inline continuations
Pure epoll approach with same reactor threading architecture. Inline handler continuation for both IVTS.
Reactor count: 12
Sync workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 284.42us | 160us |
| Latency Stdev | 610.90us | — |
| Latency Max | 11.06ms | — |
| Latency p50 | — | 86us |
| Latency p90 | — | 194us |
| Latency p99 | — | 2.07ms |
| Latency p99.9 | — | 4.39ms |
| Req/Sec Avg | 3.36M | 3.17M |
| Requests Total | 17,141,225 | 15,856,691 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 403.61MB/s | 199.56MB/s |
Async workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 458.63us | 159us |
| Latency Stdev | 0.90ms | — |
| Latency Max | 15.96ms | — |
| Latency p50 | — | 74us |
| Latency p90 | — | 185us |
| Latency p99 | — | 2.68ms |
| Latency p99.9 | — | 5.32ms |
| Req/Sec Avg | 2.68M | 3.08M |
| Requests Total | 13,670,697 | 15,386,279 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 322.12MB/s | 369.72MB/s |
epoll read+write without IVTS reactor inline continuations
Pure epoll approach with same reactor threading architecture. Threadpool handler continuation for both IVTS.
Reactor count: 6
Sync workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 391.31us | 140us |
| Latency Stdev | 764.42us | — |
| Latency Max | 13.71ms | — |
| Latency p50 | — | 96us |
| Latency p90 | — | 150us |
| Latency p99 | — | 2.06ms |
| Latency p99.9 | — | 4.15ms |
| Req/Sec Avg | 2.98M | 3.60M |
| Requests Total | 15,179,066 | 18,019,801 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 357.60MB/s | 432.83MB/s |
Async workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 464.15us | 154us |
| Latency Stdev | 838.78us | — |
| Latency Max | 10.74ms | — |
| Latency p50 | — | 96us |
| Latency p90 | — | 154us |
| Latency p99 | — | 2.22ms |
| Latency p99.9 | — | 4.48ms |
| Req/Sec Avg | 2.79M | 3.27M |
| Requests Total | 14,231,176 | 16,342,325 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 236.89MB/s | 277.35MB/s |
System.Net.Socket (Kestrel stock) - epoll threadpool
Kestrel's stock network I/O with some tunning
Sync workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 156.79us | 141us |
| Latency Stdev | 342.31us | — |
| Latency Max | 6.98ms | — |
| Latency p50 | — | 129us |
| Latency p90 | — | 176us |
| Latency p99 | — | 305us |
| Latency p99.9 | — | 3.17ms |
| Req/Sec Avg | 3.09M | 3.60M |
| Requests Total | 15,748,223 | 18,024,579 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 194.39MB/s | 226.84MB/s |
Async workload
| Metric | wrk | gcannon |
|---|---|---|
| Latency Avg | 255.07us | 169us |
| Latency Stdev | 507.29us | — |
| Latency Max | 12.53ms | — |
| Latency p50 | — | 123us |
| Latency p90 | — | 237us |
| Latency p99 | — | 1.25ms |
| Latency p99.9 | — | 3.89ms |
| Req/Sec Avg | 2.67M | 3.01M |
| Requests Total | 13,618,906 | 15,043,820 |
| Duration | 5.10s | 5.00s |
| Transfer/Bandwidth | 168.14MB/s | 189.25MB/s |
Comparison at a glance
wrk and gcannon req/s and avg latency for every model, side by side.
| Implementation | Reactors | Sync | Async | ||||||
|---|---|---|---|---|---|---|---|---|---|
| wrk req/s | wrk avg | gcannon req/s | gcannon avg | wrk req/s | wrk avg | gcannon req/s | gcannon avg | ||
| io_uring r+w, IVTS inline | 12 | 3.59M | 121.45us | 3.95M | 129us | 2.53M* | 435.74us* | 2.76M* | 185us* |
| io_uring r+w, threadpool | 12 | 1.95M | 515.72us | 2.41M | 211us | 1.93M | 530.17us | 2.39M | 213us |
| io_uring recv + libc send | 12 | 2.82M | 410.23us | 3.31M | 154us | 2.74M | 418.96us | 3.20M | 159us |
| epoll r+w, IVTS inline | 12 | 3.36M | 284.42us | 3.17M | 160us | 2.68M | 458.63us | 3.08M | 159us |
| epoll r+w, threadpool | 6 | 2.98M | 391.31us | 3.60M | 140us | 2.79M | 464.15us | 3.27M | 154us |
| System.Net.Socket (Kestrel stock) | — | 3.09M | 156.79us | 3.60M | 141us | 2.67M | 255.07us | 3.01M | 169us |
CPU usage: the inline-IVTS cases (io_uring r+w IVTS, epoll r+w IVTS) cap at around 1200% max, while every other model averages ~1600%.
* Async run flagged as very unstable in the original write-up.
Conclusion
The numbers are aligned with part 5's rant. On a fully synchronous benchmark, io_uring with the reactor inline continuation rides ahead, no cross thread hand offs.
Force the continuation on the threadpool (async workload) and that lead evaporates. The hybrid approach reclaims most of it and is a serious contender for further tests with Kestrel integration.
A little note on the load generators, quite interesting results, gcannon seems a lot more stable on latency values while wrk is all over the place.
Important to highlight that the reactor inline sync models consume in average 20% less CPU as they are bounded to 12 reactor CPU threads. On the other hand, solutions that allow threadpool continuation will use as much CPU is available. For example, epoll r+w IVTS inline can actually yield 3.9M rps if we increase the reactor count to 16, surpassing System.Net.Socket performance for same CPU usage.
Very surprising result on epoll r+w threadpool, was expecting the performance to be equal to System.Net.Socket, this will be quite interesting for part 7.
On part 7 some of these models will be integrated on Kestrel/ASP.NET for direct benchmark comparison.