← All posts
An arrow labelled DMA goes directly from the NIC into a registered memory box; the words 'kernel copy' below are struck through, showing zcrx skips the kernel copy.

C# Networking Deep Dive With io_uring — Part 4: Zero Copy Receive

DMADirect Memory Access
NICNetwork Interface Card
zcrxio_uring zero copy receive
CQECompletion Queue Entry

In this part 4 we are exploring a "side quest", io_uring zero copy receive mechanism, even though I'll still add the scaffold for the code, this will be more of a theoretical part as I don't have a network card that supports this so can't test. Future parts will not use zero copy receive mechanism.

Feel free to skip this part as it isn't required or will impact further ones.

We usually see a lot of "hot path zero allocation super ultra fast socket server" here and there when people advertise their projects or frameworks. Well that's cute, pre allocate some memory and reuse it, but how about zero copy?

When we receive external data via network through a network interface card (NIC) typically the NIC uses DMA to write this data to kernel memory, which the kernel then copies to a memory space our apps can access. This copy is what can be avoided by having the NIC use DMA to write the data bytes directly into user space accessible memory instead of kernel memory.

What is DMA?

Direct Memory Access is a hardware capability that lets a device read or write to RAM "on its own" with no interaction of the CPU. Without DMA, the only way to get data from a device into memory is through the CPU reading from the device's register and copying it to RAM, the CPU is busy during the copy process, for a NIC pushing gigabits/sec this would consume the whole core.

For our case (NIC) the driver (CPU) pre populates a ring of Rx descriptors, each pointing at a RAM buffer. When a packet arrives off the wire, the NIC's DMA writes the packet bytes straight into the next buffer in that ring, flags the descriptor done and interrupts the CPU, which then drains the ring via NAPI. The CPU never copied thee packet, the NIC placed it in RAM itself.

DMAs target physical addresses, not the virtual addresses our app/program sees. This address translation/pinning is why io_uring zcrx (zero copy receive) has to register our memory with the kernel first.

DMA is present on every normal network receive, the NIC DMAs the packet into kernel RAM buffers. The io_uring zcrx idea is to change the descriptor's target address so that the DMA lands in our registered memory instead, avoiding the extra kernel copy. This only works for the payload, the NIC splits headers from data, the headers still DMA into kernel buffers so the stack can do TCP/IP processing and only the payload is steered into our registered area.

So, what changed?

In previous parts our recv path used an io_uring provided buffer ring, we allocated a big slab, sliced it into buffers and handed it to the kernel. When data arrived the kernel picked a buffer and copied the data bytes into it. We want to remove that copying having the NIC DMA the received bytes directly to our buffers.

To avoid making this part too extensive I'll focus on the main changes.

Let's do an high level comparison between Minima (parts 1-3) and MinimaZero (part 4 with zcrx).

Who fills the buffer?

Parts(1-3) - Kernel memcpys into our slab.
Part 4 - NIC DMAs into our registered area.

This is the pivotal difference, everything that follows exists to support it. In parts 1-3 the kernel receives the packet into its own memory and memcpys the payload to one of our pre registered slab buffers. With zero copy rx the NIC DMAs the payload straight into a memory area we register, avoiding the the kernel copy.

What we register?

Parts(1-3) - Provided buffer ring PBUF_RING.
Part 4 - zcrx ifq bound to NIC Rx queue (ZCRX_IFQ).

Parts 1-3 register a provided buffer ring with IORING_REGISTER_PBUF_RING, a pool of our own memory the kernel can copy into. In part 4 we register a zcrx interface queue with IORING_REGISTER_ZCRX_IFQ which binds our memory area to one specific NIC hardware receive queue, "wiring" our memory with the NIC's DMA path.

Recv operation (to be multishotted)

Parts(1-3) - RECV + IOSQE_BUFFER_SELECT
Part 4 - RECV_ZC multishot, no buffer selection

Parts 1-3 use IORING_OP_RECV with IOSQE_BUFFER_SELECT flag which basically tells the kernel to pick a buffer from the provided ring only when data arrives, this makes idle connections cheap. In part 4 we use IORING_OP_RECV_ZC without buffer selection as the destination was set when the ifq is registered. Both work with multishot.

Completion

Parts(1-3) - 16-byte CQE, buffer id in flags
Part 4 - 32-byte CQE, CQE32 plus trailing zcrx_cqe

In part 4 the 16 bytes are not enough as the kernel needs to include information of where inside our area the NIC wrote.

Locating the data

Parts(1-3) - slab + bid*size
Part 4 - area + (off & ~AREA_MASK) from the token

Each completion points to the bytes, parts 1-3 own a fixed numbered slots so it's simple arithmetic, slab pointer plus buffer id times buffer size. In part 4 we don't own numbered slots, the NIC picks where to write in the provided area.

Returning a buffer

Parts(1-3) - ReturnBuffer
Part 4 - refill queue entry RefillRqe

Similar lifecycle for both, buffers must be "handed back" so that they can be used again. What changed is that in parts 1-3 we return the buffer id to the provided buffer ring with ReturnBuffer, in part 4 we post an entry to the refill ring with RefillRqe, this entry is basically a token pointing back at a chunk of the registered area.

Concurrency

Parts(1-3) - N reactors
Part 4 - 1 reactor, 1 ifq, one HW queue

Concurrency changed a lot and since I cannot test zcrx due to not having a NIC that supports it, I could not understand or optimize what is the best way to set this up.

In parts 1-3 we run N reactors, each has its own ring and buffer pool, using SO_REUSEPORT the kernel spreads incoming connetions across the reactors. zcrx breaks that, the ifq binds to one hardware receive queue on a single reactor while SO_REUSEPORT still spreads accepts across all reactors, so a connection accepted on another reactor has its zero copy bytes DMA'd into the ifq owning reactor's area, which breaks the multi reactor architecture where connections are owned by reactors that never thread hop. While it is still possible to have multi reactor patter with zcrx by having multiple ifq, I could not test it so won't cover it.

Host setup

Parts(1-3) - None
Part 4 - ethtool split + steering, NIC + kernel >= 6.15

Now the code

I decided to not include any code for this part as I cannot test it.

You can find my scaffolding here It is a port to C# from an existing C implementation plus some "theoretical" changes I cannot test, might be useful in the future if I managed to get my hands on a NIC that supports this.

Sources: