At the same load, how did latency look for A vs B.
What was throughput and latency at maximum load like for A vs B. For whichever one had the smaller max throughput, what did latency look like for the other option.
For bonus points while testing: is there another observable metric to indicate available capacity, if cpu % free is less useful.
Specifically, not thread per core code has the following issues:
* you have to use atomics/locks to synchronize data access. This involves expensive HW operations to implement the semantics of what an atomic operation is
* you have to deal with lock contention and cache contention
* when an OS migrates the core that is executing your code, you’ve suddenly got cold caches all over the place (icache, dcache, and TLB).
There’s also a bunch of related things that pop up - even if you do thread per core, the processor interrupts for events probably land on a different CPU resulting in extra overhead within the OS to deliver the event to you.
Io_uring doesn’t “handle more things in user space”. It specifically avoids a bunch of overheads; you’re context switching less (other cores can execute the OS code to process your request) and you can pipeline I/O (you can tell the OS “do IO A, then B, then C and tell me when that’s all done”) and you get fewer memory copies (the kernel reads into your buffer directly without needing to create another copy although this is more nuanced).
Anyway, the better mental model is specifically io_uring is more efficient and thus CPUs spend less time standing around waiting for things to happen at the hardware level (context switching, waiting for locks, etc). If the CPUs weren’t actually spending much time waiting, then you don’t get much benefit. This is the same phenomenon as Jevons paradox in economics; IO gets cheaper so you can do more of it within a given time unit and thus your CPUs end up more often having real work to do.
For something like networking, if you are maximizing packets per second, you'll hit kernel limits[1] very quickly and instead have to start leveraging features like GSO/GRO or completely bypass the network stack.
https://access.redhat.com/solutions/4723221
Go should reconsider support. They should have a 'go' at it.
Rdma, dpdk, io_uring it’s really kind of up to the user to do the memory isolation
In io_urings case tho, you can’t do much because the rings are in the kernel.
I’m hopeful though that with Llm things will get better.
But it’s just hard problem to solve . Very difficult to do in the kernel itself, and folks don’t really even understand tuning for it.
I took a (very brief) look at the github repo [1], it doesn't look like you're doing anything with cpu pinning.
You can probably eke (thanks) out a bit more performance if you cpu pin your threads and cpu pin your listen sockets (sockopt SO_INCOMING_CPU).
If you also cpu align your outgoing sockets, you should get a significant boost, but afaik, there's no great api for that. Linux does have an api for compatible NICs (traffic steering/flow steering) which can work, but if you know what hash your NIC uses (it's probably toeplitz) and you manage source port selection to your backend, you can pick ports that will hash properly.
The goal is for your proxy to be able to handle packets without any cross cpu communication.
[1] https://github.com/sibexico/TinyGate