Hacker News

NUMA: Cores, memory, and the distance between them

56 points by sys_call 5 days ago | 8 comments

Twirrim 15 minutes ago

NUMA is one of those amazing things that trip you up in all sorts of ways at unexpected times. The amazing "invisible" performance killler (invisible because unless you're already aware of NUMA, or remember to check, you won't know it's there potentially crippling you.)

It has been a source of routine conversations with customers and engineers of all kinds, and often one of those things you don't know about until too late.

I don't know if the kernel has improved this behaviour in the several years since last tested, but a coworker realised that the linux page-cache wasn't fully split by NUMA node. They were benchmarking mysql running it in each NUMA node, and noticed the second NUMA node was noticeably slower. Then discover after a reboot the second node was fast, and the first was slower. After a bit of thinking and tinkering they discovered that libmysql was ending up in the page cache in the same NUMA node as the benchmark client was run in first, so even though they were pinning the benchmark tool and mysql process to the NUMA node, the benchmark client was causing the OS to reach across the NUMA node to get at the page cached library.

treesknees 49 minutes ago

Something I didn’t see mentioned was that this unequal memory access time also affects pcie I/O. If your thread on CPU A needs to get data in or out of a nic on CPU B, your throughput/latency will be impacted.

We have to explain this to customers of our software all the time, it’s something that’s easy to miss.

suprjami 17 minutes ago

Same. The drop in performance can be surprisingly bad. 10Gbps becomes 5Gbps. 100Gbps becomes 20Gbps.

lukax 4 hours ago

NUMA can cause really crappy performance. We deployed a Go based LLM gateway in Kubernetes deployed on a server with hundreds of CPU cores. We didn't explicitly set GOMAXPROCS so Go runtime scheduled goroutines over different CPUs and it constantly used 200% CPU and GC was causing latency spikes. Then we set GOMAXPROCS 8 and all performance issues went away. Until recently Kubernetes didn't work well with NUMA.

CarRamrod 2 hours ago

There is one instance where the NUMA performance never disappoints: https://www.youtube.com/watch?v=Cqd1Gvq-RBY

re-thc 3 hours ago

Is this on AMD? I wonder if it's all to do with NUMA or their CCD architecture etc (well these days Intel and everyone also does it to some extent).

Twirrim 20 minutes ago

Intel suffers just as much when NUMA enters the picture, even prior to CCD style architecture. That extra latency hop across to the other core to get at memory is absolutely crippling, especially in a hot loop. It requires very careful handling, while being this kind of invisible element (unless you know to look for it, nothing will draw your attention to it)

toast0 3 hours ago

Hundreds of cores is likely two sockets and so you've got NUMA there.

Scaling to large core counts has a lot of gotchas.

StreamCtx 3 hours ago

[dead]