Also as other commenters have mentioned, redirecting allocations to managed memory would also enable similar oversubscription
And the hijack approach only makes sense for making apps have this behavior with no changes, and could be done with minor app changes (e.g. PyTorch has a pluggable allocator interface). App changes also enable intentionally placing specific allocations.
My impression is that this is vibe from beginning to end, starting from a design that only makes sense if you are hallucinating
"I turned a $95 AMD APU into a 16GB VRAM GPU and it can run stable diffusion!"
I have the 4650G APU, and the best way to describe it is: lacking of support. This was even more true 3 yo than now. rocm (is) was absolutely dogshit then, I know this because I tried to do the same when that post was made. You have to compile everything from scratch, get the relevant patches, and even then, xformers which is a library that accelerate diffusion model inferencing was not supported for renoir or rocm back then. Yes, you can generate an image, but it was much slower, and rigged with bugs. You couldn'rt update rocm because it broke compatibility, and it was partly the reason I got into nixos. That being said, those APUs are a power house. Nowadays I can run decent agentic workflows on them (I have 64gb of ddr4 ram, ie APU can suck as much as it needs with the latest linux kernels).
Just note, diffusion models are still second class citizens on AMD apus even GPUs. But then again, nothing close right now on the market except for what apple offers.
But I'm always interested in first hand experiences of how good is it really - I'm pretty cynical about the idea that AMD actually knows what it takes to build good software end-to-end.
The ExLlamaV3 EXL3 2bpw (8 GB, full VRAM) row is an order of magnitude faster than the baseline - but the baseline seems to be the 32GB model running with the KV cache shared to system memory only (I think?)
But if a 8GB model gives sufficient quality then it seems like that would have worked without the shared memory thing?
I think the useful apples-to-apples benchmark is currently the Ollama + GreenBoost shim (baseline) (2-5 tps) vs ExLlamaV3 + GreenBoost cache (8–20 tps) comparison.
It would be really useful to see this compared with the existing llama CPU/memory offload. There is a note at the start ("Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence") - but it is unclear if that 5-10x token speed drop is compared to running a model completely in GPU or compared to the greenboost approach.
I think it is vs GPU, in which case it seems likely the performance is similar to what greenboost is giving but probably much more stable.
The reported size of the ModelOpt FP8, 16 GB, sounds wrong to me. If its 8 bits per parameter it is going to be a similar size to the glm-4.7-flash:q8_0. They repeat this a few times in the readme.
> I have an RTX 5070 with 12 GB VRAM and I wanted to run glm-4.7-flash:q8_0, which is a 31.8 GB model. The standard options are:
> Offload layers to CPU — works, but drops token/s by 5–10× because CPU RAM has no CUDA coherence. You end up waiting. Use a smaller quantization — you lose quality. At q4_0 the model is noticeably worse on reasoning tasks.
> Buy a bigger GPU — not realistic for consumer hardware. A 48 GB card costs more than a complete workstation.
> None of those felt right, so I built an alternative: route the overflow memory to DDR4 via DMA-BUF, which gives the GPU direct access to system RAM over PCIe 4.0 without a CPU copy involved.
And then limps home with this caveat on the closest thing to a benchmark:
> The PCIe 4.0 link (~32 GB/s) is the bottleneck when the model overflows VRAM. The best strategy is to shrink the model until it fits — either with EXL3 quantization or ModelOpt PTQ — and use GreenBoost's DDR4 pool for KV cache only.
I think the reason it refers it to DDR4 is because that is how the user explained it to their coding agent. LLMs are great at perpetuating unnecessary specificity.
For actual training, explicit sharding and RAM mapping are ugly, but at least you can see where the pressure is and reason about it. 'Transparent' often just means performance falls off a cliff and now debugging it sucks.
https://en.wikipedia.org/wiki/TurboCache
(Not the same thing 1:1, but worth the joke anyway)
Does this make sense? I'd have thought the KV is guaranteed to be used 100% of the time while say in a MoE the same can't be said of the weights.
Though I suppose if you're shooting for huge context then having that allocation go into ram makes sense specially when its allocated but not used yet
MoE is kinda related in terms of lower usage requirements vs a dense model of same total param size, but I think your mental model is a bit off.
I would prefer to use system memory to cache different models, focusing on things like embedding, rerankers, and TTS. This is sufficient to run a more complex RAG locally, for example, via Mem0, and then use a larger LLM via the cloud.
(Still cool, still would benefit from better benchmarks)
> The code is really bad with completely uneeded parts. The LLM (Qwen 2.5 7B) has hardcoded the i9 14700KF topology, and has variables related to it never used... It's even funnier that the show hardware function always prints the same string. There are even random pip log files. Why did this slop got coverage here?
https://www.phoronix.com/forums/forum/linux-graphics-x-org-d...As soon as I switched to Linux I had all sorts of problems on Wayland where as soon as that 2 GB was reached, apps would segfault or act in their own unique ways (opening empty windows) when no GPU memory was available to allocate.
Turns out this is a problem with NVIDIA on Wayland. On X, NVIDIA's drivers act more like Windows. AMD's Linux drivers act more like Windows out of the box on both Wayland and X. System memory gets used when VRAM is full. I know this because I got tired of being unable to use my system after opening 3 browser tabs and a few terminals on Wayland so I bought an AMD RX 480 with 8 GB on eBay. You could say my cost of running Linux on the desktop was $80 + shipping.
A few months ago I wrote a long post going over some of these details at https://nickjanetakis.com/blog/gpu-memory-allocation-bugs-wi.... It even includes videos showing what it's like opening apps both on Wayland and X with that NVIDIA card.
Great way to backstab you if you prefer inference speed.
Most people who know it does this turns it off because it kicks in too early so if you have 24GB it'll offload to RAM and tank your inference speed when you hit around 22GB use.
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/s...
https://nvidia.custhelp.com/app/answers/detail/a_id/5490/~/s...
Do not put swap on an SSD you care about at all.
With this approach the compute occurs on the GPU, with the tradeoff that layers in RAM have to be moved back-and-forth through PCI-DMA. It seems to me that this should offer a speedup vs compute split between GPU and CPU. The amount of speedup will depend on how many layers would have been on CPU compute, minus the reduction due to moving those layers between RAM and the GPU.
What's slower? Compute on the CPU or moving data from RAM to GPU through PCI-DMA?
Also, we had memory slots on '90s cards. They were extremely expensive and proprietary. Ever saw a Matrox VRAM card? I never did.
Like the M.2 connector?
> Data lines need to be longer
Like the data lines going all the way to an on-motherboard storage device?
Yes, though likely something with a higher pin count since memory access is more likely to be random and can be parallel versus block storage.
> Like the data lines going all the way to an on-motherboard storage device?
Yes. Why would a GPU manufacturer/packager take on that cost, if it’s presently served well enough for most people by offloading it onto other parts of the system?
This is why there are several proposals of improved forms for memory modules, which use different sockets, like LPCAMM2, which should be able to work with faster memories.
However even LPCAMM2 is unlikely to work at the speeds of soldered GDDR7.
Moreover, when you do this manually, unless it is something that you do every day it may be quite difficult to be certain that soldering has been done well enough to remain reliable during long term use. In the industry, very expensive equipment is used to check the quality of soldering, e.g. X-ray machines.
So unlike inserting a memory module in a socket, which is reasonably foolproof, soldering devices is not something that could be used in a product sold to the general population.
When I was young, there still existed computer kits, where you soldered yourself all the ICs on the motherboard, so you could get a computer at a much lower price than for a fully assembled computer. My first PC was of this kind.
However, at that time PCs were still something that was bought by a small fraction of the population, which were people that you could expect to be willing to learn things like how to solder and who would be willing to accept the risk of damaging the product that they have bought. Today PCs are addressed to the general public, so nobody would offer GPU cards that you must solder.
PCI is (was) allowed to be even longer. Old AT and ATX cases had a slotted support bracket to hold the far end of the PCI cards. See how an Adaptec 2400A looks like.
In general soldered ram seems to get much higher bandwidth than removeable ram. See ryzen AI Max vs 9950x max ram throughputfor example
Strix Halo seems to use LPDDR with 8000 MT/s, which is a bit faster than the usual 5600 MT/s-6400 MT/s "normal" DDR5-DIMMs (Albeit (expensive) faster ones seem to exist), so there's a slight edge towards soldered memory (not sure about LPCAMM2 and similar tech).
GDDR7 is a different league, a 5070 Ti also has a 256bit memory interface, but has 896GB/s bandwidth, compared to strix halo with 256GB/s
I had to get everything top spec to fit 4 channels of 6000MT/s on my 9950x (asus proArt motherboard and the top tier trident neo RAM sticks) -- otherwise it's reportedly unstable.
Strix Halo simply has more memory controllers. Threadrippers are also quad channel, and should be able to run 4 DIMMs at rated speeds, but the cheapest Zen 5 Threadripper seems to be almost double the price of a 9950X3D.
Thanks for the info on the hardware quirks, useful to know!
We seem to be arriving at a cambrian explosion of viable hardware these days between ARM and x86, soldered vs DIMM, etc.
It's refreshing coming from 20 years of x86 being all that matters.
Not as GPU VRAM expansion.
(Feels especially deceptive when there is another top story right with the headline “nvidia nemoclaw” which is an official project)
Is the virtual memory system is a good way to expand memory for inference, via having it directly manage a non uniform pool? I haven't seen any research on that.
and instead of improving the actual product, it decided to "solve the problem in software"
I expect this greenboost to fall and burn, honestly...
I really appreciate thriftful & resourceful points of view. Exploring what if, looking for use is such a great virtue.
In any case, loading a gigantic model just to use system RAM is absurdly slow (due to mem bandwidth), like 1-5 t/s, so it's not practical. It'd take a whole day to process one 86k token request. Just pay a cloud provider $0.01 to do it in 10 seconds.
This is a great project. I love the possibilities it hints at. Thanks for building it!
The benchmarks at the bottom mention memory tiering and manually controlling where things go, but if your application already does that, then you probably don’t also need a CUDA shim. The application should control the VRAM to system memory transfers with boring normal code.
You’re basically stating that swapping is also a bad idea. And to take it further, any memory or storage is a bad idea because there’s L1 cache/SRAM which is faster then the rest
The fundamental problem here is that the workload of LLMs is (vastly simplified) a repeated linear read of all the weights, in order. That is, there is no memory locality in time. There is literally anti-locality; When you read a set of weights, you know you will not need them again until you have processed everything else.
This means that many of the old approaches don't work, because time locality is such a core assumption underlying all of them. The best you can do is really a very large pool of very fast ram.
In the long term, compute is probably going to move towards the memory.
The activated experts is only available after routing, at which point you need the weights immediately and will have very poor performance if they are across PCIe
Is that a crazy thing to say? I can't recall the last time I was grateful for swap; it might've been before 2010.
Er, I did exactly this over a decade ago and never looked back. It's literally one of the first things I do on a new machine.
> Might be fine if you’re never using all your RAM
That's definitely happened occasionally, and no, swap almost always just makes it worse. The thrashing makes the entire machine unusable instead of making the allocating app(s) potentially unstable. I've recovered most times by just immediately killing the app I'm using. And in fact I have warnings that sometimes tell me fast enough before I reach the limit to avoid such issues in the first place.
Somewhat indirectly but still.
I've been a bit too busy to turn mine on for a while.
The biggest factor of whether AMD GPUs on Linux are a PITA or not is ROCm. Strix Halo is supported in ROCm 6.x, so it should be supported on most platforms (I haven't tested it tho). ROCm 7.x is supposed to be better but not all apps support it yet.
AMD, if you're reading this, please hire more SWEs. Nvidia will continue to dominate until you beat them at software.
It's pretty weird to insist on a counterargument that has no implications or consequences to the presented argument.
Yes, swapping is a bad idea.
Your second argument also falls flat, because the standard CUDA hardware setup doesn't use CXL so cache coherence isn't available. You're left with manual memory synchronization. Pretending that GPUs have cache for system RAM when they don't is pretty suspect.
Using system memory from the GPU isn’t that bad if your compute is high enough and you don’t transfer that much data. There are commercial applications that support it and only see low 2-digit percentage perf impact and not the multiples you might expect. Plus on Windows on Nvidia hardware, the driver will automatically use system memory if you oversubscribe VRAM, and I believe this was introduced to support running Stable Diffusion on smaller GPUs.
Yes, with current LLMs and current hardware and current supporting software this is a true statement. My point wasn't that this approach suddenly changes that, it was that it makes it easier to explore alternatives that might change that. Let's imagine some possibilities:
- Models that use a lot of weight reuse: If you strategically reuse layers 3-4x that could give a lot of time for async loading of future weights.
- Models that select experts for several layers at a time: Same thing, while crunching on the current layer you have teed-up future layers that can be transferring in
- HW makers start improving memory bandwidth: This is already happening right? AMD and Apple are pushing unified memory architectures with much higher bandwidth but still not quite there compared to GPUs. This could lead to a hybrid approach that makes those machines much more competitive. similarly, HW makers could bring back technologies that died on the vine that could help, things like Intel's optaine come to mind. Start making mass storage as fast as system memory is now and the equation may change.
These are quick dart throws that probably have obvious holes in them but the point is platforms like this help us explore paths that appeared dead-end until that one change makes them viable and then allows them to take over. It may not happen. It may be a dead end. But that logic means we will never go out on a limb and try something new. We need people and tech that challenges assumptions and makes it easy for people to try out ideas to keep the tech ecosystem evolving. This does that. Even if this particular project doesn't succeed it is a great thing to do if for no other reason it likely just spurred a bunch of people to try their own crazy hacks for LLM inference. Maybe it even enabled a use case with GPUs that nobody realized existed and has nothing to do with LLMs.
For example, 16x PCIe 4.0: 256 Gb/s, 16x PCIe 5.0: 512 Gb/s, while 2x DDR5-6400 DIMMs: 819 Gb/s. The actual throughput is lower for both PCIe and DDR5, due to communication overhead.
On server/workstation motherboards which may have 4, 8 or 12 DIMMs instead of 2, the ratio between memory bandwidth and PCIe bandwidth becomes proportionally higher, so the memory throughput achievable by the GPU becomes a very small fraction of the system memory bandwidth.
> On server/workstation motherboards ... the memory throughput [to system RAM] achievable by the GPU becomes a very small fraction of the system memory bandwidth.
Yes, this is a critical point. It means that this is only realistically useful for prefill, which is compute- and not memory-bandwidth bound.
Decode - the model chooses a new token to append to the end of the current token list (i.e. it generates a token), then computes the new tokens KVs.
Decode is basically prefill 1 tok -> add 1 tok -> prefill 1 more tok -> ....
but in the initial prefill stage it doesn't need to do generation, since you've provided the toks.
edit: Are you sure PCI-E is even that fast? Looking at the chart on Wikipedia (did not research further - so grain of salt here) shows much lower throughput
So don't use it for large requests. Ideal for when you just want to categorise things, for example, "does this task need a shell" or "bucket this email into one of help request, bill due or personal comms".
Sounds ambitious, for the small improvement in effective capacity. In particular when I start wondering if real life speed differences would be small enough for that 10% increase, or if it would be even smaller. And that's before factoring in power/cooling cost for saturating another interface.
Unfortunately that does not matter. Even in a cheap desktop motherboard the memory bandwidth is higher than of 16-lane PCIe 5.0.
Therefore the memory bandwidth available to a discrete GPU is determined by its PCIe slot, not by the system memory.
If you install multiple GPUs, in many MBs that will halve the bandwidth of the PCIe slots, for an even lower memory throughput.
Not on boards that have 12 channels of DDR5.
But yeah, squeezing an LLM from RAM through the PCIe bus is silly. I would expect it would be faster to just run a portion of the model on the CPU in llama.cop fashion.
Edit: the settings is "GGML_CUDA_ENABLE_UNIFIED_MEMORY=1"... useful if you have unified memory, very slow if you do not.
It’s fast for hybrid inference, if you get the KV and MoE layers tuned between the Blackwell card(s) and offloading.
We have a prototype unit and it’s very fast with large MoEs