What's in a GGUF, besides the weights – and what's still missing?
51 points by bashbjorn 5 hours ago | 18 comments

theapadayo 20 minutes ago
IMO the biggest thing still missing is an actual way to define the model architecture outside of being hard coded into the current build. It doesn't need to be a 1:1 performance parity with the fully supported models. Having proper, vendor validated support for day 1 is what is the difference between people thinking a model is amazing vs horrible. See recent Gemma vs Qwen releases.

Not sure what the solution is, other than writing a DSL to describe the model graphs which you then embed in the GGUF. The other fallback is to just read the PyTorch modules from the official model releases and convert that to GGML ops somehow.

reply
LoganDark 15 minutes ago
I feel like the computation graph could be embedded into the weights similarly to how ONNX works. Then you expose some common interfaces that except some common parameters, and additional custom ones can practically be extensions, sort of like how Wayland works. So you can support not only transformer-ish models like LLaMa, but also RNN-ish models like RWKV and also multimodal models and more. Not sure how this would be implemented in practice but it sounds like a cool idea. I just worry that if the computation graph is baked into the model file, then improvements to the architecture or optimizations that don't require changes to the weights won't be applied to existing files without a conversion.
reply
Sharlin 44 minutes ago
> The really neat thing about GGUF is that it's just one file. Compare this to a typical safetensors repo on huggingface, where there's a pile of necessary JSON files scattered around [...]

Funny, to me AI models have "always" been single files, as that's what has been the norm in the local image gen business. Safetensors files allow stuffing all kinds of stuff inside them too, no GGUF needed for that. Though given that the text encoders of modern models are multi-gigabyte language models themselves, nobody includes redundant copies of those in every checkpoint.

reply
amelius 34 minutes ago
> <|turn>user Hi there!<turn|><|turn>model Hi there, how can I help you today <turn|>

Good lord, they managed to invent a format that is even less readable than XML.

reply
aktuel 9 minutes ago
It is not supposed to be readable by humans. You rarely have to look at it. It is designed to not get confused with the actual content, where the content can be any random text from the internet. For that, you have to use a format that is not used anywhere else.
reply
badsectoracula 2 hours ago
> not to be confused with the somewhat baffling llama_chat_apply_template exposed in the libllama API, which hardcodes a handful of chat formats directly in C++

As someone who is tinkering with a desktop-based inference app in FLTK[0], i wish this used the actual Jinja2 template parser llama.cpp uses (or there was another C function that did that since AFAICT for "proper" parsing you need to be able to pass a bunch of data to the template so it knows if you, e.g., do tool calling). Currently i'm using this adhocky function, but i guess i'll either write a Jinja2 interpreter or copy/paste the one from llama.cpp's code (depending on how i feel at the time :-P).

But yeah, GGUF's "all-in-one" approach is very convenient. And i agree that it feels odd to have the projection models as separate files - i remember when i first download a vision-capable model, i just grabbed whatever GGUF looked appropriate, then llama.cpp told me it couldn't do model and took me a bit to realize that i had to download an extra file. Literally my thought once i did was "wasn't GGUF supposed to contain everything?" :-P

[0] https://i.imgur.com/GiTBE1j.png

reply
bitwize 2 hours ago
Oh my God I freaking love your app. The 90s Linux desktop vibes hit like a hammer. FLTK FTW!
reply
ge96 3 hours ago
Nice, I recently pulled down TheBloke 7B mistral to try out I have a 4070.
reply
bashbjorn 3 hours ago
I love mistral, but that model is... not the best. Maybe try out Gemma 4 e4b, it's a similar size to Mistral 7B, and should run great on your 4070 ("E4B" is slightly misleading naming).
reply
ge96 3 hours ago
Thanks for the tip, what do you use Gemma 4 e4b for?
reply
redanddead 3 hours ago
some say it’s a miniaturized gemini model

it’s good at writing, coding, decently intelligent

you can try it on nvidia nim

reply
mixtureoftakes 2 hours ago
7b mistral is quite outdated. On a 12gb 4070 you can run qwen 3.5 9b q4km or qwen 3.6 35b, the latter will be a lot smarter but also a lot slower due to ram offload.

Try both in lm studio, they really are surprisingly capable

reply
ge96 2 hours ago
I have 80gb of ram but it's slow capped by i9 CPU or specific asus mobo sucks I think only 2400mhz despite being ddr4

Tried all the stuff bios, volting

reply
ganelonhb 3 hours ago
I have a 2070 and can confirm it works amazingly fast.

I love TheBloke I wish he still made stuff

reply
bashbjorn 3 hours ago
Yeah, TheBloke era of local LLMs were good times. TBF Unsloth are doing a fantastic job of publishing quants of the major models quickly - they just don't have nearly the volume of "weird" models as TheBloke did.
reply
ge96 3 hours ago
What do you use it for? I'm still trying to use agents, I barely use copilot, only at work when I have to.

I didn't want to get personal with an LLM unless it was local so that's why I was setting this up but yeah. So far just research is what I was looking at.

reply
kenreidwilson 3 hours ago
>Published May 18, 2026

hmmm...

reply
bashbjorn 3 hours ago
whoops, my bad. Just a typo in the markdown. Fixed :)
reply