> Vision: We replaced Gemma 4’s vision encoder with a lightweight embedding module consisting of a single matrix multiplication, positional embedding and normalizations.
That's technically encoding, just without using a dedicated model for it like SigLIP? The Developer's Guide elaborates, it's still a 35M layer which I am curious is robust enough. https://developers.googleblog.com/gemma-4-12b-the-developer-...
> Small enough to run locally on consumer laptops with 16GB of RAM, it unlocks powerful multimodal and agentic experiences right on your machine.
I am assuming that involves quantization, which due to the quality loss makes that statement somewhat misleading IMO.
FAIR did this 2 years ago now: https://arxiv.org/abs/2405.09818
I've been waiting for something like this to be released since then.
The annoying thing is that chameleon was multi-modal out based on the same principles, but this model is just inputs... (I'm curious how they did pre-training without having multi-modal outputs as well. I wonder if they just chopped them off rather than support image output).
Vision embedder (35M parameters): Replaces the 27 vision transformer layers of the other medium-sized Gemma 4 models. Raw 48x48 pixel patches are projected to the LLM hidden dimension with a single matmul. A factorized coordinate lookup (X and Y matrices) attaches spatial location information directly to the input
the "single matmul" is the key here, I haven't tried it, but it's probably pretty fast and memory efficient.Standard approach for training MM-LLMs is we train the encoder first, there are O(2-10B) good images on the internet, so encoder needs to see each image O(10-100) times, that is O(100T) tokens, which is more than the entire pre-training budget for most runs. That is the reason we train the encoder separately (smaller model, 2B active vs 30B or 200B active LLM); there is nothing magical about training the encoder and LLM together, it is just more token-efficient to train the image modality first.
https://developers.google.com/edge/gallery
Anyone with a 16GB Mac — that is quite a lot of journalists, surely — can download that, install a model into it, and play.
Surely journalists have to start asking questions at least about OpenAI's consumer revenue projections now.
I am a major, major AI cynic, but I decided to be an informed cynic so I've been playing with local models for agentic work and a bit of CAD-to-image generation. I really quite like the 26B Gemma model — I've been using it to teach myself some fundamental things and learn OpenCode without developing a cloud dependency. It writes fairly good code and it is helping me learn the things I want to learn at a pace that I prefer.
But if this 12B model is even half as close as they say it is, this casts some doubt on the consumer end of the cloud business model, at least in the short term.
(Not clear if this app is using the MTP drafters; I've still not got them working with Gemma myself, though the Qwen 3.6 built-in MTP support is super in LM Studio)
However, on my 18GB RAM MacBook Pro, selecting Gemma-4-12B-it results in this error:
> The model "Gemma-4-12B-it' requires more memory (RAM) than is available on your device.
So yeah, my questions about the 16GB marketing copy are fair.
(Though perhaps it'll squeeze in with a small context window? Not sure I understand that aspect yet)
It does seem to use MTP, yes, and it is quite quick — seemingly the underlying LiteRT stuff can do MTP with Gemma 4 and presumably MTP is a big part of the practicality picture here.
The system prompt thing was a surprise when I poked around.
After that a s1/s2 system: fast generation, slow wave correction / observation operating over the fast generation seems like the next leap forward.
Tokens create and hide too many problems to be the 'optimal' solution.
Your problem isn't with tokens, but with "language". Tokens have little to do with language, other than usually being consumed in sequence, but that's true of anything that has to span over time. Thinking of tokens as letters or subwords is mistaking the general with the specific. We may have started with letters and words and subwords (trying to find the best balance for training), but then people figured why not add pixel patches to the dictionary, and then sounds, and then other signals, and after iterating on it a bit, we now have image and sound and symbol sequence data all being part of the same token space.
LLMs stopped being about "language" - in the sense of English, or C++ - long, long time ago. We're still using tokens, but they're more like quanta of sensory input now.
You can take it in two directions, I guess - either consider "Large Language Model" to be an anachronym, a name that couldn't keep up with times, but we got used to it back when it made sense, or alternatively, just broaden your understanding of "language" to encompass any pattern of quantized sensory inputs, regardless of modality :).
(Given how we know humans can communicate with pictures, gestures, body language, noises, movement, actions, or even gaze, and that when it becomes common enough, such systems develop their own pattern structure - dare I say vocabulary and grammar - and that none of it requires or usually involves going through a "normal language" intermediary - I'd lean towards the second direction :)).
--
ETA: also wrt. "thinking with tokens", LLMs don't really think in tokens. You may have heard that phrase, that may have been coined by Karpathy, that "for LLMs, tokens are units of thinking". It's a useful shorthand to remind people that prompting models to be terse and skip prose is effectively dumbing them down, but it's also a bit misleading.
A better analogy is that tokens act like clock signals: each consumed token causes certain amount of computation happen in the network, much like a single clock signal in digital electronics, or turning a crank one revolution in a mechanical contraption. This makes tokens "units of thinking" in the sense that processing N tokens causes M amount of computation to happen. Now, for whatever problem you're solving, there is a minimum amount X of computation that is required to solve in correctly, and it's mathematically impossible to do with less. So if you ask an LLM to solve it, it needs to process at least as many tokens as it takes for M = X. If you force the model to be so terse that it makes M < X, you literally make it impossible to succeed. In practice, you need M >> X.
My understanding of pixel representation is: slice a grid in an image, each square slice gets projected into a number array of x long (not sure how long x is, or if it's variable), which then gets projected down to a token representing that space (3-4 long as alpha-numeric) and AGAIN gets passed into "position detector" which outputs a token representing that pixel/position. which gets passed into the lmm (at a significantly reduced/translated signal into token space).
First, before continuing: do I have that mostly correct?
There is no such projection step. The array of x numbers is the token. For text, there is a one-to-one correspondence between the textual representation of a token, its index in the vocabulary of the model, and the array of x numbers that is fed into the linear algebra of the model, so people often equivocate between them; but for images or sound, there is no discrete vocabulary and no textual representation, only the array of x numbers.
12b means 12G @ 8 bits/param (basically lossless) and 6G at 4 b/p (generally accepted 'pretty close' level). Not too bad?
But TBD how well the base model performs before thinking too much about quantization
> Audio: We simplified audio processing even further. We removed the audio encoder entirely and projected the raw audio signal into the same dimensional space as text tokens.
From the visual guide, there's still the 35M parameter embedder, then the linear projector, for vision, and the linear projector for audio, so it does have some parameters used for the multimodal input to project it into the LLM latent space: https://newsletter.maartengrootendorst.com/p/a-visual-guide-...
And the Unsloth quants, which are missing this, don't support multimodal input. (edit: actually, I may have just needed to update my llama.cpp, will check with an updated llama.cpp soon)
I'm downloading the ggml-org GGUFs now, I tried Unsloth but got some weird problems, double checking with the bf16 model to see if the issue was just the quant.
Isn't that just projecting the patches into the d_model size vectors that the models takes?
>I am assuming that involves of quantization
12B model in 16GB seems very reasonable to me, int8 is top quality for running models.
12B at int8 would take up 12G memory, or 75% of the system memory which technically fits within 16GB but the OS will not like that. EDIT: On my 18G memory MacBook Pro, LM Studio reports a "partial GPU offload" for the int8 MLX weights. Can't test because the `gemma_unified" architecture is NYI.
The part I hate though is that I’d bet none of the performance claims are based on int8.
Why do we care about bf16 benchmarks when no one will be using that with this model.
It sounds like marketing spin where the performance claims are based on BF16 and the “runs in 16GB” claim is on a totally different quantized version.
Still progress, but not quite democratic yet.
Weird though that Google might be cannibalising it's own AI subscription service?
https://github.com/ggml-org/llama.cpp/pull/23398
Please don't use Ollama, it's a bad actor in the OSS community.
But I've moved on from Ollama for the time being, though I am mainly interested to see what the Gemma 4 MTP speeds are like on my M1 Max, so I may test it.
I am quite impressed with the tools in LM Studio, which is also a beautiful app, but it is not open source (which challenges my personal strategy somewhat) and I dread its inevitable enshittification.
Nevertheless the GUI has been very helpful while I learn, and I will probably use it until something else presents or my usage pattern settles down from experimentation to something a bit more routine.
I will try oMLX, too, but judging by the LiteRT page I may soon be able to just use that for the larger models if I end up settling with Gemma 4.
You have probably convinced me to give it a try, to be honest.
It's just that, to cut a long story short, I am currently recovering from a level of burnout so severe that twelve months ago had me fully convinced I was actually in early-onset cognitive decline (I am a bit over fifty).
Only a little over two months ago I was still sure I'd have to quit IT and find a slow job because I was so out of the loop; this whole industry shift even in just the last few months is so shocking and strange.
So I have to be a bit cautious about how many indirections I add, if that makes sense. But I am compiling bigger projects than llama.cpp so I will give it a go.
Thank you for the extra detail.
That being said, the real value in paid plans is that you get ecosystem integration that can read your gmail, photos, docs, and so on.
I'm both shocked but also not surprised that they continue to develop such efficiencies. Honestly it's like silicon and CPU architecture advancement. We kept shrinking it and shrinking it and it kept getting more and more powerful and here we are with AI and it's only going to be 100x more efficient with time. Maybe there's some point of decay but essentially the next 30 years will be more advanced than the last 30 and were going to be living in some sort of futurist blade runner scenario where gene editing is repairing ageing cells, organs and curing all sorts of cancers that haven't even appeared yet. Beyond our lifetimes people will live to 125 quite steadily and with great mobility and then obviously people will look to how do we get to living 1000 years, which of anyone is religious knows Noah and others lived to that age in a totally different era.
Anyway I'm going off on some tangent but look back 30 years. Now look forward 30 years. It's going to be insane. May God protect us.
It's definitely an exciting time, but in terms of advancements in the state of the art, there is a lot of low-hanging fruit left to pick. There IS a bottom, however, as you can only encode so much "knowledge" in a small number of parameters.
This feels to me a lot like what the early days of what radio or aviation must have been like. Or, heck, microcomputers even.
Today, data systems and algorithms can be deployed at unprecedented scale and speed. Unintended consequences will affect people with that same scale and speed
—Michael Chapman
My favourite conspiracy theory lately is that the above isn't a silly fairy tale, that we actually used to live much much longer -- until the common cold came on the scene, and the sequelae dramatically shortened our lifespans. Today we dismiss it as "just a cold" unbeknownst of what it robbed us from.
Large models still are quite far ahead, don't be fooled that even Gemma:31b (which is better than the 12b overall) is anywhere close to big models.
There is definitely room for optimization, but fundamentally, for complex tasks, you need visible small gradients for accuracy that allow the model to be trained on (and consequently be followed during inference). For example, if you specify in instructions not to write code but ask coding question, Gemma will still write code. Whereas Gemini/Claude will pick up on that and follow your instructions better.
Is it simply goodwill and/or marketing? Or am I missing something strategic?
If that inference becomes popular and valuable enough that those companies make billions of dollars in profit, those companies could use that profit to fund the building of alternative products and platforms that dis-intermediate google's relationship with the customer.
Google already has an 80% gross margin business, the biggest one in the world. Everybody wants a slice of it.
By offering frontier inference closer to cost and open-sourcing everything that's sub-frontier, they're commoditizing frontier labs' models, which inhibits their ability to durably make high gross margins on inference.
It's a strategic play.
> By offering frontier inference closer to cost *and* open-sourcing everything that's sub-frontier
It's two prongs! One prong is that their frontier inference pricing is significantly cheaper/closer-to-at-cost as Anthropic's.
The subject of this thread is the other prong: offering compelling models that are sub-frontier and self-hostable.
Self-hosting models and at-cost frontier models are the high-end and low-end disruptions, respectively, to Ant/OAI/etc.'s business models.
They need one more than ever now.
This is ridiculously anti-competitive.
2. Every time you search for Claude or ChatGPT, you get presented with an AdWords bidding war.
3. Google is deploying its models in Search, Docs/Drive/Office, YouTube, Chrome, ...
2. I'm not sure what this has to do with the case, unless you're arguing Google has an ads monopoly, in which case the best argument would likely not be that adwords lead to bidding wars because that just sounds like they're selling a product people really want to pay for
3. There's nothing criminal about being a very diversified business
Basically with upcoming spark laptops, the smaller models will likely get fine tuned to interface with google services. Then, Google can essentially make Chromebook software include those models, which is the same use case as android.
And you better believe that they will be collecting user data and building advertising models.
That's my experience right now... my company is all in on a plethora of platform products. Also, Microsoft just yesterday said their goal was "Unmetered intelligence". There's a lot of things that can be enabled by small local models, and those things are part of stacks that can generate revenue in other layers.
Of course it is...
This is Windows-Licensing-Level Money Opportunity 2.0.
And Google releases another free local model. As did Microsoft.
The actual facts of the day belie your snort take. At least a little bit.
So it's easier to just release those models as open source and make it official, since someone would inevitably hack the weights out anyway.
Companies don't commonly give away executable binaries "just because", why'd they start now for these binary blobs that are the models?
Not that I'm unhappy about it! Yay for open data any day, I'm just not understanding why, at least beyond PR in nerd circles
Are you sure that isn't about LLMs' outputs? There I know there have been some court cases that say this, but the model itself is a work created in intricate and somewhat creative ways (I hesitate to use the word "creative" here, but would similarly hesitate to label a routine picture of the moon creative whereas pictures basically always have copyright; the bar for creativity is basically an epsilon amount above zero, afaik)
They could lock them down legally which would prevent commercial use, but they choose not to, and they boast about how many tens of millions of times Gemma models have been downloaded by developers.
So there must be more to the rationale than just local model weights getting hacked out of devices.
They rise with the tide of AI adoption. But they gain ground if people opt into Google solutions. And any token sent to a Google model (free or paid) actively punishes their competitors that are then required to spend vast sums to remain bleeding edge.
The question is: do you want to release your models, or use them purely for R&D?
Since everyone else is already releasing models of similar qualities, it's hard to say you're shooting yourself in the foot if you join the chorus.
The added cannibalization of releasing them is effectively zero, so the reputational benefits are likely to be worth it.
Nobody would be looking at Qwen if their ~30b class models weren't fantastically good, it's great advertising and builds significant goodwill with developers, who are going to be your biggest advocates.
The other thing is, all these models are already disposable grade, and in a year they'll all be outclassed by The Next Big Thing. "Open" models are less than 18 months behind SOTA right now and I can't imagine that will slow down much over the next two years, they may even begin to close the gap. Nobody even talks about llama 4 anymore despite only being a year old.
So perhaps another part is just Google showing that they can indeed play at the big boys table.
Here’s a real example.
I’m in a design meeting talking about a model use case. We have a question about the data pipeline or the prompt format that would benefit from knowing about how the model was trained. The enterprise team lead calls the dev tech engineer from the company who produced the model. He is already in the office and walks into the meeting to answer the question.
We saw great results in our usecase using google direct. Moved to Openrouter because google wouldn't let us use it beyond a test.
Then Openrouters performance looked worse, not sure if there was a quantized version or something. So we instead looked at Deepseek v4 Flash, and opted to go for that.
This model would probably be great for a super low cost cloud model, would love to use it in the cloud, Google makes you go elsewhere.
They remind me a bit of HuggingFace, create something great then make money … maybe.
I'm pretty sure they are doing it because they get some research experience by shrinking and improving these models, and because they know that by doing this they get some good PR among the dev community.
Plus every open model undermines their local competition by furthering open research and reduces moats, especially since Gemini as a frontier model isn't really competitive with GPT nor Claude for most applications.
Eventually the local model is not enough, and you'll upgrade to the big ones.
It may be that the Q6 quant I got is borked (or my LM Studio is), but either way, the 0.8b's performance is mind boggling in comparison.
Then run it with the latest llama.cpp that contains the Gemma 4 12B unified bug fixes, using --image-min-tokens 560 --image-max-tokens 2240 --batch-size 4096 --ubatch-size 4096 --temp 1.0 --top-p 0.95 --top-k 64 --jinja
It's understanding far more complex things for me and can reliably handle tiny text, so it should be easily understanding an image that only contains the text "This is a test".
If it was, they wouldn't need to be using the classifiers they are using to warn Gemini about problematic prompts.
A model that comfortably fits in 16GB of VRAM (allowing room for context) is a welcome upgrade.
- Transcribing scanned documents into formatted text
- Captioning/describing images and classifying them for audience suitability (includes anti-spam)
- Matching documents with relevant Wikipedia pages for tagging
I don't use them like frontier models. I break the work down into micro-tasks with one clear goal for each prompt. I write a lot of glue software to make the complete flow work. I was working on all of these tasks before LLMs appeared on the scene. The LLMs have allowed me to replace a lot of complicated code with less code plus a model, while achieving better results.
I use local models for reasons of cost and control. I already had the workstation and GPU. The only running cost is electricity. I have used proprietary models from OpenAI and Google for some of these tasks, but I also encountered churn when the models I built my tools around were retired. I don't worry about that when I have the weights saved locally.
I saw a little app the other day, I think someone posted on here, that looks at your screenshot and renames the file based off the contents of the file.
There's tons of little examples like that. For a lot of use cases, you really don't need the frontier models.
If you have a very specific idea for local model use you can find a way to make it work very well, you don't even need to have a graphics card or NPU chip. You just have to be extremely constrained in how it's used. I think as a generic chatbot they're not great, I'd use a hosted SOTA model and I'm a big fan of local LLMs myself.
Could you talk a bit how you did the finetuning? Did you use unsloth or any other tool and how went the verification to proof the outcome?
Yea absolutely, but man, where to even start, it is very specific.
Fundementally I didn't use any wrappers like unsloth or axolotl, although I have used the latter before a year or two back and it was good, but I needed something very very custom. I also wanted the whole fine tuning pipeline to exported OpenVino model to be seamless.
I heavily leaned on codex, claude and some manual sleuthing around the internet to understand what I needed. I'd played about with QLoRA finetuning with axolotl before and felt most comfortable with that. So I needed to keep everything as stripped down as possible and figured I can just utilise the 3 main huggingface libraries (transformers, peft and datasets) and also bitsandbytes (as suggested by claude to quantize the model to keep this working on my GPU) along with some custom scripts generated by claude/codex (each cross referencing each other) that will do the different stages of the training run.
The next part was the data. Obviously didn't have access to thousands of meetings and associated output documents but I did have a 3090ti sitting there and a codex subscription. So I set about working out what format I needed the data in (many thanks again, to claude/codex) and started generating hundreds of different transcripts, different amounts of speakers, content, tones, subjects, spelling mistakes - like all the different things you could think a meeting would have. Then it's a case of actually generating a good meeting document off the back of the transcripts and creating the "gold standard" that we'd use.
I'm going to gloss over a lot here as I'd rather not detail it as it relates to some propriatary stuff that I had to work through, but you basically pair the transcripts together and run the training.
At the verification stage, there was pretty much 3 things:
1. "just" do some regex string matching to see if there's any of the source transcript key facts in the output to ensure fact preservation. Same with owner fabrication (who said what), I don't want something attributed to someone when it wasn't them that said it and then finally markdown validation.
2. Using codex/claude to validate the transcript and output from the model - I used the latest frontier models, probably overkill for my task, but they were good at the job
3. Finally me going through some actual recordings of myself, groups, meetings and manually verifiying the output
So a fair bit of work, and for context I'm on version 10 now, so it's been a journey!
Repo is https://github.com/Rebreda/listenr - mainly geared toward Whisper fine-tuning, AMD hardware and local inference
In practice I haven't got around to building something around multimodality since I'm primarily using their text generation capabilities.
Then when I’m getting close to feature-complete, I’ll move to a hosted frontier model for the final integration.
Cost savings are enormous if you’re making dozens of calls to language models a minute.
I found Gemma 4 to be better, or at least more nuanced, than Gemini 2.5 Flash. And, the new Gemini 3.5 Flash is very good but is unrealistically expensive (ten times more expensive than DeepSeek or MiMo). So, since I don't need extremely fast performance, a self-hosted Gemma 4 wins for a bunch of stuff.
I've also found Qwen 3.6 27B to be shockingly good at finding security bugs for its size. It beats several larger models, and is close to Gemini Pro 3.1 (but Gemini 3.5 Flash surprisingly beats it soundly). Since it only costs electricity, and my electricity is cheap and 100% renewable, I can use it more broadly than I might otherwise use a hosted model.
All that said, the smart money is still on buying the subsidized tokens from the providers that offer them, rather than buying the hardware needed to run models that are 30+GB in size, as all of the ones I've been using regularly are (8-bit quantization, as they get a little dumber for every bit you drop below that). A $100 subscription to Claude or Codex currently provides access to the best models at a heavily discounted rate. And, DeepSeek/MiMo are extremely cheap, one or more orders of magnitude cheaper than the top models from Anthropic or OpenAI, if you need an API for automated usage. I spent about $4000 on my two inference machines (a Strix Halo with 128GB unified RAM, and a new desktop build based around two cheap old 32GB AMD data center GPUs), which buys a lot of tokens for tiny models like this...probably a couple/few years worth. But, I like tinkering, so having an excuse to play with hardware is its own reward. If it happens to pay me back some of that money, that's a bonus.
Of course, as the major providers decide they need to ring the cash register and stop burning money on subsidized tokens, that math may change, and I may find I'm grateful to have already bought this stuff before the RAM prices made everything 2-3x more expensive.
But, I think if you're not interested in learning about the technology and doing your own training experiments and such, you should probably not try to run stuff locally most of the time.
Wow LLMs are changing the world, what a utopia.
Think almost like unix pipelines, have used it for many workflows.
I expect it to be something like https://huggingface.co/OuteAI/Llama-OuteTTS-1.0-1B-GGUF
They are charging $15.00 an hour for an llm powered assistant. Like wtf, how do these people think that's a valid business model. This will 1000% annoy every customer that uses it. I hate this timeline so much.
Can you call a receptionist at 10pm and book an appointment? Or ask for directions? What if it's 10am and she's already on the line with someone else and you just want to ask if there's parking?
Yes, they're called after hours answering services and they're exponentially better because I get to talk to a human.
If my doctors office replaced a receptionist with this I would switch and leave bad reviews across every platform possible.
Ive already switched doctors once because they used an LLM transcription service during my appoitment that influenced the doctors recommendations for care. Sorry technology does not belong everywhere.
AI produces low quality work and will turn your business to shit.
Ideally companies would share the fucking datasets and training code already, but no, no one wants to talk about the source of those or even share the ones they have as then who knows what comes out of Pandora's box...
I am not overly impressed with the smaller gemma models. And gemma 3 was a bit of a mixed bag, great at some things, bad at most others
Qwen 3.6 on the other hand barely uses any memory at all for its KV cache.
I guess we have to wait for someone to produce perplexity curves at different Q's.
It went into a crash loop on a british columbia 1 dollar coin. This happened with both Q4_1 and Q8. Maybe I'm holding it wrong or it's just really bad for this task.
In contrast, gemma4 gets the british columbia coin right though it also mis-identifies the quarter. gemini 3.1-flash-lite nails them both.
Was getting about 50 t/s output on a 3090 with Q8 which seems ok.
Gemma 4 26B (a4b MoE) 0.647
Qwen 3 14B 0.621
Gemma 4 12B 0.618
Ministral 14B 2512 0.604
Gemma 3 12B 0.547
The quwen 3 14B vs Gemma 4 12B difference is within random variance they same in some repeat runs they actually got the exact same score. Next step up Gemma 4 31B gets 0.676 on this. Or let in some reasoning Qwen 3 14B (reasoning) 0.676.I'll run some cheat-proof benchmarks ones tomorrow see if qwen is still on top.
[0] https://ollama.com/library/gemma4/tags
Edit: MLX being Mac-only is independent of the model being MLX (and therefore Mac) only. The latter is what I am asking about.
I was sure "MLX" stood for "Metal-something-something" but can't find any reference to that somehow, anywho, "Metal" is hardware-accelerated graphics on Apple platforms FWIW.
Edit: about the actual release on Ollama, if you're on non-Apple hardware you probably want the NVFP4 variant ("gemma4:12b-nvfp4") which was uploaded 45 minutes ago, especially if you're with a recent nvidia GPU.
For the GGUF 4bit variant (i.e. non-macs) you'll need `gemma4:12b-it-q4_K_M` which I just pushed. You'll also need to upgrade to version 0.30.4 which we're just about to release (it's in prerelease and we're running through our last regression tests).
> You'll also need to upgrade to version 0.30.4 which we're just about to release
Interesting, wasn't Google coordinating today's release with you? Considering the blog post seems to have gone out way before the release even been cut.
That said, I think the gemma4:12b-nvfp4 model is pretty solid. It's been tuned with Nvidia's model optimizer. I've been waiting on the results for MMLU-Pro, but I'll have to retrigger that after reconverting.
Hah, missed that! Guess that's slightly neat though, you get a second chance ;) NVFP4 been a blast to use across a wide range of models, seems to work really well, at least with vLLM and a nvidia card.
The underlying LiteRT-LM framework used in the edge gallery does support the MTP drafters for the smaller models, but according to:
https://developers.google.com/edge/litert-lm/models/gemma-4
> Note: LiteRT-LM supports E2B and E4B models today, with support for larger models coming soon.
So even Google aren't shipping MTP support for the 26B and 31B models yet.
With many laptops dropping back down to 8GB because of the memory shortage there's some interesting pressures building in the industry.
Is the entire point of this model then that it runs if you don’t have enough GPU memory to load the 26B? That one runs faster anyway due to lower active params.
Wait, *Excluding Chinese language.
This is ... curious.
P.S. Where is gemma 4 124b?
I would be interested in how this actually works. I couldn't find a description of the model architecture (and I did check the links in the Google blog)
This is often a separate module grafted onto the main model, and further pre-trained (e.g. OpenAI's CLIP, SigLIP used in the Gemma 3 and PaliGemma series).
The image encoder approach has a few problems.
One problem is that many like Gemma 3's encoder have fixed image resolution constraints and inputs must be resized with all the attendant distortions that causes with spatial understanding. However, the Gemma 4 series image encoders overcame this and can handle variable-dimension inputs.
Two, these image encoders are somewhat large (ranging from 300-500M parameters) requiring extra memory and FLOPs to run.
Three, say we need to fine-tune a vision language model, updates to its weights, may affect its understanding of the representations generated by the image encoder if we don't fine-tune both together.
The new Gemma-4-12B replaces the encoder (with its many attention layers and large parameter count) with a simple linear projection to generate the embeddings for images. That reduces the computational requirements and simplifies the input pipelines for image processing.
I don't have any expertise on the topic though and might very well be wrong on some details.
It's not parameter size - there is apparently such a thing as "Gemini Nano", which famously is downloaded automatically by Chrome. How similar is it to Gemma E4B? And how strange - you have the weights, but you don't "have" them?
Use OpenCode Go instead: https://opencode.ai/go
But between same (V)RAM requirement 4 bit 26B-A3B and 8 bit 12B it's unclear which one will win, especially given one is MoE and the other dense.
All the launch benchmarks are at 16 bit.
I'm curious how they pre-trained it... I feel like it must have had audio/image output that they chopped off.
I wonder how hard it would be to add it back on.
mmmkay.
Similarly, the 26B A4B Gemma 4 and the 35B A3B Qwen 3.6 identify it clearly, give me the title and trends analysis fairly accurately. While this 12B spits out gobbledygook about it having something to do with hard-drive capacity. It's like it can barely see, gets the very broad strokes (knows it's looking at some kind of chart), but can't identify any details clearly.
https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/tree/mai...
https://huggingface.co/ggml-org/gemma-4-12B-it-GGUF/blob/mai...
It is getting questions like "David has 18 apples and Ivan has 7 apples. How many apples do they have together?" wrong half the time, while Gemma3 12B could very consistently answer that. Other smoke tests (like Chinese translation, and the infamous "Rs in Strawberry" test) also show poor results.
I don't know if it is a quantization/release issue, if the parameters needed for accurate responses have changed (i.e. it needs "thinking" tokens to handle its base error rate), or if the model has been so focused on audio/video that the text processing is bad.
Consumers were complaining about the standard 8GB with the early 2020 refresh of MacBook Pros, many OSes ago. Sure, it might be workable for many tasks (as evidenced by the recent sales of the MacBook Neo), but users with a mere 8GB shouldn't have expectations of LLM performance. Even 16GB feels like a stretch.
The result is decent, but it had a few bizzare/trivial syntax errors I had to fix manually: it would do an extra closing bracket or paren a few times, and wanted to separate function definitions with comma. Not sure what that was about, but otherwise the output run just fine.
So, with those qualifiers, I think it's a decent local coding model. It roughly compares with GPT-4.1 (!!), released 14 months ago, on the output: https://senko.net/vibecode-bench/2025/minesweeper-gpt-4.1.ht... (actually I'd call it better, but those syntax errors...)
I ran the quantized version (4-bit GGUF) on my consumer-grade card with 12G of VRAM and got 5t/s for output. Not for interactive use for coding, but fairly capable model.
To me, it's fascinating how much progress we got in over a year. GPT-4.1 was considered an extremely capable coding model. Now we got something with 12B of params performing roughly the same (in this specific benchmark, disclaimers, etc).
Lists of various models I tested: https://senko.net/vibecode-bench/
For 16GB laptops, Qwen 3.5 9B is the undisputed champ.
Gemma 4 31B is the top dog at small model coding, but is dense so it needs ~48GB unified RAM for full context. If you want decent coding on a laptop you need a lot of RAM. But this shouldn't be surprising, dev machines have always needed lots of resources.
Thanks
(not recommendation, I've not used it .. yet)
Which is unsurprising in the AI space.
You get a wall of text showing you various random fine-tuned models by random people, and that is basically it.
Actual sane default requirements like "just give me the normal AI labs", "please filter for dense only" and "I want this exact context size at this quant" are not part of the tool, apparently. Neither is "compare these quants for me for the same model".
Or maybe it's just hidden enough that I did not find them before I've stopped caring.
Conway's law is at it again.
____
Edit:
I have since then had qwen3.6 ponder the codebase and think about my complaints.
Seems to require a major data model overhaul to actually fix those, so they're legit. Which I didn't doubt, but nice to have some extra fabricated confirmation after it initially refused and said "nooooo the readme says otherwise nooo hypfer is just a hater noo"
___
Edit 2:
It gets worse the longer I stare at it. This could've been a web calculator.
For smaller ones like my native Latvian, the output could be confused for good translation from across the room, the words do look like Latvian words. But the quality is Google translate circa 20 years ago, tops.
It could probably do a decent enough translation to English, if all you need is to get the gist of text. But for smaller European language outputs, nothing comes close to Gemini.
I'm not doing translations, rather querying Hebrew text with a Hebrew prompt.
you can run qwen 3.6 35BA3B on a 12-16GB vram gpu and ot works pretty well.
https://www.youtube.com/watch?v=8F_5pdcD3HY&t=1s
even the 27B in some quants can fit.
https://www.reddit.com/r/LocalLLaMA/comments/1tkmgwj/qwen27b...
qwen IMO is far better for coding, esp agentic coding when combined with something like Pi, it comes probably close enough to Sonnet for a lot of use cases.
Gemma family is better for almost all other tasks you'd use a local llm for.
Which quant of Gemma? For coding Qwen seems to be pretty far ahead, but generally Gemma seems to have a "vaster" set of knowledge, but armed with a search tool it doesn't really matter, and Qwen 3.6 been really great for all sorts of tool calling. I mostly do programming and related things though, fwiw.
> I was going off of peoples' opinions on reddit
It's extremely astroturfed all over the place, especially the larger subreddits, and especially the one related to a specific animal in a specific location. It's sad, as early on it was a great resource, but now it's mostly paid posts and a race to the bottom, with lots of piling, and all the knowledgeable people I used to recognize are nowhere to be found.
Screens:
* https://i.ibb.co/TBBV5nJk/kl-01.png (voice design)
* https://i.ibb.co/nNvvKDyV/kl-02.png (quotation attributions)
It's right there in the middle benchmark bar "LiveCode Bench" 72%.
I don't have unified RAM tho and offloading to CPU is dog slow, which is why I'm interested in 7b-12b models.
Why do people with modern laptops have such little amounts of ram?
Fine if work's paying, but for personal devices (that might have been purchased before local models got good), people have what they have.
I’m a system administrator and I can do my job with no issues at 16GB. Most days 8GB would likely be enough, since I’m just using and abusing other systems anyway.
Java devs at my last job were still running 16GB in 2020. Admittedly that was a while ago. Still not a decade.
Close some Chrome tabs?
I think the mayor win for coding was reasoning. That's why such a small model can match GPT-4.1 in coding, but I suspect that GPT-4.1 still wins in general world knowledge due to bigger size.
Encyclopedic knowledge matters relatively little in perspective, given the expectable future developments: even the more knowledgeable of us will use that knowledge for reasoning and intuition (and we will have absorbed the intellectual keys during our training), but under our professional hat we should in theory be ready to go "I stand corrected" and "more precisely" with the actual data at hand.
I.e.: for the encyclopedic knowledge needed, the /understander/ will have a RAG subsystem and a corpus of knowledge to inquire upon processing queries.
(Corroboration: we can't delirate, and neither can the machine...)
That speed for token output indicates to me that it somehow is using hybrid mode and involving cpu+system ram somehow. That ~5tk/s is about the ram bandwidth of DDR4 RAM versus that size model at 4bit. Any consumer GPU with 12 GB like a nvidia rtx 2080 or rtx 3060 should be doing 20+ tk/s with llama.cpp and CUDA backend.
I should play a bit more with llama.cpp options and see what bappened there. Thanks!
Up until this point, I've found the cost/value to unequivocally favor using a cloud subscription, but I would be lying if I didn't worry that one day OpenAI is going to increase the price for my subscription by 5-10x. I rely on these tools enough that if there is a real viable local option, I'm going to take it.
the value of local models is allowing normal people to access AI without needing to subscribe to cloud services. this is esp imp for the rest of the world where even a 12GB gpu is extremely expensive.
there is no real viable local option that will come even close to Sonnet/Gemini Flash or the cheaper chinese models. Even if your pc costs <$2k you are never going to recoup the hw costs, and the results will be far worse.
Not really. There's a reason the announcement didn't include ANY benchmark (!) and didn't mention EXACTLY what is the memory bandwidth. It's going to be dog-slow unusable for large models, as tok/sec is basically bandwidth divided by active weights. Rumoured 300GB/s / 30GB active weights (decent model) = 10 tokens per second, which is really slow
The Nvidia boxes have only slightly more memory bandwidth, so I wouldn't expect them to be notably faster. At least not enough to make it useful interactively at that scale.
Even batched usage needs to be fast enough to deliver results in a reasonable time. Overnight runs are useful, 24 hour runs are...less so.
Anyway, most of the time people are talking about interactive use, and there's currently an upper bound on how large a model can be for local hosting on a reasonable budget (i.e. not a crazy amount more expensive than what a high end developer desktop or laptop costs). The sweet spot is probably currently the big Qwen 3.6 or Gemma 4 models, which are in the ~60GB range for 8-bit quantization plus a large context.
The dense model is almost usable, but feels really sluggish, even with MTP. I think it's about 12-15 tokens/second on the Strix Halo. Slow enough to where I'd rather pay to use a cloud model.
I might try the 6-bit version of the dense model to see how it behaves, though. Maybe it'll retain its bug hunting abilities while making it fast enough to use interactively and not take all day for benchmark runs.
Can you instruct it to use a lsp?
Thank you for giving me hope!