Hacker News

Accelerating Gemma 4: faster inference with multi-token prediction drafters

63 points by amrrs 2 hours ago | 16 comments

zdw 21 minutes ago

MTP support is being addedto llama.cpp, at least for the Qwen models ( https://github.com/ggml-org/llama.cpp/pull/20533) and I'd imagine Gemma 4 will come soon.

The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.

WhitneyLand 2 minutes ago

Yeah important conceptually to remember MTP is kind of just more weights, but speculative decoding is the runtime algorithm that’s a significant add to whatever code is serving the model.

EGreg 6 minutes ago

How does this get added in practice?

dakolli 17 minutes ago

yet, still mostly useless.

nalinidash 14 seconds ago

technical details are added here: https://x.com/googlegemma/status/2051694045869879749

skybrian 13 minutes ago

Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.

shay_ker 3 minutes ago

curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron

https://docs.nvidia.com/megatron-core/developer-guide/0.15.0...

these 32 minutes ago

Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.

Havoc 2 minutes ago

Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.

They're somehow connected to vision & block speculative decode...don't ask me how/why though

For gemma specifically had more luck with speculative using the llama-server route than lm studio

dvt 21 minutes ago

It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.

[1] https://github.com/ml-explore/mlx-lm/pull/990

[2] https://github.com/ggml-org/llama.cpp/pull/22673

svachalek 13 minutes ago

I've gotten it to work with other models. They've got to be perfectly aligned usually, in terms of provider, quantization etc. Might be a bit before you can get a matched set.

mchusma 39 minutes ago

I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?

Farmadupe 31 minutes ago

I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?

Might be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?

Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?

As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis

disiplus 19 minutes ago

i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.

dakolli 13 minutes ago

Genuinely curious, what are you "fine tuning" these smaller models to do reliably? I hear this talked about a lot but very few people actually cough up examples, and I'd love to actually hear of one.

seamossfet 30 minutes ago

[dead]

disiplus 27 minutes ago

nice, will run it later agains qwen3.6 27b, the speed was one of the reasons why in was running qwen and not gemma. the difference was big, there is some magic that happpens when you have more then 100tps.