Accelerating Gemma 4: faster inference with multi-token prediction drafters
63 points by amrrs 2 hours ago | 16 comments
nalinidash 14 seconds ago
technical details are added here:
https://x.com/googlegemma/status/2051694045869879749
replyskybrian 13 minutes ago
Watching the computer write text sort of reminds me of using a modem to call a BBS in the old days. This seems like going from 300 baud to 1200 - a significant improvement, but still pretty slow, and someday we will wonder how we put up with it.
replyshay_ker 3 minutes ago
curious that they are doing speculative decoding and not baking MTP into the model, like Nemotron
replyhttps://docs.nvidia.com/megatron-core/developer-guide/0.15.0...
these 32 minutes ago
Has anyone managed to get this to work in LM Studio? They've got a option in the UI, but it never seems to allow me to enable it.
replyHavoc 2 minutes ago
Normally when LM Studio doesn't like it it's because of the presence of mmproj files in the folder. Sometimes removing them helps it show up.
replyThey're somehow connected to vision & block speculative decode...don't ask me how/why though
For gemma specifically had more luck with speculative using the llama-server route than lm studio
dvt 21 minutes ago
It's not implemented in mlx[1] yet (or llama.cpp[2]), so it may take a while.
replymchusma 39 minutes ago
I find it puzzling Google doesn’t actively promote its own cloud for inference of Gemma 4. Open source is great, love it. But shouldn’t Google want me to be able to use and pay for it through Gemini and vertex?
replyFarmadupe 31 minutes ago
I wonder if for a model that small with a permissive license it might not be worth their time to host a commercial grade inference stack?
replyMight be easier to chuck it over the fence and let other providers handle it as it'll run in almost any commercial grade card?
Also speculating, but I wonder if it might also create a bit of a pricing problem relative to Gemini flashlight depending on serving cost and quality of outputs?
As a comparison, despite being SotA for their size, the smallest qwen models on openrouter (27b and 35b) are not at all worth using, as there are way bigger and better models for less oricemon a per token basis
disiplus 19 minutes ago
i dont know what are you talking about, i replaced an older gpt4o with a finetuned qwen. there is a huge amount of "AI, that can be done with those models, or partly by those models." Huge amount of people would not notice the difference. And if you prepare the context correctly, even bigger slice of people would not notice.
reply
The performance uplift on local/self-hosted models in both quality and speed has been amazing in the last few months.