DeepSeek 4 Flash local inference engine for Metal
342 points by tamnd 14 hours ago | 95 comments

kgeist 12 hours ago
Heh, I made something very similar for the Qwen3 models a while back. It only runs Qwen3, supports only some quants, loads from GGUF, and has inference optimized by Claude (in a loop). The whole thing is compact (just a couple of files) and easy to reason about. I made it for my students so they could tinker with it and learn (add different decoding strategies, add abliteration, etc.). Popular frameworks are large, complex, and harder to hack on, while educational projects usually focus on something outdated like GPT-2.

Even though the project was meant to be educational, it gave me an idea I can't get out of my head: what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination? GPUs are expensive and harder to get with each day. If you remove enough abstractions and code directly to the exact hardware/model, you can probably optimize things quite a lot (I hope). Maybe run an agent which tries to optimize inference in a loop (like autoresearch), empirically testing speed/quality.

The only problem with this is that once a model becomes outdated, you have to do it all again from scratch.

reply
Aurornis 8 hours ago
> what if we started building ultra-optimized inference engines tailored to an exact GPU+model combination?

The inference engines in use already include different backend building blocks optimized for different hardware.

While there are places where you can pick up some low hanging fruit for less popular platforms, there isn't a lot of room to squeeze in super optimized model-runners for specific GPU families and get much better performance. The core computations are already done by highly optimized kernels for each GPU.

There are forks of llama.cpp that have better optimizations for running on CPU architectures, but (barring maintainer disagreements) a better use of time is to target merging these improvements upstream instead of trying to make super specific model+GPU runners.

reply
LoganDark 2 hours ago
When you support multiple backends, you end up having to abstract over them. Each backend may implement the abstraction to the best of its capability, but you still have to deal with the abstraction sitting between your workload and its compute. Wouldn't it be nice if you didn't need that abstraction? That's what GP is talking about, I'm sure: optimizing the workload directly for the hardware, rather than merely the workload and the backend for the abstraction.
reply
xtracto 10 hours ago
This takes me to the famous FizzBuzz High performance codegolf answer [1]. If we could implement optimizations like that for the inferences, maybe we could increase the speeds 10x or more.

[1] https://codegolf.stackexchange.com/questions/215216/high-thr...

reply
Juvination 9 hours ago
I love scrolling and reading through this, thinking yeah of course Python is slower than Java, oh wow Rust is pretty on par I wonder what the Java devs did. Then you hit asm and your jaw drops.
reply
slaw 9 hours ago
Check out cpp at 208.3 GiB/s, 3x faster than asm.
reply
mirsadm 11 hours ago
I've built something like this. One issue is that LLMs are actually terrible at writing good shaders. I've spent way too much time trying to get them not to be so awful at it.
reply
wahnfrieden 8 hours ago
Just curious if you've tried GPT 5.5 Pro?
reply
joshmarlow 11 hours ago
Another suggestion for optimizing local inference - the Hermes team talks a lot on X about how much better results are when you use custom parsers tuned to the nuances of each model. Some models might like to use a trailing `,` in JSON output, some don't - so if your parser can handle the quirks of the specific model, then you get higher-performing functionality.
reply
didip 9 hours ago
What if PyTorch is extended to have a pluggable compiler? For M GPU types and N models, if the backend allows, run a specialized compiler?
reply
p_stuart82 4 hours ago
this feels closer to ATLAS/FFTW than a model runner. the generated kernel ages out, the tuning harness is the bit you actually want to keep.
reply
lhl 6 hours ago
I think especially with the ability for SOTA AI to optimize kernels more people should try their hand at making better inference for their specific hardware.

I have an older W7900 (RDNA3) which, besides 48GB of VRAM, has some pretty decent roofline specs - 123 FP16 TFLOPS/INT8 TOPS, 864 GB/s MBW, but has had notoriously bad support both from AMD (ROCm) as well as llama.cpp.

Recently I decided I'd like to turn the card into a dedicated agentic/coder endpoint and I started tuning a W8A8-INT8 model. Over the course of a few days of autolooping (about 800 iterations using a variety of frontier/SOTA models, Kimi K2.6 did surprisingly well), and I ended up with prefill +20% and decode +50% faster than the best llama.cpp numbers for Qwen3.6 MoE.

I'm currently grinding MTP and DFlash optimization on it, but I've been pretty pleased with the results, and will probably try Gemma 4 next.

reply
maherbeg 14 hours ago
This is so sick. I'm really curious to see what focused effort on optimizing a single open source model can look like over many months. Not only on the inference serving side, but also on the harness optimization side and building custom workflows to narrow the gap between things frontier models can infer and deduce and what open source models natively lack due to size, training etc.
reply
dakolli 13 hours ago
There will always be a huge gap between frontier models and open source models (unless you're very rich). This whole industry makes no sense, everyone is ignoring the unit economics. It cost 20k a month to running Kimi 2.6 at decent tok/ps, to sell those tokens at a profit you'd need your hardware costs to be less 1k a month.

Everyone who's betting their competency on the generosity of billionaires selling tokens for 1/10-1/20th of the cost, or a delusional future where capable OS models fit on consumer grade hardware are actually cooked.

reply
bensyverson 13 hours ago
If you looked at a graph of GPU power in consumer hardware and model capability per billion parameters over time, it seems inevitable that in the next few years a "good enough" model will run on entry-level hardware.

Of course there will always be larger flagship models, but if you can count on decent on-device inference, it materially changes what you can build.

reply
physicsguy 12 hours ago
It also massively changes the value economics of the frontier models. In a lot of cases, you really don't need a general purpose intelligence model too.
reply
bensyverson 12 hours ago
Exactly… as hn readers, we sometimes forget that a lot of people are using these tools to search for the best sunscreen, or rewrite an email.
reply
dakolli 12 hours ago
No offense, this is a crazy delusional statement.
reply
afro88 12 hours ago
No offense, this is a crazy worthless contribution to the discussion.

Why?

reply
dakolli 11 hours ago
Because everyone in these replies is in complete denial about the physical limits of memory and scaling in general. Ya'll literally living in an alternate reality where model capability increases with a decrease in size, its simply not the case. There will be small focused models that preform well on very narrow tasks, yes, but you will not have "agents" capable of "building most things" running on consumer hardware until more capable (and affordable) consumer hardware exists.
reply
bensyverson 11 hours ago
Ah, you haven't realized that consumer hardware gets more capable over time
reply
adrian_b 9 hours ago
Not this year, when many vendors either offer lower memory capacities or demand higher prices for their devices.
reply
bensyverson 7 hours ago
Correct, the progress is not perfectly linear. But do you believe technological progress has stalled forever? If so, I'd get out of tech and start selling bomb shelters.
reply
dakolli 6 hours ago
Do you really think the trend of consumer hardware is heading towards more memory and better specs? Apple's most popular product this year is an 8gb of RAM laptop..

The trend is heading in the opposite direction, less options for strong consumer hardware and towards cloud based products. This is a memory issue more than anything. Nvidia is done selling their ddr7 to gamers and people with AI girlfriends.

reply
bensyverson 3 hours ago
Just so that I have your position straight: you actually believe that over the long term, like 10, 20 years, that the amount of RAM in a laptop is going to go down?

It's not out of the realm of possibility, but I just want to make you aware that this would be a very surprising development in computing history.

reply
fulafel 2 hours ago
This seems to be a different discussion than was going on up thread about:

> in the next few years a "good enough" model will run on entry-level hardware

reply
wtallis 2 hours ago
Exactly. In the next few years, entry-level hardware will not be advancing beyond 16GB. And anything beyond 32GB will remain decidedly high-end.

And that's for laptops with unified memory. In the desktop space, 8GB discrete GPUs are going to be sticking around for a very long time.

reply
dakolli 3 hours ago
A future with less RAM is possible with more applications using computational storage with ssd/nvme.

But that's not my main argument is that its delusional for OP thinks its reasonable to expect that soon we'll be able to run models on consumer hardware that will be able to build basically most things,

But I do think there will be many compromises made for consumer electronics, I don't think the powers that be are eager to give consumers all the best memory (that should be clear by now) There's 3 DDR5 DRAM manufactures in the world that have to provide memory to all the world's militaries, governments, datacenters/corporations. Consumers are last priority.

reply
iuffxguy 3 hours ago
This is more then just the hardware evolving over time but we also are seeing big improvements in quantization and efficiency improvements.
reply
dakolli 3 hours ago
There are physical limits to how much you can compress data. I'm just saying, don't sit on your hands waiting for this to happen, becuase its probably not going to for another decade +. There's no use in waiting, just write the code your fkin self and stop being lazy.
reply
liuliu 12 hours ago
I am not sure where this comment is from (possibly without looking at this project?). This project is running quasi-frontier model at reasonable tps (~30) with reasonable prefill performance (~500tps) with a high-end laptop. People simply project what they see from this project to what you optimistically can expect.

You can argue whether the projection is too optimistic or not, but this project definitely made me a little bit optimistic on that end.

reply
maherbeg 8 hours ago
There will always be a gap, but what's interesting is that because new models are constantly coming out, we as an industry never spend any time extracting the maximal value out of an existing model. What if there are techniques, and harness workflows that could be optimized for a singular model end to end? How far can that push the state of the art.

An example is https://blog.can.ac/2026/02/12/the-harness-problem/ for just improving edits.

Or if we could really steer these open source models using well structured plans, could we spend more time planning into a specific way and kick off the build over night (a la the night shift https://jamon.dev/night-shift)

reply
amunozo 12 hours ago
Most tasks do not require frontier models, so as long as these models cover 95-99 per cent of the tasks, closed frontier models can be left for niche and specialized cases that are harder.
reply
dakolli 11 hours ago
Frontier models can hardly do the tasks I want them too, I simply cannot buy into this notion.
reply
drob518 10 hours ago
For instance?
reply
daveguy 6 hours ago
> There will always be a huge gap between frontier models and open source models (unless you're very rich).

They said the same thing about open source chess engines.

reply
otabdeveloper4 12 hours ago
> a delusional future where capable OS models fit on consumer grade hardware

48 gb is enough for a capable LLM.

Doing that on consumer grade hardware is entirely possible. The bottleneck is CUDA and other intellectual property moats.

reply
antirez 12 hours ago
A random, funny, interesting and telling data point: my MacBook M3 Max while DS4 is generating tokens at full speed peaks 50W of energy usage...
reply
minimaxir 12 hours ago
"Data centers for LLMs are technically more energy efficient per-user than self-hosting LLM models due to economies-of-scale" is a data point the internet isn't ready for.
reply
wlesieutre 9 hours ago
But if you're running it on your own hardware you might only generate tokens when you have something useful to do with them, instead of every time you load a Google search results page because Google decided the future is stuffing Gemini-generated answers down your eyeballs instead of letting you read it yourself from the primary source for 0.1 watts.
reply
menno-sh 10 hours ago
If LLM's were a mature product then this would be true at some point. However, you could argue (and I will) that the popularization of on-device LLM inference will lead to two things:

- Consumers of LLM inference (developers and hobbyists) will be more aware of compute cost, leading them to develop more token-efficient uses of LLM inference and be incentivized to pick the right model for the right job (instead of throwing Sonnet at the wall and follow up with Opus if that doesn't stick)

- A larger market for on-device (and therefore open weight) LLM's will probably result in more research concentrated on those inherently more efficient (because compute/memory-constrained) models.

I think that despite the inefficiencies, shifting the market towards local inference would be a net positive in terms of energy use. Remember that 50W might seem like a lot, but is still much less than what, let's say, a PS5 draws.

Also remember how AWS had the same promise and now we're just deploying stack after stack and need 'FinOps' teams to get us to be more resource-efficient?

reply
aeonfox 6 hours ago
Separate to the self-host/datacentre argument, it would be interesting to see a speed/performance/watts-per-token leaderboard between leading models. Which model is the most watt-efficient?
reply
airstrike 9 hours ago
This is neither a controversial take nor a reason to prefer third-party hosting over self-hosting, so I don't think the internet really needs to be ready for it.
reply
cortesoft 11 hours ago
I thought this is a pretty generally accepted fact?
reply
crazygringo 5 hours ago
I've seen plenty of people on HN claim that LLM's running on their phones is the obvious future in terms of not just privacy but also efficiency, i.e. better along every possible metric.

They don't usually go into much detail, but the impression I get is that they think data centers are energy monsters full of overheated GPU's that need to be constantly replaced, while your phone is full of mostly unused compute capacity and will barely break a sweat if it's only serving queries for a single user at a time.

They don't seem to give much thought to the energy usage per user (or what this will potentially do to your phone battery), or how different phone-sized vs data center-sized models are in terms of capability.

reply
drob518 10 hours ago
This is pretty much true for all applications.
reply
Onavo 12 hours ago
There's a bunch of companies doing garage GPU datacenters now. Probably can act as a heat source during winter too if you have a heat pump.
reply
kristianp 6 hours ago
That's an interesting idea [1], the value being that its easier to build servers into a bunch of homes that are being built than building a datacenter. Every now and then something reminds me of "Dad's Nuke", a novel by Marc Laidlaw, about a family that has a nuclear reactor in their basement. A really bizarre, memorable satire [2].

[1] https://finance.yahoo.com/sectors/technology/articles/nvidia...

[2] https://en.wikipedia.org/wiki/Dad%27s_Nuke

reply
Lalabadie 11 hours ago
Using only this dimension in a vacuum, it sounds like an easy choice, but we're extremely early in this market, and the big providers are already a mess of pricing choices, pricing changes, and sudden quota adjustments for consumers.

Plus, a Mac that's not running inference idles down to 1-5W, only drawing power when it needs to. Datacenters must maximize usage, individuals and their devices don't have to.

A Mac is also the rest of the personal computer!

reply
j_maffe 10 hours ago
But it's simply an economic fact that EoS will be more efficient with a task that's so easy to offload somewhere else.
reply
losvedir 10 hours ago
It's so interesting to think about how much power it takes these machines to "think". I think I had a vague notion that it was "a lot" but it's good to put a number on it.

If DS4 Flash peaks at 50W and is 280B parameters, does that mean DS4 Pro at 1.6T parameters would likely be 300W or so? And the latest GPT 5 and Opus which feel maybe comparable-ish around 500W? Is it fair to say that when I'm using Claude Code and it's "autofellating" or whatever I'm burning 500W in a datacenter somewhere during that time?

reply
Aurornis 8 hours ago
There isn't a relationship between parameter size and energy use like that. You could run a 280B parameter model on a Raspberry Pi with a big SSD if you were so determined. The energy use would be small, but you would be waiting a very long time for your response.

Data center energy use isn't simple to calculate because servers are configured to process a lot of requests in parallel. You're not getting an entire GPU cluster to yourself while your request is being processed. Your tokens are being processed in parallel with a lot of other people's requests for efficiency.

This is why some providers can offer a fast mode: Your request gets routed to servers that are tuned to process fewer requests in parallel for a moderate speedup. They charge you more for it because they can't fit as many requests into that server.

reply
zozbot234 7 hours ago
You're thinking about power use, not energy. There are systems that can more directly minimize energy per operation at the cost of high latency but they look more like TPUs than Raspberry Pi's.
reply
zozbot234 9 hours ago
Energy use for any given request is going to be roughly proportional to active parameters, not total. That would be something like 13B for Flash and 49B for Pro. So you'd theoretically get something like 190W if you could keep the same prefill and decode speed as Flash, which is unlikely.
reply
eurekin 10 hours ago
Batching lowers that, since the model is read once from memory. Activation accumulation doesn't scale as nicely
reply
wmf 10 hours ago
Power isn't proportional to parameters. It may be vaguely proportional to tokens/s although batching screws that up.

Claude Sonnet is probably running on a 8 GPU box that consumes 10 kW while Opus might use more like 50 kW but that's shared by a bunch of users thanks to batching.

reply
jwr 11 hours ago
Not everybody might realize this, but this is a truly excellent and very impressive result. Most models on my M4 Max run at 150W consumption.
reply
Aurornis 8 hours ago
Power consumption numbers aren't useful for efficiency calculations without also considering the tokens per second for the same model and quantization.

I could write an engine that only uses 10W on your machine, but it wouldn't be meaningful if it was also 10X slower.

More power consumption is usually an indicator that the hardware is being fully utilized, all things equal (comparing GPU to GPU or CPU to CPU, not apples to oranges)

reply
dkga 8 hours ago
That a serious number? By the way, how does a hardware normie like me even measure this?
reply
wmf 8 hours ago
Most components have built in power measurement (although some are more accurate than others). Apps like Intel Power Gadget, Mx Power Gadget, Afterburner, Adrenalin, etc. can show power usage in real time.
reply
bertili 12 hours ago
equals 2 or 3 human brains in power usage. Amazing work!
reply
antirez 12 hours ago
True quantitatively, not qualitatively. DeepSeek V4 is not capable of doing what a human brain can do, of course, but for the tasks it can do, it can do it at a speed which is completely impossible for a human, so comparing the two requires some normalization for speed.
reply
scotty79 11 hours ago
I'm sure human brain, at least my present brain, is incapable of many things DeepSeek V4 can do. Qualitatively.
reply
Hamuko 12 hours ago
I think I’ve seen about 60 watt total system whenever I’ve used a local model on a MacBook Pro or a Mac Studio. Baseline for the Mac Studio is like 10 W and like 6 W for the MacBook Pro.
reply
layoric 6 hours ago
Very impressive. One thing that seems odd to me is that is at like 4 minutes before it starts a response for large input? I don't use mac hardware for LLMs, but that is quite surprising and would seem to be a pretty large stumbling block for practical usage.

Edit: Caching story makes a lot more sense for regular usage: > Claude Code may send a large initial prompt, often around 25k tokens, before it starts doing useful work. Keep --kv-disk-dir enabled: after the first expensive prefill, the disk KV cache lets later continuations or restarted sessions reuse the saved prefix instead of processing the whole prompt again.

reply
visarga 12 hours ago
Large LLMs on MacBook produce tokens at an acceptable speed but the problem is reading context. Not incremental reading like when you have a chat session, because they use KV cache, but large size reading, like when you paste a big file. It can take minutes.
reply
habosa 2 hours ago
Can you ELI5 why this is so slow for local inference but so fast for using hosted models?
reply
antirez 11 hours ago
DS4 can process 460 prompt tokens per second. Not stellar but not so slow. On M3 max. See the benchmarks on readme.
reply
bel8 11 hours ago
And unless I'm mistaken, the repo is about running it with 2bit quantization.

This is probably far from the raw intelligence provided by cloud providers.

Still, this shines more light on local LLMs for agentic workflows.

reply
antirez 9 hours ago
It runs both q2 and original (4 bit routed experts). At the same speed more or less. The q2 quants are not what you could expect: it works extremely well for a few reasons. For the full model you need a Mac with 256GB.
reply
someone13 3 hours ago
Out of curiosity, do you have any theories of why it works so well at such aggressive quantization levels?
reply
brcmthrowaway 10 hours ago
Why is this the case?

Are there any architectures that don't rely on feeding the entire history back into the chat?

Recurrent LLMs?

reply
amunozo 12 hours ago
I am curious about it producing less tokens except for the max mode. I love DeepSeek V4 Flash and I use it extensively, it's so cheap I can use it all day and still not use all my 10$ OpenCode Go subscription. I use it always in max mode because of this, but now I wonder whether I should rather use high.
reply
unshavedyak 12 hours ago
What do you use it for? I tend to just stick to SOTA (Claude 4.7 Max thinking), and put up with the slow req/response. I'm not sure what type of work i'd trust a less thinking model, as my intuition is built around what Claude vSOTA Max can handle.

Nonetheless eventually i want to build an at-home system. I imagine some smaller local model could handle metadata assignment quite well.

edit: Though TIL Mac Studio doesn't offer 512GB anymore... DRAM shortage lol. Rough.

reply
amunozo 11 hours ago
I am experimenting with some game development and my thesis' beamer. I have a 20$ Codex account and I use GPT-5.5 for planning and DeepSeek for executing in OpenCode. This makes my Codex 5h tokens to last more than 10 minutes.
reply
actsasbuffoon 10 hours ago
Apple just dropped the 128GB option as well.
reply
fgfarben 3 hours ago
It is still available for the M5 Max Macbook Pro, but yes, the Mac Studio is now only offered with up to 96 GB.
reply
PhilippGille 12 hours ago
On max it uses more than twice as many tokens as on high when running the ArtificialAnalysis benchmark suite, and then it's indeed the model with the highest token usage (among the current top tier models). See the "Intelligence vs. Token Use" chart here:

https://artificialanalysis.ai/models?models=gpt-5-5%2Cgpt-5-...

reply
amunozo 11 hours ago
Wow, the difference is quite considerable and the gain in intelligence is not that much. I might try to use high and just iterate more often. I am working with hobby stuff so I don't have to worry whether it breaks things or not.
reply
syntaxing 12 hours ago
How has opencode go been for you? Worth changing over from Claude pro?
reply
DefineOutside 11 hours ago
I've found that opencode and codex are the two subscriptions that still seem to subsize usage. Deepseek V4 has been the most powerful model in opencode IMO, I trust it with problems where I can validate the solution such as debugging an issue - but I only trust the proprietary GPT-5.5 and Claude Opus 4.7 models for writing code that matters.
reply
amunozo 11 hours ago
Given the price, extremely satisfied, especially thanks to DeepSeek V4 Flash that makes it last forever. I use it on top of my 20$ Codex which is great but tokens last nothing.
reply
nazgulsenpai 12 hours ago
I keep seeing DS4 and in order my brain interprets it as Dark Souls 4 (sadface), DualShock 4, Deep Seek 4.
reply
throwaway613746 11 hours ago
[dead]
reply
Havoc 9 hours ago
Was excited until I realized DS flash is still enormous. Oh well...glad it exists anyway & happy to see antirez still doing fun stuff
reply
zozbot234 7 hours ago
It could run viably with SSD offload on Macs with very little memory. You could even exploit batching to make the model almost compute limited even in that challenging setting, seeing as the KV cache is so extremely small (for non-humongous context). In fact, if that approach can be made to work I'd like to see a comparison between DS4 Flash and Pro on the same (Mac) hardware.
reply
Havoc 6 hours ago
>It could run viably with SSD offload on Macs with very little memory

Not really. That's going to land you somewhere in the 0.2-0.5 tokens a second range

Lovely as modern nvmes are they're not memory

reply
zozbot234 6 hours ago
You can run multiple inferences in parallel on the same set of weights, that's what batching is. Given enough parallelization it can be almost entirely compute-limited, at least for small context (max ~10GB per request apparently, but that's for 1M tokens!)
reply
sourcecodeplz 12 hours ago
Great project!

This is also a fine example of a vibe-coded project with purpose, as you acknowledged.

reply
brcmthrowaway 10 hours ago
How does this compare with oMLX?
reply
micalo 2 hours ago
[flagged]
reply
andrefelipeafos 7 hours ago
[flagged]
reply
m00dy 14 hours ago
[dead]
reply
happyPersonR 12 hours ago
So just gonna ask a question, probably will get downvoted

I know this is flash, but….

But other than this guy, did our whole society seriously never flamegraph this stuff before we started requesting nuclear reactors colocated at data centers and like more than 10% of gdp?

Someone needs to answer because this isn’t even a m4 or m5… WHAT THE FUCK

Sidenote: shout out antirez love my redis :)

reply
AlotOfReading 12 hours ago
This is built atop a tower of stuff people built with profiling and performance-oriented design.

That said, I've found that most corporate environments are unintentionally hostile to this kind of optimization work. It's hard to justify until the work is already done. That means you often need people with the skills, means, and motivation to do this that are outside normal corporate constraints. There aren't many of those.

reply
happyPersonR 12 hours ago
Building this into agentic dev workflows (subject to token/time constraints) is something I spent a lot of time doing at work. I actually am kind of proud of that hahah

But you’re right I agree

In the corporate world they sadly don’t take kindly to performance profiling as a first class citizen

Granted I will say optimization without requirements may not be beneficial but at least profiling itself seems worthy if you have use cases.

A lot of us have been working in the network packet pusher software , distributed systems , distributed storage space

I’m happy to see more stuff like this :)

TLDR; I’ve not seen a lot of flamegraphs of Llm end to end … idk if anyone else has?

reply
liuliu 12 hours ago
DSv4 generates much faster on NVIDIA class hardware. It is just a very efficient model.
reply
wmf 12 hours ago
Every lab has a bunch of people doing nothing but optimizing.
reply
fgfarben 12 hours ago
The world is not China.
reply