Is this solution based on what Apple describes in their 2023 paper 'LLM in a flash' [1]?
This is why mixture of experts (MoE) models are favored for these demos: Only a portion of the weights are active for each token.
The iPhone 17 Pro only has 12GB of RAM. This is a -17B MoE model. Even quantized, you can only realistically fit one expert in RAM at a time. Maybe 2 with extreme quantization. It's just swapping them out constantly.
If some of the experts were unused then you could distill them away. This has been tried! You can find reduced MoE models that strip away some of the experts, though it's ony a small number. Their output is not good. You really need all of the experts to get the model's quality.
When the individual expert sizes are similar to the entire size of the RAM on the device, that's your only option.
https://scienceleadership.org/thumbnail/34729/1920x1920
Just in case if someone still didn't realize - we do live in Idiocracy
> OpenAI's original logo was a simple, text-based mark. Then came the redesign: a perfect circle with a subtle gradient and central void.
The redesign is neither a circle nor does it have a gradient.
> Was that even written by a human?
Was anyone supposed to think a post about comparing logos to buttholes was meant to be serious? Either way, the joke doesn’t work if what you’re describing makes no sense (circle and gradient) and are stretching the definition to unrecognizability.
> > Was that even written by a human?
So, probably not:
> I build AI agent systems and help companies implement AI that works in practice
> OpenClaw maintainer
> I make YouTube videos about AI workflows, agent architecture, and practical automation
Quantizing is also a cheat code that makes the numbers lie, next up someone is going to claim running a large model when they're running a 1-bit quantization of it.
There's no misleading here, they show every detail from model to quantization to that atrocious time to first token. Stuff like this feels more like code golf than anyone claiming the mainstream phone user is going to even download 100GB of model weights.
https://www.reddit.com/r/EmulationOnAndroid/comments/1m269k0...
Something of this sort should keep the device moisturised:
https://www.thehydrobros.com/products/automatic-water-spraye...
0.2ml/s at its lowest setting looks like the ballpark of what's required to maintain temperature.
https://onexplayerstore.com/products/onexplayer-super-x?vari...
https://www.notebookcheck.net/Xiaomi-launches-new-mobile-wat...
Apple fans never cease to amaze me.
EDIT: found this in the replies: https://github.com/Anemll/flash-moe/tree/iOS-App
With 64GB of RAM you should look into Qwen3.5-27B or Qwen3.5-35B-A3B. I suggest Q5 quantization at most from my experience. Q4 works on short responses but gets weird in longer conversations.
There are dynamic quants such as Unsloth which quantize only certain layers to Q4. Some layers are more sensitive to quantization than others. Smaller models are more sensitive to quantization than the larger ones. There are also different quantization algorithms, with different levels of degradation. So I think it's somewhat wrong to put "Q4" under one umbrella. It all depends.
This approach also makes less sense for discrete GPUs where VRAM is quite fast but scarce, and the GPU's PCIe link is a key bottleneck. I suppose it starts to make sense again once you're running the expert layers with CPU+RAM.
Also I wouldn’t trust 3-bit quantization for anything real. I run a 5-bit qwen3.5-35b-A3B MoE model on my studio for coding tasks and even the 4-bit quant was more flaky (hallucinations, and sometimes it would think about running tools calls and just not run them, lol).
If you decided to give it a go make sure to use the MLX over the GGUF version! You’ll get a bit more speed out of it.
Apple has always seen RAM as an economic advantage for their platform: Make the development effort to ensure that the OS and apps work well with minimal memory and save billions every year in hardware costs. In 2026, iPhones still come with 8Gb of RAM, Pro/Max come with 12Gb.
The problem is that AI (ML/LLM training and inference) are areas where you can't get around the need for copious amounts of fast working memory. (Thus the critical shortage of RAM at the moment as AI data centers consume as many memory chips as possible.)
Unless there's something I don't know (which is more than possible) Apple can't code their way around this problem, nor create specialized SoCs with ML cores that obviate the need for lots and lots of RAM.
So, it's going to be interesting whether they accept this reality and we start seeing the iPhones in the future with 16Gb, 32Gb or more as standard in order to make AI performant. And if they give up on adding AI to the billions of iPhones with minimal RAM already out there.
As a side note, 8Gb of RAM hasn't been enough for a decade. It prevents basic tasks like keeping web tabs live in the background. My pet peeve is having just a few websites open, and having the page refresh when swapping between them because of aggressive memory management.
To me, Apple's obvious strength is pushing AI to the edge as much as possible. While other companies are investing in massive data centers which will have millions of chips that will be outdated within the next couple years, Apple will be able to incrementally improve their ML/AI features by running on the latest and greatest chips every year. Apple has a huge advantage in that they can design their chips with a mega high speed bus, which is just as important as the quantity of RAM.
But all that depends on Apple's willingness to accept that RAM isn't an area they can skimp on any more, and I'm not sure they will.
Sorry for the brain dump. I'd love to be educated on this in case I'm totally off base.
If you're loading gigabytes of model weights into memory, you're also pushing gigabytes through the compute for inference. No matter how you slice it, no matter how dense you make the chips, that's going to cost a lot of energy. It's too energy intensive, simple as.
"On device" inference (for large LLM I mean) is a total red herring. You basically never want to do it unless you have unique privacy considerations and you've got a power cable attached to the wall. For a phone maybe you would want a very small model (like 3B something in that size) for Siri-like capabilities.
On a phone, each query/response is going to cost you 0.5% of your battery. That just isn't tenable for the way these models are being used.
Try this for yourself. Load a 7B model on your laptop and talk to it for 30 minutes. These things suck energy like a vacuum, even the shitty models. A network round trip costs gets you hundreds of tokens from a SOTA model and costs 1 joule. By contrast, a single forward pass (one token) of a shitty 7b model costs 1 joule. It's just not tenable.
The gap will always be there. If the silicon gets efficient enough to compute a question/response on the phone in 1 joule, the datacenter will be able to do it with a way smarter way better model in 0.1 joule. And also if the silicon gets efficient enough, that means everything else on the phone will get more efficient too and the battery will get smaller and lighter, so 1 joule will be more 'expensive' relative to the battery SOC. It will never make sense no matter how good the silicon gets.
We have GPT-4 level performance in 22b models today. Only a tiny tiny minority actually use those, because opus is that much better. When it comes to energy efficiency the bar gets higher everywhere in inference and training.
That said, power consumption is one of the reasons I think pushing this stuff to the edge is the only real path for AI in terms of a business model. It basically spreads the load and passes the cost of power to the end user, rather than trying to figure out how to pay for it at the data center level.
Apple recently stated on an earnings call they signed contracts with RAM vendors before prices got out of control, so they should be good for a while. Nvidia also uses TSMC for their chips, which may affect A series and M series chip production.
Yes, TSMC has a plant in Arizona but my understanding is they can't make the cutting edge chips there; at least not yet.
Pros will want higher intelligence or throughput. Less demanding or knowledgeable customers will get price-funneled to what Apple thinks is the market premium for their use case.
It'll probably be a little harder to keep their developers RAM disciplined (if that's even still true) for typical concerns. But model swap will be a big deal. The same exit vs voice issues will exist for apple customers but the margin logic seems to remain.
Why do you say they can't do this?
It's not like Apple's GPU designs are world-class anyways, they're basically neck-and-neck with AMD for raster efficiency. Except unlike AMD, Apple has all the resources in the world to compete with Nvidia and simply chooses to sit on their ass.
Apple technically hasn't supported the professional GPGPU workflow for over a decade. macOS doesn't support CUDA anymore, Apple abandoned OpenCL on all of their platforms and Metal is a bare-minimum effort equivalent to what Windows, Android and Linux get for free. Dedicated matmul hardware is what Apple should have added to the M1 instead of wasting silicon on sluggish, rinky-dink NPUs. The M5 is a day late and a dollar short.
According to reports, even Apple can't quite justify using Apple Silicon for bulk compute: https://9to5mac.com/2026/03/02/some-apple-ai-servers-are-rep...
If the alternative is paying a subscription and/or being fed ads, people will try the local private ones first.
I expect OpenAI, Anthropic, and other companies will attempt to do the same, but the OS manufacturers will have a step up.
Having a complete computer in my pocket was very new to me, coming from Nokia where I struggled (as a teenager) to get any software running besides some JS in a browser. I still don't know where they hid whatever you needed to make apps for this device. Android's power, for me, was being able to hack on it (in the HN sense of the word)
Instead, take the advantage of Termux power, namely the fact that you can install things like Openclaw or Gemini-cli. Google Ai plus or Pro plans are actually really good value, considering they bundle it with storage.
https://www.mobile-hacker.com/2025/07/09/how-to-install-gemi...
There is also Termux:GUI with bindings for languages, which you can use to vibecode your own GUI app, which then can basically serve as an interface to an agent, an Termux API which lets you interface with the phone, including USB devices.
Furthermore, termux has the cloudflared package availble, which lets you use clouflared free ssh tunnels (as long as you have a domain name).
All put together, you can do some pretty cool things.
With all the money you will save on subscription fees you should be able to afford treatment for your psychosis!
Don't get me wrong, it's an awesome achievement, but 0.6s token/s at presumably fairly heavy compute (and battery), on a mobile device? There aren't too many use cases for that :)
This exists[0], but the chip in question is physically large and won't fit on a phone.
Getting bigger (foldable) phones, without losing battery life, and running useable models in the same form-factor is a pretty big ask.
Moore's law will shrink it to 8mm soon. I think it'll be like a microSD card you plug in.
Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
> Or we develop a new silicon process that can mimic synaptic weights in biology. Synapses have plasticity.
It's amazing to me that people consider this to be more realistic than FAANG collaborating on a CUDA-killer. I guess Nvidia really does deserve their valuation.
The $$$ would probably make my eyes bleed tho.
Realistically you need +300GB/s fast access memory to the accelerator, with enough memory to fully hold at least greater than 4bit quants. That's at least 380GB of memory. You can gimmick a demo like this with an ssd, but the ssd is just not fast enough to meet the minim specs for anything more than showing off a neat trick on twitter.
The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough that does way more with less, and custom silicon designed for running that type of model. The transformer architecture is neat, but it's just not up for that task, and I doubt anyone's really going to want to build silicon for it.
The latest M5 MacBook Pro's start at 307 GB/s memory bandwidth, the 32-core GPU M5 Max gets 460 GB/s, and the 40-core M5 Max gets 614 GB/s. The CPU, GPU, and Neural Engine all share the memory.
The A19/A19 Pro in the current iPhone 17 line is essentially the same processor (minus the laptop and desktop features that aren’t needed for a phone), so it would seem we're not that far off from being able to run sophisticated AI models on a phone.
As such I can't agree with "The only hope for a handheld execution of a practical, and capable AI model is both an algorithmic breakthrough" - we are much closer than 15/20 years to get these on a phone
This is a toy.
We need to build open infrastructure in the cloud capable of hosting a robust ecosystem of open weights.
And then we need to build very large scale open weights.
That's the only way we don't get owned by the hyperscalers.
At the edge isn't going to happen in a meaningful way to save us.
The fact that it's running on a phone now just sets the goalpost and gets everyone excited about it: add more RAM and GPU to the next iPhone and it's not a toy anymore. Co-incidentally, phone companies also have thousands of engineers sitting around wondering what to do in their next release to convince consumers to buy ...
We're not going to get more RAM and GPU in consumer devices.
All of the supply is going into data center build outs. As the hyper scaler gamble on the future continues, we get left with weaker (or more expensive) devices - not stronger ones.
The market makers make more money if we're left to thin clients. They're also the ones who control supply and the shapes of devices.
While there are problems that can be solved with 0.6t/sec, particularly offline, at the edge, in the field applications, these are currently vastly outnumbered by other applications.
There's just no competing. Local sucks.
absolutely, however this doesn’t mean we should abandon local. i can’t remember who, but someone in the ai nuts and bolts arena said “smaller local models is where the exciting stuff is happening right now. it’s the area real fast progression is happening.” and it seems to be true. new big models aren’t making near the leaps smaller models are.
it’s so important we keep moving forward on running locally for the same reason it was important for us to use open standards when building the internet. if we hadn’t we’d all be connected through aol with 10 hours/month allowed internet usage and termed in through a sun workstation renting cpu cycles from some mainframe company at like “you’ve got 10,000 cpu cycles left on your monthly plan, please deposit $500 for 5,000 more.”
while all of this this is before my time, i’ve heard and read so many horror stories about how people could only connect through dumb terminals to “you wouldn’t believe it, computers then were the size of buildings” 1000 miles away and had to sign up for workload timeslots. make no mistake, this is the future these companies want, they want us to rent everything and own nothing.
I don't know why we can't just get over the local compute thing and instead build open infra and models in the cloud. That's literally the only way we'll be able to keep pace with hyperscalers.
Local is not going to benefit 99% of use cases. It's a silly toy.
If we build open infra for cloud-based provisioning and inference, we could build a future we still have some ownership in. We'd be able to fine tune large models for lots of purposes. We wouldn't be locked in to major vendors.
use the experience we gain from both to bolster the other.
a future where we are unable to locally run is kind of troubling. as is a future with no open cloud. we need both to stop some of the horrors the hyperscalers will happily inflict.
0.6 t/s, wait 30 seconds to see what these billions of calculations get us:
"That is a profound observation, and you are absolutely right ..."
Which makes it even funnier.
It makes me a little sad that Douglas Adams didn't live to see it.
https://gwern.net/doc/fiction/science-fiction/1953-dahl-theg...
The joke revolves around the incongruity of "42" being precisely correct.
(One) source: https://www.reddit.com/r/Fedora/comments/1mjudsm/comment/n7d...
To quote the message from the universes creators to its creation “We apologise for the inconvenience”. Does seem to sum up Douglas Adam’s views on absurdity of life.
This is 100% correct!
"You are absolutely right to be confused"
That was the closest AI has been to calling me "dumb meatbag".
You're absolutely right. Now, LLMs are too slow to be useful on handheld devices, and the future of LLMs is brighter than ever.
LLMs can be useful, but quite often the responses are about as painful as LinkedIn posts. Will they get better? Maybe. Will they get worse? Maybe.
I find it hard to understand your uncertainty; how could they not keep getting even better when we've been seeing qualitative improvements literally every second week for months on end? These improvements being eminently public and applied across multiple relevant dimensions: raw inference speed (https://github.com/ggml-org/llama.cpp/releases), external-facing capabilities (https://github.com/open-webui/open-webui/releases) and performance against established benchmarks (https://unsloth.ai/docs/models/qwen3.5/gguf-benchmarks)
Emphasis on slowly.
So this post is like saying that yes an iPhone is Turing complete. Or at least not locked down so far that you're unable to do it.
With hardware and model improvements, the future is bright.
Local LLMs are going to make people sit on their phones instead of taking to real people.
Practical LLMs on mobile devices are at least a few years away.
I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
Putting the GPU and CPU together and having them both access the same physical memory is standard for phone design.
Mobile phones don't have separate GPUs and separate VRAM like some desktops.
This isn't a new thing and it's not unique to Apple
> I understand this is for a demo but do we really need a 400B model in the mobile? A 10B model would do fine right? What do we miss with a pared down one?
There is already a smaller model in this series that fits nicely into the iPhone (with some quantization): Qwen3.5 9B.
The smaller the model, the less accurate and capable it is. That's the tradeoff.
> Mobile phones don't have separate GPUs and separate VRAM like some desktops.
That's true. The difference is the iPhone has wider memory buses and uses faster LPDDR5 memory. Apple places the RAM dies directly on the same package as the SoC (PoP — Package on Package), minimizing latency. Some Android phones have started to do this, too.
iOS is tuned to this architecture which wouldn't be the case across many different Android hardware configurations.
Package-on-Package has been used in mobile SoCs for a long time. This wasn't an Apple invention. It's not new, either. It's been this way for 10+ years. Even cheap Raspberry Pi models have used package-on-package memory.
The memory bandwidth of flagship iPhone models is similar to the memory bandwidth of flagship Android phones.
There's nothing uniquely Apple in this. This is just how mobile SoCs have been designed for a long time.
More correct to say that the memory bandwidth of ALL iPhone models is similar to the memory bandwidth of flagship Android models. The A18 and A18 pro do not differ in memory bandwidth.
Tl;dr a lot, model is much worse
(Source: maintaining llama.cpp / cloud based llm provider app for 2-3 years now)
> That is a profound observation, and you are absolutely right
Twenty seconds and a hot phone for that.
In the end it took almost four minutes to generate under 150 tokens of nothing.
Impressive that they got it to run, but that’s about the only thing.
You do have a lot of "MLEs" and "Data Scientists" who only know basic PyTorch and SKLearn, but that kind of fat is being trimmed industry wide now.
Domain experience remains gold, especially in a market like today's.
They didn't make special purpose hardware to run a model. They crafted a large model so that it could run on consumer hardware (a phone).
We haven't had phones running laptop-grade CPUs/GPUs for that long, and that is a very real hardware feat. Likewise, nobody would've said running a 400b LLM on a low-end laptop was feasible, and that is very much a software triumph.
Agree to disagree, we've had laptop-grade smartphone hardware for longer than we've had LLMs.
We've had solid CPUs for a while, but GPUs have lagged behind (and they're the ones that matter for this particular application). iPhones still lead by a comfortable margin on this front, but have historically been pretty limited on the IO front (only supported USB2 speeds until recently).
And even if you raise the requirements, we still have to contend with cheap CUDA-capable GPUs like the one in the ($300!!!) Nintendo Switch, or the Jetson SOCs. The mobile market has had tons of high-speed/low-power options for a very long time now.
It’s been a lot of years, but all I can hear after reading that is … I’m making a note here, huge success
It's just so slow that nobody pursued it seriously. It's fun to see these tricks implemented, but even on this 2025 top spec iPhone Pro the output is 100X slower than output from hosted services.
It is objectively slow at around 100X slower than what most people consider usable.
The quality is also degraded severely to get that speed.
> but the point of this is that you can run cheap inference in bulk on very low-end hardware.
You always could, if you didn't care about speed or efficiency.
iPhone 17 Pro outperforms AMD’s Ryzen 9 9950X per https://www.igorslab.de/en/iphone-17-pro-a19-pro-chip-uebert...
Remember when people were arguing about whether to use mmap? What a ridiculous argument.
At some point someone will figure out how to tile the weights and the memory requirements will drop again.
Experts are predicted by layer and the individual layer reads are quite small, so this is not really feasible. There's just not enough information to guide a prefetch.
That said, it'd be a fun quote and I've jokingly said it as well, as I think of it more as part of 'popular' culture lol
https://duckdb.org/2024/12/06/duckdb-tpch-sf100-on-mobile#a-...
"The phone a few minutes after finishing the benchmark. It no longer booted because the battery was too cold!"
Your time-average power budget for things that run on phones is about 0.5W (batteries are about 10Wh and should last at least a day). That's about three orders of magnitude lower than a the GPUs running in datacenters.
Even if battery technology improves you can't have a phone running hot, so there are strong physical limits on the total power budget.
More or less the same applies to laptops, although there you get maybe an additional order of magnitude.
This is extremely inefficient though. For efficiency you need to batch many requests (like 32+, probably more like 128+), and when you do that with MoE you lose the advantage of only having to read a subset of the model during a single forward pass, so the trick does not work.
But this did remind me that with dense models you might be able to use disk to achieve high throughput at high latency on GPUs that don't have a lot of VRAM.
It’s only paying Google $1 billion a year for access to Gemini for Siri
Put another way, there is no demonstrated first mover advantage in LLM-based AI so far and all of the companies involved are money furnaces.
Apple’s bet is intelligent, the “presumed winners” are hedging our economic stability on a miracle, like a shaking gambling addict at a horse race who just withdrew his rent money.
The financial math on actually buying over $40k worth of Mac for 1 to 2 youtube videos probably doesn't work that well, even for the really big players.
Pretty sure the M5 Ultra will be out after WWDC, so my M3 Ultra is (while still completely capable of fulfilling my needs) looking a bit long in the tooth. If I can get a good price for it now, I might be able to offset most of the M5 post WWDC...
If they continue to increase.
That blows up the whole “industrial complex” being developed around massive data centers, proprietary models, and everything that goes with that. Complete implosion.
Apple has sat on the sidelines for much of this as it seems clear they know the end game is everyone just does this stuff locally on their phone or computer and then it’s game over for everything going on now.
I'd put my work into that. Not the only option just an example.
Every great project takes time to build. It's possible.
Weights, datasets, code, multiple checkpoints...
I like their FlexOlmo concept.
We’re seeing a massive slowing in the value of all that additional training. Folks don’t like to talk about that, but absent a completely new break-thru the current math of LLMs has largely run its course.
We simply don’t need massive training forever and ever. We’re getting to the point that “good enough” models will solve most use cases. The demonstrated business value is also still broadly missing for AI on the level required to keep funding all this training for much longer.
I'm adapting all of this to Rust+WGPU with compute shaders if you want to follow along.
See this repo: https://github.com/tmzt/shady-thinker
Goal is Qwen3.5 27b on a Pixel 10 Pro running GrapheneOS.
I don't think the argument is that isn't true, it's that the gains from those massive training runs is diminishing. Eventually, it won't be worth it to do the run for each new idea, you'll have to bundle a bunch together to get any noticeable change.
I think local will always have a place, but the infrastructure is going to be used in my humble opinion.
Maybe, for current use cases. I'd argue that anyone who thinks they can do everything a 10kW server can do on their 10W device just isn't being creative enough :)