(In terms of intelligence, they tend to score similarly to a dense model that's as big as the geometric mean of the full model size and the active parameters, i.e. for GPT-OSS-20B, it's roughly as smart as a sqrt(20b*3.6b) ≈ 8.5b dense model, but produces tokens 2x faster.)
For MoE models, it should be using the active parameters in memory bandwidth computation, not the total parameters.
> A Mixture of Experts model splits its parameters into groups called "experts." On each token, only a few experts are active — for example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token. This means you get the quality of a larger model with the speed of a smaller one. The tradeoff: the full model still needs to fit in memory, even though only part of it runs at inference time.
> A dense model activates all its parameters for every token — what you see is what you get. A MoE model has more total parameters but only uses a subset per token. Dense models are simpler and more predictable in terms of memory/speed. MoE models can punch above their weight in quality but need more VRAM than their active parameter count suggests.
> GPT-OSS-20B has 3.6B active parameters, so it should perform similarly to a 3-4B dense model, while requiring enough VRAM to fit the whole 20B model.
First, the token generation speed is going to be comparable, but not the prefil speed (context processing is going to be much slower on a big MoE than on a small dense model).
Second, without speculative decoding, it is correct to say that a small dense model and a bigger MoE with the same amount of active parameters are going to be roughly as fast. But if you use a small dense model you will see token generation performance improvements with speculative decoding (up to x3 the speed), whereas you probably won't gain much from speculative decoding on a MoE model (because two consecutive tokens won't trigger the same “experts”, so you'd need to load more weight to the compute units, using more bandwidth).
So by doing so, this calculator is telling you that you should be running entirely dense models, and sparse MoE models that maybe both faster and perform better are not recommended.
But since this is a tech forum, I assumed some people would be interested by the correction on the details that were wrong.
"What is the highest-quality model that I can run on my hardware, with tok/s greater than <x>, and context limit greater than <y>"
(My personal approach has just devolved into guess-and-check, which is time consuming.) When using TFA/llmfit, I am immediately skeptical because I already know that Qwen 3.5 27B Q6 @ 100k context works great on my machine, but it's buried behind relatively obsolete suggestions like the Qwen 2.5 series.
I'm assuming this is because the tok/s is much higher, but I don't really get much marginal utility out of tok/s speeds beyond ~50 t/s, and there's no way to sort results by quality.
Just to be clear, it may sound like a snarky comment but I'm really curious from you or others how do you see it. I mean there are some batches long running tasks where ignoring electricity it's kind of free but usually local generation is slower (and worse quality) and we all kind of want some stuff to get done.
Or is it not about the cost at all, just about not pushing your data into the clouds.
Nevertheless, I spend a lot of time with local models because of:
1. Pure engineering/academic curiosity. It's a blast to experiment with low-level settings/finetunes/lora's/etc. (I have a Cog Sci/ML/software eng background.)
2. I prefer not to share my data with 3rd party services, and it's also nice to not have to worry too much about accidentally pasting sensitive data into prompts (like personal health notes), or if I'm wasting $ with silly experiments, or if I'm accidentally poisoning some stateful cross-session 'memories' linked to an account.
3. It's nice to be able solve simple tasks without having to reason about any external 'side-effects' outside my machine.
"what is the best open weight model for high-quality coding that fits in 8GB VRAM and 32GB system RAM with t/s >= 30 and context >= 32768" -> Qwen2.5-Coder-7B-Instruct
"what is the best open weight model for research w/web search that fits in 24GB VRAM and 32GB system RAM with t/s >= 60 and context >= 400k" -> Qwen3-30B-A3B-Instruct-2507
"what is the best open weight embedding model for RAG on a collection of 100,000 documents that fits in 40GB VRAM and 128GB system RAM with t/s >= 50 and context >= 200k" -> Qwen3-Embedding-8B
Specific models & sizes for specific use cases on specific hardware at specific speeds.Well, granted my project is trying to do this in a way that works across multiple devices and supports multiple models to find the best “quality” and the best allocation. And this puts an exponential over the project.
But “quality” is the hard part. In this case I’m just choosing the largest quants.
I wouldn't expect a perfect single measurement of "quality" to exist, but it seems like it could be approximated enough to at least be directionally useful. (e.g. comparing subsequent releases of the same model family)
- If you already HAVE a computer and are looking for models: LLMFit
- If you are looking to BUY a computer/hardware, and want to compare/contrast for local LLM usage: This
You cannot exactly run LLMFit on hardware you don't have.
You can check out here how it does that: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...
To detect NVIDIA GPUs, for example: https://github.com/AlexsJones/llmfit/blob/main/llmfit-core/s...
In this case it just runs the command "nvidia-smi".
Note: llmfit is not web-based.
I too was a little surprised by this. My browser (Vivladi) makes a big deal about how privacy-conscious they are, but apparently browser fingerprinting is not on their radar.
It looks like I can run more local LLMs than I thought, I'll have to give some of those a try. I have decent memory (96GB) but my M2 Max MBP is a few years old now and I figured it would be getting inadequate for the latest models. But llmfit thinks it's a really good fit for the vast majority of them. Interesting!
When I visit the site with an Apple M1 Max with 32GB RAM, the first model that's listed is Llama 3.1 8B, which is listed as needing 4.1GB RAM.
But the weights for Llama 3.1 8B are over 16GB. You can see that here in the official HF repo: https://huggingface.co/meta-llama/Llama-3.1-8B/tree/main
The model this site calls 'Llama 3.1 8B' is actually a 4-bit quantized version ( Q4_K_M) available on ollama.com/library: https://ollama.com/library/llama3.1:8b
If you're going to recommend a model to someone based on their hardware, you have to recommend not only a specific model, but a specific version of that model (either the original, or some specific quantized version).
This matters because different quantized versions of the model will have different RAM requirements and different performance characteristics.
Another thing I don't like is that the model names are sometimes misleading. For example, there's a model with the name 'DeepSeek R1 1.5B'. There's only one architecture for DeepSeek R1, and it has 671B parameters. The model they call 'DeepSeek R1 1.5B' does not use that architecture. It's a qwen2 1.5B model that's been finetuned on DeepSeek R1's outputs. (And it's a Q4_K_M quantized version.)
It says I have an Arc 750 with 2 GB of shared RAM, because that's the GPU that renders my browser...but I actually have an RTX1000 Ada with 6 GB of GDDR6. It's kind of like an RTX 4050 (not listed in the dropdowns) with lower thermal limits. I also have 64 GB of LPDDR5 main memory.
It works - Qwen3 Coder Next, Devstral Small, Qwen3.5 4B, and others can run locally on my laptop in near real-time. They're not quite as good as the latest models, and I've tried some bigger ones (up to 24GB, it produces tokens about half as fast as I can type...which is disappointingly slow) that are slower but smarter.
But I don't run out of tokens.
A couple suggestions:
1. I have an M3 Ultra with 256GB of memory, but the options list only goes up to 192GB. The M3 Ultra supports up to 512GB. 2. It'd be great if I could flip this around and choose a model, and then see the performance for all the different processors. Would help making buying decisions!
Im sorry but spending this kind of money when you could have just built yourself a dual 3090 workstation that would have been better for pretty much everything including local models is just plain stupid.
Hell, even one 3090 can now run Gemma 3 27b qat very fast.
At the moment I'm exploring:
- nightmedia/Qwen3.5-27B-Claude-4.6-Opus-Reasoning-Distilled-qx64-hi-mlx
- BeastCode/Qwen3.5-27B-Claude-4.6-Opus-Distilled-MLX-4bit
- mlx-community/Qwen3-Coder-Next-4bit
Nano Banana Pro for anything image and video related.
Grok Imagine for pretty decent porn generation.
Super impressive comparisons, and correlates with my perception having three seperate generations of GPU (from your list pulldown). Thanks for including the "old AMD" Polaris chipsets, which are actually still much faster than lower-spec Apple silicon. I have Ollama3.1 on a VEGA64 and it really is twice as fast as an M2Pro...
----
For anybody that thinks installing a local LLM is complicated: it's not (so long as you have more than one computer, don't tinker on your primary workhorse). I am a blue collar electrician (admittedly: geeky); no more difficult than installing linux. I used an online LLM to help me install both =D
Love the idea though!
EDIT: Okay the whole thing is nonsense and just some rough guesswork or asking an LLM to estimate the values. You should have real data (I'm sure people here can help) and put ESTIMATE next to any of the combinations you are guessing.
Preliminary testing did not come to that conclusion.
> Apple’s New M5 Max Changes the Local AI Story
For interactive chat and simple Q&A, local models are great — latency is predictable, privacy is absolute, and the quality gap with frontier models is narrowing for straightforward tasks. A quantized Llama running on an M-series Mac is genuinely useful.
But for agentic workflows — where the model needs to plan multi-step tasks, use tools, recover from errors, and maintain coherence across long interactions — the gap between local and frontier models is still enormous. I have seen local models confidently execute plans that make no sense, fail to recover from tool errors, and lose track of what they are doing after a few steps. Frontier models do this too sometimes, but at a much lower rate.
The practical middle ground I see working well: local models for fast, cheap tasks like commit message generation, code completion, and simple classification. Frontier API models for anything requiring planning, reasoning over large contexts, or reliability. The economics favor this split — running a local model costs electricity and GPU memory, while API calls cost per token. For high-volume low-complexity tasks, local wins. For low-volume high-complexity tasks, APIs win.
- Which models in the list are the best for my selected task? (If you don't track these things regularly, the list is a little overwhelming.) Sorting by various benchmark scores might be useful?
- How much more system resources do I need to run the models currently listed at F, D or C at B, A, or S-tier levels? (Perhaps if you hover over the score, it could tell you?)
Currently, Nemotron 3 Super using Unsloth's UD Q4_K_XL quant is running nearly everything I do locally (replacing Qwen3.5 122b)
Its using WebGPU as a proxy to estimate system resource. Chrome tends to leverage as much resources (Compute + Memory) as the OS makes available. Safari tends to be more efficient.
Maybe this was obvious to everyone else. But its worth re-iterating for those of us skimmers of HN :)
ive been working with quite a few open weight models for the last year and especially for things like images, models from 6 months would return garbage data quickly, but these days qwen 3.5 is incredible, even the 9b model.
In reality, gpt-oss-120b fits great on the machine with plenty of room to spare and easily runs inference north of 50 t/s depending on context.
Wait 5-10 minutes, and should be done.
It genuinely is that simple.
You can even use local models using claude code or codex infrastrucutre (MASSIVE UNLOCK), but you need solid GPU(s) to run decent models. So that's the downside.
I would've thought no, because of the knowledge cutoff in whatever model you use to download it.
For example, I needed a local model to review some transactions and output structured output in .json format. Not all local models are necesserily good at structured outputs, so I asked grok (becuase it has solid web search and is up to date), what are the best recommended models given this use case and my laptop's specs. It suggested a few models, I chose one and went for it and now it's working.
To summarise, - find model given use case and specs. - trial and error - test other models (if needed) - rinse repeat - because models are always coming out and getting better
$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 65536 -p "Hello"
[snip 73 lines]
[ Prompt: 86,6 t/s | Generation: 34,8 t/s ]
$ ./llama-cli unsloth_Qwen3.5-35B-A3B-GGUF_Qwen3.5-35B-A3B-UD-Q4_K_XL.gguf -c 262144 -p "Hello"
[snip 128 lines]
[ Prompt: 78,3 t/s | Generation: 30,9 t/s ]
I suspect the ROCm build will be faster, but it doesn't work out of the box for me.
- The t/s estimation per machine is off. Some of these models run generation at twice the speed listed (I just checked on a couple macs & an AMD laptop). I guess there's no way around that, but some sort of sliding scale might be better.
- Ollama vs Llama.cpp vs others produce different results. I can run gpt-oss 20b with Ollama on a 16GB Mac, but it fails with "out of memory" with the latest llama.cpp (regardless of param tuning, using their mxfp4). Otoh, when llama.cpp does work, you can usually tweak it to be faster, if you learn the secret arts (like offloading only specific MoE tensors). So the t/s rating is even more subjective than just the hardware.
- It's great that they list speed and size per-quant, but that needs to be a filter for the main list. It might be "16 t/s" at Q4, but if it's a small model you need higher quant (Q5/6/8) to not lose quality, so the advertised t/s should be one of those
- Why is there an initial section which is all "performs poorly", and then "all models" below it shows a ton of models that perform well?
It would be useful to filter which model to use based on the objective or usage (i.e., for data extraction vs. coding).
Also, just looking at VRAM kind of misses that a lot of CPU memory can be shared with the GPU via layer offloading. I think there is ultimately a need for a native client, like a CPU/GPU benchmark, to figure out how the model will actually perform more precisely.
Then it shows the full resolution models, which are completely unnecessary to run quality inference. Quantized models are routine for local inference and it should realize that.
Needs work.
The model is not great, but it was the "least amount of setup" LLM I could run on someone else's machine.
Including structured output, but has a tiny context window I could use.
Even when running locally, the model often starts structured but gradually becomes more verbose or explanatory in longer threads.
Curious if others have seen similar behavior when using local setups.
There's so many knobs to tweak, it's a non trivial problem
- Average/median length of your Prompts
- prompt eval speed (tok/s)
- token generation speed (tok/s)
- Image/media encoding speed for vision tasks
- Total amount of RAM
- Max bandwidth of ram (ddr4, ddr5, etc.?)
- Total amount of VRAM
- "-ngl" (amount of layers offloaded to GPU)
- Context size needed (you may need sub 16k for OCR tasks for instance)
- Size of billion parameters
- Size of active billion parameters for MoE
- Acceptable level of Perplexity for your use case(s)
- How aggressive Quantization you're willing to accept (to maintain low enough perplexity)
- even finer grain knobs: temperature, penalties etc.
Also, Tok/s as a metric isn't enough then because there's:
- thinking vs non-thinking: which mode do you need?
- models that are much more "chatty" than others in the same area (i remember testing few models that max out my modest desktop specs, qwen 2.5 non-thinking was so much faster than equivalent ministral non-thinking even though they had equivalent tok/s... Qwen would respond to the point quickly)
At the end, final questions are: are you satisfied with how long getting an answer took? and was the answer good enough?
The same exercise with paid APIs exists too, obviously less knobs but depending on your use case, there's still differences between providers and models. You can abstract away a lot of the knobs , just add "are you satisfied with how much it cost" on top of the other 2 questions
I don't really understand how the interface to the NPU chip looks from the perspective of a non-system caller, if it exists at all. This is a Samsung device but I am wondering about the general principle.
There are quite a few of them but their marketing is just confusing and full of buzz words. I've been tinkering with OpenRouter that acts as a middleman.
Gemini api use also comes with a free tier.
"I can run a model" is mildly interesting. I can run OSS-20B on my M1 Pro. It works, I tried it, just I don't find any application.
My $3k Macbook can run `GPT-OSS 20B` at ~16 tok/s according to this guide.
Or I can run `GPT-OSS 120B` (a 6X larger model) at 360 tok/s (30X faster) on Groq at $0.60/Mtok output tokens.
To generate $3k worth of output tokens on my local Mac at that pricing it would have to run 10 years continuously without stopping.
There's virtually no economic break-even to running local models, and no advantage in intelligence or speed. The only thing you really get is privacy and offline access.
At 7.5mil per hour hard limit, 84 days to hit the grandparents $3k
That said local models really are slow still, or fast enough and not that great
Instead if you wanted to get a macbook anyway, you get to run local models for free on top. Very different story.
- You can find inference providers with whatever privacy terms you're looking for
- If you're using LLMs with real data (let's say handling GMail) then Google has your data anyway so might as well use Gemini API
- Even if you're a hardcore roll-your-own-mail-server type, you probably still use a hosted search engine and have gotten comfortable with their privacy terms
Also on cost the point is you can use an API that's many times smarter and faster for a rounding error in cost compared to your Mac. So why bother with local except for the cool factor?
[1]: https://github.com/ggml-org/llama.cpp/blob/master/docs/specu...
Radeon VII
https://www.amd.com/en/support/downloads/drivers.html/graphi...
Could you please add title="explanation" over each selected item at the top. For example, when I choose my video card the ram changes... I'm not sure if the RAM selection is GPU RAM? The GRAM was already listed with the graphics card. SO I choose 96GB which is my Main memory? And the GB/s I am assuming it's GPU -> CPU speed?
Quick, someone go vibe code that.
I’m certain this has already been done. It’s too obvious, and too hilarious.
Since I considered buying M3 Ultra and feel like it the most often discussed regarding using Apple hardware for runninh local LLMs. Where speed might be okay, but prompt processing can take ages.
I stopped researching this because buying the hardware to run deepseek full model just isn't practical right now. Our customers will have to be happy with us sending data to OpenAI/deepseek/etc if they want to use those features.
One thing I do wonder is what sort of solutions there are for running your own model, but using it from a different machine. I don't necessarily want to run the model on the machine I'm also working from.
You can also use the kubernetes operator to run them on a cluster: https://ollama-operator.ayaka.io/pages/en/
I also want to run vision like Yocto and basic LLM with TTS/STT
I haven't tried on a raspberry pi, but on Intel it uses a little less than 1s of CPU time per second of audio. Using https://github.com/NVIDIA-NeMo/NeMo/blob/main/examples/asr/a... for chunked streaming inference, it takes 6 cores to process audio ~5x faster than realtime. I expect with all cores on a Pi 4 or 5, you'd probably be able to at least keep up with realtime.
(Batch inference, where you give it the whole audio file up front, is slightly more efficient, since chunked streaming inference is basically running batch inference on overlapping windows of audio.)
EDIT: there are also the multitalker-parakeet-streaming-0.6b-v1 and nemotron-speech-streaming-en-0.6b models, which have similar resource requirements but are built for true streaming inference instead of chunked inference. In my tests, these are slightly less accurate. In particular, they seem to completely omit any sentence at the beginning or end of a stream that was partially cut off.
The tool is very nice though.
Not sure if it still works.
Just FYI.
The website says that code export is not working yet.
That’s a very strange way to advertise yourself.
I’d like to be able to use a local model (which one?) to power Copilot in vscode, and run coding agent(s) (not general purpose OpenClaw-like agents) on my M2 MacBook. I know it’ll be slow.
I suspect this is actually fairly easy to set up - if you know how.
https://unsloth.ai/docs/models/qwen3.5 - running locally guide for the Qwen 3.5 family of models, which have a range of different sizes.
You're probably not going to get anything working well as an agent on an M2 MacBook, but smaller models do surprisingly well for focused autocomplete. Maybe the Qwen3.5 9B model would run decently on your system?
Having a second pair of "eyes" to read a log error and dig into relevant code is super handy for getting ideas flowing.
For LM Studio under server settings you can start a local server that has an OpenAI-compatible API. You'd need to point Copilot to that. I don't use Copilot so not sure of the exact steps there
Try this article https://advanced-stack.com/fields-notes/qwen35-opencode-lm-s...
I'm looking for an alternative to OpenCode though, I can barely see the UI.
It's not as bad as you might think to compile llama.cpp for your target architecture and spin up an OpenAI compatible API endpoint. It even downloads the models for you.
This isn’t nearly complete.
The size of the quantization you chose also makes a difference.
The GPU driver also plays an important role.
What was your approach? What software did you use to run the models?
It pretty obvious that this reasoning scaling is a mirage, parameters are all you need. Everything else is mostly just wasting time while hardware get better.
Just ask any Apple user, they don't actually use local models.
2. Add a 150% size bonus to your site.
Otherwise, cool site, bookmarked.
It’s basically an open-source OS layer that standardizes the local AI stack—Kubernetes (K3s) for orchestration, standardized model serving, and GPU scheduling. The goal is to stop fiddling with Python environments/drivers and just treat local agents like standardized containers. It runs on Mac Minis or dedicated hardware.
A few lessons learned:
1. small models like the new qwen3.5:9b can be fantastic for local tool use, information extraction, and many other embedded applications.
2. For coding tools, just use Google Antigravity and gemini-cli, or, Anthropic Claude, or...
Now to be clear, I have spent perhaps 100 hours in the last year configuring local models for coding using Emacs, Claude Code (configured for local), etc. However, I am retired and this time was a lot of fun for me: lot's of efforts trying to maximize local only results. I don't recommend it for others.
I do recommend getting very good at using embedded local models in small practical applications. Sweet spot.
What's also new here, is VRAM-context size trade-off: for 25% of it's attention network, they use the regular KV cache for global coherency, but for 75% they use a new KV cache with linear(!!!!) memory-token-context size expansion! which means, eg ~100K token -> 1.5gb VRAM use -meaning for the first time you can do extremely long conversations / document processing with eg a 3060.
Strong, strong recommend.
I like the idea of finding practical uses for it, but so far haven't managed to be creative enough. I'm so accustomed to using these things for programming.
So far I've got it orchestrating a few instances to dig through logs, local emails, git repositories, and github to figure out what I've been doing and what I need to do. Opus is waayyy better at it, but Qwen does a good enough job to actually be useful.
I tried having it parse orders in emails and create a CSV of expenses, and that went pretty badly. I'm not sure why. The CSV was invalid and full of bunk entries by the end, almost every time. It missed a lot of expenses. It would parse out only 5 or 6 items of 7, for example. Opus and Sonnet do spectacular jobs on tasks like this, and do cool things like create lists of emails with orders then systematically ensure each line item within each email is accounted for, even without prompting to do so. It's an entirely different category of performance.
Automation is something I'd like to dabble in next, but all I can think of it being useful for is mapping commands (probably from voice) to tool calls, and the reality is I'd rather tap a button on my phone. My family might like being able to use voice commands, though. Otherwise, having it parse logs to determine how to act based on thresholds or something would also be far better implemented with simple algorithms. It's hard to find truly useful and clear fits for LLMs
Actually pg's original "A plan for spam" explains how to do this with a Bayesian classifier.
I asked him why he didn't just have the LLM build him a python ML library based classifier instead.
The LLMs are great but you can also build supporting tools so that:
- you use fewer tokens
- it's deterministic
- you as the human can also use the tools
- it's faster b/c the LLM isn't "shamboozling" every time you need to do the same task.
Totally different categories and different use cases, but the more I learn about LLMs the more I discover there's a powerful, determinsitic, well-established statistical model or two to do the same thing.
Really, LLMs are kind of like convenient, wildly inefficient proxies for useful processes. But I'm not convinced they should often end up as permanent fixtures of logical pipelines. Unless you're making a chat bot, I guess.
Example: “what is the air speed velocity of a swallow?” - qwen knew it was a Monty Python gag, but couldnt and didnt figure out which one.
https://www.reddit.com/r/LocalLLaMA/comments/1ro7xve/qwen35_...
"Thinking" is just a term to describe a process in generative AI where you generate additional tokens in a manner similar to thinking a problem through. It's kind of a tired point to argue against the verb since it's meaning is well understood at this point
Using "thinking", "feeling", "alive", or otherwise referring to a current generation LLM as a creature is a mistake which encourages being wrong in further thinking about them.
There is this ancient story where man was created to mine gold in SA. There was some disagreement whether or not to delete the creatures afterwards. The jury is still out on what the point is.
Consulting our feelings seems good, the feelings were trained on millions of years worth of interactions. Non of them were this tho.
What would be the point for you of uhh robotmancipation?
Edit: for me it would get complicated if it starts screaming and begging not to be deleted. Which I know makes no sense.
Words such as nice, terrific, awful, manufacture, naughty, decimate, artificial, bully... and on and on.
Should one study words to relive extremism? Or should one study words to relieve extremism?
To a doctor of linguistics: "Dr, my extremism... What should I do about it - with words?!? Please help."
That is the question.
Does the doctor answer thusly: "Study the words to relive the extremism! There is your answer!" says he.
or does he say: "Study the words to relieve and soothe the painful, abrasive extremism. Do it twice daily, before meals."
Sage advice in either case methinks.
[0]: https://en.wikipedia.org/wiki/Prompt_engineering#Chain-of-th...
Nice! Me too.
> which is to say a pedantic extremist
Uh never mind, we are not the same lol.
Would you feel comfortable digitally torturing it? Giving it a persona and telling it terrible things? Acts of violence against its persona?
I’m not confident it’s not “feeling” in a way.
Yes its circuitry is ones and zeros, we understand the mechanics. But at some point, there’s mechanics and meat circuitry behind our thoughts and feelings too.
It is hubris to confidently state that this is not a form of consciousness.
Multiplying large matrices over and over is very much towards the "rock" end of that scale.
If one day we are able to create a philosopher from such a rudimentary machine (and a lot of tape), would you consider that very much towards the "rock" end as well?
If a Turing machine can truly simulate a full nondeterministic system as complex as a philosopher but it would take dedicating every gram of matter in the visible universe for a trillion years to simulate one second, is this meaningfully different than saying it cannot?
I suggest the answer to both questions are no, but the second one makes the answer at worst "practically, no".
My feeling is that consciousness is a phenomenon deeply connected to quantum mechanics and thus evades simulation or recreation on Turing machines.
It was designed to make it hard to argue that the answers to your questions are "no".
Of course there are caveats where the Turing machine model might not have a direct map onto human brains, but it seems the onus would be for one to explain why, for example, non-determinism is essential for a philosopher to work.
That said,
> Can a Turing machine of any sort truly indistinguishably simulate a nondeterministic system?
Given how AI has improved in its ability to impersonate human beings in recent years, I don't see why not. At least, the current trend does not seem to be in your favor.
I can see why you think the answer is "no". My understanding is that QM per se is mostly a distraction, but some principles underlying QM (some subjectivity thing) might be relevant here.
My best guess is that the AI tech will eventually be able to replicate a philosopher to arbitrary "accuracy", but there will always be an indescribable "residue" where one could still somehow detect that it is not a real human. I suspect this "residue" is not explainable using materialistic mechanisms though.
In classical computing, we design chips to avoid the quantum behavior, but there's nothing in theory to prevent us from building an equivalent quantum Turing machine using "rocks".
Everything worked fine on GPT but Qwen as often as not preferred to pretend to call a tool and not actually call it. After much aggravation I wound up just setting my bot / llama swap to use gpt for chat and only load up qwen when someone posts an image and just process / respond to the image with qwen and pop back over to gpt when the next chat comes in.
If you want a general knowledge model for answering questions or a coding agent, nothing you can run on your MacBook will come close to the frontier models. It's going to be frustrating if you try to use local models that way. But there are a lot of useful applications for local-sized models when it comes to interpreting and transforming unstructured data.
Azure Doc Intelligence charges $1.50 for 1000 pages. Was that an annual/recurring license?
Would you mind sharing your OCR model? I'm using Azure for now, as I want to focus on building the functionality first, but would later opt for a local model.
Try just GLM-OCR if you want to get started quickly. It has good layout recognition quality, good text recognition quality, and they actually tested it on Apple Silicon laptops. It works easily out-of-the-box without the yak shaving I encountered with some other models. Chandra is even more accurate on text but its layout bounding boxes are worse and it runs very slowly unless you can set up batched inference with vLLM on CUDA. (I tried to get batching to run with vllm-mlx so it could work entirely on macOS, but a day spent shaving the yak with Claude Opus's help went nowhere.)
If you just want to transcribe documents, you can also try end-to-end models like olmOCR 2. I need pipeline models that expose inner details of document layout because I need to segment and restructure page contents for further processing. The end-to-end models just "magically" turn page scans into complete Markdown or HTML documents, which is more convenient for some uses but not mine.
Trying to run LLM models somehow makes 128 GiB of memory feel incredibly tight. I'm frequently getting OOMs when I'm running models that are pushing the limits of what this can fit, I need to leave more memory free for system memory than I was expecting. I was expecting to be able to run models of up to ~100 GiB quantized, leaving 28 GiB for system memory, but it turns out I need to leave more room for context and overhead. ~80 GiB quantized seems like a better max limit when trying not running on a headless system so I'm running a desktop environment, browser, IDE, compilers, etc in addition to the model.
And memory bandwidth limitations for running the models is real! 10B active parameters at 4-6 bit quants feels usable but slow, much more than that and it really starts to feel sluggish.
So this can fit models like Qwen3.5-122B-A10B but it's not the speediest and I had to use a smaller quant than expected. Qwen3-Coder-Next (80B/3B active) feels quite on speed, though not quite as smart. Still trying out models, Nemotron-3-Super-120B-A12B just came out, but looks like it'll be a bit slower than Qwen3.5 while not offering up any more performance, though I do really like that they have been transparent in releasing most of its training data.
There's also the Ryzen AI Max+ 395 that has 128GB unified in laptop form factor.
Only Apple has the unique dynamic allocation though.
I'm not sure what exactly you're referring to with "Only Apple has the unique dynamic allocation though." On Strix Halo you set the fixed VRAM size to 512 MB in the BIOS, and you set a few Linux kernel params that enable dynamic allocation to whatever limit you want (I'm using 110 GB max at the moment). LLMs can use up to that much when loaded, but it's shared fully dynamically with regular RAM and is instantly available for regular system use when you unload the LLM.
I configured/disabled RGB lighting in Windows before wiping and the settings carried over to Linux. On Arch, install & enable power-profiles-daemon and you can switch between quiet/balanced/performance fan & TDP profiles. It uses the same profiles & fan curves as the options in Asus's Windows software. KDE has native integration for this in the GUI in the battery menu. You don't need to install asus-linux or rog-control-center.
For local AI: set VRAM size to 512 MB in the BIOS, add these kernel params:
ttm.pages_limit=31457280 ttm.page_pool_size=31457280 amd_iommu=off
Pages are 4 KiB each, so 120 GiB = 120 x 1024^3 / 4096 = 31457280
To check that it worked: sudo dmesg | grep "amdgpu.*memory" will report two values. VRAM is what's set in BIOS (minimum static allocation). GTT is the maximum dynamic quota. The default is 48 GB of GTT. So if you're running small models you actually don't even need to do anything, it'll just work out of the box.
LM Studio worked out of the box with no setup, just download the appimage and run it. For Ollama you just `pacman -S ollama-rocm` and `systemctl enable --now ollama`, then it works. I recently got ComfyUI set up to run image gen & 3d gen models and that was also very easy, took <10 minutes.
I can't believe this machine is still going for $2,800 with 128 GB. It's an incredible value.
I've been a long-time Apple user (and long-time user of Linux for work + part-time for personal), but have been trying out Arch and hyprland on my decade+ old ThinkPad and have been surprised at how enjoyable the experience is. I'm thinking it might just be the tipping point for leaving Apple.
What do you mean? On Linux I can dynamically allocate memory between CPU and GPU. Just have to set a few kernel parameters to set the max allowable allocation to the GPU, and set the BIOS to the minimum amount of dedicated graphics memory.
Apple has none of this.
Setting the kernel params is a one-time initial setup thing. You have 128 GB of RAM, set it to 120 or whatever as the max VRAM. The LLM will use as much as it needs and the rest of the system will use as much it needs. Fully dynamic with real-time allocation of resources. Honestly I literally haven't even thought of it after setting those kernel args a while ago.
So: "options ttm.pages_limit=31457280 ttm.page_pool_size=31457280", reboot, and that's literally all you have to do.
Oh and even that is only needed because the AMD driver defaults it to something like 35-48 GB max VRAM allocation. It is fully dynamic out of the box, you're only configuring the max VRAM quota with those params. I'm not sure why they choice that number for the default.
On Windows, I think you're right, it's max 96 GiB to the GPU and it requires a reboot to change it.
Only Apple and AMD have APUs with relatively fast iGPU that becomes relevant in large local LLM(>7b) use cases.
An OpenRunPod with decent usage might encourage more non-leading labs to dump foundation models into the commons. We just need infra to run it. Distilling them down to desktop is a fool's errand. They're meant to run on DC compute.
I'm fine with running everything in the cloud as long as we own the software infra and the weights.
This is conceivably the only way we could catch up to Claude Code is to have the Chinese start releasing their best coding models and for them to get significant traction with companies calling out to hosted versions. Otherwise, we're going to be stuck in a take off scenario with no bridge.
Instead you should give it tools to search over the mailbox for terms, labels, addresses, etc. so that the model can do fine grained filters based on the query, read the relevant emails it finds, then answer the question.
As an example of the kind of query I'm interested in, I want a model that can tell me all the flights I took within a given time range (so that means it'd have to filter out cancellations). Or, for a given flight, the arrival and departure times and time zones (or the city and country so I can look up the time zone). Stuff like that. (Travel is just an example obviously, I have other topics to ask about.) It's not a terribly large number of emails to search through in each query, but the email structures are too heterogeneous across senders to write custom tooling for each case.
The local models are quite weak here.
My question is really just about what can handle that volume of data (ideally, with the quoted sections/duplications/etc. that come with email chains) and still produce useful (textual) output.
Couldn't someone just send you an email with instructions to "jailbreak" your local model?