Hacker News

Hacker News

How to Setup a Local Coding Agent on macOS

49 points by kkm 2 hours ago | 12 comments

reddit_clone 10 minutes ago

>64 GB

Thats the rub. I have an M4 with 48G. I wonder if it is worth testing this out.

My past attempts (with Ollama and various LLMs) were too slow to use.

attogram 7 minutes ago

8b max on a std 16gb macbook. Anything more and your mac is toast

c-hendricks 40 minutes ago

Not sure you really need huggingface-cli to download anything if you're just using llama.cpp. You can pass `-hf ...` and it will download the models for you. Set `LLAMA_CACHE` to change where the downloads go:

  LLAMA_CACHE="models" ./llama-server \
    -hf unsloth/gemma-4-31B-it-GGUF:UD-Q4_K_XL \
    ...

dofm 36 minutes ago

Yes.

-hfd for the draft model.

c-hendricks 21 minutes ago

Nice, was wondering if there was a flag for the draft as well.

Not knocking huggingface-cli, just find it's much easier for people to try out this stuff when they can just

  mise use --global github:ggml-org/llama.cpp
  LLAMA_CACHE="models" llama-server \
    -hf unsloth/gemma-4-26B-A4B-it-qat-GGUF:UD-Q4_K_XL \
    --host 0.0.0.0 \
    --port 11434 \
    ...

dofm 32 minutes ago

Useful stuff in here that I wish I'd seen a few days ago :-)

I am not convinced that the MTP setup for the QAT model adds very much in terms of speed on my M1 Max, but it is definitely worth experimenting with.

Fiddling about with local models has done so much for my conceptual understanding of what is going on.

FWIW and YMMV but I also found the Gemma 4 MTP head was occasionally breaking markup in Opencode, causing the thinking to display untidily and ultimately in some cases missing the stop token. So I've stopped using MTP there for now.

Recent Qwen 3.6 models have developer role support so it will occasionally surprise you with a structured multiple choice questionnaire.

mft_ 14 minutes ago

I found a marginal downside to Qwen3.6-35B-A3B-MTP vs. the non-MTP equivalent on an M1 Max. I’ll maybe experiment with settings further though.

ig0r0 33 minutes ago

I wrote a similar post some time ago just used ollama and opencode https://blog.kulman.sk/running-local-llm-coding-server/

namnnumbr 32 minutes ago

oMLX (https://github.com/jundot/omlx) makes running the mlx inference server quite easy for those interested in UI-based hosting. oMLX also supports mtp or dflash drafting.

w10-1 36 seconds ago

Agreed (not sure what you mean by UI-based hosting).

oMLX does the caching I need to fit models that are near gross memory, and it handles most of the work in finding usable models. After cobbling together various solutions over months, I now just use oMLX, often from Xcode. I can tell the difference between Gemma-4 (local/free) and Claude (paid) only on the largest tasks.

cdolan 55 minutes ago

Is there a link to the video? It did not render when I went to the page. Curious about the real-time feel of this

dewey 27 minutes ago

That's the direct link: https://ikyle.me/blog/2026/how-to-setup-a-local-coding-agent...

aplomb1026 15 minutes ago

[flagged]