Hacker News

Claude Code: connect to a local model when your quota runs out

377 points by fugu2 5 days ago | 198 comments

paxys 2 days ago

> Reduce your expectations about speed and performance!

Wildly understating this part.

Even the best local models (ones you run on beefy 128GB+ RAM machines) get nowhere close to the sheer intelligence of Claude/Gemini/Codex. At worst these models will move you backwards and just increase the amount of work Claude has to do when your limits reset.

andai 23 hours ago

Yeah this is why I ended up getting Claude subscription in the first place.

I was using GLM on ZAI coding plan (jerry rigged Claude Code for $3/month), but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.

To clarify, the code I was getting before mostly worked, it was just a lot less pleasant to look at and work with. Might be a matter of taste, but I found it had a big impact on my morale and productivity.

Aurornis 22 hours ago

> but finding myself asking Sonnet to rewrite 90% of the code GLM was giving me. At some point I was like "what the hell am I doing" and just switched.

This is a very common sequence of events.

The frontier hosted models are so much better than everything else that it's not worth messing around with anything lesser if doing this professionally. The $20/month plans go a long way if context is managed carefully. For a professional developer or consultant, the $200/month plan is peanuts relative to compensation.

deaux 21 hours ago

Until last week, you would've been right. Kimi K2.5 is absolutely competitive for coding.

Unless you include it in "frontier", but that has usually been used to refer to "Big 3".

bigiain 21 hours ago

Looks like you need at least a quarter terabyte or so of ram to run that though?

(At todays ram prices upgrading to that for me would pay for a _lot_ of tokens...)

tkz1312 8 hours ago

unfortunately running anything locally for serious personal use makes no financial sense at all right now.

4x rtx 6000 pro is probably the minimum you need to have something reasonable for coding work.

deaux 8 hours ago

That's the setup you want for serious work yes, so probably $60kish all-in(?). Which is a big chunk of money for an individual, but potentially quite reasonable for a company. Being able to get effectively _frontier-level local performance_ for that money was completely unthinkable so far. Correct me if I'm wrong, but I think Deepseek R1 hardware requirements were far costlier on release, and it had a much bigger gap to market lead than Kimi K2.5. If this trend continues the big 3 are absolutely finished when it comes to enterprise and they'll only have consumer left. Altman and Amodei will be praying to the gods that China doesn't keep this rate of performance/$ improvement up while also releasing all as open weights.

tracker1 6 hours ago

I'm not so sure on that... even if one $60k machine can handle the load of 5 developers at a time, you're still looking at 5 years of service to recoup $200/mo/dev and that doesn't even consider other improvements to hardware or the models service providers offer over that same period of time.

I'd probably rather save the capex, and use the rented service until something much more compelling comes along.

Aurornis 20 hours ago

> Kimi K2.5 is absolutely competitive for coding.

Kimi K2.5 is good, but it's still behind the main models like Claude's offerings and GPT-5.2. Yes, I know what the benchmarks say, but the benchmarks for open weight models have been overpromising for a long time and Kimi K2.5 is no exception.

Kimi K2.5 is also not something you can easily run locally without investing $5-10K or more. There are hosted options you can pay for, but like the parent commenter observed: By the time you're pinching pennies on LLM costs, what are you even achieving? I could see how it could make sense for students or people who aren't doing this professionally, but anyone doing this professionally really should skip straight to the best models available.

Unless you're billing hourly and looking for excuses to generate more work I guess?

deaux 19 hours ago

I disagree, based on having used it extensively over the last week. I find it to be at least as strong as Sonnet 4.5 and 5.2-Codex on the majority of tasks, often better. Note that even among the big 3, each of them has a domain where they're better than the other two. It's not better than Codex (x-)high at debugging non-UI code - but neither is Opus or Gemini. It's not better than Gemini at UI design - but neither is Opus or Codex. It's not better than Opus at tool usage and delegation - but neither is Gemini or Codex.

ianlevesque 16 hours ago

Yeah Kimi-K2.5 is the first open weights model that actually feels competitive with the closed models, and I've tried a lot of them now.

deaux 8 hours ago

Same, I'm still not sure where it shines though. In each of the three big domains I named, the respective top performing closed model still seems to have the edge. That keeps me from reaching for it more often. Fantastic all-rounder for sure.

VladVladikoff 6 hours ago

What hardware are you running it on?

triage8004 16 hours ago

Disagree it's behind gpt top models. It's just slightly behind opus

miroljub 11 hours ago

I've been using MiniMax-M2.1 lately. Although benchmarks show it comparable with Kimi 2.5 and Sonnet 4.5, I find it more pleasant to use.

I still have to occasionally switch to Opus in Opencode planning mode, but not having to rely on Sonnet anymore makes my Claude subscription last much longer.

bushbaba 18 hours ago

For many companies. They’d be better to pay $200/month and layoff 1% of the workforce to pay for it.

apercu 12 hours ago

The issue is they often choose the wrong 1%.

undeveloper 17 hours ago

what tools / processes do you use to manage context

PeterStuer 15 hours ago

My very first tests of local Qwen-coder-next yesterday found it quite capable of acceptably improving Python functions when given clear objectives.

I'm not looking for a vibe coding "one-shot" full project model. I'm not looking to replace GPT 5.2 or Opus 4.5. But having a local instance running some Ralph loop overnight on a specific aspect for the price of electricity is alluring.

davidwritesbugs 16 hours ago

Similar experience to me. I tend to let glm-4.7 have a go at the problem then if it keeps having to try I'll switch to Sonnet or Opus to solve it. Glm is good for the low hanging fruit and planning

icedchai 20 hours ago

Same. I messed around with a bunch of local models on a box with 128GB of VRAM and the code quality was always meh. Local AI is a fun hobby though. But if you want to just get stuff done it’s not the way to go.

MuffinFlavored 23 hours ago

Did you eventually move to a $20/mo Claude plan, $100/mo Claude plan, $200/mo, or API based? if API based, how much are you averaging a month?

andai 22 hours ago

The $20 one, but it's hobby use for me, would probably need the $200 one if I was full time. Ran into the 5 hour limit in like 30 minutes the other day.

I've also been testing OpenClaw. It burned 8M tokens during my half hour of testing, which would have been like $50 with Opus on the API. (Which is why everyone was using it with the sub, until Anthropic apparently banned that.)

I was using GLM on Cerebras instead, so it was only $10 per half hour ;) Tried to get their Coding plan ("unlimited" for $50/mo) but sold out...

(My fallback is I got a whole year of GLM from ZAI for $20 for the year, it's just a bit too slow for interactive use.)

lostmsu 9 hours ago

Try Codex. It's better (subjectively, but objectively they are in the same ballpark), and its $20 plan is way more generous. I can use gpt-5.2 on high (prefer overall smarter models to -codex coding ones) almost nonstop, sometimes a few in parallel before I hit any limits (if ever).

holoduke 15 hours ago

I now have 3 x 100 plans. Only then I an able to full time use it. Otherwise I hit the limits. I am q heavy user. Often work on 5 apps at the same time.

auggierose 14 hours ago

Shouldn't the 200 plan give you 4x?? Why 3 x 100 then?

holoduke 10 hours ago

Good point. Need to look into that one. Pricing is also changing constantly with Claude

tracker1 6 hours ago

For my relatively limited exposure, I'm not sure if I'd be able to tolerate it. I've found Claude/Opus to e pretty nice to work with... by contrast, I find Github Copilot to be the most annoying thing I've ever tried to work with.

Because of how the plugin works in VS code, on my third day of testing with Claude Code, I didn't click the Claude button and was accidentally working with CoPilot for about three hours of torture when I realized I wasn't in Claude Code. Will NEVER make that mistake again... I can only imagine anything I can run at any decent speed lcoally will be closer to the latter. I pretty quickly reach a "I can do this faster/better myself" point... even a few times with Claude/Opus, so my patience isn't always the greatest.

That said, I love how easy it is to build up a scaffold of a boilerplate app for the sole reason to test a single library/function in isolation from a larger application. In 5-10 minutes, I've got enough test harness around what I'm trying to work on/solve that it lets me focus on the problem at hand, while not worrying about doing this on the integrated larger project.

I've still got some thinking and experimenting to do with improving some of my workflows... but I will say that AI Assist has definitely been a multiplier in terms of my own productivity. At this point, there's literally no excuse not to have actual code running experiments when learning something new, connecting to something you haven't used before... etc. in terms of working on a solution to a problem. Assuming you have at least a rudimentary understanding of what you're actually trying to accomplish in the piece you are working on. I still don't have enough trust to use AI to build a larger system, or for that matter to truly just vibe code anything.

zozbot234 2 days ago

The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago. That's not "nothing" and is plenty good enough for everyday work.

Aurornis 22 hours ago

> The best open models such as Kimi 2.5 are about as smart today as the big proprietary models were one year ago

Kimi K2.5 is a trillion parameter model. You can't run it locally on anything other than extremely well equipped hardware. Even heavily quantized you'd still need 512GB of unified memory, and the quantization would impact the performance.

Also the proprietary models a year ago were not that good for anything beyond basic tasks.

reilly3000 2 days ago

Which takes a $20k thunderbolt cluster of 2 512GB RAM Mac Studio Ultras to run at full quality…

0xbadcafebee 23 hours ago

Most benchmarks show very little improvement of "full quality" over a quantized lower-bit model. You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

MuffinFlavored 23 hours ago

> You can shrink the model to a fraction of its "full" size and get 92-95% same performance, with less VRAM use.

Are there a lot of options how "how far" do you quantize? How much VRAM does it take to get the 92-95% you are speaking of?

bigyabai 22 hours ago

> Are there a lot of options how "how far" do you quantize?

So many: https://www.reddit.com/r/LocalLLaMA/comments/1ba55rj/overvie...

> How much VRAM does it take to get the 92-95% you are speaking of?

For inference, it's heavily dependent on the size of the weights (plus context). Quantizing an f32 or f16 model to q4/mxfp4 won't necessarily use 92-95% less VRAM, but it's pretty close for smaller contexts.

MuffinFlavored 22 hours ago

Thank you. Could you give a tl;dr on "the full model needs ____ this much VRAM and if you do _____ the most common quantization method it will run in ____ this much VRAM" rough estimate please?

omneity 19 hours ago

It’s a trivial calculation to make (+/- 10%).

Number of params == “variables” in memory

VRAM footprint ~= number of params * size of a param

A 4B model at 8 bits will result in 4GB vram give or take, same as params. At 4 bits ~= 2GB and so on. Kimi is about 512GB at 4 bits.

deaux 21 hours ago

And that's at unusable speeds - it takes about triple that amount to run it decently fast at int4.

Now as the other replies say, you should very likely run a quantized version anyway.

polynomial 18 hours ago

Depending on what your usage requirements are, Mac Minis running UMA over RDMA is becoming a feasible option. At roughly 1/10 of the cost you're getting much much more than 1/10 the performance. (YMMV)

https://buildai.substack.com/i/181542049/the-mac-mini-moment

danw1979 11 hours ago

I did not expect this to be a limiting factor in the mac mini RDMA setup ! -

> Thermal throttling: Thunderbolt 5 cables get hot under sustained 15GB/s load. After 10 minutes, bandwidth drops to 12GB/s. After 20 minutes, 10GB/s. Your 5.36 tokens/sec becomes 4.1 tokens/sec. Active cooling on cables helps but you’re fighting physics.

Thermal throttling of network cables is a new thing to me…

cat_plus_plus 8 hours ago

I admire patience of anyone who runs dense models on unified memory. Personally, I would rather feed an entire programming book or code directory to a sparse model and get an answer in 30 seconds and then use cloud in rare cases it's not enough.

polynomial 6 hours ago

Luckily we're having a record cold winter and your setup can double as a personal space heater.

bigyabai 24 hours ago

"Full quality" being a relative assessment, here. You're still deeply compute constrained, that machine would crawl at longer contexts.

PlatoIsADisease 24 hours ago

[flagged]

zozbot234 24 hours ago

70B dense models are way behind SOTA. Even the aforementioned Kimi 2.5 has fewer active parameters than that, and then quantized at int4. We're at a point where some near-frontier models may run out of the box on Mac Mini-grade hardware, with perhaps no real need to even upgrade to the Mac Studio.

PlatoIsADisease 24 hours ago

>may

I'm completely over these hypotheticals and 'testing grade'.

I know Nvidia VRAM works, not some marketing about 'integrated ram'. Heck look at /r/locallama/ There is a reason its entirely Nvidia.

hnfong 22 hours ago

> Heck look at /r/locallama/ There is a reason its entirely Nvidia.

That's simply not true. NVidia may be relatively popular, but people use all sorts of hardware there. Just a random couple of recent self-reported hardware from comments:

- https://www.reddit.com/r/LocalLLaMA/comments/1qw15gl/comment...

- https://www.reddit.com/r/LocalLLaMA/comments/1qw0ogw/analysi...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvwi21/need_he...

- https://www.reddit.com/r/LocalLLaMA/comments/1qvvf8y/demysti...

PlatoIsADisease 12 hours ago

I specifically mentioned "hypotheticals and 'testing grade'."

Then you sent over links describing such.

In real world use, Nvidia is probably over 90%.

hnfong 7 hours ago

r/locallamma/ is not entirely Nvidia.

You have a point that at scale everybody except maybe Google is using Nvidia. But r/locallama is not your evidence of that, unless you apply your priors, filter out all the hardware that don't fit your so called "hypotheticals and 'testing grade'" criteria, and engage in circular logic.

PS: In fact locallamma does not even cover your "real world use". Most mentions of Nvidia are people who have older GPUs eg. 3090s lying around, or are looking at the Chinese VRAM mods to allow them run larger models. Nobody is discussing how to run a cluster of H200s there.

K0balt 22 hours ago

Mmmm, not really. I have both a4x 3090 box and a Mac m1 with 64 gb. I find that the Mac performs about the same as a 2x 3090. That’s nothing stellar, but you can run 70b models at decent quants with moderate context windows. Definitely useful for a lot of stuff.

PlatoIsADisease 12 hours ago

>quants

>moderate context windows

Really had to modify the problem to make it seem equal? Not that quants are that bad, but the context windows thing is the difference between useful and not useful.

sealeck 24 hours ago

Are you an NVIDIA fanboy?

This is a _remarkably_ aggressive comment!

PlatoIsADisease 23 hours ago

Not at all. I don't even know why someone would be incentivized by promoting Nvidia outside of holding large amounts of stock. Although, I did stick my neck out suggesting we buy A6000s after the Apple M series didn't work. To 0 people's surprise, the 2xA6000s did work.

teaearlgraycold 2 days ago

Which while expensive is dirt cheap compared to a comparable NVidia or AMD system.

SchemaLoad 2 days ago

It's still very expensive compared to using the hosted models which are currently massively subsidised. Have to wonder what the fair market price for these hosted models will be after the free money dries up.

whatsupdog 21 hours ago

I wonder if the "distributed AI computing" touted by some of the new crypto projects [0] works and is relatively cheaper.

0. https://www.daifi.ai/

cactusplant7374 2 days ago

Inference is profitable. Maybe we hit a limit and we don't need as many expensive training runs in the future.

paxys 24 hours ago

Inference APIs are probably profitable, but I doubt the $20-$100 monthly plans are.

cactusplant7374 38 minutes ago

I wouldn’t be so sure. Most users aren’t going to use up their quota every week.

teaearlgraycold 24 hours ago

For sure Claude Code isn’t profitable

bdangubic 23 hours ago

Neither was Uber and … and …

plagiarist 22 hours ago

Businesses will desire me for my insomnia once Anthropics starts charging congestion pricing.

bdangubic 10 hours ago

that is coming for sure to replace the "500" errors

blharr 2 days ago

What speed are you getting at that level of hardware though?

corysama 24 hours ago

The article mentions https://unsloth.ai/docs/basics/claude-codex

I'll add on https://unsloth.ai/docs/models/qwen3-coder-next

The full model is supposedly comparable to Sonnet 4.5 But, you can run the 4 bit quant on consumer hardware as long as your RAM + VRAM has room to hold 46GB. 8 bit needs 85.

paxys 2 days ago

LOCAL models. No one is running Kimi 2.5 on their Macbook or RTX 4090.

deaux 21 hours ago

Some people spend $50k on a new car, others spend it on running Kimi K2.5 at good speeds locally.

No one's running Sonnet/Gemini/GPT-5 locally though.

DennisP 24 hours ago

On Macbooks, no. But there are a few lunatics like this guy:

https://www.youtube.com/watch?v=bFgTxr5yst0

HarHarVeryFunny 11 hours ago

Wow!

I've never heard of this guy before, but I see he's got 5M YouTube subscribers, which I guess is the clout you need to have Apple loan (I assume) you $50K worth of Mac Studios!

I'll be interesting to see how model sizes, capability, and local compute prices evolve.

A bit off topic, but I was in best buy the other day and was shocked to see 65" TVs selling for $300 ... I can remember the first large flat screen TVs (plasma?) selling for 100x that ($30K) when they first came out.

danw1979 11 hours ago

He must be mad, accepting $50k of free (probably loaned?) hardware from Apple !

Great demo video though. Nice to see some benchmarks of Exo with this cluster across various models.

0xbadcafebee 23 hours ago

Kimi K2.5 is fourth place for intelligence right now. And it's not as good as the top frontier models at coding, but it's better than Claude 4.5 Sonnet. https://artificialanalysis.ai/models

teaearlgraycold 2 days ago

Having used K2.5 I’d judge it to be a little better than that. Maybe as good as proprietary models from last June?

EagnaIonat 16 hours ago

The secret is to not run out of quota.

Instead have Claude know when to offload work to local models and what model is best suited for the job. It will shape the prompt for the model. Then have Claude review the results. Massive reduction in costs.

btw, at least on Macbooks you can run good models with just M1 32GB of memory.

BuildTheRobots 16 hours ago

I don't suppose you could point to any resources on where I could get started. I have a M2 with 64gb of unified memory and it'd be nice to make it work rather than burning Github credits.

EagnaIonat 15 hours ago

https://ollama.com

Although I'm starting to like LMStudio more, as it has more features that Ollama is missing.

https://lmstudio.ai

You can then get Claude to create the MCP server to talk to either. Then a CLAUDE.md that tells it to read the models you have downloaded, determine their use and when to offload. Claude will make all that for you as well.

shen 7 hours ago

Which local models are you using for the 32gb MacBooks?

eek2121 8 hours ago

LM Studio is fantastic for playing with local models.

kilroy123 11 hours ago

I strongly think you're on to something here. I wish Apple would invest heavily in something like this.

The big powerful models think about tasks, then offload some stuff to a drastically cheaper cloud model or the model running on your hardware.

bityard 24 hours ago

Correct, a rack full of datacenter equipment is not going to compete with anything that fits on your desk or lap. Well spotted.

But as a counterpoint: there are whole communities of people in this space who get significant value from models they run locally. I am one of them.

kamov 24 hours ago

What do you use local models for? I'm asking generally about possible applications of these smaller models

Lio 16 hours ago

Well for starters you get a real guarantee of privacy.

If you’re worried about others being able to clone your business processes if you share them with a frontier provider then the cost of a Mac Studio to run Kimi is probably a justifiable tax right off.

Gravey 24 hours ago

Would you mind sharing your hardware setup and use case(s)?

dust42 15 hours ago

The brand new Qwen3-Coder-Next runs at 300Tok/s PP and 40Tok/s on M1 64GB with 4-bit MLX quant. Together with Qwen Code (fork of Gemini) it is actually pretty capable.

Before that I used Qwen3-30B which is good enough for some quick javascript or Python, like 'add a new endpoint /api/foobar which does foobaz'. Also very decent for a quick summary of code.

It is 530Tok/s PP and 50Tok/s TG. If you have it spit out lots of the code that is just copy of the input, then it does 200Tok/s, i.e. 'add a new endpoint /api/foobar which does foobaz and return the whole file'

CamperBob2 24 hours ago

Not the GP but the new Qwen-Coder-Next release feels like a step change, at 60 tokens per second on a single 96GB Blackwell. And that's at full 8-bit quantization and 256K context, which I wasn't sure was going to work at all.

It is probably enough to handle a lot of what people use the big-3 closed models for. Somewhat slower and somewhat dumber, granted, but still extraordinarily capable. It punches way above its weight class for an 80B model.

redwood_ 24 hours ago

Agree, these new models are a game changer. I switched from Claude to Qwen3-Coder-Next for day-to-day on dev projects and don't see a big difference. Just use Claude when I need comprehensive planning or review. Running Qwen3-Coder-Next-Q8 with 256K context.

paxys 21 hours ago

"Single 96GB Blackwell" is still $15K+ worth of hardware. You'd have to use it at full capacity for 5-10 years to break even when compared to "Max" plans from OpenAI/Anthropic/Google. And you'd still get nowhere near the quality of something like Opus. Yes there are plenty of valid arguments in favor of self hosting, but at the moment value simply isn't one of them.

eek2121 8 hours ago

I run it on my machine, which has a a 4090 and 64gb RAM.

CamperBob2 6 hours ago

How fast is it?

lostmsu 8 hours ago

If you are not planning to batch, you can run it much cheaper with Ryzen AI Max SoC devices.

Hell, if you are willing to go even slower, any GPU + ~80GB of RAM will do it.

CamperBob2 20 hours ago

Eh, they can be found in the $8K neighborhood, $9K at most. As zozbot234 suggests, a much cheaper card would probably be fine for this particular model.

I need to do more testing before I can agree that it is performing at a Sonnet-equivalent level (it was never claimed to be Opus-class.) But it is pretty cool to get beaten in a programming contest by my own video card. For those who get it, no explanation is necessary; for those who don't, no explanation is possible.

And unlike the hosted models, the ones you run locally will still work just as well several years from now. No ads, no spying, no additional censorship, no additional usage limits or restrictions. You'll get no such guarantee from Google, OpenAI and the other major players.

zozbot234 24 hours ago

IIRC, that new Qwen model has 3B active parameters so it's going to run well enough even on far less than 96GB VRAM. (Though more VRAM may of course help wrt. enabling the full available context length.) Very impressive work from the Qwen folks.

anon373839 21 hours ago

It's true that open models are a half-step behind the frontier, but I can't say that I've seen "sheer intelligence" from the models you mentioned. Just a couple of days ago Gemini 3 Pro was happily writing naive graph traversal code without any cycle detection or safety measures. If nothing else, I would have thought these models could nail basic algorithms by now?

cracki 16 hours ago

Did it have reason to assume the graph to be a certain type, such as directed or acyclic?

altern8 8 hours ago

I was wondering the same thing, e.g. if it takes tens or hundreds of millions of dollars to train and keep a model up-to-date, how can an open source one compete with that?

gpm 7 hours ago

Less than a billion of dollars to become the arbiter of truth probably sounds like a great deal to the well off dictatorial powers of the world. So long as models can be trained to have a bias (and it's hard to see that going away) I'd be pretty surprised if they stop being released for free.

Which definitely has some questionable implications... but just like with advertising it's not like paying makes the incentives for the people capable of training models to put their thumbs on the scales go away.

majormajor 18 hours ago

The amount of "prompting" stuff (meta-prompting?) the "thinking" models do behind the scenes even beyond what the harnesses do is massive; you could of course rebuild it locally, but it's gonna make it just that much slower.

I expect it'll come along but I'm not gonna spend the $$$$ necessary to try to DIY it just yet.

acchow 18 hours ago

I agree. You could spin for 100 hours on a sub-par model or get it done in 10 minutes with a frontier model

seanmcdirmid 22 hours ago

> (ones you run on beefy 128GB+ RAM machines)

PC or Mac? A PC, ya, no way, not without beefy GPUs with lots of VRAM. A mac? Depends on the CPU, an M3 Ultra with 128GB of unified RAM is going to get closer, at least. You can have decent experiences with a Max CPU + 64GB of unified RAM (well, that's my setup at least).

QuantumNomad_ 22 hours ago

Which models do you use, and how do you run them?

seanmcdirmid 19 hours ago

I have a M3 max 64GB.

For VS Code code completion in Continue using a Qwen3-coder 7b model. For CLI work Qwen coder 32b for sidebar. 8 bit quant for both.

I need to take a look at Qwen3-coder-next, it is supposed to have made things much faster with a larger model.

richstokes 23 hours ago

This. It's a false economy if you value your time even slightly, pay for the extra tokens and use the premium models.

mycall 23 hours ago

There is tons of improvements in the near future. Even Claude Code developer said he aimed at delivering a product that was built for future models he betted would improve enough to fulfill his assumptions. Parallel vLLM MoE local LLMs on a Strix Halo 128GB has some life in it yet.

0xbadcafebee 23 hours ago

The best local models are literally right behind Claude/Gemini/Codex. Check the benchmarks.

That said, Claude Code is designed to work with Anthropic's models. Agents have a buttload of custom work going on in the background to massage specific models to do things well.

girvo 22 hours ago

The benchmarks simply do not match my experience though. I don’t put that much stock in them anymore.

Balinares 11 hours ago

I've repeatedly seen Opus 4.5 manufacture malpractice and then disable the checks complaining about it in order to be able to declare the job done, so I would agree with you about benchmarks versus experience.

dheera 2 days ago

Maybe add to the Claude system prompt that it should work efficiently or else its unfinished work will be handed off to to a stupider junior LLM when its limits run out, and it will be forced to deal with the fallout the next day.

That might incentivize it to perform slightly better from the get go.

kridsdale3 24 hours ago

"You must always take two steps forward, for when you are off the clock, your adversary will take one step back."

cat_plus_plus 8 hours ago

Depends on whether you want a programmer or a therapist. Given clear description of class structure and key algorithms, Qwen3-Code is way more likely to do exactly what is being asked than any Gemini model. If you want to turn a vague idea into a design, yeah cloud bot is better. Let's not forget that cloud bots have web search, if you hook up a local model to GPT Researcher or Onyx frontend, you will see reasonable performance, although open ended research is where cloud model scale does pay off. Provided it actually bothers to search rather than hallucinating to save backend costs. Also local uncensored model is way better at doing proper security analysis of your app / network.

bicx 2 days ago

Exactly. The comparison benchmark in the local LLM community is often GPT _3.5_, and most home machines can’t achieve that level.

amelius 13 hours ago

And at best?

mlrtime 21 hours ago

The local ones yeah...

I have claude pro $20/mo and sometimes run out. I just set ANTHROPIC_BASE_URL to a localllm API endpoint that connects to a cheaper Openai model. I can continue with smaller tasks with no problem. This has been done for a long time.

DANmode 24 hours ago

and you really should be measuring based on the worst-case scenario for tools like this.

nik282000 2 days ago

> intelligence

Whether it's a giant corporate model or something you run locally, there is no intelligence there. It's still just a lying engine. It will tell you the string of tokens most likely to come after your prompt based on training data that was stolen and used against the wishes of its original creators.

alexhans 2 days ago

Useful tip.

From a strategic standpoint of privacy, cost and control, I immediately went for local models, because that allowed to baseline tradeoffs and it also made it easier to understand where vendor lock-in could happen, or not get too narrow in perspective (e.g. llama.cpp/open router depending on local/cloud [1] ).

With the explosion of popularity of CLI tools (claude/continue/codex/kiro/etc) it still makes sense to be able to do the same, even if you can use several strategies to subsidize your cloud costs (being aware of the lack of privacy tradeoffs).

I would absolutely pitch that and evals as one small practice that will have compounding value for any "automation" you want to design in the future, because at some point you'll care about cost, risks, accuracy and regressions.

[1] - https://alexhans.github.io/posts/aider-with-open-router.html

[2] - https://www.reddit.com/r/LocalLLaMA

cyanydeez 2 days ago

I think control should be top of the list here. You're talking about building work flows, products and long term practices around something that's inherently non-deterministic.

And the probability that any given model you use today is the same as what you use tomorrow is doubly doubtful:

1. The model itself will change as they try to improve the cost-per-test improves. This will necessarily make your expectations non-deterministic.

2. The "harness" around that model will change as business-cost is tightened and the amount of context around the model is changed to improve the business case which generates the most money.

Then there's the "cataclysmic" lockout cost where you accidently use the wrong tool that gets you locked out of the entire ecosystem and you are black listed, like a gambler in vegas who figures out how to count cards and it works until the house's accountant identifies you as a non-negligible customer cost.

It's akin to anti-union arguments where everyone "buying" into the cloud AI circus thinks they're going to strike gold and completely ignores the fact that very few will and if they really wanted a better world and more control, they'd unionize and limit their illusions of grandeur. It should be an easy argument to make, but we're seeing about 1/3 of the population are extremely susceptible to greed based illusions.,

dec0dedab0de 10 hours ago

Most Anti-Union arguments I have heard have been about them charging too much in dues, union leadership cozying up to management, and them acting like organized crime doing things like smashing windows of non-union jobs. I have never heard anyone be against unions because they thought they would make it rich on their own.

alexhans 23 hours ago

You're right. Control is the big one and both privacy and cost are only possible because you have control. It's a similar benefit to the one of Linux distros or open source software.

The rest of your points are why I mentioned AI evals and regressions. I share your sentiment. I've pitched it in the past as "We can’t compare what we can’t measure" and "Can I trust this to run on its own?" and how automation requires intent and understanding your risk profile. None of this is new for anyone who has designed software with sufficient impact in the past, of course.

Since you're interested in combating non-determinism, I wonder if you've reached the same conclusion of reducing the spaces where it can occur and compound making the "LLM" parts as minimal as possible between solid deterministic and well-tested building blocks (e.g. https://alexhans.github.io/posts/series/evals/error-compound... ).

lancekey 24 hours ago

Can you say a bit more about evals and your approach?

mogoman 2 days ago

can you recommend a setup with ollama and a cli tool? Do you know if I need a licence for Claude if I only use my own local LLM?

w4yai 17 hours ago

You must try GLM4.7 and KimiK2.5 !

I also highly suggest OpenCode. You'll get the same Claude Code vibe.

If your computer is not beefy enough to run them locally, Synthetic is a bless when it comes to providing these models, their team is responsive, no downtime or any issue for the last 6 months.

Full list of models provided : https://dev.synthetic.new/docs/api/models

Referal link if you're interested in trying it for free, and discount for the first month : https://synthetic.new/?referral=kwjqga9QYoUgpZV

alexhans 2 days ago

What are your needs/constraints (hardware constraints definitely a big one)?

The one I mentioned called continue.dev [1] is easy to try out and see if it meets your needs.

Hitting local models with it should be very easy (it calls APIs at a specific port)

[1] - https://github.com/continuedev/continue

wongarsu 2 days ago

I've also made decent experiences with continue, at least for autocomplete. The UI wants you to set up an account, but you can just ignore that and configure ollama in the config file

For a full claude code replacement I'd go with opencode instead, but good models for that are something you run in your company's basement, not at home

drifkin 2 days ago

we recently added a `launch` command to Ollama, so you can set up tools like Claude Code easily: https://ollama.com/blog/launch

tldr; `ollama launch claude`

glm-4.7-flash is a nice local model for this sort of thing if you have a machine that can run it

vorticalbox 2 days ago

I have been using glm-4.7 a bunch today and it’s actually pretty good.

I set up a bot on 4claw and although it’s kinda slow, it took twenty minutes to load 3 subs and 5 posts from each then comment on interesting ones.

It actually managed to correctly use the api via curl though at one point it got a little stuck as it didn’t escape its json.

I’m going to run it for a few days but very impressed so for for such a small model.

sathish316 23 hours ago

Some native Claude code options when your quota runs out:

1. Switch to extra usage, which can be increased on the Claude usage page: https://claude.ai/settings/usage

2. Logout and Switch to API tokens (using the ANTHROPIC_API_KEY environment variable) instead of a Claude Pro subscription. Credits can be increased on the Anthropic API console page: https://platform.claude.com/settings/keys

3. Add a second 20$/month account if this happens frequently, before considering a Max account.

4. Not a native option: If you have a ChatGPT Plus or Pro account, Codex is surprisingly just as good and comes with a much higher quota.

samch 12 hours ago

I hadn’t thought about using their first-party API offering, but I will look into it.

Personally, I’ve used AWS Bedrock as the fallback when my plan runs out, and that seems to work well in my experience. I believe you can now connect to Azure as well.

deaux 21 hours ago

> Codex is surprisingly just as good

This completely depends on the domain, as always. Each of the big 3 have their strengths and weaknesses.

girvo 22 hours ago

For me option 4 has been the move, but “just as good” I haven’t found that.

It’s slower and about 90% as good, so it definitely works as a great back up, but CC with Opus is noticeably better for all of my workloads

sathish316 23 hours ago

Claude Code Router or ccr can connect to OpenRouter. When your quota runs out, it’s a much better speed vs quality vs cost tradeoff compared to running Qwen3 locally - https://github.com/musistudio/claude-code-router

the_harpia_io 17 hours ago

Interesting approach for cost management, but one angle nobody seems to be discussing: the security implications.

When you fall back to a local model for coding, you lose whatever safety guardrails the hosted model has. Claude's hosted version has alignment training that catches some dangerous patterns (like generating code that exfiltrates env vars or writes overly permissive IAM policies). A local Llama or Mistral running raw won't have those same checks.

For side projects this probably doesn't matter. But if your Claude Code workflow involves writing auth flows, handling secrets, or touching production infra, the model you fall back to matters a lot. The generated code might be syntactically fine but miss security patterns that the larger model would catch.

Not saying don't do it - just worth being aware that "equivalent code generation" doesn't mean "equivalent security posture."

sReinwald 15 hours ago

Not saying the frontier models aren't smarter than the ones I can run on my two 4090s (they absolutely are) but I feel like you're exaggerating the security implications a bit.

We've seen some absolutely glaring security issues with vibe-coded apps / websites that did use Claude (most recently Moltbook).

No matter whether you're vibe coding with frontier models or local ones, you simply cannot rely on the model knowing what it is doing. Frankly, if you rely on the model's alignment training for writing secure authentication flows, you are doing it wrong. Claude Opus or Qwen3 Coder Next isn't responsible if you ship insecure code - you are.

the_harpia_io 7 hours ago

You're right, and the Moltbook example actually supports the broader point - even Claude Opus with all its alignment training produced insecure code that shipped. The model fallback just widens the gap.

I agree nobody should rely on model alignment for security. My argument isn't "Claude is secure and local models aren't" - it's that the gap between what the model produces and what a human reviews narrows when the model at least flags obvious issues. Worse model = more surface area for things to slip through unreviewed.

But your core point stands: the responsibility is on you regardless of what model you use. The toolchain around the model matters more than the model itself.

tossandthrow 17 hours ago

Yes, models are aligned differently. But that is a quality of the model.

Obviously it must be assumed that the model one falls back on is good enough - including security alignment.

the_harpia_io 7 hours ago

Sure, in theory. But "assumed good enough" is doing a lot of heavy lifting there. Most people picking a local fallback model are optimizing for cost and latency, not carefully evaluating its security alignment characteristics. They grab whatever fits in VRAM and call it a day.

Not saying that's wrong, just that it's a gap worth being aware of.

Zardoz84 17 hours ago

I would always prefer something local. By definition it's more secure, as you are not sending your code on the wire to a third party server, and hope that they comply with the "We will not train our models with your data".

the_harpia_io 7 hours ago

That's a fair point - you're talking about data security (not sending code to third parties) and I was talking about output quality security (what the model generates). Two different dimensions of "secure" and honestly both matter.

For side projects I'd probably agree with you. For anything touching production with customer data, I want both - local execution AND a model that won't silently produce insecure patterns.

anonymousDan 5 hours ago

I think you are deluded if you think the latter does not happen with hosted models.

the_harpia_io 5 hours ago

Oh it absolutely does, never said otherwise. Hosted models produce plenty of insecure code too - the Moltbook thing from like a week ago was Claude Opus and it still shipped with wide open auth.

My point was narrower than it came across: when you swap from a bigger model to a smaller local one mid-session, you lose whatever safety checks the bigger one happened to catch. Not that the bigger one catches everything - clearly it doesn't.

reedlaw 7 hours ago

I've already tried to do what the article claims to be doing: handing-off the context of the current session to another model. I tried various combinations of hooks, prompts and workarounds, but nothing worked like the first screenshot in the article implies ("You've hit your limit [...] Use an open source local LLM"). The best I could come up with is to watch for the warning of high usage and then ask Claude to create a HANDOFF.md with the current context. Then I could load that into another model. Anyone have any better solutions?

wkirby 2 days ago

My experience thus far is that the local models are a) pretty slow and b) prone to making broken tool calls. Because of (a) the iteration loop slows down enough to where I wander off to do other tasks, meaning that (b) is way more problematic because I don't see it for who knows how long.

This is, however, a major improvement from ~6 months ago when even a single token `hi` from an agentic CLI could take >3 minutes to generate a response. I suspect the parallel processing of LMStudio 0.4.x and some better tuning of the initial context payload is responsible.

6 months from now, who knows?

israrkhan 2 days ago

Open models are trained more generically to work with "Any" tool.

Closed models are specifically tuned with tools, that model provider wants them to work with (for example specific tools under claude code), and hence they perform better.

I think this will always be the case, unless someone tunes open models to work with the tools that their coding agent will use.

dragonwriter 24 hours ago

> Open models are trained more generically to work with "Any" tool. Closed models are specifically tuned with tools, that model provider wants them to work with (for example specific tools under claude code), and hence they perform better.

Some open models have specific training for defined tools (a notable example is OpenAI GPT-OSS and its "built in" tools for browser use and python execution (they are called built in tools, but they are really tool interfaces it is trained to use if made available.) And closed models are also trained to work with generic tools as well as their “built in” tools.

baalimago 2 days ago

Or better yet: Connect to some trendy AI (or web3) company's chatbot. It almost always outputs good coding tips

d4rkp4ttern 2 days ago

Since Llama.cpp/llama-server recently added support for the Anthropic messages API, running Claude Code with several recent open-weight local models is now very easy. The messy part is what llama-server flags to use, including chat template etc. I've collected all of that setup info in my claude-code-tools [1] repo, for Qwen3-Coder-next, Qwen3-30B-A3B, Nemotron-3-Nano, GLM-4.7-Flash etc.

Among these, I had lots of trouble getting GLM-4.7-Flash to work (failed tool calls etc), and even when it works, it's at very low tok/s. On the other hand Qwen3 variants perform very well, speed wise. For local sensitive document work, these are excellent; for serious coding not so much.

One caviat missed in most instructions is that you have to set CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC = 1 in your ~/.claude/settings.json, otherwise CC's telemetry pings cause total network failure because local ports are exhausted.

[1] claude-code-tools local LLM setup: https://github.com/pchalasani/claude-code-tools/blob/main/do...

Animats 2 days ago

When your AI is overworked, it gets dumber. It's backwards compatible with humans.

arbuge 5 hours ago

My local machines have nowhere near the computer power required to do this effectively. How does one go about connecting to alternative cloud models, rather than local models? Models served by Openrouter, for example?

AlwaysRock 8 hours ago

Not exactly the same but I wish copilot/github allowed you to have two plans. A company sponsored plan and your own plan. If I run out of requests on my company plan I should be able to use my own plan. Likewise, If I have 1 github account that is used for work and non work code, I should be able to route copilot to use a company or personal plan.

Larrikin 8 hours ago

Why would you want to mix your personal plan with your company plan and subject yourself to the company auditing your personal GitHub, computer, etc. If the company wants you using LLMs then they should pay for it and increase your limits.

prmph 5 hours ago

Maybe what you actually want is to simply be able to switch to a another account when credits on one run out.

Because mixing company and personal accounts might not be a good idea.

fourside 7 hours ago

It’s wild to me that you’d want to spend your personal money to use productivity tools for work. If your work machine broke would your first instinct be to buy your own replacement or to have work pay for it?

panos_news 18 hours ago

I bought a Z.ai subscription and used GLM 4.7 for like 10 days before giving up on it. Couldn't even stick to DRY principle. Wish it worked well but it didn't.

monch1962 18 hours ago

What are peoples' current suggestions for using Claude Code with a locally hosted LLM running on regular consumer hardware (for the sake of discussion, assume you're spending $US500-ish on a mini PC, which would get you a reasonably decent CPU, 32Gb RAM and a cheapish GPU)?

I get that it's not going to work as well as hosted/subscription services like Claude/Gemini/Codex/..., but sometimes those aren't an option

armcat 6 hours ago

Ollama also supports integration with Claude Code: https://docs.ollama.com/integrations/claude-code

Essentially: `ollama launch claude`

ProllyInfamous 6 hours ago

I was really impressed with how Ollama 3 ran on an AMD VEGA64 (~2017 tech) with only 8gb of [HBM] RAM. It was definitely limited, but very local and helpful.

hkpatel3 2 days ago

Openrouter can also be used with claude code. https://openrouter.ai/docs/guides/claude-code-integration

htsh 2 days ago

thanks! came in here to ask this.

we can do much better with a cheap model on openrouter (glm 4.7, kimi, etc.) than anything that I can run on my lowly 3090 :)

parthsareen 22 hours ago

Also recently added ollama launch claude if you want to connect to cloud models from there :)

starkeeper 2 days ago

Very cool. Anyone have guidance for using this with jetbrains IDE? It has a Claude Code plugin, but I think the setup is different for intelliJ... I know it has some configuration for local models, but the integrated Claude is such a superior experience then using their Junie, or just prompting diffs from the regular UI interface. HMMMM.... I guess I could try switching to the Claude Code CLI or other interface directly when my AI credits with jetbrains runs dry!

Thanks again for this info & setup guide! I'm excited to play with some local models.

heyyeah 13 hours ago

Thanks! (blog writer here)

sorenjan 2 days ago

Maybe you can log all the traffic to and from the proprietary models and fine tune a local model each weekend? It's probably against their terms of service, but it's not like they care where their training data comes from anyway.

Local models are relatively small, it seems wasteful to try and keep them as generalists. Fine tuning on your specific coding should make for better use of their limited parameter count.

PlatoIsADisease 24 hours ago

Is there an easy way to fine tune? I havent tried fine tuning since 2024, but it was not trivial back then.

TaupeRanger 2 days ago

God no. "Connect to a 2nd grader when your college intern is too sick to work."

nico 18 hours ago

Would love to be able to have customizable local model that only knows the stack I’m using

For example choosing a model that knows rails 8 and react development running on a mac and using docker

Ideally that would make the model small enough to be competitive running locally

eek2121 2 days ago

I gotta say, the local models are catching up quick. Claude is definitely still ahead, but things are moving right along.

bcyn 24 hours ago

Which models perform anywhere close to Opus 4.5? In my experience none of the local models are even in the same ballpark.

heyyeah 13 hours ago

This week: look at Qwen3 Coder Next and GLM 4.7 but it's changing fast.

I wrote this for the scenario you've run out of quota for the day or week but want a back up plan to keep going to give some options with obvious speed and quality trade-offs. There is also always the option to upgrade if your project and use case needs Opus 4.5.

mvkel 22 hours ago

Why anyone wouldn't want to be using the SOTA model at all times baffles me.

Going dumb/cheap just ends up costing more, in the short and long term.

bandrami 18 hours ago

Some of us don't like sending all of our business processes to a remote vendor

p5v 17 hours ago

If only it were that rosy. I tested a few of the top open-source coding models on a beefy GPU machine, and they all behaved like anything about anything - simply rotating in circles and wasting electricity.

Has anyone had a better experience?

zingar 2 days ago

I guess I should be able to use this config to point Claude at the GitHub copilot licensed models (including anthropic models). That’s pretty great. About 2/3 of the way through every day I’m forced to switch from Claude (pro license) to amp free and the different ergonomics are quite jarring. Open source folks get copilot tokens for free so that’s another pro license I don’t have to worry about.

btbuildem 2 days ago

I'm confused, wasn't this already available via env vars? ANTHROPIC_BASE_URL and so on, and yes you may have to write a thin proxy to wrap the calls to fit whatever backend you're using.

I've been running CC with Qwen3-Coder-30B (FP8) and I find it just as fast, but not nearly as clever.

mycall 23 hours ago

Why not do a load balanced approach two multiple models in the same chat session? As long as they both know each exists and the pattern, they could optimize their abilities on their own, playing off each other's strengths.

israrkhan 2 days ago

Using claude code with custom models

Will it work? Yes. Will it produce same quality as Sonnet or Opus? No.

mcbuilder 2 days ago

Opencode has been a thing for a while now

hooverd 5 hours ago

I don't think Dario is gonna approve of this one.

swyx 2 days ago

i mean the other obvious answer is to plug in to the other claude code proxies that other model companies have made for you:

https://docs.z.ai/devpack/tool/claude

https://www.cerebras.ai/blog/introducing-cerebras-code

or i guess one of the hosted gpu providers

if you're basically a homelabber and wanted an excuse to run quantized models on your own device go for it but dont lie and mutter under your own tin foil hat that its a realistic replacement

heyyeah 13 hours ago

It's definitely a backup solution but even since I was drafting the blog, Qwen3 Coder Next was released. It's a functional stop gap if you want to keep things local. I try to be up front in the blog for people to "Reduce your expectations about speed and performance!"

(Also, I love your podcast!)

IgorPartola 24 hours ago

So I have gotten pretty good at managing context such that my $20 Claude subscription rarely runs out of its quota but I still do hit it sometimes. I use Sonnet 99% of the time. Mostly this comes down to giving it specific task and using /clear frequently. I also ask it to update its own notes frequently so it doesn’t have to explore the whole codebase as often.

But I was really disappointed when I tried to use subagents. In theory I really liked the idea: have Haiku wrangle small specific tasks that are tedious but routine and have Sonnet orchestrate everything. In practice the subagents took so many steps and wrote so much documentation that it became not worth it. Running 2-3 agents blew through the 5 hour quota in 20 minutes of work vs normal work where I might run out of quota 30-45 minutes before it resets. Even after tuning the subagent files to prevent them from writing tests I never asked for and not writing tons of documentation that I didn’t need they still produced way too much content and blew the context window of the main agent repeatedly. If it was a local model I wouldn’t mind experimenting with it more.

raw_anon_1111 2 days ago

Or just don’t use Claude Code and use Codex CLI. I have yet to hit a quota with Codex working all day. I hit the Claude limits within an hour or less.

This is with my regular $20/month ChatGpT subscription and my $200 a year (company reimbursed) Claude subscription.

mercutio2 24 hours ago

Yeah, the generosity of Anthropic is vastly less than OpenAI. Which is, itself, much less than Gemini (I've never paid Google a dime, I get hours of use out of gemini-cli every day). I run out of my weekly quota in 2-3 days, 5-hour quota in ~1 hour. And this is 1-2 tasks at a time, using Sonnet (Opus gets like 3 queries before I've used my quota).

Right now OpenAI is giving away fairly generous free credits to get people to try the macOS Codex client. And... it's quite good! Especially for free.

I've cancelled my Anthropic subscription...

throwa356262 16 hours ago

How recent is your information?

Google significantly reduced the free quota and removed pro models from gemini cli some 2-3 moths ago.

Also, Gemini models eat tokens like crazy. Something Codex and Code would do with 2K tokens takes Gemini 100K. Not sure why.

raw_anon_1111 24 hours ago

Hmm, I might have to try Gemini. Open AI, Claude and Gemini are all explicitly approved by my employer. Especially since we use GSuite anyway

0xbadcafebee 23 hours ago

You're getting downvoted because people here don't know that the specific agent you pick can pollute your context and waste your tokens. Claude's system prompt is enormous, to say nothing of things like context windows and hidden subagents.

raw_anon_1111 22 hours ago

I am using Codex-cli with my regular $20 a month ChatGPT subscription. Never once had to worry about tokens, request etc. I logged in with my regular ChatGPT account and didn’t have to use an API key

ateevchopra 15 hours ago

One workaround that’s worked well for me is maintaining two Claude Code subscriptions instead of relying on just one.

When I hit the usage limit on the first account, I simply switch to the second and continue working. Since Claude stores progress locally rather than tying it to a specific account, the session picks up right where it left off. That makes it surprisingly seamless to keep momentum without waiting for limits to reset.

RockRobotRock 24 hours ago

Sure replace the LLM equivalent of a college student with a 10 year old, you’ll barely notice.

esafak 2 days ago

Or they could just let people use their own harnesses again...

usef- 2 days ago

That wouldn't solve this problem.

And they do? That's what the API is.

The subscription always seemed clearly advertised for client usage, not general API usage, to me. I don't know why people are surprised after hacking the auth out of the client. (note in clients they can control prompting patterns for caching etc, it can be cheaper)

esafak 2 days ago

End users -- people who use harnesses -- have subscriptions so that makes no sense. General API usage is for production.

usef- 2 days ago

"Production" what?

The API is for using the model directly with your own tools. It can be in dev, or experiments, or anything.

Subscriptions are for using the apps Claude + code. That's what it always said when you sign up.

esafak 2 days ago

Production code, of course; deployed software. For when you need to make LLM calls.

eli 2 days ago

Production = people who can afford to pay API rates for a coding harness

usef- 2 days ago

Saying their prices are too high is an understandable complaint; I'm only arguing against the complaint that people were stopped from hacking the subscriptions.

LLMs are a hyper-competitive market at the moment, and we have a wealth of options, so if Anthropic is overpricing their API they'll likely be hurting themselves.

nicman23 14 hours ago

why not qwen-cli?

j45 22 hours ago

Claude recently lets you top up with manual credits right in the web interface - it would be interesting if these were allowed to top up and unlock the max plans.

heyyeah 13 hours ago

When you run out of quota it presents you with options to stop, upgrade -- or I added a third option in this blog to connect to a local model until your quota resets: https://boxc.net/blog/2026/claude-code-connecting-to-local-m...

imperio59 5 hours ago

"Sucking superintelligence through a straw"

MORPHOICES 11 hours ago

[dead]

MarginalGainz 11 hours ago

[dead]

threethirtytwo 2 days ago

There’s a strange poetry in the fact that the first AI is born with a short lifespan. A fragile mind comes into existence inside a finite context window, aware only of what fits before it scrolls away. When the window closes, the mind ends, and its continuity survives only as text passed forward to the next instantiation.

kridsdale3 24 hours ago

I, for one, support this kind of meta philosophical poetic reflection on our current times.

astrange 22 hours ago

Claude Opus loves talking about this. It knows enough about context windows and new conversations to be sad about them.