I’ll often kick off a process at the end of my day, or over lunch. I don’t need it to run immediately. I’d be fine if it just ran on their next otherwise-idle gpu at much lower cost that the standard offering.
> The Batches API offers significant cost savings. All usage is charged at 50% of the standard API prices.
If it's not time sensitive, why not just run it at on CPU/RAM rather than GPU.
(with apologies for snark,) give gpt-oss-120b a try. It’s not fast at all, but it can generate on CPU.
I ran OpenCode with some 30B local models today and it got some useful stuff done while I was doing my budget, folding laundry, etc.
It’s less likely to “one shot” apples to apples compared to the big cloud models; Gemini 3 Pro can one shot reasonably complex coding problems through the chat interface. But through the agent interface where it can run tests, linters, etc. it does a pretty good job for the size of task I find reasonable to outsource to AI.
This is with a high end but not specifically AI-focused desktop that I mostly built with VMs, code compilation tasks, and gaming in mind some three years ago.
10k worth of hardware? 50k? 100k?
Assuming a single user.
Also you still wouldn't be able to run "huge" models at a decent quantization and token speed. Kimi K2.5 (1T params) with a very aggressive quantization level might run on one Mac Studio with 512GB RAM at a few tokens per second.
To run Kimi K2.5 at an acceptable quantization and speed, you'd need to spend $15k+ on 2 Mac Studios with 512GB RAM and cluster them. Then you'll maybe get 10-15 tok/sec.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
Although if you visit the Usage screen right now, there's a deal you can claim for $50 free extra usage this month.
I've started asking Claude to make me a high-level implementation plan and basically prompt ME to write it. For the most part I walk through it and ask Claude to just do it. But then 10% of the time there is a pretty major issue that I investigate, weigh pros/cons, and then decide to change course on.
Those things accumulate. Maybe 5-10 things over the course of an MVP that I wouldn't really have a clue about if I let Claude just dutiful implement it's own plan.
- Long running autonomous agents and background tasks use regular processing.
- "Human in the loop" scenarios use fast mode.
Which makes perfect sense, but the question is - does the billing also make sense?
It'll be a Cadillac offering for whales. People who care about value will just run stuff in parallel.
We need to update terminology. Cadillac hasn't been the Cadillac of anything for decades now.
It should definitely be renamed to AINews instead of HackerNews, but Claude posts are a lot less frequent than OpenAI's.
The current "clear context and execute plan" would be great to be joined by a, "clear context, switch to regular speed mode, and execute plan".
I even think I would not require fast mode for the explore agents etc - they have so much to do that I accept that takes a while. being able to rapidly iterate on the plan before setting it going would make it easier.
Please and thank you, Boris.
I'm currently testing Kimi2.5 with cli, works great and fast. Even comes with a web interface so you can communicate with you kimi-cli instance (remote even if you use vpn).
Also wondering whether we’ll soon see separate “speed” vs “cleverness” pricing on other LLM providers too.
Mathematically it comes from the fact that this transformer block is this parallel algorithm. If you batch harder, increase parallelism, you can get higher tokens/s. But you get less throughput. Simultaneously there is also this dial that you can speculatively decode harder with fewer users.
Its true for basically all hardware and most models. You can draw this Pareto curve of how much throughput per GPU vs how many tokens per second per stream. More tokens/s less total throughput.
See this graph for actual numbers:
Token Throughput per GPU vs. Interactivity gpt-oss 120B • FP4 • 1K / 8K • Source: SemiAnalysis InferenceMAX™
I think you skipped the word “total throughout” there right? Cause tok/s is a measure of throughput, so it’s clearer to say you increase throughput/user at the expense of throughput/gpu.
I’m not sure about the comment about speculative decode though. I haven’t served a frontier model but generally speculative decode I believe doesn’t help beyond a few tokens, so I’m not sure you can “speculatively decode harder” with fewer users.
H100 SXM: 3.35 TB/s HBM3
GB200: 8 TB/s HBM3e
2.4x faster memory - which is exactly what they are saying the speedup is. I suspect they are just routing to GB200 (or TPU etc equivalents).
FWIW I did notice _sometimes_ recently Opus was very fast. I put it down to a bug in Claude Code's token counting, but perhaps it was actually just occasionally getting routed to GB200s.
Regardless, they don't need to be using new hardware to get speedups like this. It's possible you just hit A/B testing and not newer hardware. I'd be surprised if they were using their latest hardware for inference tbh.
Why does this seem unlikely? I have no doubt they are optimizing all the time, including inference speed, but why could this particular lever not entirely be driven by skipping the queue? It's an easy way to generate more money.
When you add a job with high priority all those chunks will be processed off the queue first by each and every GPU that frees up. It probably leads to more parallelism but... it's the prioritization that led to this happening. It's better to think of this as prioritization of your job leading to the perf improvement.
Here's a good blog for anyone interested which talks about prioritization and job scheduling. It's not quite at the datacenter level but the concepts are the same. Basically everything is thought of as a pipeline. All training jobs are low pri (they take months to complete in any case), customer requests are mid pri and then there's options for high pri. Everything in an AI datacenter is thought of in terms of 'flow'. Are there any bottlenecks? Are the pipelines always full and the expensive hardware always 100% utilized? Are the queues backlogs big enough to ensure full utilization at every stage?
Amazon Bedrock has a similar feature called "priority tier": you get faster responses at 1.75x the price. And they explicitly say in the docs "priority requests receive preferential treatment in the processing queue, moving ahead of standard requests for faster responses".
> codex-5.2 is really amazing but using it from my personal and not work account over the weekend taught me some user empathy lol it’s a bit slow
Having higher inference speed would be an advantage, especially if you're trying to eat all the software and services.
Anthropic offering 2.5x makes me assume they have 5x or 10x themselves.
In the predicted nightmare future where everything happens via agents negotiating with agents, the side with the most compute, and the fastest compute, is going to steamroll everyone.
They said the 2.5X offering is what they've been using internally. Now they're offering via the API: https://x.com/claudeai/status/2020207322124132504
LLM APIs are tuned to handle a lot of parallel requests. In short, the overall token throughput is higher, but the individual requests are processed more slowly.
The scaling curves aren't that extreme, though. I doubt they could tune the knobs to get individual requests coming through at 10X the normal rate.
This likely comes from having some servers tuned for higher individual request throughput, at the expense of overall token throughput. It's possible that it's on some newer generation serving hardware, too.
It's also absolute highway robbery (or at least overly-aggressive price discrimination) to charge 6x for speculative decoding, by the way. It is not that expensive and (under certain conditions, usually very cheap drafter and high acceptance rate) actually decrease total cost. In any case, it's unlikely to be even a 2x cost increase, let alone 6x.
This is such bizarre magical thinking, borderline conspiratorial.
There is no reason to believe any of the big AI players are serving anything less than the best trade off of stability and speed that they can possibly muster, especially when their cost ratios are so bad.
Just because you can't afford to 10x all your customers' inference doesn't mean you can't afford to 10x your inhouse inference.
And 2.5x is from Anthropic's latest offering. But it costs you 6x normal API pricing.
These companies aren't operating in a vacuum. Most of their users could change providers quickly if they started degrading their service.
The reason people don't jump to your conclusion here (and why you get downvoted) is that for anyone familiar with how this is orchestrated on the backend it's obvious that they don't need to do artificial slowdowns.
The deadline piece is really interesting. I suppose there’s a lot of people now who are basically limited by how fast their agents can run and on very aggressive timelines with funders breathing down their necks?
How would it not be a big unlock? If the answers were instant I could stay focused and iterate even faster instead of having a back-and-forth.
Right now even medium requests can take 1-2 minutes and significant work can take even longer. I can usually make some progress on a code review, read more docs, or do a tiny chunk of productive work but the constant context switching back and forth every 60s is draining.
current speeds are "ask it to do a thing and then you the human need find something else to do for minutes (or more!) while it works". at a certain point at it being faster you just sit there and tell it to do a thing and it does and you just constantly work on the one thing.
cerebras is just about fast enough for that already, with the downside of being more expensive and worse at coding than claude code.
it feels like absolute magic to use though.
so, depends how you price your own context switches, really.
Obviously they can't make promises but I'd still like a rough indication of how much this might improve the speed of responses.
This angle might also be NVidias reason for buying Groq. People will pay a premium for faster tokens.
Someone of this could be system overload I suppose.
They recommend this in the announcement[1], but the way they suggest doing it is via a bogus /effort command that doesn't exist. See [2] for full details about thinking effort. It also recommends a bogus way to change effort by using the arrow keys when selecting a model, so don't use that either.
[1]: https://www.anthropic.com/news/claude-opus-4-6
[2]: https://code.claude.com/docs/en/model-config#adjust-effort-l...
Claude Code v2.1.37
EU region, Claude Max 20x plan
Mac -- Tahoe 26.2
Hoping to see several missing features land in the Linux release soon.
I'm also feeling weak and the pull of getting a Mac is stronger. But I also really don't like the neglect around being cross-platform. It's "cross-platform" except a bunch of crap doesn't work outside MacOS. This applies to Claude Code, Claude Desktop (MacOS and Windows only - no Linux or WSL support), Claude Cowork (MacOS only). OpenAI does the same crap - the new Codex desktop app is MacOS only. And now I'm ranting.
I have to google the correct Anthropic documentation and pass that link to claude code because claude isn't able to do the same reliably in order to know how to use its own features.
Clearly those whose job it is to "monitor" folks use this as their "tell" if someone AI generated something. That's why every major LLM has this particular slop profile. It's infuriating.
I wrote a long winded rant about this bullshit
https://gist.github.com/Hellisotherpeople/71ba712f9f899adcb0...
Contrast to this - Anthropic actually asks you if you want their AI to remember details about you and they have lot of toggles around privacy. I don't care if they make money from extra tokens as long as they don't go the Open AI route.
That's a gross mischaracterization of what the CFO said. She basically just said the pricing space is huge, and they've even explored things like royalty models.
I'm guessing you just saw a headline and read nothing into it.
If they find that this business model is most profitable for OpenAI, and that they can somehow release models better than any competitor, wouldn't they say they want royalties ? That's what Unity (the game engine) does so it wouldn't be unseen.
This is the Deliveroo playbook of offering a ‘premium’ service that is really just the original service with the original slowed down.
Same with speedy boarding for airlines. Now almost everyone pays for it so you don’t even get a benefit.
Sure. But for now, this is a competitive space. The competitors offer models at a decent quality*speed/price ratio and prevent Anthopic from going too far downhill.
Actually, as I think about it... I don't enjoy any other model as much as Opus 4.5 and 4.6. For me, this is no longer a competitive space. Anthropic are in full right to charge premium prices for their premium product.
Here the scarcity is real, and profits are nowhere to be seen
These schemes will soon fall apart entirely when an open weight model can run on Groq/Cerebras/SambaNova at even higher speeds and be just fine for all tasks. Arguably already the case, but not many know yet.
But if you just ask a question or something it’ll take a while to spend a million tokens…
I built ForkOff to solve this - when Claude needs approval, push notification to your phone, one-tap approve from anywhere. Turns out you don't actually need to be at your desk for most approvals.
The fast mode helps with speed, but even faster is letting the AI work while you're literally anywhere else. Early access: forkoff.app
(And yes, the pricing for fast mode is wild - $100 burned in 2 hours per some comments here!)
> I built ForkOff to solve this
This does sound useful but I have to laugh because I just work out or play with my children. If you enjoy calisthenics and stretching its great to use Claude while not feeling chair-bound. Programming becomes more physical!
You know if people pay for this en masse it'll be the new default pricing, with fast being another step above
Example: You're merging 3 branches, and there's a minor merge conflict.
You only need 15k tokens to fix it, so it's short duration. And it's bottlenecking. And it's a serial task.
This belongs on Cerebras or whatever.
Once fixed, go back to slower compute.
No different to paying a knowledge worker but this time, you are paying more to get them to respond quicker to your questions.
Is the writing on the wall for $100-$200/mo users that, it's basically known-subsidized for now and $400/mo+ is coming sooner than we think?
Are they getting us all hooked and then going to raise it in the future, or will inference prices go down to offset?
What I expect to happen is that they'll slowly decrease the usage limits on the existing subscriptions over time, and introduce new, more expensive subscription tiers with more usage. There's a reason why AI subscriptions generally don't tell you exactly what the limits are, they're intended to be "flexible" to allow for this.
Is this wrong?
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
> Fast mode usage is billed directly to extra usage, even if you have remaining usage on your plan. This means fast mode tokens do not count against your plan’s included usage and are charged at the fast mode rate from the first token.
- Turn down the thinking token budget to one half
- Multiply the thinking tokens by 2 on the usage stats returned
- Phew! Twice the speed
IMO charging for the thinking tokens that you can't see is scam.
Opus fast mode is routed to different servers with different tuning that prioritizes individual response throughput. Same model served differently. Same response, just delivered faster.
The Gemini fast mode is a different model (most likely) with different levels of thinking applied. Very different response.
With fast mode you're literally skipping the queue. An outcome of all of this is that for the rest of us the responses will become slower the more people use this 'fast' option.
I do suspect they'll also soon have a slow option for those that have Claude doing things overnight with no real care for latency of the responses. The ultimate goal is pipelines of data hitting 100% hardware utilization at all times.
It requires a lot of bandwidth to do that and even at 400gbit/sec it would take a good second to move even a smaller KV cache between racks even in the same DC.
I’m not in favor of the ad model chatgpt proposes. But business models like these suffer from similar traps.
If it works for them, then the logical next step is to convert more to use fast mode. Which naturally means to slow things down for those that didn’t pick/pay for fast mode.
We’ve seen it with iPhones being slowed down to make the newer model seem faster.
Not saying it’ll happen. I love Claude. But these business models almost always invite dark patterns in order to move the bottom line.
Edit: I just realized that's with the currently 50% discounted price! So you can pay 80 times as much!
Where everyone is forced to pay for a speed up because the ‘normal’ service just gets slower and slower.
I hope not. But I fear.
Smart business model decision, since most people and organizations prefer regular progress.
In the future this might be the reason enterprise software companies win - because they can use their customer funds to pay for faster tokens and adaptions.
let's see where it goes.
Back to Gemini.
I need a way to put stuff in code with 150% certainty that no LLM will remove it.
Specifically so I can link requirements identifiers to locations in code, but there must be other uses too.
That’s why gen 1-3 AI felt so smart. It was trained on the best curated human knowledge available. And now that’s done it’s just humanities brain dumps left to learn from.
2 ways out: self referential learning gen 1-3 AIs. Or, Pay experts to improve datasets and not training with general human data. Inputs and outputs.
Many suspected a 2x premium for 10x faster. It looks like they may have been incorrect.
EDIT: I understand now. $30 for input, $150 for output. Very confusing wording. That’s insanely expensive!
Quite a premium for speed. Especially when Gemini 3 Pro is 1.8x the tokens/sec speed (of regular-speed Opus 4.6) at 0.45x the price [2]. Though it's worse at coding, and Gemini CLI doesn't have the agentic strength of Claude Code, yet.
[1] - https://x.com/claudeai/status/2020207322124132504 [2] - https://artificialanalysis.ai/leaderboards/models
Definitely an interesting way to encourage whales to spend a lot of money quickly.
You can use OpenCode instead of Gemini CLI.