Hacker News

103 points by hexagr 3 days ago | 26 comments

This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!

antirez 24 minutes ago

Token/sec only makes sense once you tell me three four things:

1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.

2. prefill t/s, that is, prompt processing speed.

3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.

4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.

For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.

On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.

The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.

bjelkeman-again 2 hours ago

Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.

NitpickLawyer 2 hours ago

It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.

emehrkay 21 minutes ago

I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.

ohadron 41 minutes ago

This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?

black_knight 2 minutes ago

People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.

I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!

dkersten 31 minutes ago

For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.

Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.

c7b 13 minutes ago

Do you have ideas/suggestions for agentic workflows that only start making sense at such speeds?

8note 26 minutes ago

i really want a qwen on one of these chips: https://chatjimmy.ai

15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem

Cerium 19 minutes ago

Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.

philipp-gayret 33 minutes ago

If you have a Cerebras Code subscription you can experience it right now. Indeed, a very different experience.

KronisLV 25 minutes ago

Used them for a while! They didn't seem to have prompt caching so I burnt through the daily 24M token limitations really quickly when doing large scale changes on a codebase (essentially a team's worth of menial migration/refactoring work). A lot of it was okay, but plenty had to be re-done and I still spotted some issues months down the line, in part I blame their model catalogue which did get an update to GLM 4.7 sometime way back, but definitely is showing its age: https://inference-docs.cerebras.ai/models/overview

Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.

Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.

dkersten 27 minutes ago

It’s GLM 4.7, GPT OSS 120B, or llama 3.1 8B so not exactly the latest or best models.

But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!

[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]

tantalor 4 minutes ago

> Now switch between c and t at the same rate. The difference is striking — and intentional.

I don't see a big difference.

aurareturn 13 minutes ago

We truly are in the dial up era of GenAI.

johng 3 days ago

Neat website, the visualization is great. I had a hard time wrapping my head around the tokens/s thing but this made it easy.

raverbashing 53 minutes ago

On avg 1 token = 4 chars

So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem

dfollent 3 days ago

Neat visual. 5 tok/s is still faster than me!

himata4113 2 hours ago

I had the opposite reaction, 5tok/s is so slow that when you include all the reasoning and thinking + warmup it is far slower than me.

warmwaffles 28 minutes ago

The sweet spot for being just fast enough to not irritate you is 10tok/s. Still slow but faster than you can sustain at typing and thinking. Just interesting to observe.

zurfer 35 minutes ago

yeah 3t/s seems human. only that i never wrote code perfectly top to bottom.

Eswo 46 minutes ago

super cool, thanks

dbalatero 2 hours ago

This is cool, thanks for making it.

dario-dentes 3 days ago

Thank you for this great utility. I love the "gut feel" calibration utilities like this one!

victorbjorklund 44 minutes ago

This is great.

tuo-lei 15 minutes ago

[flagged]