How fast is N tokens per second really?
103 points by hexagr 3 days ago | 26 comments

adampzakaria 2 minutes ago
This is awesome!! I use Cursor and I've been trending towards medium thinking models as much as possible - I don't like the dev cadence with something like opus 4.7 (thinking: very high) (great for some tasks, like complex plans). Eventually I'd like to make my way to open models and open harness, and this tool or something like it could help me understand what performance I'd need for productive work - bookmarked!
reply
antirez 24 minutes ago
Token/sec only makes sense once you tell me three four things:

1. decoding t/s, that is, when the model is generating text in the autoregressive fashion.

2. prefill t/s, that is, prompt processing speed.

3. What is the slope of those two numbers as the context size increases. An implementation that decodes at 50t/s with 2k context but decodes at 7t/s at 100k context is going to be a lot less useful that it seems at a first glance for a big number of real world use cases.

4. What's your use case? Reading a huge text and then having a small output like, fraud probability=12%? Or Reading a small question and generating a lot of text? This changes substantially if a model is usable based on its prefill/decoding speed.

For instance my DS4F inference on the DGX Spark does prefill at 350 t/s and at 200 t/s on already large contexts. But decodes at 13 t/s.

On the Mac Ultra the prefill is like 400 t/s and decoding 35 t/s.

The two systems can perform dramatically differently or almost the same based on the use case. In general for local inference to be acceptable, even if slow, you want at least 100 t/s prefill, at least 10 t/s generation. To be ok-ish from 200 to 400 t/s prefill, 15-25 t/s generation. To be a wonderful experience thousands of t/s prefill, 100 t/s generation.

reply
bjelkeman-again 2 hours ago
Interesting. It seems to me that with that speed (20-30) on local hardware the real issue is quality of output, not tokens per sec.
reply
NitpickLawyer 2 hours ago
It really depends. With the new "thinking" models they usually spend some time before writing the final answer. If they "think" for 1k tokens, that's a minute of spinning wheel you're gonna see for each question. Add that to the prompt processing, and diminishing speeds as context increases, and it becomes really slow for longer sessions.
reply
emehrkay 21 minutes ago
I just looked up what my computer is capable of (m2 MacBook Air) and it says 15-35 tokens per second. I could live with that writing code with a local model.
reply
ohadron 41 minutes ago
This is great. Agentic coding at 600+ tokens/sec is going to be a radically different beast. Coming soon-ish?
reply
black_knight 2 minutes ago
People seem to use these tools very differently from each other. I value intelligence over speed any day. My programs are written in Haskell, so there are rarely any tasks which require thousands and thousands of lines to solve. Just intelligence. If there are rote tasks, I want the LLM to help me find intelligent ways of automating it: the right abstraction, the right meta-programming technique.

I constantly push Opus and GPT, and they are getting better. But still have to do the hardest parts myself. I would not mind waiting 10-15 minutes for the right 20 lines of code!

reply
dkersten 31 minutes ago
For small enough tasks with tight enough workflows, you can have it right now. Ie if you can constrain the task to work well with GPT OSS 120B/llama 3.3/qwen 3, then you can get upwards of 600 TPS on groq and up to 3k TPS on Cerebras.

Those models aren’t comparable to Opus, or even weaker models like MiniMax, but for certain task (focused context and prompts, strict workflows, single purpose requests) you absolutely can use these models and get insane speeds.

reply
c7b 13 minutes ago
Do you have ideas/suggestions for agentic workflows that only start making sense at such speeds?
reply
8note 26 minutes ago
i really want a qwen on one of these chips: https://chatjimmy.ai

15k tokens/s would get me feeling like its actually worth splitting out worktrees to try several approaches to a problem

reply
Cerium 19 minutes ago
Why is that? It seems the other direction? I want to be sure I can complete a task in a certain amount of wall clock time. If the tokens per second are slow, then I am risking more by running a single approach at a time, and then have an incentive to try to multiplex my attention between separate work-streams. If the generation is fast enough to occupy my attention then there is no more available improvement by having parallel threads.
reply
philipp-gayret 33 minutes ago
If you have a Cerebras Code subscription you can experience it right now. Indeed, a very different experience.
reply
KronisLV 25 minutes ago
Used them for a while! They didn't seem to have prompt caching so I burnt through the daily 24M token limitations really quickly when doing large scale changes on a codebase (essentially a team's worth of menial migration/refactoring work). A lot of it was okay, but plenty had to be re-done and I still spotted some issues months down the line, in part I blame their model catalogue which did get an update to GLM 4.7 sometime way back, but definitely is showing its age: https://inference-docs.cerebras.ai/models/overview

Quality wise, Anthropic gives me the best results (Opus for almost everything, I make sub-agents with fresh context review its work, after 2-10 loops, usually finds most issues). Token amount wise for agentic work, DeepSeek V4 is up there. What Cerebras is doing pretty cool though, apparently they even have prompt caching now like the other big providers: https://inference-docs.cerebras.ai/capabilities/prompt-cachi... At the same time, producing bad code faster was annoying in a uniquely new way.

Wish they'd update the models with their subscription, it could genuinely be great with the proper harness. Like if they can run GLM 4.7, surely they could at least get DeepSeek V4 Flash with a big context window going as a starting point. How can you have so much money to make your own chips, but can't run modern models that you can get for free? It's like they don't want people to use their subscription.

reply
dkersten 27 minutes ago
It’s GLM 4.7, GPT OSS 120B, or llama 3.1 8B so not exactly the latest or best models.

But GLM is good enough for many small tasks, certainly enough to get a taste for Cerebras’ high speeds!

[edit: actually that’s just their general models, I can’t see what Cerebras code offers. It was Qwen-coder when it launched but I don’t know what it is now. I think GLM 4.7 but I’m not completely sure]

reply
tantalor 4 minutes ago
> Now switch between c and t at the same rate. The difference is striking — and intentional.

I don't see a big difference.

reply
aurareturn 13 minutes ago
We truly are in the dial up era of GenAI.
reply
johng 3 days ago
Neat website, the visualization is great. I had a hard time wrapping my head around the tokens/s thing but this made it easy.
reply
raverbashing 53 minutes ago
On avg 1 token = 4 chars

So 75 tokens/s is ~ 300 chars per second which is the speed you'd get with a 2400 baud modem

reply
dfollent 3 days ago
Neat visual. 5 tok/s is still faster than me!
reply
himata4113 2 hours ago
I had the opposite reaction, 5tok/s is so slow that when you include all the reasoning and thinking + warmup it is far slower than me.
reply
warmwaffles 28 minutes ago
The sweet spot for being just fast enough to not irritate you is 10tok/s. Still slow but faster than you can sustain at typing and thinking. Just interesting to observe.
reply
zurfer 35 minutes ago
yeah 3t/s seems human. only that i never wrote code perfectly top to bottom.
reply
Eswo 46 minutes ago
super cool, thanks
reply
dbalatero 2 hours ago
This is cool, thanks for making it.
reply
dario-dentes 3 days ago
Thank you for this great utility. I love the "gut feel" calibration utilities like this one!
reply
victorbjorklund 44 minutes ago
This is great.
reply
tuo-lei 15 minutes ago
[flagged]
reply