Claude Token Counter, now with model comparisons
26 points by twapi 5 hours ago | 7 comments
great_psy 28 minutes ago
Is there any provided reason from anthropic why they changed the tokenizer ?
replyIs there a quality increase from this change or is it a money grab ?
tomglynch 2 hours ago
Interesting findings. Might need a way to downsample images on upload to keep costs down.
replymudkipdev 2 hours ago
Why do you need an API key to tokenize the text? Isn't it supposed to be a cheap step that everything else in the model relies on?
replykouteiheika 16 minutes ago
I'd guess it's because they don't want people to reverse engineer it.
replyNote that they're the only provider which doesn't make their tokenizer available offline as a library (i.e. the only provider whose tokenizer is secret).
Interesting. Unfortunately Anthropic doesn't actually share their tokenizer, but my educated guess is that they might have made the tokenizer more semantically aware to make the model perform better. What do I mean by that? Let me give you an example. (This isn't necessarily what they did exactly; just illustrating the idea.)
Let's take the gpt-oss-120b tokenizer as an example. Here's how a few pieces of text tokenize (I use "|" here to separate tokens):
You have 3 different tokens which encode the same word (Kill, kill, <space>kill) depending on its capitalization and whether there's a space before it or not, you have separate tokens if it's the past tense, etc.This is not necessarily an ideal way of encoding text, because the model must learn by brute force that these tokens are, indeed, related. Now, imagine if you'd encode these like this:
Notice that this makes much more sense now - the model now only has to learn what "<capitalize>" is, what "kill" is, what "<space>" is, and what "ed" (the past tense suffix) is, and it can compose those together. The downside is that it increases the token usage.So I wouldn't be surprised if this is what they did. Or, my guess number #2, they removed the tokenizer altogether and replaced them with a small trained model (something like the Byte Latent Transformer) and simply "emulate" the token counts.