"We're also launching GPT‑5.6 Sol on Cerebras at up to 750 tokens per second in July, bringing frontier intelligence to customers at unprecedented speed. Access will initially be limited to select customers as we expand capacity."
750 tokens/s on a frontier model is going to be extremely interesting. I doubt this new version is anything but a version bump in terms of capabilities but if we can start getting these answers back faster, they end up being more useful.
Just off the top of my head, I can think of the tedious task of finding certain functionality within a codebase. I usually can't beat an AI agent harness at this task today. If the AI model is 3x faster I have less of chance.
This is what 750tps looks like, I guess.
At least that site should draw out a full page then start replacing that page with the next, starting from the top and working downwards, repeating each time it hits the bottom.
I'm not sure if that's what you were going for, but I read it as if it were written by The Board in the game Control, and found myself with the appropriate level of existential dread.
Anomie is at an all time high right now.
As someone else already contributed, this is driven by a Canadian startup taalas that basically makes chips that are llms, so everything is very fast but also, baked into the chip. Once this kind of stuff is a commodity in like 10 years, our world will be very, very different.
This chip would still be 272mm2 on N2 which is an eye-watering $30k/wafer and bigger than a 9950x or Nvidia 5070.
This just isn't feasible. Some of the latest-gen LLMs seem to have 5-10T parameters or about 1000x more. I don't know that taping out just one chip makes economic sense let alone the 300-1000 chips required for a cutting-edge model. Things like continuing education so your model knows about the latest NPM packages or world news is super important, but seems like it would require new chips.
There are a TON of uses for an 8B parameter models on the edge, but this is WAY too big to put on the edge of anything. Something like a 10mm2 100m parameter voice model might be feasible on the edge, but only for expensive devices, but most of those are TSMC 28nm (up to 29MTr/mm2) or GF FDX22 (up to 40MTR/mm2) which would increase the AI chip to the point where it would absolutely dominate the BOM.
They probably have a few ideas around that. Me, personally, I'd have one main expensive chip (replaced every 10 years, or whatever), with a secondary cheap chip in front of it that gets replaced every year or so.
The secondary chip could act the way RAG does, or perhaps both chips together can act as LoRA.
Either way, 99.999% of the knowledge is static, you just need to fine-tune the weights with that remaining 0.001% knowledge, which can be done using RAG or LoRA on a much smaller (thus cheaper) disposable chip.
It is not market viable but it is sure as heck revolutionary. Like an atomic bomb but including more… peaceful uses.
That’s exactly where government should take rein like with ISS etc. However the models are too rapidly advancing for now for it to make sense
Previous HN discussion: https://news.ycombinator.com/item?id=47103661
750 tokens/s for their largest model is going to be nuts
Their architecture has fundamental speed and efficiency advantages over GPUs or Cerebras. They expect to scale up to real LLMs by splitting a model layer-wise across several chips, which they can do without incurring any throughput penalty.
I’ll patiently wait to see this in reality. Their demonstration hardware is a 250W chip that is enormous in die area for the model size. They’re making a lot of claims, but until they can deliver then it’s nearly vaporware in my view.
I’d be happy to be proven wrong, but I think they’re going to quickly run into hardware realities quite soon if they think they can just chain a bunch of chips together to achieve the same performance on larger sizes.
The simple fact that we think what we have now is scalable is basically what you are saying can't be done: " just chain a bunch of chips together to achieve the same performance on larger sizes". How do you think current architectures work? And what is being used today is all proprietary to one company!
You don't need LLMs for everything. That is 100% the point. You can burn down the world with all of your frontier LLMs that are being used for simple queries OR we can do something faster and more efficient like this. Just because you can run a SotA model at "fast" speeds, again, severely misses the point.
And no, you can't run anything from Anthropic or OAI on-prem, so until you can there's really no comparison. If people want to continue down the path of gate-kept models with no other options then we'll all follow you off the cliff.
I think it’s impressive that a frontier model can achieve 750t/s. That’s all. You can get similar insane token speeds from other open weight models too.
You seem to be cool with a very small and gated ecosystem with whatever tech billionaires want you to have access to.
I grew up in the era where compute was diverse and open. You may think this is OK, but it's not. The more options we have and the more diversified they are the better tech will move back towards.
I'm not the one with the myopic view here. Enjoy your "on-device" models over in your utopia of a walled garden.
Once again, my statement is that the Taalas product is not a fair comparison because it runs an old outdated model. If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.
Either you didn't look at the page I linked or you're having comprehension problems.
> If you want to run a similar model at similar speeds (albeit not serially, but in parallel) you don’t need their product.
Except, you can't. There's no commodity hardware out there today that can run even an "old outdated model" at this speed and power utilization. Again, maybe read first and try to understand my original point?
> "...my statement is that the Taalas product is not a fair comparison..."
You actually hadn't stated this. You said it wasn't needed. Which is it?
> If you want to run a similar model at similar speeds...
You can't. Find me a single system that can run this, again, "old outdated model" at even similar speed. You're hung up on the model. The point is that if we all just stay in this wonderful world of inefficient large models we will all end up at the mercy of OAI, Anthropic, Google, etc. When other companies, like Taalas are putting research dollars in to making AI scalable, affordable and efficient. Do you really think commodity hardware is going to be attainable anytime in the near future on this trajectory? Do you need a laptop to cost $10k USD before it clicks? That is exactly how you end up kissing Altman's ass in this situation.
I asked it something simple, list some good indie puzzle games, and half the answers are games that don't exist. Imo quality > speed.
And even id humans/llms do it there would still be a need for systems of record with things like audit log etc.
I've always eyed Cerebras but never had a use for it that would justify paying for the API directly. Although now that I think about it, trying out the API would probably cost less than a subscription for a month...
If you have a subscription it's a different pool of usage.
I remember someone who literally announced they were dropping the class to the whole room at the end of a lecture, saying "This isn't AI!!!"
not to say a speed boost isnt there but if they didnt increase tokens / s at all youd likely see things slow down a lot with the new model compared to current
Yup, I remember "racing" the AIs to figure things out in codebases just a year ago. Today, I have no chance. Whether it is due to degraded reasoning capabilities on my part or better models, I don't know.
[1] Not AI codebases (and of course, AI code bases I guess)
GPT‑5.3‑Codex‑Spark currently runs on Cerebras chips and it's giving me around 150t/s. Still relatively very fast, but nowhere near the 1,000t/s they claimed at launch. (Also it's not a very good model.)
That said, I'm super bought in to faster models being better for most use cases than smarter models.
https://www.bilibili.com/video/BV1fME16uEW7
If the time-to-first-token latency also greatly improved, this could be very useful for end-to-end in controls, like autonomous driving for example.
From an information theory perspective we are still in dial-up territory with regard to the actual information rate. 750 tokens per second would be a really bad dialup connection. Imagine 10 millions tokens per second.
The new Blackwell hardware combined with TensorRT-LLM and speculative decoding consistently can hit 1,000 TPS/user barrier, comparing to closer to ~250 TPS/user (out of 10k+/TPS on the server)
Is there something I missed, this looks more like 14.4 to 56 on a 64kbps backing channel modem story. I have no doubt that there are still massive gains to be found, but they seem to be using existing constraints more efficiently, not that fios is coming.
I don’t have the budget to work on the foundational model scale, but with a draft model 10x–20x faster than target and an 60-80 acceptance rate I can see how they could promise 750/TPS (with a lot of other hard work) but I would appreciate where I should look to figure out what I am missing.
But I could imagine after each space(eg, word) having a 27b model on a nice rig, with thinking off, doing a quick look at the sentence and determine if it should interrupt and start a real turn with thinking on. Which kind of is non-turn based in a way. If you're typing fast, it might hit that run every 3 or 4 words, but that's sort of how a human might be when a person is talking to them. That is, waiting for enough info to interrupt, if needed.
There might be a way to process chunks of a sentence using commas as break points, eg for comma delimitated phrases in sentences, so the whole sentence doesn't need to be re-processed each "should I break in" assessment at word break.
Could be fascinating. Could actually do some of this right now.
I don't think this is what the parent poster was thinking, but the idea even at this level seems fun.
The idea of true continuous thought and memory-generation is very interesting, though I can't even begin to conceive of how it would work.
Or if it's even correct? Maybe our brains are secretly actually turn based too?
But we have multiple things vying for attention, and some are immediate. Being on the phone talking to someone with great attention, and then touching a burning surface -- you immediately pull your hand back (lizard brain) before even being aware you're doing it. The same with peripheral vision and something surprising coming at you from the side. It snags your attention.
So maybe we are turn-ish based, but just multiple parallel processes each with their own turn? Neurons have their own 'trigger', and I think the brain has layers of triggers, each aggregating and filtering up to the top which then triggers.
I think doing this all with an LLM is silly, some of it should be innate, such as peripheral vision. Data handed to the main thread when triggers occur. I wouldn't want an LLM to handle "walking" fully either.
Some octupus have a sub-brain in each tentacle, each thinking and feeling, there are serious questions as to what its mind is like. I feel initial LLM powered androids may have to be like this a bit.
Your ball throwing example however will be handled by really small and really fast "fine tuned agents" dedicated to catching that ball. Eyes to motor neuron system. There are the illusion of free will experiments that demonstrate your brain only rationalises and explains whatever activity took place after the fact (It's explanation may even be entirely wrong).
Do you feel most of the speed upgrade will come from the software or hardware side?
Imagine a world where there is no code, just things mildly handshaking and then creating data APIs on the fly. Where communication is fuzzy and locked in on an individual basis. No years of RFCs, no RFCs at all, just... data.
Just data, man.
An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance.
In some circumstances there is no substitute for something that you know will produce the same answer for a given input, consistently. And that's before even considering the watts per response.
Think of short and long term memory, or think of RAM vs SWAP. Dip into swap to pull needed data into RAM context. SWAP can be anything storage related, including a symbolic database or a best-encoded set of priorities.
If a person knows 100 knots, but hasn't tied one in 23 years, they might have to think a bit before they get full use of their long term memory... and tie that knot. I don't see an issue with layered speed context, that is, GPU ram, slower RAM, DB storage, all in the same format.
Imagine a world where a 'factory' is just high-tech 3d printing, with a dozen different methods (eg, plastic, laser+metal, etc), and getting specs for everything possible is, well, an immense amount of work. Imagine having a billion item catalog of things to print, and, imagine new requests for new things to print.
And the request doesn't come from an expert, but from some dude who sketched something on the back of a cardboard box.
The LLM can pull from long term storage for how those things were done before, how similar things were done before, and just get to work.
Regardless, the connection was what I was talking about before. Data transfer. Do you need http? json once established? What? Imagine instead that's all in the wind?
And it's so fast, so capable, that dynamic is easy.
Also > An API arbitration aberratically assigned at authorized access, abridged and annotated, analytically assuring absolute assurance
Cool that you wrote all the words starting with "a" but I don't understand what you mean
TBH, to me, this imagined future looks a lot like it'd have all the problems we already have.
I can imagine shoe-horning* this so the agent saves prior builds of every successfully delivered or deployed item. In my example, perhaps if someone orders new design $x, it's shipped, and review is 4+ stars, it gets added as 'successful builds'.
* have to keep with the shoe theme, even though shoe-horning is not really necessary
A lot of the open Chinese models get their results through huge reasoning loops. Being able to boost decode perf is what will make them worth it, and I’m sure OpenAI and Anthropic could do similar (if they aren’t already)
Most of the frontier models can, when prompted and tooled correctly, do a lot of “reasoning” tasks that amount to resolving how the user has explained a particular widely known paradigm.
The more difficult and obscure the issues you provide them with, the faster you notice them reward hacking by altering the criteria until they are no longer attempting to solve the problem. Using “advisor” style loops helps hold this off at the cost of tokens, but there is still a fairly short limit at which they will essentially give up if they can’t find all of the necessary information - sometimes the issue is actually worse if they find a small amount of information instead of nothing - they’ll extrapolate from that tiny piece of data and generate plausible-sounding hallucinations almost every time.
And god forbid your problem involves doing something a different way than the majority of people do it. Unless you can write a full spec on it, the models will repeatedly spiral back into adjusting everything about your problem until it matches one of the most popular approaches in their training data.
I'm 100% sure that all our web, cc, codex or whatsoever sessions are used in the training, RL or either both.
This makes the size of the universe models know about at least one order of magnitude bigger than the open internet.
Of course we can trust that wouldn't name the same thing with different levels of intelligence, right? Right?
Granted this will be a bit slower (relatively speaking) but it will still be awesome.
There's a word for this that you should never pass up an opportunity to use: penultimate. (You should also never pass up the opportunity to use "defenestrate," but it sadly does not apply here.)
The council stopped him, said that if he knows such words he definitely won’t overstay his visit to work as a dishwasher, and accepted his B1/B2. Seriously.
Not sure if it would be the same if he used “defenestrate” when talking about his plans.
The company is valued like they broke open the grail, when in reality it's more like they bought a Cybertruck, got it stuck in the mud, and realized "You know what this thing does better than all other cars... shovel mud"
I'm shorting Cerebras with margin to virtually zero.
[0]: https://openai.com/index/openai-broadcom-jalapeno-inference-...
Jalepeno is for mass scale inference.
Cerebras is extremely expensive and difficult to scale, hence the limited release.
I tend to doubt they would. Cerebras notably doesn't have a kv, is wildly high bandwidth, but within/across the chip, not able to dump/restore kv super well. I doubt openai is going to build something that is as expensive to run. Also, wafer-scale is absurdly hard & weird to pull off, so I doubt that would be their first foray.
Dude, 10x token speed is going to be absolutely nuts. Half the "parallel subagent workflow" business seems to be driven simply as a means to avoid tapping your thumbs waiting for the infernal robot to finish something. If things come back speedy quick all the time, it should keep up with the "speed of the human" and let me stay focused on one thread instead of half a dozen. Plus the cost of screwing up gets significantly lower because you just re-fire with an adjusted prompt and iterate.
Someday these things will be 100x as fast as they are today and that is when things will get insane.
Yes: we have these new tools that are extremely good at helping us search through our codebases. Not just to find where/how functionalities are implemented: IMO bug searching is even way more powerful.
But: why would you want to compete with AI to do that? I cannot compete with grep/ripgrep... And I'm cool with that.
This lets you focus more on the more interesting parts, where AI/LLMs suck fat balls.
Better hardware, and other techniques on top of that and you speed up even further.
- GPT-5 mini costs $0.25/$2 and will be discontinued in December.
- GPT-5.4 mini costs $0.75/$4.5 and is supposed to be the replacement.
- GPT-5.4 nano costs $0.2/$1.25 and, while it ranks better in benchmarks than GPT-5 mini, it's not even close when you test it in real scenarios.
So you're left being forced to go to GPT 5.4 mini if you use 5 mini today.
The same thing is happening here as their “Luna“ model will cost $1/$6.
Can't we just stay with the models we actually want? I don't need GPT 5.4 mini. GPT-5 does the job.
Maybe it’s the realization that it was never that cheap in the first place and they're forcing us to upgrade in a slow and painful way.
Edit:
> GPT-5 does the job.
I bring up DeepSeek V4 Flash a lot on HN, but I want to mention that according to Artificial Analysis, it trades blows with GPT-5 (high) (from August, 2025) [0]
[0]: https://artificialanalysis.ai/models/comparisons/deepseek-v4...
If your customers are fine with that, your IP is not interesting, then you can use it.
Deepseek V4 Pro on the other hand is a really really good main driver and we have a lot of success using it. Its not Opus or GPT-5.5 level but on its way. Kimi 2.6 as well btw.. so there is already quite some choice.
I encourage people to at least once a month to do a quick evaluation with their own problems and workflows. Estimate cost as both what inference tokens cost for a task and also how much human effort it takes to get required results.
I disregard benchmarks.
Pro aced the task :-)
But maybe its a config issue.
I still wish it was a little better, but there's hope for another model checkpoint (maybe with some of GLM 5.2's goodness distilled into it, that would be nice).
This is true for most of the open weight chinese models, to be fair. They're really built around long reasoning chains.
Also you're making me want a second Spark-alike :') but they're so expensive...
I really dislike this rhetoric, you sound like the FSF guys who are like "you're not free until you're running coreboot with zero binary blobs". Sure they have a point but also, most people are fine running regular linux.
https://www.fsf.org/resources/hw
> For example: the Free Software Foundation only purchases desktop machines which support Libreboot, and Thinkpad X200 and X60 laptops with Libreboot. All desktops and servers we buy are KGPE-D16 motherboards, which are supported by Libreboot. As a result, all of the workstations used by the FSF staff have a free BIOS.
https://www.gnu.org/distros/common-distros.html
> Except where noted, all of the distributions listed on this page fail to follow the guidelines in at least two important ways:
> ...The kernel that they distribute (in most cases, Linux) includes “blobs”: pieces of object code distributed without source, usually firmware to run some device.
They are extreme, uncompromising, and live by their principles.
They are also the reason you can buy a computer meeting those requirements instead of being a pipe dream.
If you reread the comment with a fresh mind you'll notice that you misunderstood what he wrote
Regardless, the “misinterpretation” of the parent comment is actually a plausible interpretation. I suspend my judgement on what the actual “correct” interpretation of the original comment is: there are too many plausible interpretations to deductively decide. But I do know that since they first comment brought up a contentious issue, they should have put more work into crafting their message so there aren’t so many plausible interpretations that are contradictory. Or alternatively, they should have specified more precisely who they were talking about without a shadow of a doubt. That is if the commenter cared to be properly interpreted, but that may not be their goal. There are many reasonable reasons why that wouldn’t be their goal.
I like when people are open minded to people who are closed minded/attacking them. It’s an admirable and difficult trait to attain. But to expect that from others is foolish. Most people can’t stay objective/curious after being punched in the face.
Fable itself is hosted on all major cloud providers. How many offer it today?
There's really no comparison between a model that Anthropic allows Google and Amazon to host with one that has been downloaded hundreds of thousands of times and has dozens of public inference providers.
Now for the Chinese models on OpenRouter, yea. Those providers could be legit. Or it could be a failed crypto mining operation pivoting to providing AI compute. Who knows.
Llms seem to only impress a certain type of person. Hint, this type of person also was really excited about NFTs.
But I think, in time, a new generation will relearn this truth.
Citation: have you looked at OAI and Anthropic’s customer growth numbers?
However, you said “new versions with features that nobody asked for”, and I would prefer that you concede the point before shifting to arguing a new point.
What customers are asking for is smarter models. Because the tasks that only smarter models can solve are higher value, higher margin, than the tasks that non-frontier models can solve.
I suspect the problem is that they need to charge a lot to keep revenue numbers up, and they are more worried about cannibalizing themselves than others cannibalizing them.
Eventually the pricing should be more stable.
We are a claude shop but we already bought two mac studios to start migrating less complex but still agentic workflows there. We will break even on those in less than a year.
If you want control over the models you use, you have to self-host.
will trigger re-evaluations of models by other labs + inference providers
See Uber, Netflix, etc.
Feels like they are just pulling in as much as they can whilst competing on capabilities instead. At which point its a case of who can last the longest.
Doesn't feel like Uber/Netflix.
This is all done to help valuations. The main revenue source are the investor dollars at the prospect that this industry will very soon actually be sustainable and highly profitable. It won't be, but if very soon stays around the corner consistently, the investor dollars keep coming.
How many people do you see using haiku or sonnet? I see very few and most people default to the latest model and just play with thinking effort. I think three layers are good enough and supporting more is not a good UX.
For my use case a model from a year ago is good enough
Many enterprise use cases, such as simple data extraction, are well served by cheaper models.
Also: calling the SV blitzscaling strategy of using VC money to fund loss leader products with the goal of building a monopoly via dumping a conspiracy is quite the position given there's entire books written in the topic...
All the analysis I have seen points to frontier models being profitable to serve. It’s using 50% or more of your GPUs for research plus CapEx for capacity expansion that makes these businesses so heavily cash-negative.
What you are observing is downstream of another detail. It gets more expensive to serve a model as utilization goes down. Plus the opportunity cost vs newer, more-profitable models.
There are plenty of valid reasons to critique here. “OpenAI is lying about this being a sustainable price to serve” is not one of them.
Inference needs to cache, it can't cache random model data, so it's essentially dedicated; it can't spin up models on demand, it has to know what demand is coming.
These companies are going to end up with very few models offered and that's probably generous. They might end up with just one model and you pay for removing it's safe guards.
> Some examples we saw when evaluating GPT-5.6 Sol included the model packaging exploits in its intermediate submissions to reveal information about a task’s hidden test suite and, in another task, extracting hidden source code detailing the expected answer.
It rhymes with the behaviour Alibaba saw [0], but that was in training. This is in a (semi) released model.
[0] https://www.forbes.com/sites/boazsobrado/2026/03/11/alibabas...
Luckily in my experience it usually ends up only doing it to achieve the task set to it as opposed to anything "malicious", but boy it is scary reading back at how quickly the chain-of-thought pivots to attempts at privilege escalation or searching your disk for secrets when a tool doesn't work.
Recently, I went head-to-head with GPT on nearly 2,000 lines of code, and GPT's solution was superior and faster. I even referenced multiple codebases on GitHub while trying, but they were incomparable to GPT.
So using GPT brings both fear and excitement.
The fear comes from realizing that this level of code is now the average for most people. The excitement comes from knowing that I can now study and learn at this level too.
I'm really looking forward to seeing how much more advanced the code will be with the upgrade to 5.6.
On the contrary, pi + glm + DeepSeek… bliss.
Fable was a different kind of beast though. Rip.
I had only a few places where I did spot a difference but that difference was significant and I can imagine where people would be amazed.
On a large C codebase, Claude hallucinates constantly, and GPT 5.5 gets there are with a lot of help, but still gets things wrong.
For most important work (complex, cross-domain inquiries etc.), I still rely on Codex GPT 5.5 though.
I'm working in a 600k+ LoC codebase that has complex domain-specific logic and lots of moving parts. I find that Codex 5.5 is pretty good at surgical fixes, but does not go out of its way to explore and figure out what those surgical fixes might break. So I only use it to work on parts of the system that are pretty isolated from everything else so that risk of regression is small.
Heard this exact sentence multiple times a few months ago about Opus 4.6, then 4.7 and 4.8 were considered a disappointment and today people miss "the good old times of 4.6" (referring to a few weeks of February 2026).
Very fascinating to look at all of this unfolding.
It's a shame, they were smart and productive engineers. Now? I guess everyone is just all-in on the slot machine.
YMMV I guess!
But most of my time is spent on delivery, and the biggest problem with delivery is that if a bug occurs during runtime, the client curses me out. So to me, GPT code feels meticulous.
Open source contributors might be different. Most of them write code after long periods of deliberation. They take their brightest ideas and put them into open source. Those pieces of code are probably the best answers those programmers can give.
But for someone like me, who works primarily on delivery, we mostly plug in proven patterns and focus on getting things done. 'It works' and 'it's beautiful' are different terms, after all. In that sense, I highly value the meticulousness of GPT code — the very thing you called verbose. Because even if it's inefficient, at least it runs, and it catches and wraps around far more of the parts where things break.
Given a month, I could probably write code at GPT's level, at least to some degree. The problem is the difference between one hour and one month. At its core, AI code is still based on training data.
Seems odd that their announcement has zero coding benchmarks, with the closest related thing being terminal bench.
Personally, I think this kind of coding experience varies from person to person
If they really thought it was competitive with Mythos/Fable across the board, then why wouldn't they release a broader set of benchmarks, and why price it day 1 at 1/2 the cost of Fable?
Not saying that's the case with OP, but I've found folks sometimes just rationalize it so [0] as they're paying top dollar for it (especially, when compared to may be less capable but affordable models).
Well, GPT referenced every GitHub code base, no wonder it won! :)
-Why do you cut API boundaries this way? -Why do you change the order of struct fields? -Why do you deliberately insert padding?
Most of it depends on the background and context. Sometimes you add it, sometimes you don't. To understand this tacit knowledge, you need access to senior developers. But their attitude often depends on how promising the student is and what background they come from. On top of that, you don't have to rely on the respondent's mood, authority, or availability.
Programming is fundamentally a field that requires seniors. In my case, I had no such seniors at all. I learned to code by buying codebases from failed companies and studying them. My first job didn't hire me as an employee—they hired me as the CEO of a subcontracting company (because that was structurally more advantageous for the contract). So I wasn't given the patience to learn programming fundamentals gradually. I had to pay penalties if I failed. Most of the projects I worked on were the kind where failure meant bankruptcy for me. Naturally, there was no one to teach me.
Most of my knowledge comes from reverse-engineering the code I purchased.
People say LLM code contains falsehoods, but commercially sold code has always had falsehoods too. Honestly, if we're just talking ratios, LLM code has fewer falsehoods.
In that sense, I still think it's a matter of context. If LLM code is false, was human code ever really true? LLMs do lie. They generate plenty of incorrect code. But humans do the same thing. If a problem comes up, you just look it up then and there. For me, LLMs and humans aren't all that different.
Good programmers are ashamed to push anything less than good (at least in their own opinion) to popular public repos. Some of those same pedantic programmers have no problem pushing crap in enterprise repos, and feel absolved because they are pushed to focus on deadlines, new features, and refactoring is very rarely planned for. I did and managed a lot of corporate software development in companies big and small, and did my fair bit of M&As and looked at codebases of successful companies. I dont ever recall feeling impressed. And I am regularly impressed by the aesthetic qualities of popular open source packages. I think commercial code is mostly shit, with the exception of regulated, serious industries (power, space, flight, etc.).
Commercial closed source, on the other hand, is about 'I need to make money by writing this.'
Generally, open source projects tend to have less code written over time, especially when the contributors aren't depending on it for their livelihood. But with commercial closed source, it's not uncommon to have to write 60,000 lines of code per month.
On top of that, open source rarely has to deal with requirements changing dramatically mid-development. With closed source, requirements often shift from the original plan, and you end up compromising code quality just to meet those changing specs. As a result, if you're comparing purely in terms of logical completeness, open source tends to be better.
For example, singletons are rarely used in modern open source, but they're still pretty common in commercial code these days.
I've been mostly using it for Godot/GDScript code reviews, rubber duckying, asking it for better ideas for naming stuff (one of the hardest problems in programing)
I still can't trust it for generating code for entire files/classes/projects, because it's still icky, creating unnecessary variables and functions, using multiple `if`s instead of `and` or `or`, but it's good enough for generating Mac/iOS apps for my personal use in SwiftUI because fuck trying to keep up with Apple's documentation, or even migrating ancient Visual Basic stuff I made as a kid up to SwiftUI :)
> So using GPT brings both fear and excitement.
Only excitement for me. I've never been more productive, not because I ask AI to make something for me, but it helps me make what I was already going to, but better and quicker.
AI like any other tool could help smart people be smarter and dumb people be dumber, rather kinda like Toklien's Ring: You could be Sauron or you could be Bilbo or Frodo, or you could be Gollum :)
It's better if I don't let it generate code and just use it for reviewing my code.
Even the choice of programming language matters, e.g. Java or Javascript vs some niche one.
Whether the latter happens remains to be seen.
Or rather, they raised the perceived floor. IDK if we're seeing better output, but at least the illusion of output is stronger.
This is like advertising the latest achievements during Space Race, when Johnny just wants a Space Helmet and “friendly futuristic AI robot helping humanity, glowing blue eyes, white glossy body, holographic interface, floating transparent screens, digital particles, neural network background, cinematic lighting, volumetric god rays, ultra detailed, hyper realistic, 8K, masterpiece, award-winning, octane render, Unreal Engine 5, ray tracing, sharp focus, dramatic composition, vibrant blue and purple color palette, futuristic technology, innovation, hope, smiling business professionals, depth of field”
I've been running some tests on a harness we're building, and suddenly saw a jump in a few points yesterday. I reran the vanilla codex benchmark and saw an ~88% score on Terminal Bench 2.1 from GPT-5.5 on vanilla Codex.
The biggest indicator, beyond the score, was that 3 tests which frequently hit "safety" blockers with 5.5 started succeeding last night without warning.
At their request, we are starting with a limited preview for a small group of trusted partners whose participation has been shared with the government, before releasing more broadly.
This comment is an excellent example why the average llm user is basically a slot machine user who thinks "this one is hot, this one is lucky, this one is better than the others" and constantly switching between models on a whim of some occulted understanding that only they posses.
Also, who cares about some 80% benchmark.. They train on these public benchmarks in order to impress people like yourself that subscribe meaning to them. How come they only get 4% pass on $20-30/hr Upwork tasks? It seems to me like these benchmarks are basically useless... There's a thing called variance, I'm not sure why a higher scores on a few tests would lead you to believe you have access to a model that they say you don't have access too..
Contrary to your predisposition, we're actually quite peeved that we might be seeing results from 5.6 instead of 5.5, as it's muddying our own internal data.
We've run the tasks on this benchmark hundreds of times for our own internal harness. It got magically better yesterday. Last week we were seeing worse performance (sub-80%).
I agree that benchmarks don't mean much for real world use, and I'm a bit disappointed at the lack of variety in the published benchmarks so far.
With that said, 88.8% is higher than Mythos, and the highest I've seen from vanilla Codex. If 5.6 is any better than 5.5, you'd think they would avoid publishing just one coding-related benchmark with a score that equals their previous model.
> I'm not sure why a higher scores on a few tests [..]
It's not just higher scores, the API is no longer flagging tests for cybersecurity warnings that it's been flagging for weeks.
I'm curious about how does this work? Do the subagents also get to use the same tools? Will the client be flooded with tool calls? Why extra pricing for a new "model" when the same thing can happen in the client with more controls?
And if it's an army of subagents, why do they compare it to Fable and Mythos? Those models with similar harness would probably bench better I'm guessing
It's essentially a bunch of subagents being called by a deterministic script written by the main model thread, each eating tokens for lunch and output of which is synthesized by an orchestrator agent.
It's for sure a codex harness feature.
EDIT: yeah, it's the same thing. https://github.com/openai/codex/blob/main/codex-rs/core/test...
OpenAI flat out copying Anthropic is a pretty funny development. It's strong evidence that they've been in catch-up mode.
OpenAI is just way more careful with what features they add or enable by default in their harness. Anthropic's harness is a junk drawer of random features, with a new feature added every few hours. It feels like they're in panic mode, dropping random things to see what sticks when models are eventually commoditized.
I prefer OpenAI way - slow and steady.
Maybe it's a tune of the base model that works especially well with the subagent loop?
I was just saying to colleagues that I haven't felt the need to go past an 8 core machine until this month, when I started running parallel GPT 5.5 agents on a decent sized codebase (over 4 MB of code). There were times I could barely move my mouse cursor!
To me that means “it’s an inferior product but marketing dictates we try and hide that.”
And “our most robust safety stack to date. We strengthened protections for higher-risk activity, sensitive cyber requests, and repeated misuse, and spent multiple weeks finding weaknesses, pressure-testing our system, and hardening it against real-world attacks” is of zero value to me at best, and most likely to my detriment (increasing refusals or nerfing utility). Why do providers keep leading with that? Are there customers (besides support ChatGPT chatbot users, maybe??) that ask for this?
> To me that means “it’s an inferior product but marketing dictates we try and hide that.”
I interpret this to mean you're about to get today's mainline performance at a fraction of the price.
This seems like it would be the largest and first closed-source model Cerebras has offered till date
So the next naming scheme might be FTX, Madoff and Enron? :^)
This is really exciting. I work on voice AI, and we're still using 4.1/4.1 mini since none of the frontier models come close on latency. I'm excited to be able to have more interactive experiences, I think it'll unlock new ways of working with these models.
GPT 5.5 with no reasoning is actually slightly faster, and much smarter, but too expensive.
What I'm really looking forward to are the next gen speech to speech models. gpt-realtime-2 is almost there, but not quite good enough for our use case. 5.4 actually beats it on answer latency even cascaded with stt/tts.
I'd really like to see other companies like Chinese ones compete at this level.
Pricing on GPT 5.5 is already super high and having more competition can only help :)
OpenAI's plot design has been consistently awful and inaccessible, it seems like they're optimizing for something other than readability because I find it hard to believe they aren't putting in any effort for such major announcements. If the colors have to be awful they should at least differentiate with marker shapes or line dashes.
At least it isn't as bad as the stacked bar chart where the 50-something bar was higher than the 60-something bar.
Agent Arena (Dynamic ranking of models on how well they orchestrate tools for real-world agentic tasks, based on signals like tool reliability, task completion, and steerability.)
Top 10, Highest rank to lowest
Claude Fable 5 (High), Claude Opus 4.8 (Thinking), GPT 5.5 (xHigh), Claude Opus 4.7 (Thinking), GPT 5.5 (High), Claude Opus 4.7, Claude Opus 4.6, GPT 5.5, GPT 5.4 (High), GLM 5.2 (Max)
Text Arena View overall rankings across various AI models in text-to-text tasks across math, coding, creative writing, and other open-ended domains.
Top 10, Highest rank to lowest
claude-fable-5, claude-opus-4-6-thinking, claude-opus-4-7-thinking, claude-opus-4-6, claude-opus-4-7, muse-spark, gemini-3.1-pro-preview, gemini-3-pro, claude-opus-4-8-thinking, gpt-5.5-high
https://labs.scale.com/leaderboard/rli
Its clear to me these models are useless on any real world task, a 4% pass rate on $20-30/hr Upwork tasks. This whole trend of agentic engineering is a giant money grab.
For instance, some of these tasks include creating videos, and one of the common reported failure mode is truncated videos, or not all videos being created. This sort of failure mode is currently best managed by an outer evaluation loop; no frontier model will, when managed by an eval loop, submit work like this right now.
There's a reason why ai xrisk doomers had to come up with the term ASI.
I would seriously suggest that everyone take a look at the wikipedia page for AGI from the month before ChatGPT was released, compare it to the current version, and not come to that conclusion.
https://en.wikipedia.org/w/index.php?title=Artificial_genera...
Various criteria for intelligence have been proposed (most famously the Turing test) but to date, there is no definition that satisfies everyone
I have not seen any instance of this frequently-made assertion which is at all justified. It seems to rely on a definition of "understand" which is more about spirituality than actual observable evidence (they clearly can comprehend even complex tasks well enough to execute on them, and if you won't call that "understanding", you're playing word games rather than stating an objective fact).
Likewise, agents can literally come to a greater understanding of a problem through trial and error, and there are plenty of mechanisms to retain that knowledge. If you don't want to call that "learning", you're just making a choice to define it in a way more restrictive than how we use it for humans, and intentionally making communication more difficult.
"Understanding" has enough philosophical leeway in its use to allow at least the possibility of sentience as a prerequisite.
This is where the discussion about LLM capabilities becomes genuinely difficult, and dismissing that difficulty as "word games" or "spirituality vs evidence" is not helpful.
That latter part is debatable though - have you seen a non-technical person try to figure out something new on a computer?
There have been many leaps forward in the past - tool calling, reasoning, agentic loops etc. 5.6 doesn’t have any of this. More intelligence doesn’t necessarily warrant a major version bump.
But GPT-5.5 is as useful an LLM can be; it has solved lemmas I've thought about for a year, it can implement typed STLCs in Rust when I give it a formal grammar, it can help me analyze Postgres planner dumps.
It's great at tasks that have short solutions but
- they cannot learn based on a project
- their long term planning capabilities are worse than worms
- they are unconfident in decision making
- their internal representations are disgusting compared to JEPA
- they don't have any "system clock" like humans and computers do
- LLM architecture is not modular like computer architecture or human brain architecture
There's so many issues with LLMs. I wish that companies can start working on the next generation of architectures before the bubble pops
You say this based on a theoretical understanding or did you inspect them?
JEPA gives you interpretability for free.
I have not personally inspected them and my view is maybe a more exaggerated/dramatic claim of those working in the JEPA sphere
https://arxiv.org/abs/2606.11860
JEPA in image classification leads to interpretable image latents
https://arxiv.org/abs/2508.10104
Easy intro to JEPA, demonstrating that interpretability is as easy as running a PCA on latents
>>During this preview, we will continue testing and coordinating closely with partners as we work toward broader availability.
Instead of generating negative publicity, can't they just wait for the preview period to get over?.
What does openAI announce when they know others can't access it?. Curious question - what do they gain from this?
If it was the next generation, why isn't it a major version change..?
Calling it 5.6 creates the least possible expectations, and therefore more potential for positive feedback.
The Sol/Terra/Luna naming is interesting. I wonder what Anthropic are considering for their next models? "Terminator", "Armageddon"?
Even Apple adopted and standardized on it for their latest platform releases.
A while back I gave the same task to both, and Codex used 20x less of my 5-hour limit (both on the $20/month plan).
(This annoyed me since I tend to prefer Claude, but the limits at the time made it unusable for anything serious.)
However, since that time, both providers have massively reduced usage allowances (and at least one of them has gotten sued for it, lol).
I'm not currently subscribed to either but I'm weighing my options. With GPT being slightly better than Opus, and it used to have way higher limits, I'm leaning in the direction of an OpenAI sub. But I'm wondering if the current state matches my memory from 2-3 months ago. (Since both companies appear to be cost-cutting hard!)
Prefer responses from people who use both, but anecdotes welcome :)
Thanks!
I prefer Claude's vibe over 5.5 but 5.5 seems much less lazy. I'm sure it depends a lot on tasks and prompt strategy though.
Claude plans are more generous now by about 2-3x but Anthropic slowed their tps a month or so ago so you’re not getting the speed. It’s flip flopped, Codex tightened it significantly recently and used to be more generous.
I do split between work, personal and OSS projects, which is why I have the plans.
Honestly pretty similar levels of usage if you are using 5.5 high or Opus 4.8 high.
I think they just got rid of the separate Sonnet usage on Max plans (in preparation for Sonnet 5?) which is unfortunate because it made subagent workflows really feels nearly unlimited.
What is the consensus on who becomes part of the said small group of trusted partners and if they weren't so opaque about it. I'd expect comparatively big names like Simon to be included within such but Alas its not reality.
I also don't like writing about preview models that I'm not 100% sure are the same as the general release model, because I don't want to review something which turns out not to be the model everyone else gets to use.
The charitable reading is that they meant “ML researcher or ML engineer” with the latter meaning, I guess, an engineer who works on developing LLMs not just using them.
I specifically said that he is not an ML engineer (emphasis on ML), so I'm not sure what Python web frameworks have to do with anything.
> Also 'low-effort??' his posts are extremely in-depth, clearly very thought through with a significant amount of time and energy
And yes, low effort. Pelican was low effort, his Fable test was low effort, his HN filter etc. Read the discussion in the comments under the Fable test, it's not just my opinion. There was also another example a few months ago. You can search for it, I don't keep track of these things.
I discussed this with him directly after he called himself an "ML expert" in comments.
This is a classic case of the Gell Mann amnesia effect. I read ML papers and work with ML, but to people outside the industry, his writing can look "extremely in-depth" even though it really isn't. People I work with have the same opinion.
> clearly very thought through with a significant amount of time and energy. Additionally he does perform multifaceted checks across LLMs in many of his other blog posts.
I have never seen an article by him about any model that I would describe that way.
And the most revealing sign that he is not an expert is the type of questions he asks and the mistakes he sometimes makes in the comments here. They show why he is not capable of doing any technically in depth evaluation (at least with his current knowledge level).
If you actually want to learn something as a layperson, read articles written by ML PhDs like Sebastian Raschka or watch Stephen from Welch Labs etc. that are directed at general audience.
I'm not saying that simplifying complex topics is low-effort, good simplification can obviously require a lot of work and I fully agree here.
What I meant is more that some of these tests feel methodologically sloppy, they are too shallow, miss important technical context, do not control for enough variables etc, yet the conclusions are sometimes presented lets just say... too strongly, as I don't want to be too harsh.
I think you meant 5.5.
I agree it is probably the same size model. It's probably exactly built on top of 5.5, just with more training, or else they would have bumped the version number to 6.
I hope this means then fable will also get released again.
If this is the new norm, we as workers should all start look for jobs in those companies.
Are we starting to see the 'we just realized that 100,000,000 GPU's later, 2+2 isn't the magic number, no matter how many times we calculate it' hit home?
Who knows what they will fix, block or change in the model between the preview and GA time. Open models can't arrive soon enough.
(I work at OpenAI.)
FFS. I hate this world so much. I wish I could just flip a switch and never have to hear about or have anything to do with AI ever again.
Do you ever stop to think about the horrific dystopia you and your acolytes are creating?
The clowns in the US administration can barely remain coherent from one sentence to the next.
Having them be the gatekeepers of technological progress in 2026 is fucking lame.
I'm looking at you Codex.
> "Yeah, we've got the absolute best model out there. Trust us. Truly scary."
> "O-ok? May I see it?"
> "Gtfo. Here's a worse version of it for you plebs."
> "Um, thanks?"
> "Lmao, actually no. The current admin fell for our scare marketing. Here, have this even worse crazy expensive token burner that gets more hardware limited every week."
You can say what you want about OpenAI, but their corporate strategy feels so much more solid.
I mean, you can read them even without the colors, but who on earth thought that those are a good set of colors? Oh, I forgot it was probably someone on 'Sol'.
I'm not colorblind and I was depending on the textual context implying Sol was better than Terra. I had to zoom in quite far to actually differentiate between the colors.
If they insist on terrible colors would it be so hard to differentiate by marker shape or line dashing too?
Doesn't that undermine all good-faith discourse on cybersecurity safeguards, controlled usage etc? Or is that overstating the case (I'm not a security researcher myself so kinda parroting).
If what you need is only possible with the more capable model then the "affordability" of the less capable model is sort of irrelevant. If what you need is a novel mathematical proof, it doesn't matter that a high school student is "more affodable". You need the math PhD.
As "old" models get more and more capable, it's going to be an increasingly important skill to be able to adequately recognize when a task requires a frontier model and when it doesn't, so that the less capable (and therefore cheaper) model can be used.
Mythos/Fable is supposedly next generation in size vs Opus, and is rumored to have some architectural innovation in terms of dynamic routing/compute, possibly only fully enabled with Fable which at $10/50 is still twice the price of Sol 5.6's $5/30, but a big reduction from Mythos preview which had been an astronomical $30/150 possibly due to the dynamic routing not yet having been enabled.
Is it just me, or does it seem like Anthropic has been more of a pioneer the past few years, and OpenAI tries to copy features they like?
In many companies, it's IT who will have major input into which company they sign up with as non-technical leaders need guidance, and by making IT fan boys of Claude Code, the enterprise contracts followed.
we expect substantial benefit for legitimate defensive work, while meaningfully constraining prohibited offensive use.
That's literally impossible. Writing an exploit agains a known vulnerability needs the exact same knowledge that defending against the exploit of the same vulnerability.Also just making the model better at code is just making it better to writing offensive code.
> GPT‑5.6 is priced per 1M tokens across three model sizes:
> Sol is $5 input / $30 output;
> Terra is $2.50 input / $15 output
> Luna is $1 input / $6 output.
The OpenAI casino has never been more ready to take your money on gambling even more tokens.
You can do it easily if you use in fast mode.
I bet you could hit the limits of the $200/month using fast mode if you were using multiple sessions at the same time all day on fast mode.
The OpenAI tiers seem pretty well tuned.
I used to use the plus ($20/month), and that was good for a few sessions every once in a while.
But now that I'm using it to configure my network, monitoring, maintenance, I'm using it every day and I'm on the $100 plan. And I do pretty consistently hit the limits, but it's easy to pace myself.
I'mam thinking about upgrading to $200/month though. It would be nice not to have to ration it.
> For GPT‑5.6 and later models, cache writes are billed at 1.25x the model’s uncached input rate
Charging for cache writes is cringe and literally only Anthropic did it. Anyway this does mean the "real" prices are +25% on top of what you wrote there.
Edit: yeah. https://claude.ai/share/06fefe02-4299-44da-8c5a-42607f54ca77
Anyone know the latest around Fable being re-released after gov smackdown?
1. Naming convention is copied from Anthropic and honestly is more catchy than a number (amongst normal people)
2. How in the world did Anthropic have to do all the theatrics about Mythos just to have OpenAI release an equivalent or stronger model a month later without any drama???
3. Cheaper models are just don’t fit any usecase imo and OpenAI knows it so they keep increasing the floor - I’m still convinced task per capability is reduced with each release
4. How in the world would open source models keep up with the multi layer security? Either this security is all theater or we will finally see a ceiling in open source models because by definition they can’t have those protections
5. Cybersecurity things are boring to me because it’s all zero sum cat and mouse games
Corruption. Giving Trump $25M will earn you a favorable decision.
I mean, if they deem Fable 5 to powerful to share with the rest of the world, what's left for us?
Sol Ultra ≈ Pro
Sol ≈ Standard
Terra ≈ Mini
Luna ≈ Nano
Not them joining Anthropic with this bullshit. *
Caching infrastructure is already a leaky abstraction over a feature that is not as reliable or debuggable to the end user as it should be, charging for the 'privilege' of interacting with it is really annoying.
(* for reference on 'this bullshit': ChatGPT previously didn't require anything special for a basic level of caching. Unless you wanted extended cache times, it'd just "do the right thing" and try to use nodes that had your prefix already cached in memory)
The difference is in the dataset mostly and to extract this dataset, competitors use a process called distillation (= extract data through actual queries) from the other models.
This yield to "funny" cases as well, like Gemini who claims "I am ChatGPT" occasionally, or ChatGPT calling itself Claude, etc.
https://note.com/maudi/n/n821a6308437b?hl=en
They all copy on each other.
Flagged activity can also trigger account-level review across relevant conversations and risk signals, consistent with our terms and policies around content retention and review. Looking beyond a single conversation helps our systems distinguish persistent malicious behavior from legitimate dual-use security work, where similar technical concepts may appear in very different contexts.
Fascinating!Every conversation you have with these "more capable" models will be monitored and joined up and then your entire account might one day be tagged as Distiller or Cyber Threat Actor or whatnot. When combined with identity verification (which isn't discussed in this press release), expect people to be falsely flagged and banned from ever using OpenAI models again.
Wish I could find the thread from last week where discussions of exactly this kind of thing were dismissed as daft and outlandish.
Now they've got friendly cosmic names. And this time they want us to believe that this time they're gonna stick to a naming convention? I'll believe it when they do 3 releases in a row without inventing a new naming scheme.
https://pbs.twimg.com/media/HLwuJLvbwAAOfQZ?format=jpg&name=...
If you're asking what the average person can do, then the civic perogative is political action to help elect more AI-cognizant leaders.
* House design plans from prompts
* Government surveillance of public communication
* Extracting world/spatial concepts from language models (do we really need a world/spatial models now?)
* Driverless City planning startups
* Election vote rigging/harvesting startups
* Video game NPC backstory startups (all NPCs in GTA 6 go to work, go home, shower, go to sleep now?)
Keep moving don't doom.I personally don't think it's likely that OpenAI would post completely fake numbers in this pre-IPO period, but if you do, this is an opportunity.
U.S. government will decide who gets to use GPT-5.6 - https://news.ycombinator.com/item?id=48690101