The new Opus 4.6 scores 65.4 on Terminal-Bench 2.0, up from 64.7 from GPT-5.2-codex.
GPT-5.3-codex scores 77.3.
That said ... I do think Codex 5.2 was the best coding model for more complex tasks, albeit quite slow.
So very much looking forward to trying out 5.3.
https://gist.github.com/drorm/7851e6ee84a263c8bad743b037fb7a...
I typically use github issues as the unit of work, so that's part of my instruction.
The way "Phases" are handled is incredible with research then planning, then execution and no context rot because behind the scenes everything is being saved in a State.md file...
I'm on Phase 41 of my own project and the reliability and almost absence of any error is amazing. Investigate and see if its a fit for you. The PAL MCP you can setup to have Gemini with its large context review what Claude codes.
I use 5.2 Codex for the entire task, then ask Opus 4.5 at the end to double check the work. It's nice to have another frontier model's opinion and ask it to spot any potential issues.
Looking forward to trying 5.3.
Every new model overfits to the latest overhyped benchmark.
Someone should take this to a logical extreme and train a tiny model that scores better on a specific benchmark.
But even an imperfect yardstick is better than no yardstick at all. You’ve just got to remember to maintain a healthy level of skepticism is all.
When such benchmarks aren’t available what you often get instead is teams creating their own benchmark datasets and then testing both their and existing models’ performance against it. Which is eve worse because they probably still the rest multiple times (there’s simply no way to hold others accountable on this front), but on top of that they often hyperparameter tune their own model for the dataset but reuse previously published hyperparameters for the other models. Which gives them an unfair advantage because those hyperparameters were tuned to a doffeeent dataset and may not have even been optimizing for the same task.
AI agents, perhaps? :-D
You can take off your tinfoil hat. The same models can perform differently depending on the programming language, frameworks and libraries employed, and even project. Also, context does matter, and a model's output greatly varies depending on your prompt history.
Cost to Run Artificial Analysis Intelligence Index:
GPT-5.2 Codex (xhigh): $3244
Claude Opus 4.5-reasoning: $1485
(and probably similar values for the newer models?)
Not throwing shade anyone's way. I actually do prefer Claude for webdev (even if it does cringe things like generate custom CSS on every page) -- because I hate webdev and Claude designs are always better looking.
But the meat of my code is backend and "hard" and for that Codex is always better, not even a competition. In that domain, I want accuracy and not speed.
Solution, use both as needed!
This is the way. People are unfortunately starting to divide themselves into camps on this — it’s human nature we’re tribal - but we should try to avoid turning this into a Yankees Redsox.
Both companies are producing incredible models and I’m glad they have strengths because if you use them both where appropriate it means you have more coverage for important work.
Ah and let me guess all your frontends look like cookie cutter versions of this: https://openclaw.dog/
(I'm also a "small steps under guidance" user rather than a "fire and forget" user, so maybe that plays into it too).
It's a very nice UX for iteratively creating a spec that I can refine.
Opus is the first model I can trust to just do things, and do them right, at least small things. For larger/more complex things I have to keep either model on extremely short leashes. But the difference is enough that I canceled my GPT Pro sub so I could switch to Claude. Maybe 5.3 will change things, but I also cannot continue to ethically support Sam Altman's business.
All that said, the single biggest reason why I use Codex a lot more is because the $200 plan for it is so much more generous. With Claude, I very quickly burn through the quota and then have to wait for several days or else buy more credit. With Codex, running in High reasoning mode as standard with occasional use of XHigh to write specs or debug gnarly issues, and having agents run almost around the clock in the background, I have hit the limit exactly once so far.
The only valid ARC AGI results are from tests done by the ARC AGI non-profit using an unreleased private set. I believe lab-conducted ARC AGI tests must be on public sets and taken on a 'scout's honor' basis that the lab self-administered the test correctly, didn't cheat or accidentally have public ARC AGI test data slip into their training data. IIRC, some time ago there was an issue when OpenAI published ARC AGI 1 test results on a new model's release which the ARC AGI non-profit was unable to replicate on a private set some weeks later (to be fair, I don't know if these issues were resolved). Edit to Add: Summary of what happened: https://grok.com/share/c2hhcmQtMw_66c34055-740f-43a3-a63c-4b...
I have no expertise to verify how training-resistant ARC AGI is in practice but I've read a couple of their papers and was impressed by how deeply they're thinking through these challenges. They're clearly trying to be a unique test which evaluates aspects of 'human-like' intelligence other tests don't. It's also not a specific coding test and I don't know how directly ARC AGI scores map to coding ability.
As an analogy, Terence Tao may be one of the smartest people alive now, but IQ alone isn’t enough to do a job with no domain-specific training.
Hopefully performance will pick up after the rollout.
While I love Codex and believe it's amazing tool, I believe their preparedness framework is out of date. As it is more and more capable of vibe coding complex apps, it's getting clear that the main security issues will come up by having more and more security critical software vibe coded.
It's great to look at systems written by humans and how well Codex can be used against software written by humans, but it's getting more important to measure the opposite: how well humans (or their own software) are able to infiltrate complex systems written mostly by Codex, and get better on that scale.
In simpler terms: Codex should write secure software by default.
https://www.nbcnews.com/tech/tech-news/openai-releases-chatg...
I wonder if this will continue to be the case.
"We added some more ACLs and updated our regex"
Security could be about not adding certain things/making certain mistakes. Like not adding direct SQL queries with data inserted as part of the query string and instead using bindings or ORM.
If you have insecure raw query that you feed into ORM that you added on top - that's not going to make query more secure.
But on the other hand when you're securing some endpoints in APIs you do add things like authorization, input validation and parsing.
So I think a lot depends on what you mean when you're talking about security.
Security is security - making sure bad things don't happen and in some cases it's different approach in the code, in some cases additions to the code and in some cases removing things from the code.
> GPT‑5.3‑Codex is our first model that was instrumental in creating itself. The Codex team used early versions to debug its own training
I'm happy to see the Codex team moving to this kind of dogfooding. I think this was critical for Claude Code to achieve its momentum.
- "Someone you know has an AI boyfriend"
- "Generalist agent AIs that can function as a personal secretary"
I'd be curious how many people know someone that is sincerely in a relationship with an AI.
And also I'd love to know anyone that has honestly replaced their human assistant / secretary with an AI agent. I have an assistant, they're much more valuable beyond rote input-output tasks... Also I encourage my assistant to use LLMs when they can be useful like for supplementing research tasks.
Fundamentally though, I just don't think any AI agents I've seen can legitimately function as a personal secretary.
Also they said by April 2026:
> 22,000 Reliable Agent copies thinking at 13x human speed
And when moving from "Dec 2025" to "Apr 2026" they switch "Unreliable Agent" to "Reliable Agent". So again, we'll see. I'm very doubtful given the whole OpenClaw mess. Nothing about that says "two months away from reliable".
MyBoyfriendIsAI is a thing
> Generalist agent AIs that can function as a personal secretary
Isn't that what MoltBot/OpenClaw is all about?
So far these look like successful predictions.
Like, it can't even answer the phone.
I'm sure there are research demos in big companies, I'm sure some AI bro has done this with the Twilio API, but no one is seriously doing this.
All it takes is one "can you take this to the post office", the simplest, of requests, and you're in a dead end of at best refusal, but more likely role-play.
What does work in production (at least for SMB/customer-support style calls) is making the problem less magical: 1) narrow domain + explicit capabilities (book/reschedule/cancel, take a message, basic FAQs) 2) strict tool whitelist + typed schemas + confirmations for side effects 3) robust out-of-scope detection + graceful handoff (“I can’t do that, but I can X/Y/Z”) 4) real logs + eval/test harnesses so regressions get caught
Once you do that, you can get genuinely useful outcomes without the role-play traps you’re describing.
We’ve been building this at eboo.ai (voice agents for businesses). If you’re curious, happy to share the guardrails/eval setup we’ve found most effective.
is obviously a staged demo but it seems pretty serious for him. He's wearing a suit and everything!
https://www.instagram.com/p/DK8fmYzpE1E/
seems like research by some dude (no disrespect, he doesn't seems like he's at big company though).
https://www.instagram.com/p/DH6EaACR5-f/
could be astroturf, but seems maybe a little bit serious.
that's certainly one way to refer to Scott Alexander
Do we still think we'll have soft take off?
There's still no evidence we'll have any take off. At least in the "Foom!" sense of LLMs independently improving themselves iteratively to substantial new levels being reliably sustained over many generations.
To be clear, I think LLMs are valuable and will continue to significantly improve. But self-sustaining runaway positive feedback loops delivering exponential improvements resulting in leaps of tangible, real-world utility is a substantially different hypothesis. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
(= it might take an order of magnitude of improvements to be perceived as a substantial upgrade)
So the perceived rate of change might be linear.
It's definitely true for some things such as wealth:
- $2000 is a lot of you have $1000.
- It's a substantial improvement of you have $10000.
- It's not a lot you have $1m
- It does not matter if you have $1b
It feels crazy to just say we might see a fundamental shift in 5 years.
But the current addition to compute and research etc. def goes in this direction I think.
i dont think the model will figure that out on its own, because the human in the loop is the verification method for saying if its doing better or not, and more importantly, defining better
Dirty tricks and underhanded tactics will happen - I think Demis isn't savvy in this domain, but might end up stomping out the competition on pure performance.
Elon, Sam, and Dario know how to fight ugly and do the nasty political boardroom crap. 26 is gonna be a very dramatic year, lots of cinematic potential for the eventual AI biopics.
>Dirty tricks and underhanded tactics
As long the tactics are legal ( i.e. not corporate espionage, bribes etc), the no holds barred full free market competition is the best thing for the market and the consumers.
The implicit assumption here is that we have constructed our laws so skillfully that the only path to win a free market competition is by producing a better product, or that all efforts will be spent doing so. This is never the case. It should be self-evident from this that there is a more productive way for companies to compete and our laws are not sufficient to create the conditions.
Model costs continue to collapse while capability improves.
Competition is fantastic.
However, the investors currently subsidizing those wins to below cost may be getting huge losses.
And yet RAM prices are still sky high. Game consoles are getting more expensive, not cheaper, as a result. When will competition benefit those consumers? Or consumers of desktop RAM?
The system isn't mathematically perfect, but that doesn't make it arbitrary. It works through an evolutionary process: bad bets lose money, better ones gain more resources.
Any claim that the outcome is suboptimal only really means something if the claimant can point to a specific alternative that would reliably do better under the same conditions. Otherwise critics are mostly just expressing personal frustration with the outcome.
There aren't any insurmountable large moats, plenty of open weight models that perform close enough.
> CO₂ emissions
Different industry that could also benefit from more competition ? Clean(er) energy is not even more expensive than dirty sources on pure $/kWh, we still do need dirty sources for workloads like base demand, peakers etc that the cheap clean sources cannot service today.
[1] https://en.wikipedia.org/wiki/United_States_antitrust_law
---
Sadly it was the core of anti-trust law, since 1970s things have changed.
The predominant view today (i.e. Chicago School view) in both judiciary and executive are influenced by Justice Bork's ideas that consumer benefit being the deciding factor over company's actions.
Consumer benefits becomes opinions of projections by either side of a case about the future, whereas company actions like collusion, pricing fixing or M&A are hard facts with strong evidence. Today it is all vibes on how the courts (or executive) feel .
So now we have Government sanctioned cartels like in Aviation Alliances [1] that is basically based on convoluted catch-22-esque reasoning because it favors strategic goals even though it would be a violation of the letter/spirit of the law.
[1] https://www.transportation.gov/office-policy/aviation-policy...
Europe is prematurely regarded as having lost the AI race. And yet a large portion of Europe live higher quality lives compared to their American counterparts, live longer, and don't have to worry about an elected orange unleashing brutality on them.
This may lead to better life outcomes, but if the west doesn't control the whole stack then they have lost their sovereignty.
This is already playing out today as Europe is dependent on the US for critical tech infrastructure (cloud, mail, messaging, social media, AI, etc). There's no home grown European alternatives because Europe has failed to create an economic environment to assure its technical sovereignty.
When the welfare state, enabled by technology, falls apart, it won't take long for European society to fall apart. Except France maybe.
I know that's anecdotal, but it just seems Claude is often the default.
I'm sure there are key differences in how they handle coding tasks and maybe Claude is even a little better in some areas.
However, the note I see the most from Claude users is running out of usage.
Coding differences aside, this would be the biggest factor for me using one over the other. After several months on Codex's $20/mo. plan (and some pretty significant usage days), I have only come close to my usage limit once (never fully exceeded it).
That (at least to me) seems to be a much bigger deal than coding nuances.
Claude also doesn't let you use a worse model after you reach your usage limits, which is a bit hard to swallow when you're paying for the service.
opus: 5/25 gpt: 1.75/14
As to why, I think in part it is because people who are willing to pay that much per month are much more likely to be using it heavily on "serious" tasks, which is, of course, a goldmine for training data - even if you can't use the inputs directly for training, just looking at various real world issues and how agents handle them (or not) is valuable, especially when all the low-hanging fruit have already been picked.
I wouldn't even be surprised if the $20 users are actually subsidizing the $200 users.
Claude Code was instrumental for Anthropic.
What's interesting is that people haven't heard of it/them outside of software development circles. I work on a volunteer project, a webapp basically, and even the other developers don't know the difference between Cursor and Claude Code.
I suspect that tells us less about model capability/efficiency and more about each company's current need to paint a specific picture for investors re: revenue, operating costs, capital requirements, cash on hand, growth rate, retention, margins etc. And those needs can change at any moment.
Use whatever works best for your particular needs today, but expect the relative performance and value between leaders to shift frequently.
My guess is that it's potentially that and just momentum from developers who started using CC when it was far superior to Codex has allowed it to become so much more popular. Potentially, it's might be that, as it's more autonomous, it's better for true vibe-coding and it's more popular with the Twitter/LinkedIn wantrepreneur crew which meant it gets a lot of publicity which increases adoption quicker.
Are you feeling the benefits of the switch? What prompted you to change?
I've been running cursor with my own workflows (where planning is definitely a key step) and it's been great. However, the feeling of missing out, coupled with the fact I am a paying ChatGPT customer, got me to try codex. It hasn't really clicked in what way this is better, as so far it really hasn't been.
I have this feeling that supposedly you can give these tools a bit more of a hands-off approach so maybe I just haven't really done that yet. Haven't fiddled with worktrees or anything else yet either.
I been using Unix command lines since before most people here were born. And I actively prefer cursor to the text only coding agents. I like being able to browse the code next to the chat and easily switch between sessions, see markdown rendered properly, etc.
On fundamentals I think the differences are vanishing. They have converged on the same skills format standards. Cursor uses RAG for file lookups but Claude reads the whole file - token efficiency vs completeness. They both seem to periodically innovate some orchestration function which the other copies a few weeks later.
I think it really is just a stylistic preference. But the Claude people seem convinced Claude is better. Having spent a bunch of time analyzing both I just don’t see it.
I wish they would share the full conversation, token counts and more. I'd like to have a better sense of how they normalize these comparisons across version. Is this a 3-prompt 10m token game? a 30-prompt 100m token game? Are both models using similar prompts/token counts?
I vibe coded a small factorio web clone [1] that got pretty far using the models from last summer. I'd love to compare against this.
This was built using old versions of Codex, Gemini and Claude. I'll probably work on it more soon to try the latest models.
I think these days the $200 Max subscription wouldn't be needed. I bet with these latest models you can make due with mixing two $20/mo subscriptions.
Real time was 2 weeks of watching the agents while watching TV and playing games, waiting for limit resets, etc... Very little decided focused time.
| Name | Score |
|---------------------|-------|
| OpenAI Codex 5.3 | 77.3 |
| Anthropic Opus 4.6 | 65.4 |not saying there's a better way but both suck
With the right scaffolding these models are able to perform serious work at high quality levels.
Like can the model take your plan and ask the right questions where there appear to be holes.
How wide of architecture and system design around your language does it understand.
How does it choose to use algorithms available in the language or common libraries.
How often does it hallucinate features/libraries that aren't there.
How does it perform as context get larger.
And that's for one particular language.
I’d feel unscientific and broken? Sure maybe why not.
But at the end of the day I’m going to choose what I see with my own two eyes over a number in a table.
Benchmarks are a sometimes useful to. But we are in prime Goodharts Law Territory.
I honestly I have no idea what benchmarks are benchmarking. I don’t write JavaScript or do anything remotely webdev related.
The idea that all models have very close performance across all domains is a moderately insane take.
At any given moment the best model for my actual projects and my actual work varies.
Quite honestly Opus 4.5 is proof that benchmarks are dumb. When Opus 4.5 released no one was particularly excited. It was better with some slightly large numbers but whatever. It took about a month before everyone realized “holy shit this is a step function improvement in usefulness”. Benchmarks being +15% better on SWE bench didn’t mean a damn thing.
Can you guys point me ton a single useful, majority LLM-written, preferably reliable, program that solves a non-trivial problem that hasn't been solved before a bunch of times in publicly available code?
You are correct that these models primarily address problems that have already been solved. However, that has always been the case for the majority of technical challenges. Before LLMs, we would often spend days searching Stack Overflow to find and adapt the right solution.
Another way to look at this is through the lens of problem decomposition as well. If a complex problem is a collection of sub-problems, receiving immediate solutions for those components accelerates the path to the final result.
For example, I was recently struggling with a UI feature where I wanted cards to follow a fan-like arc. I couldn't quite get the implementation right until I gave it to Gemini. It didn't solve the entire problem for me, but it suggested an approach involving polar coordinates and sine/cosine values. I was able to take that foundational logic turn it into a feature I wanted.
Was it a 100x productivity gain? No. But it was easily a 2x gain, because it replaced hours of searching and waiting for a mental breakthrough with immediate direction.
There was also a relevant thread on Hacker News recently regarding "vibe coding":
https://news.ycombinator.com/item?id=45205232
The developer created a unique game using scroll behavior as the primary input. While the technical aspects of scroll events are certainly "solved" problems, the creative application was novel.
For example, consider this game: The game creates a target that's randomly generated on the screen and have a player at the middle of the screen that needs to hit the target. When a key is pressed, the player swings a rope attached to a metal ball in circles above it's head, at a certain rotational velocity. Upon key release, the player has to let go of the rope and the ball travels tangentially from the point of release. Each time you hit the target you score.
Now, I’m trying to calculate the tangential velocity of a projectile from a circular path, I could find the trig formulas on Stack Overflow. But with an LLM, I can describe the 'vibe' of the game mechanic and get the math scaffolded in seconds.
It's that shift from searching for syntax to architecting the logic that feels like the real win.
...This may still be worth it. In any case it will stop being a problem once the human is completely out of the loop.
edit: but personally I hate missing out on the chance to learn something.
Today, I know very well how to multiply 98123948 and 109823593 by hand. That doesn't mean I will do it by hand if I have a calculator handy.
Also, ancient scholars, most notably Socrates via Plato, opposed writing because they believed it would weaken human memory, create false wisdom, and stifle interactive dialogue. But hey, turns out you learn better if you write and practice.
Today with LLMs you can literally spend 5 minutes defining what you want to get, press send, go grab a coffee and come back to a working POC of something, in literally any programming language.
This is literally stuff of wonders and magic that redefines how we interface with computers and code. And the only thing you can think of is to ask if it can do something completely novel (that it's so hard to even quantity for humans that we don't have software patents mainly for that reason).
And the same model can also answer you if you ask it about maths, making you an itinerary or a recipe for lasagnas. C'mon now.
With LLMs this phase becomes worse. we speedup 10x the poc time, we slow down almost as much in the next phases, because now you have a poc of 10k lines that you are not familiar with at all, that have to pay way more attention at code review, that have to bolt on security as an afterthought (a major slowdown now, so much so that there are dedicated companies whose business model has become fixing Security problems caused by LLM POCs). Next phase, POCs are almost always 99% happy path. Bolt on edge case as another after thought and because you did not write any of those 10k lines how do you even know what edge cases might be neccesary to cover? maybe you guessed it rigth, spend even more time studing the unfamiliar code.
We use LLM extensivly now in our day to day, development has become somewhat more enjoyable but there is, at least as of now, no real increase in final delivry times, we have just redestributed where effort and time goes.
I know we all think we are always so deep into absolutely novel territory, which only our beautiful mind can solve. But for the vast majority of work done in the world, that work is transformative. You take X + Y and you get Z. Even with brand new api, you can just slap in the documentation and navigate it in order of magnitude faster than without.
I started using it for embedded systems doing something which I could literally find nothing about in rust but plenty in arduino/C code. The LLM allowed me to make that process so much faster.
I'm using Copilot for Visual Studio at work. It is useful for me to speed some typing up using the auto-complete. On the other hand in agentic mode it fails to follow simple basic orders, and needs hand-holding to run. This might not be the most bleeding-edge setup, but the discrepancy between how it's sold and how much it actually helps for me is very real.
And this matters because? Most devs are not working on novel never before seen problems.
I can name a few times where I worked on something that you could consider groundbreaking (for some values of groundbreaking), and even that was usually more the combination of small pieces of work or existing ideas.
As maybe a more poignant example- I used to do a lot of on-campus recruiting when I worked in HFT, and I think I disappointed a lot of people when I told them my day to day was pretty mundane and consisted of banging out Jiras, usually to support new exchanges, and/or securities we hadn't traded previously. 3% excitement, 97% unit tests and covering corner cases.
To bridge the containers in userland only, without root, I had to build: https://github.com/puzed/wrapguard
I'm sure it's not perfect, and I'm sure there are lots of performance/productivity gains that can be made, but it's allowed us to connect our CDN based containers (which don't have root) across multiple regions, talking to each other on the same Wireguard network.
No product existed that I could find to do this (at least none I could find), and I could never build this (within the timeframe) without the help of AI.
Not to be outdone, chatgpt 5.2 thinking high only needed about 8 iterations to get a mostly-working ffmpeg conversion script for bash. It took another 5 messages to translate it to run in windows, on powershell (models escaping newlines on windows properly will be pretty nuch AGI, as far as I’m concerned).
People should stop focusing on vibecoding and realize how many things LLMs can do such as investigating messy codebases that took me ages of writing paper notes to connect the dots, finding information about dependencies just by giving them access to replacing painful googling and GitHub issues or outdated documentation digging, etc.
Hell I can jump in projects I know nothing about, copy paste a Jira ticket, investigate, have it write notes, ask questions and in two hours I'm ready to implement with very clear ideas about what's going on. That was multi day work till few years ago.
I can also have it investigate the task at hand and automatically find the many unknowns unknowns that as usual work tasks have, which means cutting deliveries and higher quality software. Getting feedback early is important.
LLMs are super useful even if you don't make them author a single line of code.
And yes, they are increasingly good at writing boilerplate if you have a nice and well documented codebase thus sparing you time. And in my career I've written tons of mostly boilerplate code, that was another api, another form, another table.
And no, this is not vibe coding. I review every single line, I use all of its failures to write better architectural and coding practices docs which further improves the output at each iteration.
Honestly I just don't get how people can miss the huge productivity bonus you get, even if you don't have it edit a singl line of code.
Some people just hate progress.
Sure:
"The resulting compiler has nearly reached the limits of Opus’s abilities. I tried (hard!) to fix several of the above limitations but wasn’t fully successful. New features and bugfixes frequently broke existing functionality.
As one particularly challenging example, Opus was unable to implement a 16-bit x86 code generator needed to boot into 16-bit real mode. While the compiler can output correct 16-bit x86 via the 66/67 opcode prefixes, the resulting compiled output is over 60kb, far exceeding the 32k code limit enforced by Linux. Instead, Claude simply cheats here and calls out to GCC for this phase (This is only the case for x86. For ARM or RISC-V, Claude’s compiler can compile completely by itself.)"[1]
1. https://www.anthropic.com/engineering/building-c-compiler
Another example: Red Dead Redemption 2
Another one: Roller coaster tycoon
Another one: ShaderToy
You're not gonna one-shot RD2, but neither will a human. You can one-shot particles and shader passes though.
From my perspective, comments like these read as people having their head stuck in the sand (no offense, I might be missing something.)
Also try building any complex effects by prompting LLMs, you wont get any far, this is why all of the LLM coded websites look stupidly bland.
As to your second question, it is about prompting them correctly, for example [0]. Now I don't know about you but some of those sites especially after using the frontend skill look pretty good to me. If those look bland to you then I'm not really sure what you're expecting, keeping in mind that the example you showed with the graphics are not regular sites but more design oriented, and even still nothing stops LLMs from producing such sites.
Edit: I found examples [0] of games too with generated assets as well. These are all one shot so I imagine with more prompting you can get a decent game all without coding anything yourself.
But I have plenty of examples of really atrocious human written code to show you! TheDailyWtf has been documenting the phenomenon for decades.
It satisfies your relevant criteria: LLM-written, reliable, non-trivial.
No major program is perfectly reliable so I wouldn't call it that (but we have fewer incidents vs human-written code), and "useful" is up to the reader (but our code is certainly useful to us.)
I see this originality criteria appended a lot, and
1) I don't think it's representative of the actual requirements for something to be extremely useful and productivity-enhancing, even revolutionary, for programming. IDE features, testing, code generation, compilers — all of these things did not really directly help you produce more original solutions to original problems, and yet they were huge advances in program or productivity.
I mean like. How many such programs are there in general?
The vast vast majority of programs that are written are slight modifications, reorganizations, or extensions, of one or more programs that are already publicly available a bunch of times over.
Even the ones that aren't could fairly easily be considered just recombinations of different pieces of programs that have been written and are publicly available dozens or more times over, just different parts of them combined in a different order.
Hell, most code is a reorganization or recombination of the exact same types of patterns just in a different way corresponding to different business logic or algorithms, if you want to push it that far.
And yet plenty of deeply unoriginal programs are very useful and fill a useful niche, so they get written anyway.
2) Nor is it a particularly satisfiable goal. If there aren't, as a percentage, very many reliable, useful, and original programs that have been written in the decades since open source became a thing, why would we expect a five-year-old technology to have done so, especially when, obviously, the more reliable original and broadly useful programs have already written, the narrower the scope for new ones to satisfy the originality criteria?
3) Nor is it actually something that we would expect even under the hypothesis that agents make people significantly more productive at programs. Even if agents give 100x productivity gains to writing a useful tool or service or program or improving existing ones with new features. We still wouldn't expect them to give necessarily very many much productivity gains at all to writing original programs, precisely because of their current technology is a product of deep thinking, understanding a specific domain, seeing a niche, inspiration, science, talent and luck much more than the ability to even do productive engineering.
Now... increasingly it's like changing a partner just so slightly. I can feel that something is different and it gives me pause. That's probably not a sign of the improvement diminishing. Maybe more so my capability to appreciate them.
I can see how one might get from here to the whole people being upset about 4o thing.
French military had pioneered a way to make fully interchangeable weapon parts, but the French public fought back in fear of the jobs of the artisans who used to hand-make weapons. Over the next 20 years they completely lost their edge on the battlefield, nothing could be repaired in the field. Other countries embraced the change, could repair anything in the field with cheap and precise spare parts, and soon fostered in the industrial revolution.
The artisans stopped being people who made weapons, the artisans became people who made machines that made weapons.
It’s good to be cautious and not in denial, but i usually ignore people who talk so authoritatively about the future. It’s just a waste of time. Everyone thinks they are right.
My recommendation is have a very generous emergency fund and do your best to be effective at work. That’s the only thing you can control and the only thing that matters.
It's possible the job might change drastically, but I'm struggling to think of any scenario that doesn't also put most white collar professions out of work alongside me, and I don't think that's worth worrying about
If the AI performance gains are 50% improvement, and companies decide they rather cut costs and pocket the difference, could be due to many factors, that leaves millions out of a job. And those performance gains are coming for many white collar jobs. I guess your premise is mass unemployment is not worth worrying about, so okay then.
Marginal changes in productivity can make huge impacts to industries employment rates.
I am not a software engineer and it seems to me if someone has experience as a software engineer before LLMs, they have skills no one will really be able to acquire again in the same way.
I would expect current software engineers to eat the entire non-customer facing back office in the next ten years.
Wedding photography used to be the lowest in the pecking order of professional photography. Now all the photojournalists, travel magazine and corporate events photographers are as good as extinct. Even the arts market for photography been on decline for years.
My point wasn't that it's not a big deal. My point there is that if AI ends up taking a large % of white collar work you're going to have a huge portion of the population in the same boat. Maybe an overly optimistic view but that'll end up forcing change through politics
..I also think this is a ridiculously low % chance of happening and it would take something close to AGI to bring about. I don't know how you can use AI regularly and think we're anywhere close to that
Contracting an incurable illness that renders me blind and thus unable to work is just as likely and not something I spend time worrying about
> Marginal changes in productivity can make huge impacts to industries employment rates
Maybe? We also have Jevon's paradox. Software is incredibly expensive to build right now - how many more applications for it can people find if the cost halves?
You don't need to be out of a job to struggle. Just for your pay to remain the same (or lower), for your work conditions to degrade (you think jQuery spaguetti was a mess? good luck with AI spaguetti slop) or for competition to increase because now most of the devving involves tedious fixing of AI code and the actual programming heavy jobs are as fought for as dev roles at Google/Jane Street/etc.
Devving isn't going anywhere but just like you don't punch cards anymore, you shouldn't expect your role in the coming decades to be the same as the 90s-25s period.
My experience is that most developers have little to no understanding about engineering at all: meaning weighting pros and cons, understanding the requirements thoroughly, having a business oriented mindset.
Instead they think engineering is about coding practices and technologies to write better code.
That's because they focus on the code, the craft, not money.
When we achieve true AGI we're truly cooked, but it won't just be software developers by definition of AGI, it will be everyone else too. But the last people in the building before they turn the lights out for good will be the software developers.
It can only replace whoever is not writing a fat cheque to it.
They're specifically saying that they're planning for an overall improvement over the general-purpose GPT 5.2.
Something I have been experimenting with is AI-assisted proofs. Right now I've been playing with TLAPS to help write some more comprehensive correctness proofs for a thing I've been building, and 5.2 didn't seem quite up to it; I was able to figure out proofs on my own a bit better than it was, even when I would tell it to keep trying until it got it right.
I'm excited to see if 5.3 fairs a bit better; if I can get mechanized proofs working, then Fields Medal here I come!
Interesting
Need to keep the hype going if they are both IPO'ing later this year.
Consider the fact that 7 year old TPUs are still sitting at near 100p utilization today.
Compute.
Google didn't announce $185 billion in capex to do cataloguing and flash cards.
Given that they already pre-approved various language and marketing materials beforehand there's no real reason they couldn't just leave it lined up with a function call to go live once the key players make the call.
I suppose coincidences happen too but that just seems too unlikely to believe honestly. Some sort of knowledge leakage does seem like the most likely reason.
When they hook it up to Cerebras it's going to be a head-exploding moment.
I encourage people to try. You can even timebox it and come up with some simple things that might look initially insufficient but that discomfort is actually a sign that there's something there. Very similar to moving from not having unit/integration tests for design or regression and starting to have them.
Looking at the Opus model card I see that they also have by far the highest score for a single model on ARC-AGI-2. I wonder why they didn't advertise that.
We're in the 2400 baud era for coding agents and I for one look forward to the 56k era around the corner ;)
they forgot to add “Can’t wait to see what you do with it”
GPT-5.3-Codex dominates terminal coding with a roughly 12% lead (Terminal-Bench 2.0), while Opus 4.6 retains the edge in general computer use by 8% (OSWorld).
Anyone knows the difference between OSWorld vs OSWorld Verified?
OSWorld is the full 369-task benchmark. OSWorld Verified is a ~200-task subset where humans have confirmed the eval scripts reliably score success/failure — the full set has some noisy grading where correct actions can still get marked wrong.
Scores on Verified tend to run higher, so they're not directly comparable.
[shell_environment_policy]
inherit = "all"
experimental_use_profile = true
[shell_environment_policy.set]
NVM_DIR = "[redacted]"
PATH = "[redacted]"
I.e. `eval "$(/Users/max/.local/bin/mise activate zsh)"` in `.zprofile` and `.zshrc`
Then Codex will respect whatever node you've set as default, e.g.:
mise install node@24
mise use -g node@24
Codex might respect your project-local `.nvmrc` or `mise.toml` with this setup, but I'm not certain. I was just happy to get Codex to not use a version of node installed by brew (as a dependency of some other package).Serial usecases ("fix this syntax errors") will go on Cerebras and get 10x faster.
Deep usecases ("solve Riemann hypothesis") will become massively parallel and go on slower inference compute.
Teams will stitch both together because some workflows go through stages of requiring deep parallel compute ("scan my codebase for bugs and propose fixes") followed by serial compute ("dedupe and apply the 3 fixes, resolve merge conflict").
This is hilarious lol
In case you missed it. For example:
Nvidia's $100 billion OpenAI deal has seemingly vanished - Ars Technica
https://arstechnica.com/information-technology/2026/02/five-...
Specifically this paragraph is what I find hilarious.
> According to the report, the issue became apparent in OpenAI’s Codex, an AI code-generation tool. OpenAI staff reportedly attributed some of Codex’s performance limitations to Nvidia’s GPU-based hardware.
At what point will LLMs be autonomously self creating new versions of themselves?
May I at least understand what it has "written". AI help is good but don't replace real programmers completely. I'm enough copy pasting code i don't understand. What if one day AI will fall down and there will be no real programmers to write the software. AI for help is good but I don't want AI to write whole files into my project. Then something may broke and I won't know what's broken. I've experienced it many times already. Told the AI to write something for me. The code was not working at all. It was compiling normally but the program was bugged. Or when I was making some bigger project with ChatGPT only, it was mostly working but after a longer time when I was promting more and more things, everything got broken.
What if you want to write something very complex now that most people don't understand? You keep offering more money until someone takes the time to learn it and accomplish it, or you give up.
I mean, there are still people that hammer out horseshoes over a hot fire. You can get anything you're willing to pay money for.
Am I better off buying 1 month of Codex, Claude, or Antigravity?
I want to have the agent continuesly recompile and fix compile errors on loop until all the bugs from switching to f32 are gone.
Between Codex and Claude, Codex will have much more generous limits for the same price, especially if you use top-of-the-line models (although for your task, Sonnet might actually be good enough).
I'm wanting to do it on an entire programming language made in rust: https://github.com/uiua-lang/uiua
Because there are no float32 array languages in existence today
My goal is to do it within the usage I get from a $20 monthly plan.
Reliable knowledge cutoff: May 2025, training data cutoff: August 2025
Seems to be slower/thinks longer.
> We are working to safely enable API access soon.
Anyone know if it is possible to use this model with opencode with the plus subscription?
[0]: https://opencode.ai/docs/ecosystem/#:~:text=Use%20your%20Cha...
[1]: https://github.com/numman-ali/opencode-openai-codex-auth
This week, I'm all local though, playing with opencode and running qwen3 coder next on my little spark machine. With the way these local models are progressing, I might move all my llm work locally.
However, when I use the 5.2codex model, I've found it to be very slow and worse (hard to quantify, but I preferred straight-up 5.2 output).
I really do wonder whats the chain here. Did Sam see the Opus announcement and DM someone a minute later?
GPT-4o vs. Google I/O (May 2024): OpenAI scheduled its "Spring Update" exactly 24 hours before Google’s biggest event of the year, Google I/O. They launched GPT-4o voice mode.
Sora vs. Gemini 1.5 Pro (Feb 2024): Just two hours after Google announced its breakthrough Gemini 1.5 Pro model, Sam Altman tweeted the reveal of Sora (text-to-video).
ChatGPT Enterprise vs. Google Cloud Next (Aug 2023): As Google began its major conference focused on selling AI to businesses, OpenAI announced ChatGPT Enterprise.
Not sure why everyone stays focused on getting it from Anthropic or OpenAI directly when there are so many places to get access to these models and many others for the same or less money.
What you can't do is pretend opencode is claude code to make use of that specific claude code subscription.
This really is a non-argument.
Are you really hitting limits, or are you turned off by the fact you think you will?
On Microsoft Foundry I can see the new Codex 4.6 model right now, but GPT-5.3 is nowhere to be seen.
I have a pre-paid account directly with OpenAI that has credits, but if I use that key with the Codex CLI, it can't access 5.3 either.
The press release very prominently includes this quote: "GPT‑5.3-Codex was co-designed for, trained with, and served on NVIDIA GB200 NVL72 systems. We are grateful to NVIDIA for their partnership."
Sounds like OpenAI's ties with their vendors are fraying while at the same time they're struggling to execute on the basics like "make our own models available to our own coding agents", let alone via third-party portals like Microsoft Foundry.
https://gist.github.com/simonw/a6806ce41b4c721e240a4548ecdbe...
Also, there is no reason for OpenAI and Anthropic to be trying to one-up each other's releases on the same day. It is hell for the reader.
Enterprise customers will happily pay even 100$/mo subscriptions and it has a clear value proposition that can be decently verified.
Meanwhile the prompt: Crop this photo of my passport
In my experience, you can only use Gemini structured outputs for the most trivial of schemas. No integer literals, no discriminated unions and many more paper cuts. So at least for me, it was completely unusable for what I do at work.
On the upside, they seem to have fixed it: https://blog.google/innovation-and-ai/technology/developers-...
[0]: https://platform.openai.com/docs/guides/function-calling#con...
One thing that pisses me off is this widespread misunderstanding that you can just fall back to function calling (Anthropic's function calling accepts JSON schema for arguments), and that it's the same as structured outputs. It is not. They just dump the JSON schema into the context without doing the actual structured outputs. Vercel's AI SDK does that and it pisses me off because doing that only confuses the model and prefilling works much better.
BTW, loser is spelled with a single o.
For downvoters, you must be naive to think these companies are not surveilling each other through various means.
With Codex (5.3), the framing is an interactive collaborator: you steer it mid-execution, stay in the loop, course-correct as it works.
With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
that feels like a reflection of a real split in how people think llm-based coding should work...
some want tight human-in-the-loop control and others want to delegate whole chunks of work and review the result
Interested to see if we eventually see models optimize for those two philosophies and 3rd, 4th, 5th philosophies that will emerge in the coming years.
Maybe it will be less about benchmarks and more about different ideas of what working-with-ai means
> With Opus 4.6, the emphasis is the opposite: a more autonomous, agentic, thoughtful system that plans deeply, runs longer, and asks less of the human.
Ain't the UX is the exact opposite? Codex thinks much longer before gives you back the answer.
https://openspec.dev/
I see around 30 t/s
This might be better with the new teams feature.
When they ask approval for a tool call, press down til the selector is on "No" and press tab, then you can add any extra instructions
Another thing that annoys me is the subagents never output durable findings unless you explicitly tell their parent to prompt the subagent to “write their output to a file for later reuse” (or something like that anyway)
I have no idea how but there needs to be ways to backtrack on context while somehow also maintaining the “future context”…
Having a human in the loop eliminates all the problems that LLMs have and continously reviewing small'ish chunks of code works really well from my experience.
It saves so much time having Codex do all the plumbing so you can focus on the actual "core" part of a feature.
LLMs still (and I doubt that changes) can't think and generalize. If I tell Codex to implement 3 features he won't stop and find a general solution that unifies them unless explicitly told to. This makes it kinda pointless for the "full autonomy" approach since effecitly code quality and abstractions completely go down the drain over time. That's fine if it's just prototyping or "throwaway" scripts but for bigger codebases where longevity matters it's a dealbreaker.
It's not a waste of time, it's a responsibility. All things need steering, even humans -- there's only so much precision that can be extrapolated from prompts, and as the tasks get bigger, small deviations can turn into very large mistakes.
There's a balance to strike between micro-management and no steering at all.
Most prompts we give are severely information-deficient. The reason LLMs can still produce acceptable results is because they compensate with their prior training and background knowledge.
The same applies to verification: it's fundamentally an information problem.
You see this exact dynamic when delegating work to humans. That's why good teams rely on extremely detailed specs. It's all a game of information.
If it really knows better, then fire everyone and let the agent take charge. lol
Also what are you even proposing/advocating for here?
This meta-state-of-company context is just as capturable as anything else with the right lines of questioning and spyware and UI/UX to elicit it.
You might be able to get away without the review step for a bit, but eventually (and not long) you will be bitten.
Some of the things it occasionally does:
- Ignores conventions (even when emphasized in the CLAUDE.md)
- Decides to just not implement tests if gets spins out on them too much (it tells you, but only as it happens and that scrolls by pretty quick)
- Writes badly performing code (N+1)
- Does more than you asked (in a bad way, changing UIs or adding cruft)
- Makes generally bad assumptions
I'm not trying to be overly negative, but in my experience to date, you still need to babysit it. I'm interested though in the idea of using multiple models to have them perform independent reviews to at least flag spots that could use human intervention / review.
Every mistake is a chance to fix the system so that mistake is less likely or impossible.
I rarely fix anything in real time - you review, see issues, fix them in the spec, reset the branch back to zero and try again. Generally, the spec is the part I develop interactively, and then set it loose to go crazy.
This feels, initially, incredibly painful. You're no longer developing software, you're doing therapy for robots. But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
Or, really, hacking in "learning", building your knowhow-base.
> But it delivers enormous compounding gains, and you can use your agent to do significant parts of it for you.
Strong yes to both, so strong that it's curious Claude Code, Codex, Claude Cowork, etc., don't yet bake in an explicit knowledge evolution agent curating and evolving their markdown knowledge base:
https://github.com/anthropics/knowledge-work-plugins
Unlikely to help with benchmarks. Very likely to improve utility ratings (as rated by outcome improvements over time) from teams using the tools together.
For those following along at home:
This is the return of the "expert system", now running on a generalized "expert system machine".
This sounds like never. Most businesses are still shuffling paper and couldn’t give you the requirements for a CRUD app if their lives depended on it.
You’re right, in theory, but it’s like saying you could predict the future if you could just model the universe in perfect detail. But it’s not possible, even in theory.
If you can fully describe what you need to the degree ambiguity is removed, you’ve already built the thing.
If you can’t fully describe the thing, like some general “make more profit” or “lower costs”, you’re in paper clip maximizer territory.
Trying to get my company to realize this right now.
Probably the most efficient way to work, would be on a video call including the product person/stakeholder, designer, and me, the one responsible for the actual code, so that we can churn through the now incredibly fast and cheap implementation step together in pure alignment.
You could probably do it async but it’s so much faster to not have to keep waiting for one another.
That could easily be automated.
specifically, the GPT-5.3 post explicitly leans into "interactive collaborator" langauge and steering mid execution
OpenAI post: "Much like a colleague, you can steer and interact with GPT-5.3-Codex while it’s working, without losing context."
OpenAI post: "Instead of waiting for a final output, you can interact in real time—ask questions, discuss approaches, and steer toward the solution"
Claude post: "Claude Opus 4.6 is designed for longer-running, agentic work — planning complex tasks more carefully and executing them with less back-and-forth from the user."
I don’t think there’s something deeply philosophical in here, especially as Claude Code is pushing stronger for asking more questions recently, introduced functionality to “chat about questions” while they’re asked, etc.
On further prompting it did the next step and terminated early again after printing how it would proceed.
It's most likely just a bug in GitHub Copilot, but it seems weird to me that they add models that clearly don't even work with their agentic harness.
I haven’t used Codex but use Claude Code, and the way people (before today) described Codex to me was like how you’re describing Opus 4.6
So it sounds like they’re converging toward “both these approaches are useful at different times” potentially? And neither want people who prefer one way of working to be locked to the other’s model.
This feels wrong, I can't comment on Codex, but Claude will prompt you and ask you before changing files, even when I run it in dangerous mode on Zed, I can still review all the diffs and undo them, or you know, tell it what to change. If you're worried about it making too many decisions, you can pre-prompt Claude Code (via .claude/instructions.md) and instruct it to always ask follow up questions regarding architectural decisions.
Sometimes I go out of my way to tell Claude DO NOT ASK ME FOR FOLLOW UPS JUST DO THE THING.
I guess its also quite interesting that how they are framing these projects are opposite from how people currently perceive them and I guess that may be a conscious choice...
I usually want the codex approach for code/product "shaping" iteratively with the ai.
Once things are shaped and common "scaling patterns" are well established, then for things like adding a front end (which is constantly changing, more views) then letting the autonomous approach run wild can *sometimes* be useful.
I have found that codex is better at remembering when I ask to not get carried away...whereas claude requires constant reminders.
This is true, but I find that Codex thinks more than Opus. That's why 5.2 Codex was more reliable than Opus 4.5
I would much rather work with things like the Chat Completion API than any frameworks that compose over it. I want total control over how tool calling and error handling works. I've got concerns specific to my business/product/customer that couldn't possibly have been considered as part of these frameworks.
Whether or not a human needs to be tightly looped in could vary wildly depending on the specific part of the business you are dealing with. Having a purpose-built agent that understands where additional verification needs to occur (and not occur) can give you the best of both worlds.
I’ve had both $200 plans and now just have Max x20 and use the $20 ChatGPT plan for an inferior Codex.
My experience (up until today) has always been that Codex acts like that one Sr Engineer that we all know. They are kind of a dick. And will disappear into a dark hole and emerge with a circle when you asked for a pentagon. Then let you know why edges are bad for you.
And yes, Anthropic is pivoting hard into everything agentic. I bet it’s not too long before Claude Code stops differentiating models. I had Opus blow 750k tokens on a single small task.
Theres hundreds of people who upload Codex 5.2 running for hours unattended and coming back with full commits
I mean Opus asks a lot if he should run things, and each time you can tell it to change. And if that's not enough you can always press esc to interrupt.