One thing that's interesting is a bunch of internal thought leaders who swear by the Flash models over the Pro models. Whether this is true or not doesn't really matter, the interesting bit to me is that we are at a point with the models where "better" models are not necessarily more useful, and that faster with more work on the harnesses may be a better trade-off.
I've seen people outside Google favoring flash Gemini models over the Pro.
There are also some benchmarks where flash models have higher scores, so yes, apparently speed does matter.
As for actual solutions to problems ignoring the VS Code extension aspect, I find all three premiere models to be excellent coding agents for my purposes.
I'd say I'm surprised by it, but uh
Most of them were vibecoded in days, so what do you expect? And new versions just add features, they never fix the old cruft.
Probably there would be some money to be made if someone actually takes the time to write a good agent harness.
This is a bunch of gabagoo. Wrong on so many layers, it's not even worth reading further.
a) goog has agentic coding in both antigravity & cli forms. While it is not at the level of cc + opus, it's still decent.
b) goog has their own versions of models trained on internal code
c) goog has claude in vertex, and most definitely can set it up in secure zones (like they can for their clients) so they'd be able to use claude (at cost) within their own projects.
Hoping they can figure it out sooner rather than later.
If internal staff aren't happy with the tools they build, typically that should drive improvements to their own tools
He made a follow up after the pushback by GDM.
Google’s businesses are very broad and durable. But Google being the only company in the world without access (except for GDM+labs) to a competent coding agent will take a toll.
We’ll see how long Google can hold out hoping for GDM to create something that is competitive.
I’m guess that within 6 months Google will give up on coding and finally let their devs use Claude/Codex.
This isn’t a security problem, this is a GDM issue with GDM’s promises being far beyond their ability.
Do we have other examples of AI being used to improve the LLMs, apart for the creation of synthetic data and the testing of the models?
A more efficient transformer just costs less to run.
"AI improving AI" would be if one generation of AI designed a next-gen AI that was fundamentally more capable (not just faster/cheaper) than itself. A reptilian brain that could autonomously design a mammalian brain.
Even when hooked up into a smart harness like AlphaEvolve, I don't think LLMs have the creativity to do this, unless the next-gen architecture is hiding in plain sight as an assemblage of parts than an LLM can be coaxed into predicting.
More likely it'll take a few more steps of human innovation, steps towards AGI, before we have an AI capable of autonomous innovation rather than just prompted mashup generation.
Yes, last year when they revealed AlphaEvolve they used a previous gemini model to improve kernels that were used in training this gen models, netting them a 1% faster training run. Not much, but still.
There still could be hard constraints to make singularity intractable or just such a long time horizon it’s not practical right?
This is the thing to look for in 2027, imho. All the big AI labs have big projects working on research agents, also specifically into improving AI (duh) and I expect a lot of that to get out of the experimental phases this year.
Next year they actually get to do a lot of work and I think we will see the first big effective architectural change co-invented by AI.
It’s a simple harness around Opus, but with tight integration to Hugging Face infra, so the agent can read papers, test code and launch experiments
Re: hyperparameter tuning and autoresearch: https://news.ycombinator.com/item?id=47444581
Parameter-free LLMs would be cool
There are only 3 companies doing this to date: Google, Sakana AI and Autohand AI.
It often feels like they do not want me to develop applications for corporate clients using their Vertex API. It is just such a shame, given that their models were so great for document analysis etc.
In the past, we used a wrapper that round-robined across multiple projects to get enough quota. Luckily, many of our workloads are workflow-style tasks, so we can simply keep retrying on 429s.
Fun fact: for one of their services, I think it was Stitch, I noticed that my paid key kept hitting quota, while the free worked fine. That blew my mind.
We generally avoid any Google AI for the most part because it's so unreliable.
I can't read the Nature paper about DeepConsensus, but from the summary, it doesn't really explain what role AE had in improving DC. It would be nice to be able to read about what role it actually played, and whether it used traditional or novel methods of performing it
How do I access AlphaEvolve?
[0] https://github.com/algorithmicsuperintelligence/openevolve
AE brings diversity from the genetic algorithms community to large scale optmized deep learning and RL models.
It is a mandatory step for moving forward. The approach is clean and simple, while generic.
The only caveats is the per optimization problem definition of the map élites dimensions. But surely, this will get tackled somehow over the next few years.
If you don't know about map-elites, go look up Jean-Baptiste Mouret' s work and talks, it's both very interesting and universal.
-2021-2024 was Denial
-2024-2025 was Anger and Bargaining
-2026 seems to be some combo of anger, bargaining and acceptance depending mostly on your class/age
Ah good, we're getting closer and closer to Venus, Inc. every day. /s
Humans get bored, inpatient, or run out of time, and so often give up in what they perceive to be a decent "local minima". Early verification harnesses using gpt-4 for optimizing robot reward functions succeeded quite well on the fact that the LLM just kept going (link below). As long as it is too boring for a human to use the same evaluation infrastructure, this is still an agent skill.
In a sentence: These foundation models are really good at optimizing these extremely high level, extremely well defined problem spaces (ie multiply matrices faster). In Antirez's case, it's "make Redis faster".
There have been two reactions: "Oh it would never work for me" and "I have seen months of my life accomplished in an hour", and I think they're both right. I think we should be excited for Antirez, (who has since been popping off [1]), and I think the rest of us should rest easy knowing that LLM's can't (and maybe were never meant to) tackle the tacit-knowledge-filled, human-system-centric, ambiguously-defined-problem-space jobs most mortals work.
[0] https://antirez.com/news/158 [1] https://antirez.com/news/164
I don't believe that anymore, to be honest. Models are starting to get good at ambiguity. Claude Code now asks me when something is ambiguous. Soon, all meetings will be recorded, transcribed and stored in a well-indexed place for the agents to search when faced with ambiguity (free startup idea here!). If they can ask you now, they'll be able to search for the answers themselves once that's possible. In fact, they already do it now if you have a well-documented Notion/Confluence, it's just that nobody has.
It's probably harder to RL for "identify ambiguity" than RL'ing for performance algorithms, sure, but it's not impossible and it's in the works. It's just a matter of time now.
That's fair, and something I've observed too. I wish I had written "the rest of us shouldn't freak out and quit software today".
But here's another data point: At the biotech I work for, writing good code has never been the bottleneck. I actually told my boss that a paid Claude vs free subscription wouldn't be that much value because even if it took every piece of code or algorithm we've ever written and 10x-ed the hell out of them, we'd still be bottlenecked by the biology and physics which dictates that we wait 24 days for our histology assay pipeline.
I have a hunch most fields outside of software are this way. And I'm personally not planning to quit anytime soon.
We were doing that over at Vowel a few years back, unfortunately it didn't pan out because you're competing directly against Zoom, Google Meet, Microsoft Teams, etc. They are all (slowly) catching up to where we were as a scrappy startup 4 years ago.
It was truly game-changing to have all of your meetings in an easily searchable database. Even as a human.
It works really well.
Maybe this rephrase will help: the proposed solution is to render all knowledge explicit.
Full transparency has a cost, and we cannot afford it.
Slack is kinda there with Salesforce - can do a lot already on Agentforce and in Slackbot, but two aren't integrated just yet and Slackbot doesn't support group chats/channels. One interesting aspect in this will be - who has superiority boss, client, analyst or developer?
Also, we are seeing a cultural shift around that as well. Now people bring "AI notetakers" to Zoom calls without even asking for your permission. People are already acting like privacy laws don't exist anymore, it's going to be even easier for the AI lobby to take it down now. Just like piracy normalized copyright infringement, opening the path to the current rulings around "fair training".
What if (when?) (AI-assisted) research moves AI beyond LLMs? Do you think that can't happen?
I'm pretty sure money is not going to be the blocker.
[0] https://hai.stanford.edu/ai-index/2026-ai-index-report
LeCun argues that most human reasoning is grounded in the physical world, not language, and that AI world models are necessary to develop true human-level intelligence. “The idea that you’re going to extend the capabilities of LLMs [large language models] to the point that they’re going to have human-level intelligence is complete nonsense,” he said. [0]
[0] https://www.wired.com/story/yann-lecun-raises-dollar1-billio...
AI is hands down the most researched topic in CS departments. Of the 10 largest companies (by market cap), only 3 aren't balls-deep in AI R&D. The fastest growing (private or public) companies by revenue are also almost all companies focused primarily on AI (Anthropic, OpenAI, xAI, Scale AI, Nvidia).
And the money isn't even the most important part. It's all about mindshare and collective research time. The architectural concepts can be researched and developed on top of open models, so even individual relatively poor researchers unaffiliated to anything can make breakthroughs.
Even the computing required for the legendary "Attention is all you need" paper could probably be recreated on con-/prosumer hardware in a month's time.
[0] https://en.wikipedia.org/wiki/OpenAI#Creation_of_for-profit_...
Realistically, one can build a AI capable of reasoning (i.e recurrent loops with branches) using very basic models that fit on a 3090, with multi agent configuration along the lines https://github.com/gastownhall/gastown. Nobody has done it yet because we don't know what the number of agents is required and what the prompts for those look like.
The fundamental philosophical problem is if that configuration is possible to arrive at using training, or do ai agents have to go through equivalent "evolution epocs" to be able to do all that in a simulated environment. Because in the case of those prompts and models, they have to be information agnostic.
1. Amazing, you just tweaked 1% efficiency
2. You idiot, you just spent an hour trying to trouble shoot a hallucinated api.
On average, it's really hard to tell which ones going to win here.
Imagine going back to 2020 and tell people in 6 years going to be able to spend $200.00 a month and be able to spin up $2mm in GPUs at full throttle to respond to your emails. None of this makes sense.
LLMs are a "complicated solution" in the sense that they're expensive. Once you know what they're capable of, you can scale them down to something less expensive. There's usually a way.
Also, an important advantage of LLMs over other approaches is that it's easy to improve them by finding better ways of prompting them. Those prompting strategies can then get hard-coded into the models to make them more efficient. Rinse and repeat. Similarly, you can produce curated data to make them better in certain areas like programming or mathematics.
A Statement all but guaranteed to look incredibly short sighted by 2030.
The real question to me is if the system can pay for itself. Economics are racing against efficiency gains and it's anyone's guess which wins.