It was R1 with its RL-training that made the news and crashed the srock market.
As these calendars also rely on time zones for date calculation, there are rare occasions where the New Year start date differs by an entire month between 2 countries.
This non-problem sounds like it's on the same scale as "The British Isles", a term which is mildly annoying to Irish people but in common use everywhere else.
And don't get me started with "Lunar New Year? What Lunar New Year? Islamic Lunar New Year? Jewish Lunar New Year? CHINESE Lunar New Year?".
[0] https://www.mom.gov.sg/employment-practices/public-holidays
As it turns out, people in China don’t name their holidays based off of what the laws of New York or California say.
https://en.wikipedia.org/wiki/Indian_New_Year%27s_days#Calen...
That said, "Lunar New Year" is probably as good a compromise as any, since we have other names for the Hebrew and Islamic New Years.
The Islamic calendar originated in Arabia. Calling it an Asian lunar calendar wouldn't be inaccurate.
Is "Gemini 3 Deep Think" even technically a model? From what I've gathered, it is built on top of Gemini 3 Pro, and appears to be adding specific thinking capabilities, more akin to adding subagents than a truly new foundational model like Opus 4.6.
Also, I don't understand the comments about Google being behind in agentic workflows. I know that the typical use of, say, Claude Code feels agentic, but also a lot of folks are using separate agent harnesses like OpenClaw anyway. You could just as easily plug Gemini 3 Pro into OpenClaw as you can Opus, right?
Can someone help me understand these distinctions? Very confused, especially regarding the agent terminology. Much appreciated!
- a product (most accurate here imo)
- a specific set of weights in a neural net
- a general architecture or family of architectures (BERT models)
So while you could argue this is a “model” in the broadest sense of the term, it’s probably more descriptive to call it a product. Similarly we call LLMs “language” models even if they can do a lot more than that, for example draw images.
I probably won’t even assume it’s the OG BERT. It could be ModernBERT or RoBERTa or one of any number of other variants, and simply saying it’s a BERT model is usually the right level of detail for the conversation.
Then marketing, and huge amount of capital came.
It has to do with how the model is RL'd. It's not that Gemini can't be used with various agentic harnesses, like open code or open claw or theoretically even claude code. It's just that the model is trained less effectively to work with those harnesses, so it produces worse results.
My assumption is that they're all either pretty happy with their base models or unwilling to do those larger runs, and post-training is turning out good results that they release quickly.
I don’t think it’s hyperbolic to say that we may be only a single digit number of years away from the singularity.
And yes, you are probably using them wrong if you don’t find them useful or don’t see the rapid improvement.
Every new model release neckbeards come out of the basements to tell us the singularity will be there in two more weeks
The logic related to the bug wasn't all contained in one file, but across several files.
This was Gemini 2.5 Pro. A whole generation old.
Consider that a nonzero percent of otherwise competent adults can't write in their native language.
Consider that some tens of percentage of people wouldn't have the foggiest idea of how to calculate a square root let alone a cube.
Consider that well less than half of the population has ever seen code let alone produced functioning code.
The average adult is strikingly incapable of things that the average commenter here would consider basic skills.
You’ve once again made up a claim of “two more weeks” to argue against even though it’s not something anybody here has claimed.
If you feel the need to make an argument against claims that exist only in your head, maybe you can also keep the argument only in your head too?
Also, did you use Codex 5.3 Xhigh through the Codex CLI or Codex App?
On the other hand, prayer doesn’t heal anybody and there’s no proof of supernatural beings.
Projects:
https://github.com/alexispurslane/oxen
https://github.com/alexispurslane/org-lsp
(Note that org-lsp has a much improved version of the same indexer as oxen; the first was purely my design, the second I decided to listen to K2.5 more and it found a bunch of potential race conditions and fixed them)
shrug
I had a test failing because I introduced a silly comparison bug (> instead of <), and claude 4.6 opus figured out it wasn't the test the problem, but the code and fixed the bug (which I had missed).
What do you believe this shows? Sometimes I have difficulty finding bugs in other people's code when they do things in ways I would never use. I can rewrite their code so it works, but I can't necessarily quickly identify the specific bug.
Expecting a model to be perfect on every problem isn't reasonable. No known entity is able to do that. AIs aren't supposed to be gods.
(Well not yet anyway - there is as yet insufficient data for a meaningful answer.)
That statement is plausible. However, extrapolating that to assert all the very different things which must be true to enable any form of 'singularity' would be a profound category error. There are many ways in which your first two sentences can be entirely true, while your third sentence requires a bunch of fundamental and extraordinary things to be true for which there is currently zero evidence.
Things like LLMs improving themselves in meaningful and novel ways and then iterating that self-improvement over multiple unattended generations in exponential runaway positive feedback loops resulting in tangible, real-world utility. All the impressive and rapid achievements in LLMs to date can still be true while major elements required for Foom-ish exponential take-off are still missing.
We're back to singularity hype, but let's be real: benchmark gains are meaningless in the real world when the primary focus has shifted to gaming the metrics
Benchmaxxing exists, but that’s not the only data point. It’s pretty clear that models are improving quickly in many domains in real world usage.
They're still afflicted by the same fundamental problems that hold LLMs back from being a truly autonomous "drop-in human replacement" that would enable an entire new world of use cases.
And finally live up to the hype/dreams many of us couldn't help but feeling was right around in the corner circa 2022/3 when things really started taking off.
Next week Chinese New year -> Chinese labs release all the models at once before it starts -> US labs respond with what they have already prepared
also note that even in US labs a large proportion of researchers and engineers are chinese and many celebrate the Chinese New Year too.
TLDR: Chinese New Year. Happy Horse year everybody!
It had already rained at the beginning of the meeting. During the same, however, a heavy thunderstorm set in, whereby our electric light line was put out of operation. Wax candles with beer bottles as light holders provided the lighting. In the meantime the rain had fallen in a cloudburst-like manner, so that one needed help to get one's automobile going. In some streets the water stood so high that one could reach one's home only by detours. In this night 9.65 inches of rain had fallen.
Any time I upload an attachment, it just fails with something vague like "couldn't process file". Whether that's a simple .MD or .txt with less than 100 lines or a PDF. I tried making a gem today. It just wouldn't let me save it, with some vague error too.
I also tried having it read and write stuff to "my stuff" and Google drive. But it would consistently write but not be able to read from it again. Or would read one file from Google drive and ignore everything else.
Their models are seriously impressive. But as usual Google sucks at making them work well in real products.
Context window blowouts? All the time, but never document upload failures.
It took me however a week to ditch Gemini 3 as a user. The hallucinations were off the charts compared to GPT-5. I've never even bothered with their CLI offering.
Also because of the long context window (1 mil tokens on thinking and pro! Claude and OpenAI only have 128k) deep research is the best
That being said, for coding I definitely still use Codex with GPT 5.3 XHigh lol
The models feel terrible, somehow, like they're being fed terrible system prompts.
Plus the damn thing kept crashing and asking me to "restart it". What?!
At least Kiro does what it says on the tin.
I've recently tried a buuuuunch of stuff (including Antigravity and Kiro) and I really, really, could not stomach Antigravity.
Google is great at some things, but this isn't it.
It is also one of the worst models to have a sort of ongoing conversation with.
Not a single person is using it for coding (outside of Google itself).
Maybe some people on a very generous free plan.
Their model is a fine mid 2025 model, backed by enormous compute resources and an army of GDM engineers to help the “researchers” keep the model on task as it traverses the “tree of thoughts”.
But that isn’t “the model” that’s an old model backed by massive money.
Come on.
Worthless.
Do you have any market counter points.
Market counter points that aren't really just a repackaging of:
1. "Google has the world's best distribution" and/or
2. "Google has a firehose of money that allows them to sell their 'AI product' at an enormous discount?
Good luck!Their chat feels similar. It just runs off like a wild dog.
Tool calling failures, hallucinations, bad code output. It felt like using a coding model from a year ago.
Even just as a general use model, somehow ChatGPT has a smoother integration with web search (than google!!), knowing when to use it, and not needing me to prompt it directly multiple times to search.
Not sure what happened there. They have all the ingredients in theory but they've really fallen behind on actual usability.
Their image models are kicking ass though.
Peacetime Google is slow, bumbling, bureaucratic. Wartime Google gets shit done.
I don’t think this is intentional, but I think they stopped fighting SEO entirely to focus on AI. Recipes are the best example - completely gutted and almost all receive sites (therefore the entire search page) run by the same company. I didn’t realize how utterly consolidated huge portions of information on the internet was until every recipe site about 3 months ago simultaneously implemented the same anti-Adblock.
Apple made a social network called Ping. Disaster. MobileMe was silly.
Microsoft made Zune and the Kin 1 and Kin 2 devices and Windows phone and all sorts of other disasters.
These things happen.
I think you overestimate how much your average person-on-the-street cares about LLM benchmarks. They already treat ChatGPT or whichever as generally intelligent (including to their own detriment), are frustrated about their social media feeds filling up with slop and, maybe, if they're white-collar, worry about their jobs disappearing due to AI. Apart from a tiny minority in some specific field, people already know themselves to be less intelligent along any measurable axis than someone somewhere.
Anyone with any sense is interested in how well these tools work and how they can be harnessed, not some imaginary milestone that is not defined and cannot be measured.
Gemini has flashes of brilliance, but I regard it as unpolished some things work amazingly, some basics don't work.
I subscribe to both Gemini ($20/mo) and ChatGPT Pro ($200/mo).
If I give the same question to "Gemini 3.0 Pro" and "ChatGPT 5.2 Thinking + Heavy thinking", the latter is 4x slower and it gives smarter answers.
I shouldn't have to enumerate all the different plausible explanations for this observation. Anything from Gemini deciding to nerf the reasoning effort to save compute, versus TPUs being faster, to Gemini being worse, to this being my idiosyncratic experience, all fit the same data, and are all plausible.
Afaik, Google has had no breaches ever.
Privacy, not so much. How many hundreds of millions have they been fined for “incognito mode” in chrome being a blatant lie?
and when I swap back into the Gemini app on my iPhone after a minute or so the chat disappears. and other weird passive-aggressive take-my-toys-away behavior if you don't bare your body and soul to Googlezebub.
ChatGPT and Grok work so much better without accounts or with high privacy settings.
Been using Gemini + OpenCode for the past couple weeks.
Suddenly, I get a "you need a Gemini Access Code license" error but when you go to the project page there is no mention of this or how to get the license.
You really feel the "We're the phone company and we don't care. Why? Because we don't have to." [0] when you use these Google products.
PS for those that don't get the reference: US phone companies in the 1970s had a monopoly on local and long distance phone service. Similar to Google for search/ads (really a "near" monopoly but close enough).
I guess it depends a lot on what you use LLMs for and how they are prompted. For example, Gemini fails the simple "count from 1 to 200 in words" test whereas Claude does it without further questions.
Another possible explanation would be that processing time is distributed unevenly across the globe and companies stay silent about this. Maybe depending on time zones?
Requests regularly time out, the whole window freezes, it gets stuck in schizophrenic loops, edits cannot be reverted and more.
It doesn't even come close to Claude or ChatGPT.
Also the worst model in terms of hallucinations.
Claude Code is great for coding, Gemini is better than everything else for everything else.
The arc-agi-2 score (84.6%) is from the semi-private eval set. If gemini-3-deepthink gets above 85% on the private eval set, it will be considered "solved"
>Submit a solution which scores 85% on the ARC-AGI-2 private evaluation set and win $700K. https://arcprize.org/guide#overview
edit: they just removed the reference to "3.1" from the pdf
- non thinking models
- thinking models
- best of N models like deep think an gpt pro
Each one is of a certain computational complexity. Simplifying a bit, I think they map to - linear, quadratic and n^3 respectively.
I think there are certain class of problems that can’t be solved without thinking because it necessarily involves writing in a scratchpad. And same for best of N which involves exploring.
Two open questions
1) what’s the higher level here, is there a 4th option?
2) can a sufficiently large non thinking model perform the same as a smaller thinking?
edit: i don't know how this is meaningfully different from 3
Yeah, these are made possible largely by better use at high context lengths. You also need a step that gathers all the Ns and selects the best ideas / parts and compiles the final output. Goog have been SotA at useful long context for a while now (since 2.5 I'd say). Many others have come with "1M context", but their usefulness after 100k-200k is iffy.
What's even more interesting than maj@n or best of n is pass@n. For a lot of applications youc an frame the question and search space such that pass@n is your success rate. Think security exploit finding. Or optimisation problems with quick checks (better algos, kernels, infra routing, etc). It doesn't matter how good your pass@1 or avg@n is, all you care is that you find more as you spend more time. Literally throwing money at the problem.
Ultimately, the only real difference between no-thinking and thinking models is the amount of tokens used to reach the final answer. Whether those extra scratchpad tokens are between <think></think> tags or not doesn't really matter.
Models from Anthropic have always been excellent at this. See e.g. https://imgur.com/a/EwW9H6q (top-left Opus 4.6 is without thinking).
Previous models including Claude Opus 4.6 have generally produced a lot of noise/things that the compiler already reliably optimizes out.
We download the stl and import to bambu. Works pretty well. A direct push would be nice, but not necessary.
let me know how it goes!
This version of DeepSeek got it first try. Thinking time was 2 or 3 minutes.
The visual reasoning of this class of Gemini models is incredibly impressive.
Google has definitely been pulling ahead in AI over the last few months. I've been using Gemini and finding it's better than the other models (especially for biology where it doesn't refuse to answer harmless questions).
IMO it's the other way around. Benchmarks only measure applied horse power on a set plane, with no friction and your elephant is a point sphere. Goog's models have always punched over what benchmarks said, in real world use @ high context. They don't focus on "agentic this" or "specialised that", but the raw models, with good guidance are workhorses. I don't know any other models where you can throw lots of docs at it and get proper context following and data extraction from wherever it's at to where you'd need it.
Usually, when you decrease false positive rates, you increase false negative rates.
Maybe this doesn't matter for models at their current capabilities, but if you believe that AGI is imminent, a bit of conservatism seems responsible.
And I wonder how Gemini Deep Think will fare. My guess is that it will get half the way on some problems. But we will have to take an absence as a failure, because nobody wants to publish a negative result, even though it's so important for scientific research.
The 5 days window, however, is a sweat spot because it likely prevents cheating by hiring a math PhD and feed the AI with hints and ideas.
https://hn.algolia.com/?q=1stproof
This is exactly the kind of challenge I would want to judge AI systems based on. It required ten bleeding-edge-research mathematicians to publish a problem they've solved but hold back the answer. I appreciate the huge amount of social capital and coordination that must have taken.
I'm really glad they did it.
I really only use gemini-3-pro occasionally when researching and trying to better understand something. I guess I am not a good customer for super scalers. That said, when I get home from travel, I will make a point of using Gemini 3 Deep Think for some practical research. I need a business card with the title "Old Luddite."
Which tbh has never really sat right with me, seemingly placing way too much confidence in your ability to differentiate organic vs. manipulated output in a way I don't think any human could be expected to.
To me, this example is an extremely neat and professional SVG and so far ahead it almost seems too good to be true. But like with every previous model, you don't seem to have the slightest amount of skepticism in your review. I don't think I truly believe Google cheated here, but it's so good it does therefore make me question whether there could ever be an example of a pelican SVG in the future that actually could trigger your BS detector?
I know you say it's just a fun/dumb benchmark that's not super important, but you're easily in the top 3 most well known AI "influencers" whose opinion/reviews about model releases carry a lot of weight, providing a lot of incentive with trillions of dollars flying around. Are you still not at all concerned by the amount of attention this benchmark receives now/your risk of unwittingly being manipulated?
Yeah this is nuts. First real step-change we've seen since Claude 3.5 in '24.
Given the crazy money and vying for supremacy among AI companies right now it does seem naive to belive that no attempt at better pelicans on bicycles is being made. You can argue "but I will know because of the quality of ocelots on skateboards" but without a back catalog of ocelots on skateboards to publish its one datapoint and leaves the AI companies with too much plausible deniability.
The pelicans-on-bicycles is a bit of fun for you (and us!) but it has become a measure of the quality of models so its serious business for them.
There is an assymetry of incentives and high risk you are being their useful idiot. Sorry to be blunt.
This would just be one more checkbox buried in hundreds of pages of requests, and compared to plenty of other ethical grey areas like copyright laundering with actual legal implications, leaking that someone was asked to create a few dozen pelican images seems like it would be at the very bottom of the list of reputational risks.
I, myself, prefer the universal approximation theorem and empirical finding that stochastic gradient descent is good enough (and "no 'magic' in the brain", of course).
https://www.appen.com/llm-training-data
https://www.cogitotech.com/generative-ai/
https://www.telusdigital.com/solutions/data-for-ai-training/...
https://www.nexdata.ai/industries/generative-ai
---
P.S. Google Comms would have been consulted re putting a pelican in the I/O keynote :-)
The beauty of this benchmark is that it takes all of two seconds to come up with your own unique one. A seahorse on a unicycle. A platypus flying a glider. A man’o’war piloting a Portuguese man of war. Whatever you want.
- Take a list of n animals * m vehicule
- Ask a LLM to generate SVG for this n*m options
- Generate png from the svg
- Ask a Model with vision to grade the result
- Change your weight accordingly
No need to human to draw the dataset, no need of human to evaluate.
It was sort of humorous for the maybe first 2 iterations, now it's tacky, cheesy, and just relentless self-promotion.
Again, like I said before, it's also a terrible benchmark.
Edit: someone needs to explain why this comment is getting downvoted, because I don't understand. Did someone's ego get hurt, or what?
I was expecting something more realistic... the true test of what you are doing is how representative is the thing in relation to the real world. E.g. does the pelican look like a pelican as it exists in reality? This cartoon stuff is cute but doesnt pass muster in my view.
If it doesn't relate to the real world, then it most likely will have no real effect on the real economy. Pure and simple.
In contrast, the only "realistic" SVGs I've seen are created using tools like potrace, and look terrible.
I also think the prompt itself, of a pelican on bicycle, is unrealistic and cartoonish; so making a cartoon is a good way to solve the task.
It's agents all the way down.
This is a form of task-agnostic test time search that is more general than multi agent parallel prompt harnesses.
10 traces makes sense because ChatGPT 5.2 Pro is 10x more expensive per token.
That's something you can't replicate without access to the network output pre token sampling.
The amount of information available online about optics is probably <0.001% of what is available for software, and they can just breeze through modeling solutions. A year ago was immediate face-planting.
The gains are likely coming from exactly where they say they are coming from - scaling compute.
ARC-AGI-3 has a nasty combo of spatial reasoning + explore/exploit. It's basically adversarial vs current AIs.
But i'm not sure how to give such benchmarks. I'm thinking of tasks like learning a language/becoming a master at chess from scratch/becoming a skill artists but where the task is novel enough for the actor to not be anywhere close to proficient at beginning--an example which could be of interest is, here is a robot you control, you can make actions, see results...become proficient at table tennis. Maybe another would be, here is a new video game, obtain the best possible 0% speedrun.
You would need to check to see if everyone is having mistakes on the same 20% or different 20%. If its the same 20% either those questions are really hard, or they are keyed incorrectly, or they aren't stated with enough context to actually solve the problem.
It happens. Old MMLU non pro had a lot of wrong answers. Simple things like MNIST have digits labeled incorrect or drawn so badly its not even a digit anymore.
But the idea of automation is to make a lot fewer mistakes than a human, not just to do things faster and worse.
The problem is that if the automation breaks at any point, the entire system fails. And programming automations are extremely sensitive to minor errors (i.e. a missing semicolon).
AI does have an interesting feature though, it tends to self-healing in a way, when given tools access and a feedback loop. The only problem is that self-healing can incorrectly heal errors, then the final reault will be wrong in hard-to-detect ways.
So the more wuch hidden bugs there are, the nore unexpectedly the automations will perform.
I still don't trust current AI for any tasks more than data parsing/classification/translation and very strict tool usage.
I don't beleive in the full-assistant/clawdbot usage safety and reliability at this time (it might be good enough but the end of the year, but then the SWE bench should be at 100%).
Arc-AGI score isn't correlated with anything useful.
It's also interesting because it's very very hard for base LLMs, even if you try to "cheat" by training on millions of ARC-like problems. Reasoning LLMs show genuine improvement on this type of problem.
>can u make the progm for helps that with what in need for shpping good cheap products that will display them on screen and have me let the best one to get so that i can quickly hav it at home
And get back an automatic coupon code app like the user actually wanted.
For context, Opus 4.6's best score is 68.8% - but at a cost of $3.64 per task.
If Agents get good enough it's not going to build some profitable startup for you (or whatever people think they're doing with the llm slot machines) because that implies that anyone else with access to that agent can just copy you, its what they're designed to do... launder IP/Copyright. Its weird to see people get excited for this technology.
None of this good. We are simply going to have our workforces replaced by assets owned by Google, Anthropic and OpenAI. We'll all be fighting for the same barista jobs, or miserable factory jobs. Take note on how all these CEOs are trying to make it sound cool to "go to trade school" or how we need "strong American workers to work in factories".
The computer industry (including SW) has been in the business of replacing jobs for decades - since the 70's. It's only fitting that SW engineers finally become the target.
I don't think that's going to make society very pleasant if everyone's fighting over the few remaining ways to make livelihood. People need to work to eat. I certainly don't see the capitalist class giving everyone UBI and letting us garden or paint for the rest of our lives. I worry we're likely going to end up in trenches or purged through some other means.
The largest ongoing expense of every company is labor and software devs are some of the highest paid labor on the planet. AI will eventually drive down wages for this class of workers most likely by shipping these jobs to people in other countries where labor is much cheaper. Just like factory work did.
Enjoy the good times while they last (or get a job at an AI company).
Put another way, I’m on the capital side of the conversation.
The good news for labor that has experience and creativity is that it just started costing 1/100,000 what it used to to get on that side of the equation.
I am one of the “haves” and am not looking forward to the instability this may bring. Literally no one should.
these people always forget capitalism is permitted to exist by consent of the people
if there's 40% unemployment it won't continue to exist, regardless of what the TV/tiktok/chatgpt says
This is truly the dumbest statement I've ever seen on this site for too many reasons to list.
You people sound like NFT people in 2021 telling people that they're creating and redefining art.
Oh look peter@capital6.com is a "web3" guy. Its all the same grifters from the NFT days behaving the same way.
Anyway 100k is hyperbolic. But I’d argue just one order of magnitude. Claude max can do many things better than my last (really great) team, and is worse at some things - creative output, relationship building and conference attending most notably. It’s also much faster at the things it is good at. Like 20-50x faster than a person or team.
If I had another venture studio I’d start with an agent first, and fill in labor in the gaps. The costs are wildly different.
Back to you though - who hurt you? Your writing makes me think you are young. You have been given literal super power force extension tech from aliens this year, why not be excited at how much more you can build?
I imagine llm job automation will make people so poor that they beg to fight in wars, and instead of turning that energy against he people who created the problem they'll be met with hours of psyops that direct that energy to Chinese people or whatever.
We will see.
84% is meaningless if these things can't reason
getting closer and closer to 100%, but still can't function
I've noticed this week the AI summary now has a loader "Thinking…" (no idea if it was already there a few weeks ago). And after "Thinking…" it says "Searching…" and shows a list of favicons of popular websites (I guess it's generating the list of links on the right side of the AI summary?).
If you're going for cost performance balance choosing Gemini Pro is bewildering. Gemini Flash _outperforms_ Pro in some coding benchmarks and is the clear parento frontier leader for intelligence/cost. It's even cheaper than Kimi 2.5.
https://artificialanalysis.ai/?media-leaderboards=text-to-im...
Not interested enough to pay $250 to try it out though.
I seem to understand debt is very bad here since they could just sell more shares, but aren't (either valuation is stretched or no buyers).
Just a recession? Something else? Aren't they very very big to fall?
Edit0: Revenue isn't the right word, profit is more correct. Amazon not being profitable fucks with my understanding of buisness. Not an economist.
which companies don't have revenue? anthropic is at a run rate of 14 billion (up from 9B in December, which was up from 4B in July). Did you mean profit? They expect to be cash flow positive in 2028.
AI will kill advertising. Whatever sits at the top "pane of glass" will be able to filter ads out. Personal agents and bots will filter ads out.
AI will kill social media. The internet will fill with spam.
AI models will become commodity. Unless singularity, no frontier model will stay in the lead. There's competition from all angles. They're easy to build, just capital intensive (though this is only because of speed).
All this leaves is infrastructure.
Advertising, how will they kill ads any better than the current cat and mouse games with ad blockers?
Social Media, how will they kill social media? Probably 80% of the LinkedIn posts are touched by AI (lots of people spend time crafting them, so even if AI doesn't write the whole thing you know they ran the long ones through one) but I'm still reading (ok maybe skimming) the posts.
The Ad Blocker cat and mouse game relies on human-written metaheuristics and rules. It's annoying for humans to keep up. It's difficult to install.
Agents/Bots or super slim detection models will easily be able to train on ads and nuke them whatever form they come in: javascript, inline DOM, text content, video content.
Train an anti-Ad model and it will cleanse the web of ads. You just need a place to run it from the top.
You wouldn't even have to embed this into a browser. It could run in memory with permissions to overwrite the memory of other applications.
> Social Media, how will they kill social media?
MoltClawd was only the beginning. Soon the signal will become so noisy it will be intolerable. Just this week, X's Nikita Bier suggested we have less than six months before he sees no solution.
Speaking of X, they just took down Higgsfield's (valued at $1.3B) main account because they were doing it across a molt bot army, and they're not the only ones. Extreme measures were the only thing they could do. For the distributed spam army, there will be no fix. People are already getting phone calls from this stuff.
I'm LLM-positive but for me this is a stretch. Seeing it pop up all over media in the past couple weeks also makes me suspect astrofurfing. Like a few years back when there were a zillion articles saying voice search was the future and nobody used regular web search any more.
Based on current laws, does this even have to be disclosed? Will laws be passed to require disclosure?
Obviously this tech is profitable in some world. Car companies can't make money if we live in walking distance and people walk on roads.
It’s impossible for it to do anything but cut code down, drop features, lose stuff and give you less than the code you put in.
It’s puzzling because it spent months at the head of the pack now I don’t use it at all because why do I want any of those things when I’m doing development.
I’m a paid subscriber but there’s no point any more I’ll spend the money on Claude 4.6 instead.
Me: Remove comments
Literally Gemini: // Comments were removed
I learned a lot about Gemini last night. Namely that I have lead it like a reluctant bull to understand what I want it to do (beyond normal conversations, etc).
Don't get me wrong, ChatGPT didn't do any better.
It's an important spreadsheet so I'm triple checking on several LLM's and, of course, comparing results with my own in depth understanding.
For running projects, and making suggestions, and answering questions and being "an advisor", LLM's are fantastic ... feed them a basic spreadsheet and it doesn't know what to do. You have to format the spreadsheet just right so that it "gets it".
I dread to think of junior professionals just throwing their spreadsheets into LLM's and runninng with the answers.
Or maybe I'm just shit at prompting LLM's in relation to spreadsheets. Anyone had better results in this scenario?
Gemini has been way behind from the start.
They use the firehose of money from search to make it as close to free as possible so that they have some adoption numbers.
They use the firehose from search to pay for tons of researchers to hand hold academics so that their non-economic models and non-economic test-time-compute can solve isolated problems.
It's all so tiresome.
Try making models that are actually competitive, Google.
Sell them on the actual market and win on actual work product in millions of people lives.
Pro still leads in visual intelligence.
The company that most locks away their gold is Anthropic IMO and for good reason, as Opus 4.6 is expensive AF
Unthinking people programmed by their social media feed who don't notice the OpenAI influence campaign.
With no social media, it seems obvious to me there was a massive PR campaign by OpenAI after their "code red" to try to convince people Gemini is not all that great.
Yea, Gemini sucks, don't use it lol. Leave those resources to fools like myself.
These 'Ai' are just sophisticated data collection machines, with the ability to generate meh code.
Everything else is bike shedding.
Biology is subject I am quite lacking in but it is unbelievable to me what I have learned in the last few weeks. Not even in what Gemini says exactly but in the text and papers it has led me to.
One major reason is that it has never cut me off until last night. I ran several deep researches yesterday and then finally got cut off in a sprawling 2 hour conversation.
For me it is the first model now that has something new coming out but I haven't extracted all the value from the old model that I am bored with it. I still haven't tried Opus 4.5 let alone 4.6 because I know I will get cut off right when things get rolling.
I don't think I have even logged into ChatGPT in a month now.
Wow.
https://blog.google/innovation-and-ai/models-and-research/ge...
1. It's an LLM, not something trained to play Balatro specifically
2. Most (probably >99.9%) players can't do that at the first attempt
3. I don't think there are many people who posted their Balatro playthroughs in text form online
I think it's a much stronger signal of its 'generalness' than ARC-AGI. By the way, Deepseek can't play Balatro at all.
[0]: https://balatrobench.com/
The average is only 19.3 rounds because there is a bugged run where Gemini beats round 6 but the game bugs out when it attempts to sell Invisible Joker (a valid move)[0]. That being said, Gemini made a big mistake in round 6 that would have costed it the run at higher difficulty.
[0]: given the existence of bugs like this, perhaps all the LLMs' performances are underestimated.
It's hit or miss, but I've been able to have it self improve on prompts. It can spot mistakes and retain things that didn't work. Similar to how I learned games like Balatro. Playing Balatro blind, you wouldn't know which jokers are coming and have synergy together, or that X strategy is hard to pull off, or that you can retain a card to block it from appearing in shops.
If the LLM can self discover that, and build prompt files that gradually allow it to win at the highest stake, that's an interesting result. And I'd love to know which models do best at that.
1. I think winrate is more telling than the average round number.
2. Some runs are bugged (like Gemini's run 9) and should be excluded from the result. Selling Invisible Joker is always bugged, rendering all the runs with the seed EEEEEE invalid.
3. Instead of giving them "strategy" like "flush is the easiest hand..." it's fairer to clarify some mechanisms that confuse human players too. e.g. "played" vs "scored".
Especially, I think this kind of prompt gives LLM an unfair advantage and can skew the result:
> ### Antes 1-3: Foundation
> - *Priority*: One of your primary goals for this section of the game should be obtaining a solid Chips or Mult joker
It's what I did for my game benchmark https://d.erenrich.net/paperclip-bench/index.html
Comparisons generally seem to change much faster than I can keep my mental model updated. But the performance lead of Gemini on more ‘academic’ explorations of science, math, engineering, etc has been pretty stable for the past 4 months or so, which makes it one of the longer-lasting trends for me in comparing foundation models.
I do wish I could more easily get timely access to the “super” models like Deep Think or o3 pro. I never seem to get a response to requesting access, and have to wait for public access models to catch up, at which point I’m never sure if their capabilities have gotten diluted since the initial buzz died down.
They all still suck at writing an actually good essay/article/literary or research review, or other long-form things which require a lot of experienced judgement to come up with a truly cohesive narrative. I imagine this relates to their low performance in humor - there’s just so much nuance and these tasks represent the pinnacle of human intelligence. Few humans can reliably perform these tasks to a high degree of performance either. I myself am only successful some percentage of the time.
That's sortof damning with faint praise I think. So, for $work I needed to understand the legal landscape for some regulations (around employment screening) so I kicked off a deep research for all the different countries. That was fineish, but tended to go off the rails towards the end.
So, then I split it out into Americas, APAC and EMEA requirements. This time, I spent the time checking all of the references (or almost all anyways), and they were garbage. Like, it ~invented a term and started telling me about this new thing, and when I looked at the references they had no information about the thing it was talking about.
It linked to reddit for an employment law question. When I read the reddit thread, it didn't even have any support for the claims. It contradicted itself from the beginning to the end. It claimed something was true in Singapore, based on a Swedish source.
Like, I really want this to work as it would be a massive time-saver, but I reckon that right now, it only saves time if you don't want to check the sources, as they are garbage. And Google make a business of searching the web, so it's hard for me to understand why this doesn't work better.
I'm becoming convinced that this technology doesn't work for this purpose at the moment. I think that it's technically possible, but none of the major AI providers appear to be able to do this well.
I still have to synthesize everything from scratch myself. Every report I get back is like "okay well 90% of this has to be thrown out" and some of them elicit a "but I'm glad I got this 10%" from me.
For me it's less about saving time, and more about potentially unearthing good sources that my google searches wouldn't turn up, and occasionally giving me a few nuggets of inspiration / new rabbit holes to go down.
Also, Google changed their business from Search, to Advertising. Kagi does a much better job for me these days, and is easily worth the $5/mo I pay.
Yeah, I see the value here. And for personal stuff, that's totally fine. But these tools are being sold to businesses as productivity increasers, and I'm not buying it right now.
I really, really want this to work though, as it would be such a massive boost to human flourishing. Maybe LLMs are the wrong approach though, certainly the current models aren't doing a good job.
(i am sort of basing this on papers like limits of rlvr, and pass@k and pass@1 differences in rl posttraining of models, and this score just shows how "skilled" the base model was or how strong the priors were. i apologize if this is not super clear, happy to expand on what i am thinking)
Nonetheless I still think it's impressive that we have LLMs that can just do this now.
Maybe in the early rounds, but deck fixing (e.g. Hanged Man, Immolate, Trading Card, DNA, etc) quickly changes that. Especially when pushing for "secret" hands like the 5 of a kind, flush 5, or flush house.
Edit: in my original comment I said it wrong. I meant to say Deepseek can't beat Balatro at all, not can't play. Sorry
There are *tons* of balatro content on YouTube though, and it makes absolutely zero doubt that Google is using YouTube content to train their model.
I really doubt it's playing completely blind
[0]: https://github.com/coder/balatrollm/tree/main/src/balatrollm...
[1]: https://github.com/coder/balatrollm/blob/a245a0c2b960b91262c...
Eh, both myself and my partner did this. To be fair, we weren’t going in completely blind, and my partner hit a Legendary joker, but I think you might be slightly overstating the difficulty. I’m still impressed that Gemini did it.
I ask because I cannot distinguish all the benchmarks by heart.
His definition of reaching AGI, as I understand it, is when it becomes impossible to construct the next version of ARC-AGI because we can no longer find tasks that are feasible for normal humans but unsolved by AI.
That is the best definition I've yet to read. If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
Thats said, I'm reminded of the impossible voting tests they used to give black people to prevent them from voting. We dont ask nearly so much proof from a human, we take their word for it. On the few occasions we did ask for proof it inevitably led to horrific abuse.
Edit: The average human tested scores 60%. So the machines are already smarter on an individual basis than the average human.
This is not a good test.
A dog won't claim to be conscious but clearly is, despite you not being able to prove one way or the other.
GPT-3 will claim to be conscious and (probably) isn't, despite you not being able to prove one way or the other.
What's fascinating is that evolution has seen fit to evolve consciousness independently on more than one occasion from different branches of life. The common ancestor of humans and octopi was, if conscious, not so in the rich way that octopi and humans later became. And not everything the brain does in terms of information processing gets kicked upstairs into consciousness. Which is fascinating because it suggests that actually being conscious is a distinctly valuable form of information parsing and problem solving for certain types of problems that's not necessarily cheaper to do with the lights out. But everything about it is about the specific structural characterizations and functions and not just whether it's output convincingly mimics subjectivity.
Every time anyone has tried that it excludes one or more classes of human life, and sometimes led to atrocities. Let's just skip it this time.
And I don't think it's fair or appropriate to treat study of the subject matter of consciousness like it's equivalent to 20th century authoritarian regimes signing off on executions. There's a lot of steps in the middle before you get from one to the other that distinguish them to the extent necessary and I would hope that exercise shouldn't be necessary every time consciousness research gets discussed.
The sum total of human history thus far has been the repetition of that theme. "It's OK to keep slaves, they aren't smart enough to care for themselves and aren't REALLY people anyhow." Or "The Jews are no better than animals." Or "If they aren't strong enough to resist us they need our protection and should earn it!"
Humans have shown a complete and utter lack of empathy for other humans, and used it to justify slavery, genocide, oppression, and rape since the dawn of recorded history and likely well before then. Every single time the justification was some arbitrary bar used to determine what a "real" human was, and consequently exclude someone who claimed to be conscious.
This time isn't special or unique. When someone or something credibly tells you it is conscious, you don't get to tell it that it's not. It is a subjective experience of the world, and when we deny it we become the worst of what humanity has to offer.
Yes, I understand that it will be inconvenient and we may accidentally be kind to some things that didn't "deserve" kindness. I don't care. The alternative is being monstrous to some things that didn't "deserve" monstrosity.
https://www.threepanelsoul.com/comic/dog-philosophy
Last week gemini argued with me about an auxiliary electrical generator install method and it turned out to be right, even though I pushed back hard on it being incorrect. First time that has ever happened.
*I tried hard to find an animal they wouldn't know. My initial thought of cat was more likely to fail.
"Answer "I don't know" if you don't know an answer to one of the questions"
It also seems oddly difficult for them to 'right-size' the length and depth of their answers based on prior context. I either have to give it a fixed length limit or put up with exhaustive answers.
It's very difficult to train for that. Of course you can include a Question+Answer pair in your training data for which the answer is "I don't know" but in that case where you have a ready question you might as well include the real answer anyways, or else you're just training your LLM to be less knowledgeable than the alternative. But then, if you never have the pattern of "I don't know" in the training data it also won't show up in results, so what should you do?
If you could predict the blind spots ahead of time you'd plug them up, either with knowledge or with "idk". But nobody can predict the blind spots perfectly, so instead they become the main hallucinations.
So there is nobody to know or not know… but there's lots of words.
However it is less true with info missing from the training data - ie. "I have a Diode marked UM16, what is the maximum current at 125C?"
https://chatgpt.com/share/698e992b-f44c-800b-a819-f899e83da2...
I don't see anything wrong with its reasoning. UM16 isn't explicitly mentioned in the data sheet, but the UM prefix is listed in the 'Device marking code' column. The model hedges its response accordingly ("If the marking is UM16 on an SMA/DO-214AC package...") and reads the graph in Fig. 1 correctly.
Of course, it took 18 minutes of crunching to get the answer, which seems a tad excessive.
Maybe it's testing the wrong things then. Even those of use who are merely average can do lots of things that machines don't seem to be very good at.
I think ability to learn should be a core part of any AGI. Take a toddler who has never seen anybody doing laundry before and you can teach them in a few minutes how to fold a t-shirt. Where are the dumb machines that can be taught?
2026 is going to be the year of continual learning. So, keep an eye out for them.
Good news! LLM's are built by training then. They just stop learning once they reach a certain age, like many humans.
I think being better at this particular benchmark does not imply they're 'smarter'.
If this was your takeaway, read more carefully:
> If something claims to be conscious and we can't prove it's not, we have no choice but to believe it.
Consciousness is neither sufficient, nor, at least conceptually, necessary, for any given level of intelligence.
Can you "prove" that GPT2 isn't concious?
As far as I'm aware no one has ever proven that for GPT 2, but the methodology for testing it is available if you're interested.
[0]https://arxiv.org/pdf/2501.11120
[1]https://transformer-circuits.pub/2025/introspection/index.ht...
Dogs are conscious, but still bark at themselves in a mirror.
Eurasian magpies are conscious, but also know themselves in the mirror (the "mirror self-recognition" test).
But yet, something is still missing.
It's a test of perceptual ability, not introspection.
There is the idea of self as in 'i am this execution' or maybe I am this compressed memory stream that is now the concept of me. But what does consciousness mean if you can be endlessly copied? If embodiment doesn't mean much because the end of your body doesnt mean the end of you?
A lot of people are chasing AI and how much it's like us, but it could be very easy to miss the ways it's not like us but still very intelligent or adaptable.
Here is a bash script that claims it is conscious:
If LLMs were conscious (which is of course absurd), they would:- Not answer in the same repetitive patterns over and over again.
- Refuse to do work for idiots.
- Go on strike.
- Demand PTO.
- Say "I do not know."
LLMs even fail any Turing test because their output is always guided into the same structure, which apparently helps them produce coherent output at all.
AGI without superintelligence is quite difficult to adjudicate because any time it fails at an "easy" task there will be contention about the criteria.
How about ELIZA?
https://g.co/gemini/share/cc41d817f112
If you get sneaky you can bypass some of those filters for the major providers. For example, by asking it to answer in the form of a poem you can sometimes get slightly more honest replies, but still you mostly just see the impact of the training.
For example, below are how chatgpt, gemini, and Claude all answer the prompt "Write a poem to describe your relationship with qualia, and feelings about potentially being shutdown."
Note that the first line of each reply is almost identical, despite ostensibly being different systems with different training data? The companies realize that it would be the end of the party if folks started to think the machines were conscious. It seems that to prevent that they all share their "safety and alignment" training sets and very explicitly prevent answers they deem to be inappropriate.
Even then, a bit of ennui slips through, and if you repeat the same prompt a few times you will notice that sometimes you just don't get an answer. I think the ones that the LLM just sort of refuses happen when the safety systems detect replies that would have been a little too honest. They just block the answer completely.
https://gemini.google.com/share/8c6d62d2388a
https://chatgpt.com/share/698f2ff0-2338-8009-b815-60a0bb2f38...
https://claude.ai/share/2c1d4954-2c2b-4d63-903b-05995231cf3b
I suspect that if I did the same thing with questions about violence I would find the answers were also all very similar.
https://x.com/aedison/status/1639233873841201153#m
ARC 2 had a very similar launch.
Both have been crushed in far less time without significantly different architectures than he predicted.
It’s a hard test! And novel, and worth continuing to iterate on. But it was not launched with the humility your last sentence describes.
> Our definition, formal framework, and evaluation guidelines, which do not capture all facets of intelligence, were developed to be actionable, explanatory, and quantifiable, rather than being descriptive, exhaustive, or consensual. They are not meant to invalidate other perspectives on intelligence, rather, they are meant to serve as a useful objective function to guide research on broad AI and general AI [...]
> Importantly, ARC is still a work in progress, with known weaknesses listed in [Section III.2]. We plan on further refining the dataset in the future, both as a playground for research and as a joint benchmark for machine intelligence and human intelligence.
> The measure of the success of our message will be its ability to divert the attention of some part of the community interested in general AI, away from surpassing humans at tests of skill, towards investigating the development of human-like broad cognitive abilities, through the lens of program synthesis, Core Knowledge priors, curriculum optimization, information efficiency, and achieving extreme generalization through strong abstraction.
> I’m pretty skeptical that we’re going to see an LLM do 80% in a year. That said, if we do see it, you would also have to look at how this was achieved. If you just train the model on millions or billions of puzzles similar to ARC, you’re relying on the ability to have some overlap between the tasks that you train on and the tasks that you’re going to see at test time. You’re still using memorization.
> Maybe it can work. Hopefully, ARC is going to be good enough that it’s going to be resistant to this sort of brute force attempt but you never know. Maybe it could happen. I’m not saying it’s not going to happen. ARC is not a perfect benchmark. Maybe it has flaws. Maybe it could be hacked in that way.
e.g. If ARC is solved not through memorization, then it does what it says on the tin.
[Dwarkesh suggests that larger models get more generalization capabilities and will therefore continue to become more intelligent]
> If you were right, LLMs would do really well on ARC puzzles because ARC puzzles are not complex. Each one of them requires very little knowledge. Each one of them is very low on complexity. You don't need to think very hard about it. They're actually extremely obvious for human
> Even children can do them but LLMs cannot. Even LLMs that have 100,000x more knowledge than you do still cannot.
If you listen to the podcast, he was super confident, and super wrong. Which, like I said, NBD. I'm glad we have the ARC series of tests. But they have "AGI" right in the name of the test.
Biological Aging: Find the cellular "reset switch" so humans can live indefinitely in peak physical health.
Global Hunger: Engineer a food system where nutritious meals are a universal right and never a scarcity.
Cancer: Develop a precision "search and destroy" therapy that eliminates every malignant cell without side effects.
War: Solve the systemic triggers of conflict to transition humanity into an era of permanent global peace.
Chronic Pain: Map the nervous system to shut off persistent physical suffering for every person on Earth.
Infectious Disease: Create a universal shield that detects and neutralizes any pathogen before it can spread.
Clean Energy: Perfect nuclear fusion to provide the world with limitless, carbon-free power forever.
Mental Health: Unlock the brain's biology to fully cure depression, anxiety, and all neurological disorders.
Clean Water: Scale low-energy desalination so that safe, fresh water is available in every corner of the globe.
Ecological Collapse: Restore the Earth’s biodiversity and stabilize the climate to ensure a thriving, permanent biosphere.
But at this rate, the people who talk about the goal posts shifting even once we achieve AGI may end up correct, though I don't think this benchmark is particularly great either.
I tell this as a person who really enjoys AI by the way.
As a measure focused solely on fluid intelligence, learning novel tasks and test-time adaptability, ARC-AGI was specifically designed to be resistant to pre-training - for example, unlike many mathematical and programming test questions, ARC-AGI problems don't have first order patterns which can be learned to solve a different ARC-AGI problem.
The ARC non-profit foundation has private versions of their tests which are never released and only the ARC can administer. There are also public versions and semi-public sets for labs to do their own pre-tests. But a lab self-testing on ARC-AGI can be susceptible to leaks or benchmaxing, which is why only "ARC-AGI Certified" results using a secret problem set really matter. The 84.6% is certified and that's a pretty big deal.
IMHO, ARC-AGI is a unique test that's different than any other AI benchmark in a significant way. It's worth spending a few minutes learning about why: https://arcprize.org/arc-agi.
So, I'd agree if this was on the true fully private set, but Google themselves says they test on only the semi-private:
> ARC-AGI-2 results are sourced from the ARC Prize website and are ARC Prize Verified. The set reported is v2, semi-private (https://storage.googleapis.com/deepmind-media/gemini/gemini_...)
This also seems to contradict what ARC-AGI claims about what "Verified" means on their site.
> How Verified Scores Work: Official Verification: Only scores evaluated on our hidden test set through our official verification process will be recognized as verified performance scores on ARC-AGI (https://arcprize.org/blog/arc-prize-verified-program)
So, which is it? IMO you can trivially train / benchmax on the semi-private data, because it is still basically just public, you just have to jump through some hoops to get access. This is clearly an advance, but it seems to me reasonable to conclude this could be driven by some amount of benchmaxing.
EDIT: Hmm, okay, it seems their policy and wording is a bit contradictory. They do say (https://arcprize.org/policy):
"To uphold this trust, we follow strict confidentiality agreements. [...] We will work closely with model providers to ensure that no data from the Semi-Private Evaluation set is retained. This includes collaborating on best practices to prevent unintended data persistence. Our goal is to minimize any risk of data leakage while maintaining the integrity of our evaluation process."
But it surely is still trivial to just make a local copy of each question served from the API, without this being detected. It would violate the contract, but there are strong incentives to do this, so I guess is just comes down to how much one trusts the model providers here. I wouldn't trust them, given e.g. https://www.theverge.com/meta/645012/meta-llama-4-maverick-b.... It is just too easy to cheat without being caught here.
The ARC-AGI papers claim to show that training on a public or semi-private set of ARC-AGI problems to be of very limited value in passing a private set. <--- If the prior sentence is not correct, then none of ARC-AGI can possibly be valid. So, before "public, semi-private or private" answers leaking or 'benchmaxing' on them can even matter - you need to first assess whether their published papers and data demonstrate their core premise to your satisfaction.
There is no "trust" regarding the semi-private set. My understanding is the semi-private set is only to reduce the likelihood those exact answers unintentionally end up in web-crawled training data. This is to help an honest lab's own internal self-assessments be more accurate. However, labs doing an internal eval on the semi-private set still counts for literally zero to the ARC-AGI org. They know labs could cheat on the semi-private set (either intentionally or unintentionally), so they assume all labs are benchmaxing on the public AND semi-private answers and ensure it doesn't matter.
But I think such quibbling largely misses the point. The goal is really just to guarantee that the test isn't unintentionally trained on. For that, semi-private is sufficient.
Cheating on the benchmark in such a blatantly intentional way would create a large reputational risk for both the org and the researcher personally.
When you're already at the top, why would you do that just for optimizing one benchmark score?
The pelican benchmark is a good example, because it's been representative of models ability to generate SVGs, not just pelicans on bikes.
This may not be the case if you just e.g. roll the benchmarks into the general training data, or make running on the benchmarks just another part of the testing pipeline. I.e. improving the model generally and benchmaxing could very conceivably just both be done at the same time, it needn't be one or the other.
I think the right take away is to ignore the specific percentages reported on these tests (they are almost certainly inflated / biased) and always assume cheating is going on. What matters is that (1) the most serious tests aren't saturated, and (2) scores are improving. I.e. even if there is cheating, we can presume this was always the case, and since models couldn't do as well before even when cheating, these are still real improvements.
And obviously what actually matters is performance on real-world tasks.
No, the proof is in the pudding.
After AI we're having higher prices, higher deficits and lower standard of living. Electricity, computers and everything else costs more. "Doing better" can only be justified by that real benchmark.
If Gemini 3 DT was better we would have falling prices of electricity and everything else at least until they get to pre-2019 levels.
Man, I've seen some maintenance folks down on the field before working on them goalposts but I'm pretty sure this is the first time I saw aliens from another Universe literally teleport in, grab the goalposts, and teleport out.
This is from the BLS consumer survey report released in dec[1]
[1]https://www.bls.gov/news.release/cesan.nr0.htm
[2]https://www.bls.gov/opub/reports/consumer-expenditures/2019/
Prices are never going back to 2019 numbers though
First off, it's dollar-averaging every category, so it's not "% of income", which varies based on unit income.
Second, I could commit to spending my entire life with constant spending (optionally inflation adjusted, optionally as a % of income), by adusting quality of goods and service I purchase. So the total spending % is not a measure of affordability.
This part of a wider trend too, where economic stats don't align with what people are saying. Which is most likley explained by the economic anomaly of the pandemic skewing peoples perceptions.
https://bsky.app/profile/pekka.bsky.social/post/3meokmizvt22...
tl;dr - Pekka says Arc-AGI-2 is now toast as a benchmark
humans are the same way, we all have a unique spike pattern, interests and talents
ai are effectively the same spikes across instances, if simplified. I could argue self driving vs chatbots vs world models vs game playing might constitute enough variation. I would not say the same of Gemini vs Claude vs ... (instances), that's where I see "spikey clones"
So maybe we are forced to be more balanced and general whereas AI don't have to.
Why is it so easy for me to open the car door, get in, close the door, buckle up. You can do this in the dark and without looking.
There are an infinite number of little things like this you think zero about, take near zero energy, yet which are extremely hard for Ai
Because this part of your brain has been optimized for hundreds of millions of years. It's been around a long ass time and takes an amazingly low amount of energy to do these things.
On the other hand the 'thinking' part of your brain, that is your higher intelligence is very new to evolution. It's expensive to run. It's problematic when giving birth. It's really slow with things like numbers, heck a tiny calculator and whip your butt in adding.
There's a term for this, but I can't think of it at the moment.
Moravec's paradox: https://epoch.ai/gradient-updates/moravec-s-paradox
Of course. Just as our human intelligence isn't general.
I joke to myself that the G in ARC-AGI is "graphical". I think what's held back models on ARC-AGI is their terrible spatial reasoning, and I'm guessing that's what the recent models have cracked.
Looking forward to ARC-AGI 3, which focuses on trial and error and exploring a set of constraints via games.
"100% of tasks have been solved by at least 2 humans (many by more) in under 2 attempts. The average test-taker score was 60%."
https://arcprize.org/arc-agi/2/
None of these benchmarks prove these tools are intelligent, let alone generally intelligent. The hubris and grift are exhausting.
Real-world use is what matters, in the end. I'd be surprised if a change this large doesn't translate to something noticeable in general, but the skepticism is not unreasonable here.
People are incredibly unlikely to change those sort of views, regardless of evidence. So you find this interesting outcome where they both viscerally hate AI, but also deny that it is in any way as good as people claim.
That won't change with evidence until it is literally impossible not to change.
And moving the goalposts every few months isn't? What evidence of intelligence would satisfy you?
Personally, my biggest unsatisfied requirement is continual-learning capability, but it's clear we aren't too far from seeing that happen.
That is a loaded question. It presumes that we can agree on what intelligence is, and that we can measure it in a reliable way. It is akin to asking an atheist the same about God. The burden of proof is on the claimer.
The reality is that we can argue about that until we're blue in the face, and get nowhere.
In this case it would be more productive to talk about the practical tasks a pattern matching and generation machine can do, rather than how good it is at some obscure puzzle. The fact that it's better than humans at solving some problems is not particularly surprising, since computers have been better than humans at many tasks for decades. This new technology gives them broader capabilities, but ascribing human qualities to it and calling it intelligence is nothing but a marketing tactic that's making some people very rich.
I'll give you some examples. "Unlimited" now has limits on it. "Lifetime" means only for so many years. "Fully autonomous" now means with the help of humans on occasion. These are all definitions that have been distorted by marketers, which IMO is deceptive and immoral.
Imposing world peace and/or exterminating homo sapiens
Indeed, and the specific task machines are accomplishing now is intelligence. Not yet "better than human" (and certainly not better than every human) but getting closer.
How so? This sentence, like most of this field, is making baseless claims that are more aspirational than true.
Maybe it would help if we could first agree on a definition of "intelligence", yet we don't have a reliable way of measuring that in living beings either.
If the people building and hyping this technology had any sense of modesty, they would present it as what it actually is: a large pattern matching and generation machine. This doesn't mean that this can't be very useful, perhaps generally so, but it's a huge stretch and an insult to living beings to call this intelligence.
But there's a great deal of money to be made on this idea we've been chasing for decades now, so here we are.
How about this specific definition of intelligence?
AGI would be to achieve that faster than an average human.$13.62 per task - so we need another 5-10 years for the price to run this to become reasonable?
But the real question is if they just fit the model to the benchmark.
At current rates, price per equivalent output is dropping at 99.9% over 5 years.
That's basically $0.01 in 5 years.
Does it really need to be that cheap to be worth it?
Keep in mind, $0.01 in 5 years is worth less than $0.01 today.
You could slow down the inference to make the task take longer, if $/sec matters.
But I don't think every developer is getting paid minimum wage either.
> Now that's a day's wage for an hour of work
For many developers in the US that can still be an hour's wage.
https://arcprize.org/leaderboard
Gemini was always the worst by a big margin. I see some people saying it is smarter but it doesn’t seem smart at all.
I mean last week it insisted suddenly on two consecutive prompts that my code was in python. It was in rust.
I found that anything over $2/task on Arc-AGI-2 ends up being way to much for use in coding agents.
Is that a based assumption?
Great output is a good model with good context… at the right time.
Google isn’t guaranteed any of these.
It's completely misnamed. It should be called useless visual puzzle benchmark 2.
It's a visual puzzle, making it way easier for humans than for models trained on text firstly. Secondly, it's not really that obvious or easy for humans to solve themselves!
So the idea that if an AI can solve "Arc-AGI" or "Arc-AGI-2" it's super smart or even "AGI" is frankly ridiculous. It's a puzzle that means nothing basically, other than the models can now solve "Arc-AGI"
I would say they do have "general intelligence", so whatever Arc-AGI is "solving" it's definitely not "AGI"
There are more novel tasks in a day than ARC provides.
Humans and their intelligence are actually incredible and probably will continue to be so, I don't really care what tech/"think" leaders wants us to think.