Towards Autonomous Mathematics Research
104 points by gmays 2 days ago | 55 comments

dash2 17 hours ago
One thing that is maybe underestimated is the very large amount of applied mathematics that is nowhere near the level of frontier mathematics research. For example lots of economics includes a theory component which is usually trivially simple by the standards of mathematics – undergraduate level at best. I tried ChatGPT on the appendix of this paper and it correctly reproduced the results (and the exposition was more elegant):

https://pubmed.ncbi.nlm.nih.gov/35790706/

reply
umairnadeem123 21 hours ago
interesting that they call out success cases as rare, thats honestly the most useful part. are people here seeing better hit rate from tighter decomposition + verifier loops, or mostly just more compute?

also curious where failures cluster most: search, formalization, or proof checking?

reply
amiune 2 days ago
Perfect match for this test: https://arxiv.org/abs/2602.05192
reply
dang 19 hours ago
Discussed here: First Proof - https://news.ycombinator.com/item?id=46924591 - Feb 2026 (122 comments)
reply
bwfan123 2 days ago
reply
noosphr 2 days ago
This is what everyone who uses llms regularly expected. Good results require a human in the loop and the internet is so big that just about everything has been done there by someone. Most often you.
reply
paulpauper 2 days ago
"...well as model outputs at this https URL."

Had no idea it was possible to put a live url in the abstract of an arxiv listing

reply
measurablefunc 2 days ago
I still don't get how achieving 96% on some benchmark means it's a super genius but that last 4% is somehow still out of reach. The people who constantly compare robots to people should really ponder how a person who manages to achieve 90% on some advanced math benchmark still misses that last 10% somehow.
reply
bee_rider 2 days ago
This feels like a maybe interesting position, but I don’t really follow what you mean. Is it possible to just state it directly? Asking us to ponder is sort of vague.

These math LLMs seem very different from humans. A person has a specialty. A LLM that was as skilled as, say, a middling PhD recipient (not superhuman), but also was that skilled in literally every field, maybe somebody could argue that’s superhuman (“smarter” than any one human). By this standard a room full of people or an academic journal could also be seen as superhuman. Which is not unreasonable, communication is our superpower.

reply
sdenton4 2 days ago
Yeah - it's interesting where the edge is. In theory, an llm trained in everything should be more ready to make cross-field connections. But doing that well requires certain kind of translation and problem selection work which is hard even for humans. (I would even say, beyond PhD level - knowing which problem is with throwing PhD students at is the domain of professors... And many of them are bad at it, as well.)

On the human side, mathematical silos reduce our ability to notice opportunities for cross-silo applications. There should be lots of opportunity available.

reply
botusaurus 2 days ago
do you think Terence Tao can solve any math problem in the world that is solvable by another matematician?
reply
arthurcolle 10 hours ago
Probably
reply
mlpoknbji 3 hours ago
Obviously not
reply
Joel_Mckay 2 days ago
Humans have heuristic biases, and intuition often doesn't succeed with the unknown.

https://en.wikipedia.org/wiki/List_of_cognitive_biases

LLM are good at search, but plagiarism is not "AI".

Leonhard Euler discovered many things by simply trying proofs everyone knew was impossible at the time. Additionally, folks like Isaac Newton and Gottfried Leibniz simply invented new approaches to solve general problems.

The folks that assume LLM are "AI"... also are biased to turn a blind eye to clear isomorphic plagiarism in the models. Note too, LLM activation capping only reduces aberrant offshoots from the expected reasoning models behavioral vector (it can never be trusted.) Thus, will spew nonsense when faced with some unknown domain search space.

Most exams do not have ambiguous or unknown contexts in the answer key, and a machine should score 100% matching documented solutions without fail. However, LLM would also require >75% of our galaxy energy output to reach 1 human level intelligence error rates in general.

YC has too many true believers with "AI" hype, and it is really disturbing. =3

https://www.youtube.com/watch?v=X6WHBO_Qc-Q

reply
botusaurus 2 days ago
> However, LLM would also require >75% of our galaxy energy output to reach 1 human level intelligence error rates in general.

citation needed

reply
Joel_Mckay 2 days ago
The activation capping effect on LLM behavior is available in this paper:

https://www.anthropic.com/research/assistant-axis

The estimated energy consumption versus error rate is likely projected from agent test and hidden-agent coverage.

You are correct, in that such a big number likely includes large errors itself given models change daily. =3

reply
botusaurus 2 days ago
ok, your quote was over generalized, you meant "current LLM need..." and not "any conceivable LLM"

although the word "energy" does not appear on that page, not sure where you get the galaxy energy consumption from

reply
Joel_Mckay 24 hours ago
In general, "any conceivable LLM" was the metric based on current energy usage trends within the known data-centers peak loads (likely much higher due to municipal NDA.) A straw-man argument on whether it is asymptotic or not is irrelevant with numbers that large. For example, 75% of a our galaxy energy output... now only needing 40% total output... does not correct a core model design problem.

LLM are not "AI", and unlikely ever will be due to that cost... but Neuromorphic computing is a more interesting area of study. =3

reply
whattheheckheck 2 days ago
Humans also spew nonsense when faced with some unknown domain search space
reply
Joel_Mckay 2 days ago
Indeed, the list of human cognitive biases was posted above.

The activation capping effect on LLM behavior is available in this paper:

https://www.anthropic.com/research/assistant-axis

This data should already have been added to the isomorphic plagiarism machine models.

Some seem to want to bury this thread, but I think you are hilarious. =3

reply
tug2024 2 days ago
[dead]
reply
engelo_b 2 days ago
[dead]
reply
naths88 2 days ago
This is not really true. It can become very difficult for mathematicians to review the work of their peers. Many times, mistake also evade the scrupulous verifications and the test of time/of many is the best test.
reply
ndriscoll 2 days ago
Why would you not have the bot write in a formal language (e.g. Lean) and then just typecheck it? Then you only need to decide that your definitions were interesting. LLMs are now extremely good at programming given a compiler/typechecker, so I'd expect them to be good at formal math as well. It's nearly (if not precisely) the exact same activity.
reply
Agingcoder 2 days ago
That’s what they do usually I understand - llm generates proof in lean, and proof checker proves.
reply
measurablefunc 2 days ago
You're confusing yourself w/ fancy words like "proof space". The LLM is not doing any kind of traversal in any meaningful sense of the word b/c the "proof" is often just grammatically coherent gibberish whereas an actual traversal in an actual space of proofs would never land on incorrect proofs.
reply
joshuaissac 2 days ago
My reading of their comment is that a proof space is a concept where a human guesses that a proof of some form q exists, and the AI searches a space S(q) where most points may be not valid proofs, but if there is a valid proof, it will hopefully be found.

So it is not a space of proofs in the sense that everything in a vector space is a vector. More like a space of sequences of statements, which have some particular pattern, and one of which might be a proof.

reply
measurablefunc 2 days ago
So it's not a proof space then. It's some computable graph where the edges are defined by standard autoregressive LLM single step execution & some of the vertices can be interpreted by theorem provers like Lean, Agda, Isabelle/HOL, Rocq, etc. That's still not any kind of space of proofs. Actually specifying the real logic of what is going on is much less confusing & does not lead readers astray w/ vague terms like proof spaces.
reply
nivcmo 2 days ago
[dead]
reply
tug2024 2 days ago
[dead]
reply
jodytornado 9 hours ago
[dead]
reply
allreduce 9 hours ago
You talk like a LLM and the entire concept seems from one. What value do you think are others going to get from work you didn't come up with and didn't bother to fully understand yourself?
reply
random3 8 hours ago
Plus “different” website (website has a screen recordings taken by phone) plus a link to the “technical white paper” and a demo - only the publicly sale date of the token is missing
reply
u1hcw9nx 2 days ago
>The results of this paper should not be interpreted as suggesting that AI can consistently solve research-level mathematics questions. In fact, our anecdotal experience is the opposite: success cases are rare, and an apt intuition for autonomous capabilities (and limitations) may currently be important for finding such cases. The papers (ACGKMP26; Feng26; LeeSeo26) grew out of spontaneous positive outcomes in a wider benchmarking effort on research-level problems; for most of these problems, no autonomous progress was made.
reply
getnormality 2 days ago
The ridiculous resources being thrown at this, and the ability through RLVR to throw gigatons of spaghetti at the wall to see what sticks, should make it very clear just how incredibly inefficient frontier AI reasoning is - however spectacular it may be that it can reason at this level at all.
reply
asdff 2 days ago
Long term though, AI will win out. The thing is that you can improve capability. You can make the context window bigger. You can throw more compute at it. Improve efficiency of chips. Throw more power at it. And indeed, that has worked so far to turn the gpts of 2017 into the gpts of 2026 that can actually do stuff.

Meanwhile, human thoughtpower cannot really be improved. Once the tipping point is reached where computers exceed humans, humans will never be able to catch up by definition.

Humans can also only maintain so much contextual information and scope. They can only learn so much in the time scale they have to get up to speed. They can only do so much within the timescale of their own mental peak before they fall off and go senile or die. While these limits are bound by evolution, they change on the orders of thousands of generations, and require strong selection for these changes at that.

The turtle has marched far already, but the hare in the speeding car they continually improve is not far behind. Efficiency doesn't matter. What is inefficient now will be trivial to parallelize and scale in the future as its always been in the history of compute. We'd have to engage in something like the Bene Gesserit breeding program if we are to have human thoughtpower be competitive against compute in the future.

reply
getnormality 2 days ago
You're presupposing an answer to what is actually the most interesting question in AI right now: does scaling continue at a sufficiently favorable rate, and if so, how?

The AI companies and their frontier models have already ingested the whole internet and reoriented economic growth around data center construction. Meanwhile, Google throttles my own Gemini Pro usage with increasingly tight constraints. The big firms are feeling the pain on the compute side.

Substantial improvements must now come from algorithmic efficiency, which is bottlenecked mostly by human ingenuity. AI-assisted coding will help somewhat, but only with the drudgery, not the hardest parts.

If we ask a frontier AI researcher how they do algorithmic innovation, I am quite sure the answer will not be "the AI does it for me."

reply
asdff 2 days ago
Of course it continues. Look at the investment in hardware going on. Even with no algorithmic efficiency improvement that is just going to force power out of the equation just like a massive inefficient V8 engine with paltry horsepower per liter figures.
reply
getnormality 2 days ago
I believe it continues, but I don't know if the rate is that favorable. Today's gigawatt-hungry models that can cost $10-100 per task or more to run... still can't beat Pokémon without a harness. And Pokémon is far from one task.

I believe AGI is probably coming, but not on a predictable timeline or via blind scaling.

reply
asdff 2 days ago
The harness can be iterated upon (1).

I don't think the sci fi definition agi is happening soon but, something more boring in the meanwhile that is perhaps nearly as destructive to life as we know it as knowledge workers today. That is, using a human still, but increasingly fewer humans of lower and lower skill as the models are able to output more and more complete solutions. And naturally, there are no geographic or governmental barriers to protect employment in this sector, or physical realities that demand the jobs take place in a certain place of the world. This path forward is ripe for offshoring to the lowest internet-connected labor available, long term. Other knowledge work professions like lawyer or doctor have set up legal moats to protect their field and compensation decades ago, whereas there is nothing similar to protect the domestic computer science engineer.

By all means they are on this trajectory already. You often see comments on here from developers who say something along the lines of the models years ago needing careful oversight, now they are able to trust them to do more of the project accurately with less oversight as a result. Of course you will find anecdotes either way, but as the years go on I see more and more devs reporting useful output from these tools.

1. https://news.ycombinator.com/item?id=46988596

https://news.ycombinator.com/item?id=46988596

reply
getnormality 24 hours ago
In my experience, AI enables smart people to do their best work while automating zero-quality work like SEO spam that no humans should have been doing in the first place. I have yet to see anything that I would remotely call tragic.
reply
margorczynski 2 days ago
> legal moats to protect their field

I wonder how do they hold up when there's a big enough benefit of using AI over human work. Like how are politicians to explain these moats to the masses when your AI doctor costs 10x less and according to a multitude of studies is much better at diagnosis?

Or in law? I've read China is pushing AI judges because people weren't happy with the impartiality of the human ones. I think in general people overestimate how much these legal moats are worth in the long run.

reply
asdff 4 hours ago
One might ask how they explain the moats already. A nurse can do plenty of what a doctor does. One questions if a law partner is really producing 10x the work of a new law grad to justify that hourly difference. Same is true for banking; all that money spent on salary, bonus, stock options, converted to luxury homes, products, and services, is surely a waste compared to the "efficiency" one might get out of a math post doctoral researcher clearing only $54k a year in academia. All examples of a field carving out a safe and luxurious harbor for themselves, protected by various degrees of regulation and cartel behavior, that has been practiced long enough now so as to be an unremarkable and widely accepted part of the field.
reply
measurablefunc 22 hours ago
Who handles the liability when the AI makes a catastrophic error in your diagnosis?
reply
margorczynski 9 hours ago
Insurance? Some general fund ran by the government? There's a lot of options and the ones making the law can change it as seen fit.
reply
measurablefunc 32 minutes ago
So profits go to the AI company but the liability is socialized? Where is the logic in your proposal?
reply
jwpapi 22 hours ago
Honestly Im not even sure how much model improvement was in the last 12 months, or it was mainly harness improvement. It feels to me like I could’ve done the same stuff with 4, if I would be able to split every task into multiple subtasks with perfect prompts. So to me it could totally be that there is an inner harnessing happen that has been the recent improvements, but then I ask myself is this maybe the same with our own intelligence?
reply
p1esk 20 hours ago
The AI companies and their frontier models have already ingested the whole internet

Has the frontier models been trained on the whole of youtube?

reply
umairnadeem123 21 hours ago
this is the key tension imo. do you think labs are underinvesting in eval infra because scaling headlines are easier to sell?

also curious what would change your mind first: a clear algorithmic breakthrough, or just sustained cost/latency drops from systems work?

reply
amelius 2 days ago
You are forgetting that the current approach to AI may lead to a flat asymptote that still lies well below human capabilities.
reply
p1esk 21 hours ago
The AI I’m using (gpt5.2) is already vastly more capable than me in pretty much any mental task - even in my domain of expertise. I will be surprised if I still have my job one year from now.

And robotics field advances pretty fast too. I will be surprised if personal humanoid robots that can do any physical task (plumbing, cooking, etc) won’t appear within 5 years.

reply
FuckButtons 23 hours ago
You’re pre-supposing that we can actually afford to just keep throwing more compute at the problem.

Moores law is long dead, leading edge nodes are getting ever more expensive, the most recent generation of tensor silicon is not significantly better in terms of flops/watt over the previous generation.

Given that model performance has consistently trended log linear with compute thrown at the problem, there must be a point at which it is no longer economically viable to throw more flops at the problem.

reply
__MatrixMan__ 23 hours ago
You seem to have a very one-dimensional perspective on "human thoughtpower".
reply
thereitgoes456 2 days ago
I credit them for acknowledging their limitations and not actively trying to be misleading. Unlike a certain other company in the space.
reply
noosphr 2 days ago
I've been at this longer than most.

After three major generations of models the "intuition" I've build isn't about what AI can do, but about what a specific model family can do.

No one cares what the gotchas in gpt3 are because it's a stupid model. In two years no one will care what they were for gpt5 or Claude 4 for the same reason.

We currently have the option of wasting months of our lives to get good at a specific model, or burn millions to try and get those models to do things by themselves.

Neither option is viable long term.

reply
CuriouslyC 2 days ago
My philosophy is to try and model the trajectories of these systems and build rigging around where the curve is flat (e.g. models have been producing big balls of mud since the beginning and this hasn't improved meaningfully). Models also have a strong mean bias that I don't expect to go away any time soon.

Trying to outsmart the models at core behaviors over time is asking to re-learn the bitter lesson though.

reply
noosphr 20 hours ago
The issue is that finding where the curve is flat reuiqres either intuition or a lot of money.

The contrapositive of the bitter lesson is that any hand crafted system will over any market meanginfil time scale outperform a system using data and compute.

reply
pjbk 21 hours ago
It seems that common sense is very difficult to program. Perhaps because we don't really know how to properly define it or how an encoding of it would look like.

All of these models keep trying to convince me they can solve the Post Correspondence Problem.

reply
umairnadeem123 21 hours ago
yeah this resonates. do you think model churn is getting faster than teams ability to build stable evals?

have you found any eval set that survives model generations, or does every major release force you to rewrite harness + rubrics?

reply
noosphr 21 hours ago
Dynamic evals of simple tasks are the only ones that matter, the closer to your target domain the better. If the model can't get simple tasks right it won't be able to get complex tasks right.

Finding parse trees given a grammar and a sentence. Paths between vertices in a graph. Asking it to do a search and replace on a document. Anything that you can scale and test against an algorithms answer automatically.

reply