> Instead of taking a stab in the dark, Leanstral rolled up its sleeves. It successfully built test code to recreate the failing environment and diagnosed the underlying issue with definitional equality. The model correctly identified that because def creates a rigid definition requiring explicit unfolding, it was actively blocking the rw tactic from seeing the underlying structure it needed to match.
Otherwise in some cases, you get this issue [0].
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...
LLMs are good at writing tests in my experience.
Europeans not wanting to be dependent, and they are giving for free what US investors planed to charge with 90% margin.
Amazing! What a blast. Thank you for your service (this first 100M$ burned to POC GPT1 and from here, we are so good to go)
If I do not accept that level of independence but want more, I need to buy what's on OVH, Scaleway, Ionos etc. or host my own, but that usually means even smaller, worse models or a lot of investment.
Nevertheless, the "band" that Mistral occupies for economic success is very narrow. Basically just people who need independence "on paper" but not really. Because if I'm searching for actual independence, there's no way I could give them money at the moment for one of their products and it making sense, cause none of their plans are an actual independence-improvement over, let's say, Amazon Bedrock.
I really really want to support them, but it must make economic sense for my company, too, and it doesn't.
The key is to avoid chantage, remember Oracle with DBs, people learned not to build on top of unreplaceable stuff
Also, they're listing CoreWeave as inference provider in "EEA" area, but CoreWeave is of course also an US company. Even if they have their data center physically in the EU, it must be considered open access for the USA due to the CLOUD act.
https://trust.mistral.ai/subprocessors
If what you say is true, they have a communications problem and they need to fix that urgently. Right now, this is why they don't get my business. Others will have made the same decision based on their own subprocessor list.
Or did you mean, they're like, right now building it and plan to move there, but it's not up yet?
Sounds like a worth challenge for this community, mind giving actual examples and see what others can suggest?
Right now, we see a lot of business experts in enterprises tempted to use AI to impl. business logic so they don't have to wait for (or pay) software experts. Would this kind of technology help these users any time soon?
My current theory is that the real breakthrough for these non-developers will only happen when they can actually verify the result themselves without needing an another expert in the loop. But I don't see that with formal validation anytime soon.
Do I overlook something?
Many comments here point out that Mistral's models are not keeping up with other frontier models - this has been my personal experience as well. However, we need more diversity of model alignment techniques and companies training them - so any company taking this seriously is valuable.
This model is specifically trained on this task and significantly[1] underperforms opus.
Opus costs about 6x more.
Which seems... totally worth it based on the task at hand.
[1]: based on the total spread of tested models
Most Copilot customers use Copilot because Microsoft has been able to pinky promise some level of control for their sensitive data. That's why many don't get to use Claude or Codex or Mistral directly at work and instead are forced through their lobotomised Copilot flavours.
Remember, as of yet, companies haven't been able to actually measure the value of LLMs ... so it's all in the hands of Legal to choose which models you can use based on marketing and big words.
That would also help to reduce our dependency on American Hyperscalers, which is much needed given how untrustworthy the US is right now. (And also hostile towards Europe as their new security strategy lays out)
The AI Act absolutely befuddled me. How could you release relatively strict regulation for a technology that isn't really being used yet and is in the early stages of development? How did they not foresee this kneecapping AI investment and development in Europe? If I were a tinfoil hat wearer I'd probably say that this was intentional sabotage, because this was such an obvious consequence.
Mistral is great, but they haven't kept up with Qwen (at least with Mistral Small 4). Leanstral seems interesting, so we'll have to see how it does.
Speaking as someone who's been doing stats and ML for a while now, the AI act is pretty good. The compliance burden falls mostly on the companies big enough to handle it.
The foundation model parts are stupid though.
It's not an excuse. Anybody with half a working brain should've been able to tell that this was going to happen. You can't regulate a field in its infancy and expect it to ever function.
>The compliance burden falls mostly on the companies big enough to handle it.
You mean it falls on anyone that tries to compete with a model. There's a random 10^25 FLOPS compute rule in there. The B300 does 2500-3750 TFLOPS at fp16. 200 of these can hit that compute number in 6 months, which means that in a few years time pretty much every model is going to hit that.
And if somebody figures out fp8 training then it would only take 10 of these GPUs to hit it in 6 months.
The copyright rule and having to disclose what was trained on also means that it will be impossible to have enough training data for an EU model. And this even applies to people that make the model free and open weights.
I don't see how it is possible for any European AI model to compete. Even if these restrictions were lifted it would still push away investors because of the increased risk of stupid regulation.
Still, the more interesting comparison would be against something such as Codex.
I think it would still be fine for the legs and on battery for relatively short loads: https://www.notebookcheck.net/Apple-MacBook-Pro-M5-2025-revi...
But 40 degrees and 30W of heat is a bit more than comfortable if you run the agent continuously.
Most people I know that use agents for building software and tried to switch to local development, every single time they switch back to Claude/codex.
It's just not worth it. The models are that much better and continue to get released / improve.
And it's much cheaper unless you're doing like 24/7 stuff.
Even on the $200/m plan, that's cheaper than buying a $3k dgx or $5k m4 max with enough ram.
Not to mention you can no longer use your laptop as a laptop as the power draw drains it - you'd need to host separately and connect
I understand the value proposition of the frontier cloud models, but we're not as far off from self-hosting as you think, and it's becoming more viable for domain-specific models.
Not sure I really understand the comparisons though. They emphasize the cost savings relative to Haiku, but Haiku kinda sucks at this task, and Leanstral is worse? If you're optimizing for correctness, why would "yeah it sucks but it's 10 times cheaper" be relevant? Or am I misunderstanding something?
On the promising side, Opus doesn't look great at this benchmark either — maybe we can get better than Opus results by scaling this up. I guess that's the takeaway here.
I do like agents (like Claude Code), but I don't consider myself to be vibe coding when I use them. Either I'm using a language/framework I know and check every step. OR I'm learning, checking every step and asking for explanations.
I tried vibe coding, and really dislike the feeling I have when doing it. It feels like building a house, but without caring about it, and just using whatever tech. Sure I may have moisture problems later, but it's a throwaway house anyway. That's how I feel about it. Maybe I have a wrong definition.
Maybe it's good to not use "vibe coding" as a synonym for programming with agent assistance. Just to protect our profession. Like: "Ah you're vibing" (because you have Claude Code open), "No, I'm using CC to essentially type faster and prevent syntax errors and get better test coverage, maybe to get some smart solutions without deep research. But I understand and vouch for every loc here. 'We are not the same.'"
No, I feel the same. I vibe-coded a few projects and after a few weeks I just threw them away, ultimately I felt I just wasted my time and wished I coudl get it back to do something useful.
I see a whole spectrum between those two. I typically alternate between "writing code manually and asking AI for code examples" (ChatGPT coding), and "giving AI specific instructions like, write a function blarg that does foo".
The latter I call Power Coding, in the sense of power armor, because you're still in control and mostly moving manually, but you're much stronger and faster.
I like this better than "tell agent to make a bunch of changes and come back later" because first of all it doesn't break flow (you can use a smaller model for such fine-grained changes so it goes very fast -- it's "realtime"), and second, you don't ever desync from the codebase and need to spend extra time figuring out what the AI did. Each change is sanity-checked as it comes in.
So you stay active, and the code stays slop-free.
I don't hear a lot of people doing this though? Maybe we just don't have good language for it.
Interesting thought. I guess we don't really, vibe coding is to powerful a term. But perhaps just call it LLM assisted programming? Where we used to do Stack Overflow assisted programming. LLM assisted programming is more focused, goes faster. But since you're wandering around less I guess you learn less, you're exposed to less new information, some of it was helpful in unexpected ways. Now you have to make learning a specific part of your flow, and that takes discipline/time. But is well worth it imho. Actually, for me it's the only way to enjoy it.
So, most homebuilders (in the US) unfortunately.
It came at great cost though, I hated the process of learning and the execution. I was less than happy for some years. But I feel even more uncomfortable vibe-home-improving than I do vibe-coding. The place is starting to look nice now though.
But then the Lean4 specification effectively becomes the software artifact.
And we're sort of back to square 1. How do you verify a Lean4 spec is correct (and that it describes what needs to be built in the first place) without human review?
Specifications are smaller than the full code, just as high level code is smaller than the functionally equivalent assembly. As we ascend the abstraction ladder the amount of reading a human needs to do decreases. I don't think this should really count as "back to square 1".
binary => hexadecimal instructions
hexadecimal => assembly language
assembly => portable, "high-level" languages (C, FORTRAN, COBOL, etc.)
HLLs => 3GLs (BASIC, C++, Pascal, Java, C#, JavaScript, etc.)
3GLs => 4GLs/DSLs/RADs and "low-code/no-code"[0]
Among the RADs is Microsoft Visual Basic, which along with WinForms and SQL was supposed to make business programmers nearly obsolete, but instead became a new onramp into programming.In particular, I'd like to highlight UML, which was supposed to mostly obsolete programming through auto-generated code from object-oriented class diagrams.[1] The promise was that "business domain experts" could model their domain via visual UML tooling, and the codegen would handle it from there. In practice, UML-built applications became maintenance nightmares.
In every one of these examples, the artifact that people made "instead of programming" became the de-facto programming language, needing to be maintained over time, abstracted, updated, consumed behind APIs, etc. -- and programmers had to be called in to manage the mess.
It's interesting that Spec4 can be auto-generated, then used to generate code. My question is - what do you do when you have (a) consumers depending on a stable API, and (b) requests for new features? Maybe hand the job to Claude Code or a human developer with a suite of unit tests to guarantee API compatibility, but at that point we're back to an agent (LLM or human) doing the work of programming, with the Spec4 code as the programming language being updated and maintained.
[0] https://en.wikipedia.org/wiki/Fourth-generation_programming_...
A formal spec in Lean is typically 10-50x shorter than the code it proves correct. More importantly, Lean's type checker is itself a small, trusted kernel (~10k lines) that has been scrutinized by the PL community for years. So you're not trusting the agent — you're trusting the kernel.
The practical workflow isn't "agent writes spec + code." It's: human writes spec (the hard creative part), agent generates proof that code satisfies spec, Lean kernel mechanically checks the proof. The agent can hallucinate all it wants in step 2 — if the proof doesn't typecheck, it gets rejected deterministically.
The real bottleneck is step 1: writing good specs requires domain expertise. But that's exactly where humans should stay in the loop. It's a much better division of labor than reviewing thousands of lines of generated code.
> I'm interested to see what it is in the age of LLMs or similar future tools. I suspect a future phase change might be towards disregarding how easy it is for humans to work with the code and instead focus on provability, testing, perhaps combined with token efficiency.
> Maybe Lean combined with Rust shrunk down to something that is very compiler friendly. Imagine if you could specify what you need in high level language and instead of getting back "vibe code", you get back proven correct code, because that's the only kind of code that will successfully compile.
But indeed, if the spec includes 8.7k of C code, that is problematic. If you cannot look at the theorem and see that it is what you mean, that is a problem. That is why abstraction is so important; your ultimate spec should not include C-code, that is just too low-level.
Knowing whether those theorems are the right theorems for the problem can be as difficult as understanding the implementation itself. Hence the example of SeL4 where the number of theorems exceeds lines of code in the original implementation and the formal model is large.
It's my experience that most people doing formal methods have seen cases where they actually proved something slightly different than what they intended to. This usually involves an unintentional assumption that isn't generally true.
What value does this add to the conversation? I’m not seeing it: am I missing something? It comes across as a kind of insult.
They made a good point in my opinion! (The “Uhm no” part got it off on the wrong foot, I will admit.) But even if you felt annoyed or didn’t agree with the point, it was substantive and moved the conversation forward. I’m here for the (genuine) questions and (constructive) debate and (civil) pushback.
I like to welcome new users before they take too much of a beating. That can come later when they are too invested to leave and/or when morale needs improving.
So welcome! Bring a helmet, and don’t stop disagreeing.
The average quality of an AI announcement is that of a Memecoin. Lots of graphs, meandering text and no substance.
I can’t even convince most developers to use model checkers. Far more informal than a full proof in Lean. Still highly useful in many engineering tasks. People prefer boxes and arrows and waving their hands.
Anyway, I don’t know that I’d want to have a system vibe code a proof. These types of proofs, I suspect, aren’t going to be generated to be readable, elegant, and be well understood by people. Like programs they generate it will look plausible.
And besides, you will still need a human to review the proof and make sure it’s specifying the right things. This doesn’t solve that requirement.
Although I have thought that it would be useful to have a system that could prove trivial lemmas in the proof. That would be very neat.
It does actually significantly boost performance. There was an article on here about it recently, I'll see if I can find it.
Edit: https://news.ycombinator.com/item?id=44630724
They found the more different the models were (the less overlap in correctly solved problems), the more it boosted the score.
We saw last year that it's remarkably easy to bypass safety filters by fine-tuning GPT, even when the fine-tuning seems innocuous. e.g. the paper about security research finetuning (getting the model to add vulnerabilities) producing misaligned outputs in other areas. It seems like it flipped some kind of global evil neuron. (Maybe they can freeze that one during finetuning? haha)
Found it: Emergent Misalignment
Remember, language is what we make it. Dictionaries are useful catalogs of usage but we make the judgment calls.
* Even with the process, much is not well understood! / The ethics of releasing an open weights model at some capability level is a separate discussion.
It's certainly less mature when it comes to verified programming, but its appeal to mathematicians (rather than formal methods experts) has earned it much respect.
Could definitely be interesting for having another model run over the codebase when looking for improvements
I actively use gemini-3.1-pro-preview, claude-4.6-opus-high, and gpt-5.3-codex as well. I prefer them all for different reasons, however I usually _start_ with mistral if it's an option.
It's funny because I just took a break from it to read some hn and found this post.
I was surprised: even tho it was the cheapest option (against other small models from Anthropic) it performed the best in my benchmarks.
I've also used Devstral Small to make a simple raytracer[5][6] (it was made using the "classic" chat by copy/pasting code, not any agentic approach and i did fix bits of it in the process) and a quick-and-dirty "games database" in Python+Flask+Sqlite for my own use (mainly a game backlog DB :-P).
I also use it to make various small snippets, have it generate some boilerplate stuff (e.g. i have an enum in C and want to write a function that prints names for each enum value or have it match a string i read from a json file with the appropriate enum value), "translate" between languages (i had it recently convert some matrix code that i had written in Pascal into C), etc.
[0] https://i.imgur.com/f4OrNI5.png
[1] https://i.imgur.com/Zac3P4t.png
[2] https://i.imgur.com/jPYYKCd.png
[3] https://i.imgur.com/WZGfCdq.png
[4] https://i.imgur.com/ytYkyQW.png
[5] https://i.imgur.com/FevOm0o.png (screenshot)
[6] https://app.filen.io/#/d/e05ae468-6741-453c-a18d-e83dcc3de92... (C code)
Works really well. Extracts companies you have dealt with, people, topics, events, locations, financial transactions, bills, etc.
it clearly and demonstrably does not. in fact, from eyeballing their chart Qwen, Kimi, and GLM scale linearly whereas Leanstral does not. But this is not surprising because the Alibaba, Moonshot, and Zhipu have hundreds of employees each and hundreds of millions of dollars of investment each.
Model Cost ($) Score
..
Claude Opus 1,650 39.6
..
Leanstral pass@8 145 31.0
Leanstral pass@16 290 31.9Mistral seems to focus on a different market than the others. Their best model is meh, their best ASR model locally is either rather slow compared to Parakeet on similar languages, or not as good for others (like qwen ASR).
Side note: Lean seems quite unreadable with tons of single letter variable names. Part of it is me being unaccustomed with it, but still.
There are two compatible and important (but different) questions in play:
1. Is a program correct relative to a formal specification?
2. Is the formal specification what we mean/want?
*: Worth asking: “What that other person necessarily wrong? Or perhaps they are discussing a different aspect or framing?” AKA: “be curious and charitable” I’m not going to link to the specific threads, but they are happened / are happening. Le Sigh.
I tried it out myself, I let AI add action transitions through the code, like: // A -> B: some description. Then I validate via a test that every action transition defined in my model is also defined somewhere commented in code, and other way around that every comment exists in the model.
Finally, I let AI write model check queries on particular properties. If I notice a particular bug, then I ask AI to analyze the model and the model check queries on why it could happen, and ask to strengthen it.
It sounds like a lot of effort, but I got it working in a half hour.
TDD, verification, whatever your tool; verification suites of all sorts accrue over time into a very detailed repository of documentation of how things are supposed to work that, being executable, puts zero tokens in the context when the code is correct.
It’s more powerful than reams upon reams of markdown specs. That’s because it encodes details, not intent. Your intent is helpful at the leading edge of the process, but the codified result needs shoring up to prevent regression. That’s the area software engineering has always ignored because we have gotten by on letting teams hold context in their heads and docs.
As software gets more complex we need better solutions than “go ask Jim about that, bloke’s been in the code for years”.
Be careful here - make sure you encode the right details. I've seen many cases where the tests are encoding the details of how it was implemented and not what it is intended to do. This means that you can't refactor anything because your tests are enforcing a design. (refactor is changing code without deleting tests, the trick is how can you make design changes without deleting tests - which means you have to test as much as possible at a point where changing that part of the design isn't possible anyway)
As part of the proper testing strategy, you will have tests that cover individual behavior of a small block/function (real "unit" tests), tests that cover integration points only up to the integration itself, and a small number of end-to-end or multi-component integration tests.
Only the last category should stay mostly idempotent under refactoring, depending on the type of refactor you are doing.
Integration tests will obviously be affected when you are refactoring the interfaces between components, and unit tests will be affected when you are refactoring the components themselves. Yes, you should apply the strategy that keeps it under incremental reverse TDD approach (do the refactor and keep the old interface, potentially by calling into new API from the old; then in second step replace use of old API as well, including in tests).
Tests generally define behavior and implementation in a TDD approach: it'd be weird if they do not need changing at all when you are changing the implementation.
I'm not against TDD or verification-first development, but I don't think writing that as code is the end-goal. I'll concede that there's millions of lines of tests that already exist, so we should be using those as a foundation while everything else catches up.
The scientific approach is theory driven, not test driven. Understanding (and the power that gives us) is the goal.
At the risk of stretching the analogy, the LLM's internal representation is that theory: gradient-descent has tried to "explain" its input corpus (+ RL fine-tuning), which will likely contain relevant source code, documentation, papers, etc. to our problem.
I'd also say that a piece of software is a theory too (quite literally, if we follow Curry-Howard). A piece of software generated by an LLM is a more-specific, more-explicit subset of its internal NN model.
Tests, and other real CLI interactions, allow the model to find out that it's wrong (~empiricism); compared to going round and round in chain-of-thought (~philosophy).
Of course, test failures don't tell us how to make it actually pass; the same way that unexpected experimental/observational results don't tell us what an appropriate explanation/theory should be (see: Dark matter, dark energy, etc.!)
Vibing gives you something like the geocentric model of the solar system. It kind of works but but it's much more complicated and hard to work with.
I guess the current wave is going to give us Sofware Development Epicycles (SDEC?)
* All analogies are "wrong", some analogies are useful
Obviously the author has to do much work in selecting the correct bits from this baggage to get a structure that makes useful predictions, that is to say predictions that reproduces observable facts. But ultimately the theory comes from the author, not from the facts, it would be hard to imagine how one can come up with a theory that doesn't fit all the facts known to an author if the theory truly "emanated" from the facts in any sense strict enough to matter.
I disagree. Having tests (even if the LLM wrote them itself!) gives the model some grounding, and exposes some of its inconsistencies. LLMs are not logically-omniscient; they can "change their minds" (next-token probabilities) when confronted with evidence (e.g. test failure messages). Chain-of-thought allows more computation to happen; but it doesn't give the model any extra evidence (i.e. Shannon information; outcomes that are surprising, given its prior probabilities).
Don’t like the layout? Let’s reroll! Back to the generative kitchen agent for a new one! ($$$)
The big labs will gladly let you reroll until you’re happy. But software - and kitchens - should not be generated in a casino.
A finished software product - like a working kitchen - is a fractal collection of tiny details. Keeping your finished software from falling apart under its own weight means upholding as many of those details as possible.
Like a good kitchen a few differences are all that stands between software that works and software that’s hell. In software the probability that an agent will get 100% of the details right is very very small.
Details matter.
People metaphorically do that all the time when designing rooms, in the form of endless browsing of magazines or Tik Tok or similar to find something they like instead of starting from first principles and designing exactly what they want, because usually they don't know exactly what they want.
A lot of the time we'd be happier with a spec at the end of the process than at the beginning. A spec that ensures the current understanding of what is intentional vs. what is an accident we haven't addressed yet is nailed down would be valuable. Locking it all down at the start, on the other hand, is often impossible and/or inadvisable.
Spec is an overloaded term in software :) because there are design specs (the plan, alternatives considered etc) and engineering style specs (imagine creating a document with enough detail that someone overseas could write your documentation from it while you’re building it)
Those need distinct names or we are all at risk of talking past each other :)
I’ve been experimenting with a small sparse-regression system that infers governing equations from raw data, and it can produce a lot of plausible candidates quickly. The hard part is filtering out the ones that look right but violate underlying constraints.
For example, it recovered the Sun’s rotation (~25.1 days vs 27 actual) from solar wind data, but most candidate equations were subtly wrong until you enforced consistency checks.
Feels like systems that treat verification as the source of truth (not just an afterthought) are the ones that will actually scale.
They are embracing property-based specifications and testing à la Haskell's QuickCheck: https://kiro.dev
Then, already in formal methods territory, refinement types (e.g. Dafny, Liquid Haskell) are great and less complex than dependent types (e.g. Lean, Agda).
I can think of some strawmen: for example, prove a state machine in Lean, then port the proven version to Dart? But I'm not familiar enough with Lean to know if that's like saying "prove moon made of cheese with JavaScript, then deploy to the US mainframe"
if you can get a model to quickly translate a relevant subset of your code to lean to find tricky bugs and map lean fixes back to your codebase space, you've got yourself a huge unlock. (spoiler alert: you basically can, today)
Before you commented, I started poking at what you described for 15 minutes, then forget about it and fell asleep. Now I remembered, and I know it's viable and IIUC it's almost certainly going to make a big difference in my work practice moving forward. Cheers.
(One way Lean or Rocq could help you directly, though, would be if you coded your program in it and then compiled it to C via their built-in support for it. Such is very difficult at the moment, however, and in the industry is mostly reserved for low-level, high-consequence systems.)
What do you mean? It's a nice and simple language. Way easier to get started than OCaml or Haskell for example. And LLMs write programs in Lean4 with ease as well. Only issue is that there are not as many libraries (for software, for math proofs there is plenty).
But for example I worked with Claude Code and implemented a shell + most of unix coreutils in like a couple of hours. Claude did some simple proofs as well, but that part is obvs harder. But when the program is already in Lean4, you can start moving up the verification ladder up piece by piece.
Require Import String.
Definition hello: string := "Hello world!".
Print hello.
hello = String (Ascii.Ascii false false false true false false true false) (String (Ascii.Ascii true false true false false true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii true true true true false true true false) (String (Ascii.Ascii false false false false false true false false) (String (Ascii.Ascii true true true false true true true false) (String (Ascii.Ascii true true true true false true true false) (String (Ascii.Ascii false true false false true true true false) (String (Ascii.Ascii false false true true false true true false) (String (Ascii.Ascii false false true false false true true false) (String (Ascii.Ascii true false false false false true false false) EmptyString))))))))))) : string
I used to think that the only way we would be able to trust AI output would be by leaning heavily into proof-carrying code, but I've come to appreciate the other approaches as well.
If someone posted a breakthrough in cryptographic verification and the top comment was "yeah, unit tests are great," we'd all recognize that as missing the point. I don't think it's unrelated, I think it's almost related, which is worse, because it pattern-matches onto agreement while losing the actual insight.