It's also worth noting that the approach given in the link also benefits Sonnet and Opus. Not just as much - they are more forgiving - but put it in a harness that allows for various verification and repair and they too end up producing much better results than the "raw" model. And it's not clear that a harness around MiniMax, Kimi, or Qwen can measure up then.
I use those models a lot, and hope to use them more as my harnesses get better at discriminating which tasks they are cost effective for, but it's not straightforward to cost optimize this.
If I cared about running everything locally, then sure, it's amazing you can get to those kinds of results at all.
https://www.msn.com/en-us/money/other/ai-startup-backed-by-m...
ETA: reminds me of biology, too. In life, it turns out the more simple some functional component looks like, the more stupidly overcomplicated it is if you look at it under microscope.
Isn't this link am argument against the point you are making?
https://marginlab.ai/trackers/claude-code/ tries to track Claude Opus performance on SWE-Bench-Pro, but since they only sample 50 tasks per day, the confidence intervals are very wide. (This was submitted 2 months ago https://news.ycombinator.com/item?id=46810282 when they "detected" a statistically significant deviation, but that was because they used the first day's measurement as the baseline, so at some point they had enough samples to notice that this was significantly different from the long-term average. It seems like they have fixed this error by now.)
And today is a 53% pass rate vs. a baseline 56% pass rate. That's a huge difference. If we recall what Anthropic originally promised a "max 5" user https://github.com/anthropics/claude-code/issues/16157#issue... -- which they've since removed from their site...
50-200 prompts. That's an extra 1-6 "wrong solutions" per 5 hours ... and you have to get a lot of wrong answers to arrive at a wrong solution.
Opus 4.6 is available on the $20 plan too
$200 dollars + VAT is half of my rent.
I know HN is not a good place to rant on this subject, but I'm often flabbergasted about the number of people here that lives in a bubble with regard to the price of tech. Or just prices in general.
I remember someone who said a few years ago (I'm paraphrasing): "You could just use one of the empty room in your house!". It was so outlandish I believed it was a joke at first.
EDIT: "not", minor grammar
I think I am in the middle. I can afford $200/m but it'd be a brainer. And I don't pay that as I barely use home AI enough to warrant it.
I am also amazed at the richer end of HN but now I realize I am priviledged. Earned it? Like fuck I did. Lucky to be born a geek in late 20c. I'd be useless as a middle ages guy.
Subscriptions are definitely middle-class targeted. $20/month is not much for the value provided, at least not in the western world.
But if by "rich" you just mean "westerners", then in this sense, the same is and has always been true for computing in general.
So like if you want to start a business of any sort the AI sub is still peanuts.
AI is a car, or a dog, or a mild social life, or a utility bill level of cost. And thats for the level needed for a sane typical developer. (AI maximalists need 250k/y, let them slop it out)
It is not a Cessna, an infinity pool or a 1 month vacation.
Last year, at first, $200 seemed crazy. Now that I’m getting addicted to coding agents, not so much. Some companies are paying API rates for AI for employees, and it’s a lot more than $200/mo. It seems like funny money, and I’m not sure it’ll last.
Even the US has places with cheap rent/housing. The downside is that there's no (well-paying) work nearby.
You’re technically correct, btw, rental housing is a market and is subject to market forces, meaning what people are willing to pay. I’m just not so sure about framing rent as being lower priority than other necessities. And rent prices have been increasing faster than other necessities, and faster than income, so that might be a confounding factor in your argument.
Still, my initial reaction above is due to the fact that in the US and in Europe in most large cities, the average rent is north of $1000/mo.
So yes, you describe a situation that I feel like a lot of people here don't understand is not the norm.
I compared the subscription with my rent precisely because it's easier to compare: with your numbers it would be like paying from $600 up to $1500 / month. Pretty hard to justify.
But it's not all that relevant to this conversation. It's not like this is the first time economic inequality is a thing.
It's just as relevant to me factoring in your salary the next time I go buy a car.
Also, I think it's relevant to the conversation.
You replied to someone who said that "you" (undirected pronoun I suppose) can't afford the SOTA that the $200/month Anthropic subscription comes with a ton of usage. So I interpreted it as a general statement. It wasn't what you meant?
I'm a bit lost about who you're talking to/about in your first comment: the person you respond to, a general statement for everyone reading, or yourself?
Going back to the $20 plan, yes, I agree it's much more accessible.
I didn't talk about it because I've seen a lot of comments here, on blogs, on social media about how a $200 subscription for Claude is a no brainer. And it got on my nerves, so I wanted to tell how much money it can be. To you (and it was misguided reading your answers), and to concerned HN commenters in general.
If it's that I'm not working, well, I'm employed.
It it's that I'm not working enough to not have this money... Well, we still go back to the bubble. Not everywhere in the world you can easily find a job that pays you enough, even if you accept to work more. And the employer will not accept to give developers a $200/month subscription, even less for personal use.
If it's that I'm not working enough and I should go freelancing to work as much as I want and get rich (I'm extrapolating). Well, you're right, I could do that. But (at least at first), I would work a lot more for much less money. And even if I become a recognized freelancer, it doesn't change the fact that I'll earn less money compared to the baseline of SF, or even the USA in the tech sector in general. So, bubble again. I could also, like someone said, put the tokens cost into my hourly/daily rate, but I'll be much more expensive than other freelancers.
Also, but that's a "me case" compared to my previous points, health issues can greatly affect how much work you can do.
Do you have any evidence of that? I think the OPs are assuming this as a premise so their logic is probably valid but may not be sound logic for you.
Instinctively, if we suppose all the newbies freelancers without any reputation start with the same lowest rate possible to be competitive, passing additional cost to my client will mechanically increase my rate. Putting me in disadvantage about getting any work. And with the difference of monetary value for the same price of tokens, the rate delta is higher.
It's a simplified model of the world, but it feels like simple economic rules.
I assume the comment I'm referring to was written by someone who is already established and for Wich the token cost passing is lower relatively to my environment.
Sorry, no. You live in the bubble, the people you think are living in a bubble are actually doing the very opposite and taking advantage of the lack of bubbles in our globally connected world.
Today, basically anyone can sell any bullshit to billions of people around the world. We’ve never lived in less of a bubble.
$200/month is, but you don't need that for anything except beyond-casual use of coding agents.
I didn't say "only SF can afford $200/mo".
I'd not use it over pure Claude Code because I am at heart a coder and I want the raw terminal experience and there's some features missing from the "Code" tab in Claude Desktop, but just saying "a subscription to code", just goes to show how out of touch that person already is, and that's what resistance does to you when you try to resist making use of any kind of modern tooling or technology.
200 USD/month is a number only really affluent programmers (e.g. in the Silicon Valley) can perhaps pay easily.
Not true, I live in USA PNW and my last remote job paid $12k/mo. I have been jobless for over a month now (currently waiting for the next HN "who wants to be hired"), but I still have enough savings to easily afford to continue that plan for a while.
I don't think it really has to do with affluence but more the job market and economy you're in. Countries with lower salaries or higher costs of living will have less buying power.
But I run a AI SaaS and we do offer Opus 4.6, too. Our use case is not nearly as token intensive as something like coding so we are still able to offer it with a good profit margin.
Also you can run OpenClaw with your CC subscription. It's what I do.
Edit: I'm not using the term of art, I mean it literally cannot make them money.
https://aibenchy.com/compare/minimax-minimax-m2-7-medium/moo...
I'm not saying it's bad, but it's definitely different than the others.
Yuck. At that point don't publish a benchmark, explains why their results are useless too.
-
Edit since I'm not able to reply to the below comment:
"I want structured output from a model that supports structured output but will not enable structured output, nor ask for an existing format like XML or JSON" is not really an interesting thing to benchmark, and that's reflected in how you have Gemini 2.5 Flash beating GPT-5.4.
I really hope no one reads that list and thinks it's an AI leaderboard in any generalizable sense.
Even when using structured output, sometimes you want to define how the data should be displayed or formatted, especially for cases like chat bots, article writing, tool usage, calling external api's, parsing documents, etc.
Most models get this right. Also, this is just one failure mode of Claude.
I don't even need to debate if the benchmark is useful, it doesn't pass a sniff test: GPT-5.4 is not worse than Gemini 2.5 Flash in any way that matters to most users. In your benchmark it's meaningfully worse.
Needless to say, benchmarks are limited and impressions vary widely by problem domain, harness, written language, and personal preference (simplicity vs detail, tone, etc.). If personal experience is the only true measure, as with wine, solving this discovery gap is an interesting challenge (LLM sommelier!), even if model evolution eventually makes the choice trivial. (I prefer Gemini 3 for its wide knowledge, Sonnet 4.6 for balance, and GLM-5 for simplicity.)
I badly want to shift more of my work to them, and I'm finding ways of shifting more lower-level loads to them regularly, but they're really not there yet for anything complex.
The problem with doing this is cost. Constsntly testing a lot of models on a large dataset can get really costly.
I was thinking that tokens spent in such case could also be an interesting measure, but some agent can do small useful refactoring. Although prompt could specify to do the minimal change required to achieve the goal.
I use MiniMax daily, mostly for coding tasks, using pi-coding-agent mostly.
> The down sides emerge pretty fast: much higher reasoning token use, slower outputs, and degradation that is palpable.
I don't care about token use, I pay per request in my cheap coding plan. I didn't notice slower outputs, it's even faster than Anthropic. Degradation is there for long sessions with long contexts, but that also happens with Anthropic models.
> Sadly, you do get what you pay for right now. However that doesn’t prevent you from saving tons through smart model routing, being smart about reasoning budgets, and using max output tokens wisely. And optimize your apps and prompts to reduce output tokens.
Exactly. For my use case, I get 1500 API requests every 5 hours for 10€ monthly. I never hit the limit, even during the intensive coding sessions.
What I notice is, while Opus and Sonnet feel better for synthetic benchmarks, it doesn't matter in the real world. I never put so much effort into coming up with a perfect problem spec like the ones in benchmarks. I don't craft my prompts for hours expecting the LLM to one-shot a working program for me. And that's exactly what all those benchmarks are doing. And that's where Anthropic tools shine in comparison to cheaper Chinese models.
When it comes to the real world, where I put my half-baked thoughts in broken English in a prompt and execute 20 prompts in half an hour, the difference between Opus, Sonnet, and MiniMax is minimal, if at all. There, I don't want to think about costs and token savings and switching between different Anthropic models. I just use MiniMax, and that's it.
Yes, MiniMax sometimes gets stuck. Then I switch to Opus to unblock it. But the same happens if I use Opus the whole session. It gets stuck eventually, and model switch is sometimes required to get a fresh perspective on the problem.
The only difference is, using Opus or Sonnet quickly eats up my budget, while with MiniMax I have basically unlimited usage (for my coding use case) for 10€ per month.
If I was starting new projects I'd pay for a better model, but honestly I don't really know any different.
I've not ever used Claude and people seem to rave about it. Maybe its good, but I doubt its $200/month good.
When I hit issues with these lower models I think hard about creating the right tooling - agnostic to the harness and I feel like maybe its more work but I can carry those tools to any setup going forward. That's how it was in the early Linux days so why change what clearly works?
These small Chinese companies dont always have access to serious hardware.
It does have some kind of horrible context consistency problem though, if you ask it to rewrite something verbatim it'll inject tiny random changes everywhere and potentially break it. That's something that other SOTA models haven't done for at least two years now and is a real problem. I can't trust it to do a full rewrite, just diffs.
I doubt Kimi would do well with most harnesses, its outputs are pretty chaotic in terms of formatting but the inteligence is definitely there.
They're all slop when the complexity is higher than a mid-tech intermediate engineer though.
This right here. Value prop quickly goes out the window when you're building anything novel or hard. I feel that I'm still spending the same amount of time working on stuff, except that now I'm also spending money on models.
> DeepSeek V3.2 Reasoning 86.2% ~$0.002 API, single-shot
> ATLAS V3 (pass@1-v(k=3)) 74.6% ~$0.004 Local electricity only, best-of-3 + repair pipeline
1) That is relatively very slow.
2) Can also be done, simpler even, with SoTA models over API.
Being reliant on a service means you have to share whatever you're working on with the service, and the service provider decides what you can do, and make changes to their terms of service on a whim.
If locally running models can get to the point where they can be used as a daily driver, that solves the problem.
Can you explain what that means?
Local model enthusiasts often assume that running locally is more energy efficient than running in a data center, but fail to take the economies of scale into account.
It is a well known 101 truism in /r/Localllama that local is rarely cheaper, unless run batched - then it is massively, 10x cheaper indeed.
> I think they mean that the DeepSeek API charges are less than it would cost for the electricity to run a local model.
Because it is hosted in China, where energy is cheap. In ex-USSR where I live it is inexpensive too, and keeping in mind that whole winter I had to use small space heater, due to inadequacy of my central heating, using local came out as 100% free.
Our peak import rate is 3x higher than our solar export rate. In other words, we’d need to sell 3 kWh hours of energy to offset the cost of using 1 kWh at peak.
We’re currently in the process of accepting a quote for home batteries. The rates here highly incentivise maximising self-use.
Note that while a local chatbot user will mostly be using batch-size = 1, it's not going to be true if they are running an agentic framework, so the gap is going to narrow or even reverse.
Cool work though, really excited for the potential of slimming down models.
Perhaps these things aren't well represented in the training data for these open models? Every local model I've tried (minimax2.5, GLM-4.7, Quen3, 3.5 and -coder variants) spend so much time trying to get something syntactically sensible and accepted by the compiler that when they've finished they barely seem to have any "momentum" left to actually solve the problems, as pretty much anything but the most trivial change ends up in another loop of actually trying to get it working again, often losing the intent of that change in the process.
My fear is that the solution here, having multiple instances all making the same changes for later comparison, would spend a huge amount of time beating it's head against compiler errors, types, memory allocation (NO DON'T JUST SPRINKLE IN A FEW MORE RAW "new" KEYWORDS DAMMIT) before it even gets to the "logic".
Having plenty of local GPU power I'd love to be able to actually use that, and I'm already wary about some of the training data use and it's interactions with the license of the code I'm "sending" to the cloud models...
I know from first-hand experience that at least a couple of the SOTA providers use third-party providers for supervised finetuning with instructions that are heavily geared towards a specific set of languages as well. But of course the base dataset from the major providers is likely to be sufficiently better that it matters less, and the big models are good enough at carrying over training that it at least seems like extra training on the core languages they care about at least somewhat carries over (you see this with natural language too - they do really well for many minor languages that make up a miniscule proportion of the training data).
(I won't say much more regarding the SFT/RLHF work due to NDAs - plural; I know who one of the providers is; I don't know who the one or more others are as the intermediary I did some work for obscured it well enough that I couldn't really violate the NDA even if I wanted to)
But this can be pretty slow since you have to run the code in an isolated environment, check the outputs, wait for it to finish. Doing that for every candidate quickly adds up. So ATLAS has another shortcut for avoiding unnecessary testing. Instead of simply generating solutions and testing all of them, it tries to predict which one is most likely correct before running any tests.
ATLAS also asks the model for an embedding of what it just wrote which acts as a fingerprint. Two similar pieces of code will produce similar fingerprints. A well-written, confident solution will produce a different fingerprint than a confused, buggy one.
These fingerprints get fed into a separate, much smaller neural network called the Cost Field. This little network was trained ahead of time on examples where they already knew which solutions were correct and which were wrong. It learned to assign a score to each fingerprint. Correct solutions get a low score and incorrect ones get a high one.
So the process is to generate multiple solutions, get their fingerprints, score each one, and pick the lowest. Only that one gets tested. The Cost Field picks correctly about 88% of the time according to the repo.
Depends how you define efficiency. The power use of this rig is a lot less than the large data centers that serve trillion parameter models. The page suggests that the final dollar cost per request is an order of magnitude lower than the frontier models charge.
Another interesting approach could be to use this set up with a language like Clojure or Common Lisp which facilitates interactive development. If you could hook up the agent directly to a REPL in a running program, then it could run tests with a lot less overhead.
So it seems like it's a difficulty classifier for task descriptions written in English.
This is then used to score embeddings of Python code, which is a completely different distribution.
Presumably it's going to look at a simple solution, figure out it lands kinda close to simple problems in embedding space and pass it.
But none of this helps you solve harder problems, or distinguish between a simple solution which is wrong, and a more complex solution which is correct.
It does because hallucinations and low confidence share characteristics in the embedding vector which the small neural learns to recognize. And the fact that it continuously learns based on the feedback loop is pretty slick.
If the author actually wanted to explain his project he should have started with something along the lines of "Inference-time learning is the act of updating model parameters while you are generating tokens. Inference time learning is cost prohibitive for LLMs due to the need to update billions of parameters. However, what if updating billions of parameters wasn't necessary to begin with? What if you could instead have a much smaller model that merely scores a bunch of candidate output tokens? That model could be small enough for inference time learning to become viable and that's exactly what ATLAS does to achieve a 74.6% pass rate in LiveCodeBench and thereby outperforms Claude Sonnet with a small 14B open weight model that can be run locally on your $500 GPU."
This would have primed the reader to know what to look for. Instead you got this insurmountable wall of distractions.
Example: "combining constraint-driven generation, energy-based verification, self-verified iterative refinement, and adaptive routing"
That's a very long sequence of unexplained buzzwords that could mean absolutely anything.
I think the notion of a one size fits all model that is a bit like a sports car in the sense that just get the biggest/fastest/best one is overkill; you use bigger models when needed. But they use a lot of resources and cost you a lot. A lot of AI work isn't solving important math or algorithm problems. Or leet coding exercises. Most AI work is mundane plumbing work, summarizing, a bit of light scripting/programming, tool calling, etc. With skills and guard rails, you actually want agents to follow those rather than get too creative. And you want them to work relatively quickly and not overthink things. Latency is important. You can actually use guard rails to decide when to escalate to bigger models and when not to.
0: Because the only way to get cache locality out of a LLM is to batch invocations. A centralized system where the server handles thousands of invocations at the same time only needs a tiny fraction of the total memory throughput as having all of those invocations run locally on different machines would.
I hope you are not going to say, "to avoid a global recession or depression caused by the popping of the AI bubble". That would be unnecessary and harmful (in its second-order effects), and governments do have advisors who are competent enough in economics to advise against such a move.
In the UK the first bank to go, Northern Rock, was simply taken over by the government. The shareholders got nothing. The bailout of Lloyds bank required the government taking a 40% stake. This is the way to go - if you need a bailout there should be a cost to the shareholders. otherwise you are just privatising profit and nationalising risk.
Not that UK regulation was great all round or the bailout perfect. It certainly failed to prevent the crisis which could have been done (no doubt the same applies in many countries). I looked at Northern Rock's accounts some time (an year, maybe?) before the crisis and was horrified by their reliance on interbank lending. it was obvious they could not cope with a rise in rates.
If the AI labs become very influential and powerful, Washington might nationalize them, but that would be very different from bailing them out because they have become unprofitable and cannot attract additional investment from the private sector.
With the recent OpenAi deal with the government I am certain they would throw tons of money at OpenAi if it got real bad. But with upcoming IPO where they are expected to be valued at $840b, we would be a LONG way from them needing a bailout. Well past this current admin.
GM on the other hand should have been left to die.
However, I was obliquely referring to the open transactionality and patronage encouraged by the current administration, and how the AI / big tech players have, with few exceptions, gleefully joined in.
Unless they run out of money for bribes, I think it's inevitable that current government will bend over backwards to prop them up.
That's why huge concessions nobody asked for were made to the AI industry in the Big Beautiful Bill.
I don't get the financial motive for someone to keep funding these open-weight model training programs other than just purposefully trying to kill the big AI providers.
This will crush OpenAI.
Note: I am not talking about coding here - it will take a while longer but when it is optimized to the bone and llms output has stabilized, you will be running that too on local hardware. Cost will come down for Claude and friends too but why pay 5 when you can have it for free?
In this theory, can you explain why Apple has announced it’s paying Google for Gemini too?
Eventually, this may be true. This autumn? Highly unlikely.
I, too, was interested because I am always eager to use local models in my claw-like. It looks like this could be useful for an async portion of the harness but it wouldn’t work in interactive contexts.
Very cool ensemble of techniques, particularly because they’re so accessible. I think I will use this form for reusable portions of web browsing functionality in my personal agent.
There seems to be at least some detail on that point.
Edit : The 8GB seems to hit this price but 16 not so much.
one expensive and hard lesson we will learn overtime is that you can't compress generality beyond a point.
> ATLAS scores are from 599 LCB tasks using the full V3 pipeline (best-of-3 + Lens selection + iterative repair) on a frozen 14B quantized model or "pass@k-v(k=3)". Competitor scores are single-shot pass@1 (zero-shot, temperature 0) from Artificial Analysis on 315 LCB problems -- not the same task set, so this is not a controlled head-to-head.
Instead of following the LiveCodeBench methodology, it's a harness that spins up a sandbox and spends a long time testing and refining the solution. If you did the same for Sonnet, GPT5.4, or other models they would also get significantly higher scores and they'd do it faster.
The AI-coded README is also full of signs of vibecoded slop like the discoveries that some of the complex structures implemented were not actually being used or contributing anything to the output.
It looks like your card has 16GB VRAM? Start with Qwen 3.5 9B Unsloth GGUFs (UD-Q6_K_XL) and branch out from there.
I can't imagine trying to using this model on either GPU for real work. I can use much bigger and faster models on the $3 Chutes subscription or $10 OpenCode Go subscription.
Even so, I am still excited. I don't feel like there was even a model worth using with a tool like OpenCode 6 to 9 months ago. I like the way things are heading, and I am looking forward to seeing how capable coding models of this size are in another 6 to 9 months!
That doesn’t mean the 9070XT can’t do AI stuff, quite the opposite. ROCm gets better all the time. There are many AI workloads you can do on AMD cards.
Is it a card I would choose if I was primarily working on AI? Absolutely not. But it is the card I own and it’s been a great value for gaming.
I have a 9070 XT, which has 16GB of VRAM. My understanding from reading around a bunch of forums is that the smallest quant you want to go with is Q4. Below that, the compression starts hurting the results quite a lot, especially for agentic coding. The model might eventually start missing brackets, quotes, etc.
I tried various AI + VRAM calculators but nothing was as on the point as Huggingface's built-in functionality. You simply sign up and configure in the settings [1] which GPU you have, so that when you visit a model page, you immediately see which of the quants fits in your card.
From the open source models out there, Qwen3.5 is the best right now. unsloth produces nice quants for it and even provides guidelines [2] on how to run them locally.
The 6-bit version of Qwen3.5 9B would fit nicely in your 6700 XT, but at 9B parameters, it probably isn't as smart as you would expect it to run.
Which model have you tried locally? Also, out of curiosity, what is your host configuration?
[1]: https://huggingface.co/settings/local-apps [2]: https://unsloth.ai/docs/models/qwen3.5
Either that or just load up Qwen3.5-35B-A3B-Q4_K_S I'm serving it at about 40-50t/s on a 4070RTX Super 12GB + 64GB of RAM. The weights are 20.7GB + KV Cache (which should be lowered soon with the upcoming addition of TurboQuant).
The reasoning is great in opus, unbeatable at the moment.
I understand what you mean, it becomes disappointing on more niche or specific work. It’s honestly a good thing to see these models are not really intelligent yet.
I use it for reviewing existing code, specifically for a components-based framework for Godot/GDScript at [0]. You can view the AGENTS.md and see that it's a relatively simple enough project: Just for 2D games and fairly modular so the AI can look at each file/class individually and have to cross-reference maybe 1-3 dependencies/dependents at most at any time during a single pass.
I've been using Codex, and it's helped me catch a lot of bugs that would have taken a long time on my own to even notice at all. Most of my productivity and the commits from the past couple months are thanks to that.
Claude on the other hand, oh man… It just wastes my time. It's had way more gaffes than Codex, on the exact same code and prompts.
I still get really mad at AI sometimes and I am not sure whether I could use AI for coding full time.
(Codex broke my git a few days ago.)
I use Codex regularly and Claude is shit in comparison, from its constant "Oops you're right!!" backtracking to its crap Electron app (if their AI is so good why can't they make a fucking native app for each OS?)
Hell right freakin now I asked it to implement something and got a weird "Something went wrong" API error
Maybe you're too easily frustrated. Or your existing code reads like your comments.
I haven't had any such frustrations with Codex
Claude is specially annoying because of their submarining and people thinking it's the best
and other comments further back in my history
> none of their issues warrant a tantrum on a public forum
I don't get frustrated if a problem is genuinely difficult to solve and the product creator is trying their best,
I get frustrated when a problem has been solved by other similar products but a specific creator or provider refuses to follow suit and fix their shit.
Claude's Electron app vs. Codex's native app is one such example right off the first impression of both products.
I recently refactored our whole app from hard deletes to soft deletes. There are obviously various ways to skin this particular cat, but the way I chose needed all our deletions updated and also needed queries updating to exclude soft deleted rows, except in specific circumstances (e.g., admins restoring accidentally deleted data).
Of course, this is not hard to do manually but is is a bloody chore and tends toward error prone. But the agent made short work of it, for which I was very grateful.
You know your system better than me for sure, a random commenter on a website :-D your comment just shocked me out of my daze enough for my brain to say "but I always move the record to another table rather than soft delete" and i felt compelled to give unsolicited and likely wrong opinion.
More than that, though: lots of queries for reporting, and the like, suddenly need to use JOINs. Same for admin use cases where we want them to be able to see archived and live data in a unified view. The conclusion I came to is it doesn't really eliminate complexity for us: just moves it elsehwere.
Totally valid approach though. I'd also considered different views for live versus archived (or live+archived) data. Again, it solves some issues, but moves complexity elsewhere.
The other key point: it's a Ruby on Rails system so the moment you start doing funky stuff with separate tables or views, whilst it is doable, you lose a lot of the benefits of Active Record and end up having to do a lot more manual lifting. So, again, this sort of played against the alternatives.
As I say, not to diss other approaches: in a different situation I might have chosen one of them.
My conclusion - not for the first time - is that soft delete obviously adds some level of irreducible complexity to an application or system versus hard delete no matter how you do it. Whether or not that extra complexity is worth it very much depends on the application and your user/customer base.
For some people, just the ability to restore deleted rows from backup would be enough - and in other cases it's been enough for me - but that is always a bit of a faff so not a great fit if you're optimising for minimal support overhead and rapid turnaround of any issues that do arise.
It depends whether you reliably control all the DB client code, of course.
It then turns into a slowly-growing problem if you never ever clean up the soft-deleted records, but just being able to gain auditability nearly immediately is usually well worth kicking the can down the road.
Disclaimer: I'm the founder.
I don't write code by hand any more, neither at work, nor for side projects.
I work mostly in Rust and TypeScript at a developer tools company.
I think because it's so specifically sharpened to stab at the software developer, my compatriot, one of the foremost primary populations here, rather than just an overall shitty human insult -- and timed to do so when the person opens up in an honest dialogue about what they're doing.
But good news: every large software house i've talked to in the past two years is touching AI. As tragic as that is for a multitude of good reasons surrounding the workforce/copyright/ip/human-laziness/loss-of-skill/etc, that means imric is going to be outside of software , by their own rules, in totality in just a few short years!
Happy days!
Funny, others seem more hurt by it.
> AI might take your job.
I'm not the one "grieving the loss of his career". :)
We have a high standard for code review, static verification, and tests.
The fact that the code isn't hand-rolled artisanal code, and is generated by AI now, has so far turned out to have no impact on product quality or bugs reported.
So, which company is it again?
(a) the dev has no idea what the agent is doing (b) the dev gives overtly-broad instructions.
If you give it specific enough tasks (not to the point where it's writing singular functions) but a general class description, you're on a good track.
However, anyone who uses an LLM must remain aware of the limitations of this method.
There are many features of a program that cannot be tested exhaustively and which must be guaranteed by its design. When you do not understand very well the structure of a program it may be difficult to decide what must be tested.
With performance, the confidence in what an LLM produces is even lower, because it is unlikely to know if you have really reached a performance limited by hardware. Obtaining a performance better than a previously existing program does not prove anything, because most existing programs are likely to have a performance much lower than possible.
In many cases you just want a performance good enough, not the best attainable, so you can be content with your LLM-generated program. But you must not fool yourself by believing that this is really the best that can be done.
It's amazing! Saves hours of work!
I create the basic helm configd settings etc and when there is a conflict or something not working I let an agent fix it!