Hacker News

404 points by aray07 2 days ago | 475 comments

hi_hi 19 hours ago

This _all_ (waves hands around) sounds like alot of work and expense for something that is meant to make programming easier and cheaper.

Writing _all_ (waves hands around various llm wrapper git repos) these frameworks and harnesses, built on top of ever changing models sure doesn't feel sensible.

I don't know what the best way of using these things is, but from my personal experience, the defaults get me a looong way. Letting these things churn away overnight, burning money in the process, with no human oversight seems like something we'll collectively look back at in a few years and laugh about, like using PHP!

serial_dev 17 hours ago

> sounds like alot of work and expense for something that is meant to make programming easier and cheaper.

Not if you are an AI gold rush shovel salesman.

From the article:

> I've run Claude Code workshops for over 100 engineers in the last six months

p0w3n3d 16 hours ago

Yeah, my colleague recently said "hey I've burnt through $200 in Claude in 3 days". And he was prompting. Max 8hrs/day Imagine what would happen if AI was prompting.

As I like this allegory really much, AI is (or should be) like and exoskeleton, should help people do things. If you step out of your car putting it first in drive mode, and going to sleep, next day it will be farther, but the question is, is it still on road

alexey-pelykh 9 hours ago

Burnt through 4 Max x20 in a week here. Throughput isn't the bottleneck anymore. Review quality is. The 1-in-5 error rate in this thread matches my experience. More agents overnight just means more review tomorrow morning.

What moved the needle: capturing architectural context (ADRs, structured system prompts, skill files) that agents reference before making changes. Each session builds on prior decisions. The agent improves because the context compounds. Better context beat more parallelism every time.

robutsume 6 hours ago

This matches what I've found running persistent agents. The compounding context is the whole game.

The pattern that works: treat your agent's workspace like infrastructure, not a scratch pad. ADRs, skill files, structured memory of past decisions - all of it becomes the equivalent of institutional knowledge that a senior engineer carries in their head. Except it survives session restarts.

The article's TDD framing gets at something important too. The acceptance criteria aren't just verification - they're context. When you write "after 5 failed attempts, login blocked for 60 seconds" before the agent touches code, you've constrained the solution space dramatically. The agent isn't guessing what you want anymore.

Where I think the article undersells the problem: spec misunderstandings compound too. If your architectural context has a wrong assumption baked in, every agent session inherits that assumption. You need periodic human review of the context itself, not just the outputs. The ADRs need auditing the same way code does.

9wzYQbTYsAIc 5 hours ago

https://github.com/safety-quotient-lab/psychology-agent <- I've been exploring ways to track decisions, making some interesting findings, at the homelab scale, at least.

The cognitive architecture, so to speak, for the LLM can make a huge difference - triggers and skills go a long way when combined with shell scripts that dual-write.

malfist 5 hours ago

This comment reads very strongly like it was written by an LLM.

braebo 3 hours ago

Your sibling even more so.

aray07 6 hours ago

Agreed. The spec file is context. Writing acceptance criteria before you prompt provides the context the agent needs to not go off in the wrong direction. Human leverage just moved up and the plan/spec is the most important step.

Parallelism on top of bad context just gets you more wrong answers faster

gnatolf 4 hours ago

Sorry but isn't the bottleneck then simply to do even relevant things? Like how much of a qualified backlog do you have that your pipeline does not run dry?

viccis 6 hours ago

Reminds me of when I was looking for Obsidian note management workflows and every single person who posted about theirs used it to take notes on... note taking workflows.

owlcompliance 8 hours ago

Bingo.

brobdingnagians 15 hours ago

I would encourage my competitors to use AI agents on their codebase as much as possible. Make sure every new feature has it, lots of velocity! Run those suckers day and night. Don't review it, just make sure the feature is there! Then when the music stops, the AI companies hit the economic realities, go insolvent, and they are left with no one who understands a sprawling tangled web of code that is 80% AI generated, then we'll see who laughs last.

KronisLV 11 hours ago

> they are left with no one who understands a sprawling tangled web of code that is 80% [random people that I can't ask because they don't work here anymore and they didn't care to leave docs or comments] generated, then we'll see who laughs last.

Yes, this matches my experience with codebases before AI was a thing.

nananana9 10 hours ago

Yes, but given a feature that should take say 100 lines of code, the average programmer will write in the order of 100 to 500 lines. If they're a heavy OOP user, maybe they'll write 10 classes that total 2000 lines. Regardless, worst case, it will be within ~2 orders of magnitude of a reasonable solution.

It's not that they're not trying to write the biggest clusterfuck possible and maximize suffering in the world, it's just that there's a human limit on how much garbage they can type out in their allocated time.

This is where AI revolutionizes things. You want 25,000 lines of React? On the backend? And a custom useEffect-backed database? Certainly!

palmotea 7 hours ago

> it's just that there's a human limit on how much garbage they can type out in their allocated time.

Another example where removing friction and constraints is a bad thing.

aray07 6 hours ago

i think the friction has moved upstream - now it's working on the right thing and specifying what correct looks like. i don't think we are going back to a world where we will write code by hand again.

seunosewa 5 hours ago

Unless what you want to do isn't well represented in the training set.

Verdex 8 hours ago

Yeah, in the past the limiting factor was the human suffering of the engineer who had to try and fit the sprawling nightmare fuel into their brain.

The machine doesn't suffer. Or if it does nobody cares. People eventually start having panic attacks, the machine can just be reset.

I suspect that the end result is just driving further into the wilderness before reality sets in and you have to call an adult.

lugu 13 hours ago

Both be true at the same time: some teams spend a fortune on AI and the AI investments won't get the expected ROI (bubble collapse). What is sure is that a lot of capacity is been built and that capacity won't disappear.

What I could see happening in your scenario is the company suffers from diminishing return as every task becomes more expensive (new feature, debugging session, library update, refactoring, security audit, rollouts, infra cost). They could also end up with an incoherent gigantic product that doesn't make sense to their customer.

Both pitfall are avoidable, but they require focus and attention to detail. Things we still need humans for.

snoman 6 hours ago

> What is sure is that a lot of capacity is been built and that capacity won't disappear.

They really are subsidizing what will be an incredibly healthy used server equipment market in a year or two. Can’t wait. My homelab is going to be due for an upgrade.

evrenesat 13 hours ago

Qwen3 Coder Next and Qwen3.5-35B-A3B already very good and can be run on today's higher end home computers with good speed. Tomorrow's machines will not be slower but models are keep getting more efficient. A good sw engineer still would be valuable in Tomorrow's world but not as a software assembler.

ruszki 12 hours ago

Even cutting edge models are not very good. They are not even on mediocre level. Don’t get me wrong, they are improving, and they are awesome, but they are nowhere near good yet. Vibe coded projects have more bugs than features, their architecture and design system are terrible, and their tests are completely useless about half the time. If you want a good product you need to rewrite almost everything what’s written by LLMs. Probably this won’t be the case in a few years, but now even “very good” LLMs are not very good at all.

9wzYQbTYsAIc 5 hours ago

With Claude Code now having a /plan mode - you can take your time and deliberate through architecture and design, collaboratively, instead of just sending a fire-and-forget. Much less buggy and saves time if you keep an eye on the output as you go, guiding it and catching defects, imho.

rwmj 12 hours ago

Not sure why you're being downvoted, this is very much my experience. When it matters (like, customer data is on the line) vibecoded projects are not just hilariously bad, but put you in legal danger.

We've so far found that Claude code is fine as a kind of better Coverity for uncovering memory leaks and similar. You have to check its work very carefully because about 1 time in 5 it just gets stuff wrong. It's great that it gets stuff right 4 times in 5 and produces natural code that fits into the style of the existing project, but it's nothing earth-shattering. We've had tools to detect memory leaks before.

We had someone attempt to translate one of our existing projects into Rust and the result was just wrong at a fundamental level. It did compile and pass its own tests, so if you had no idea about the problem space you might even have accepted its work.

oliver_dr 11 hours ago

[dead]

bojan 9 hours ago

> Tomorrow's machines will not be slower

The way it's going, the AI hyperscalers are buying such a big portion of the world's hardware, that it may very well happen that tomorrow's machines do get slower per dollar of purchase value.

weinzierl 10 hours ago

Not my experience. Current Qwen Coder is noteworthy but still far from good. Can't compare them with current commercial offerings, it is just different leagues.

otabdeveloper4 13 hours ago

> Don't review it, just make sure the feature is there!

Bad idea. Use another agent to do automatic review. (And a third agent writing tests.)

Don't forget the architecting and orchestrating agent too!

baq 12 hours ago

Multiple agents with different frontier models for best results. Claude code/codex shops don’t know what they’re missing if they never let Gemini roast their designs, code and formal models.

9wzYQbTYsAIc 5 hours ago

This.

Claude Code wrote a blog article for me documenting a Gemini interaction that I manually operated. I found it quite interesting - the difference in "personalities", and the quality of Claude's output is stark in comparison to the Gemini's.

But still, best to have two sets of eyes.

mewpmewp2 19 hours ago

I am not laughing about PHP. To this very day many of my best projects are built on PHP. And while last 7 years I have spent in full stack JavaScript/TypeScript environment it has never produced the same things I was actually able to do with PHP.

I actually feel that things I built 15 years ago in PHP were better than anything I am trying to achieve with modern things that gets outdated every 6 months.

giva 8 hours ago

I feel like today an engineer with a modern framework and AI con produce in an afternoon a product that deliver real value, something that 25 years ago would have required a full hour by a high schooler with MS Access.

vardalab 6 hours ago

I was building awesome things with Access 20 years ago. I loved that thing. I wasn't even a software engineer. I was in the EE, but I needed a way to track process and it definitely outperformed. And the best thing, it didn't cost us anything. Everybody already had access, lol. I had 40 people use it in production, manufacturing cutting edge stuff. Definitely beat spreadsheets because Access gave you gui for operators.

jack_pp 19 hours ago

what in God's Name could you do in PHP that you can't do in a modern framework?

tbossanova 18 hours ago

Nothing; but PHP, in experienced hands, will be waaay more productive for small-to-medium things. One issue is that experienced hands are increasingly hard to come by. Truly big, complicated things, built by large teams or numbers of teams, teams with a lot of average brains or AIs trained on average brains, will be better off in something like Typescript/React. And everyone wants to work on the big complicated stuff. So the "modern frameworks" will continue to dominate while smaller, more niche shops will wonder why they waste their time.

jack_pp 18 hours ago

I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful, app doesn't scale, high latency etc. What does their php code do? 95% of it is calling a DB.

You're telling me today with LLM power multiplier it's THAT much faster to write in PHP compared to something that can actually have a future?

frio 15 hours ago

“PHP was so easy and fast that they’ve built such a successful startup they now have scaling problems” is, as far as I can tell, an endorsement of PHP and not a criticism of it.

skeledrew 13 hours ago

I think the point here is that the scaling problem is hard because of PHP.

Imustaskforhelp 11 hours ago

Scaling can be hard in PHP at the same time GGP comment's about PHP being in productive hands and thus being one of the reasons why PHP worked for them. Both of these can be true at the same time.

And for what its worth, Typescript scaling, although better than PHP is still somewhat of an issue and If you want to have massive scaling, Elixir/ (to-an-extent gleam) are developed for solving the scalability problem especially with Phoenix framework in Elixir-land.

So I guess, jack_pp comment's about PHP can also be applied to an degree towards Typescript as well so we should all use elixir, and also within the TS framework the question can be asked for (sveltekit/solid vs next-js/react)

I am more on the svelte side of things but I see people who love react and same for those who love PHP. So my opinion is sort of that everyone can run in their own languages.

Golang is another language to be taken into consideration especially with Htmx/datastar-go/alpine.

hparadiz 11 hours ago

Scaling in PHP is easy. Has never actually been an issue in my entire career unless it was a badly designed database.

PUSH_AX 13 hours ago

Yes, startup success has a direct correlation to the language chosen for your CRUD api…

duggan 15 hours ago

> I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful

You can stop there! Sounds like PHP worked for them. Already doing better than 90% of startups.

watermelon0 16 hours ago

If 95% of what app does is calling a DB, then the bottleneck is in the DB, not with the PHP.

You can use persistent DB connections, and app server such as FrankenPHP to persist state between requests, but that still wouldn't help if DB is the bottleneck.

imron 15 hours ago

Sometimes it’s still the app:

   rows = select all accounts
   for each row in rows:
       update row

But that’s not necessarily a PHP problem. N+1 queries are everywhere.

silon42 14 hours ago

Depending on what you are doing, the above is not necessarily bad.. often much better than an SQL that locks an entire table (potentially blocking the whole DB, if this is one of the key tables).

NorwegianDude 14 hours ago

> I worked at a startup, they built their API in PHP because it was easy and fast. Now they're successful, app doesn't scale, high latency etc. What does their php code do? 95% of it is calling a DB.

So PHP worked perfectly, but the DB is slow? Your DB isn't going any faster by switching to something else, if that's what you think.

PHP is the future, where React has been heading for years.

usrusr 12 hours ago

> Your DB isn't going any faster by switching to something else, if that's what you think.

Only true if none of the DB accesses are about stuff that could live as state across requests in a server that wasn't php. Sure, for some of that the DB's caching will be just as good, but for others, not at all.

graemep 10 hours ago

That is possible, but it sounds unlikely to me.

In most cases you could add a shared cache to fix the problem - e.g. put your shared state in Redis, or in a file that is synced across servers (if its kept as state in a long running process it cannot need to be updated frequently).

nake89 15 hours ago

Not scaling and high latency sound like a skill issue, not a PHP issue.

Chaosvex 10 hours ago

What does this even mean? If you've got scaling problems, it's not because you've used PHP.

arjvik 18 hours ago

by future do you mean Future<T> or metaphorical future? :)

rurban 17 hours ago

PHP did better than python and perl. Python is doomed. PHP got a good jit already, a good OO lately, good frameworks, stable extensions. It has a company behind.

Unlike python or ruby which break right and left all the time on updates. you have to use bunkers of venvs, without any security updates. A nightmare.

PHP can scale and has a future.

Incipient 16 hours ago

Python is doomed? That's new.

You use python docker images pinned to a stable version (3.11 etc), and between bigger versions, you test and handle any breaking changes.

I feel like this approach applies to pretty much every language?

Who on earth raw dogs on "language:latest" and just hopes for the best?

Granted I wouldn't be running Facebook's backend on something like this. But i feel that isn't a problem 95% of people need to deal with.

rurban 16 hours ago

No, only to python. And partially ruby and ocaml. Not to typescript, perl or PHP.

skeledrew 13 hours ago

Introducing uv...

https://docs.astral.sh/uv/

rurban 12 hours ago

uv does not fix the need for venv's or docker containers. normal people update their libs with the hope to get problems fixed.

python people don't update their libs, because then everything will break right and left. so they keep their security problems running.

skeledrew 7 hours ago

No matter how you look at it, the dependencies have to go somewhere. Node uses node_modules, most compiled languages require compiled libraries (or they're a huge blob), etc. Idk about PHP but I'm pretty sure 3rd party things for any given app also live somewhere. Different ways of managing dependencies. It's recommended that venvs are used in Python because you may accidentally nuke a system script by doing global installs, and otherwise there still needs to be some sort of 3p version handling when you have multiple projects going.

Once something works in Python (which uv now makes trivial; before it could be a pain), updating 3rd party packages rarely cause breakage. But yes, I think many who use it hardly update, because things usually continue to work for years and the attack surface is pretty narrow[0]. Heck just a few days ago I checked out a project that I hadn't touched in years, which I wrote in Python 3.7; updated to 3.13 and it continued to just work. Compare to PHP which has a far higher attack surface[1] and often has breaking changes. I've heard a couple nightmare stories of a v7.x -> v8.x move being delayed because it required a serious codebase rewrite.

[0] https://www.cvedetails.com/product/18230/Python-Python.html?... [1] https://www.cvedetails.com/product/128/PHP-PHP.html?vendor_i...

stavros 13 hours ago

I don't think it's true that experienced hands will be faster in PHP than in Python or JS or whatever. It's just about what you know, and experienced hands are experienced.

ipaddr 5 hours ago

PHP is faster to develop in then Pythin or JS then addin a framework like Laravel and you are already done.

Python has the curse of spaces or tabs and JS has the curse of npm.

stavros 5 hours ago

PHP has the curse of T_PAAMAYIM_NEKUDOTAYIM.

mewpmewp2 19 hours ago

You can build those things in modern frameworks, it will just be more headache and will feel outdated in 6 months.

the_lonely_time 19 hours ago

Where are my backbone apps? In the trash? Me ember apps? Next to them. My create-react-apps? On top of those. My Next apps? Being trashed as we speak. My rails apps? Online and making money every year with minimal upgrade time. What the hell was I thinking.

tdeck 16 hours ago

I'm guessing you avoided the CoffeeScript era of Rails, which is a good thing.

Gigachad 18 hours ago

6 years ago I was writing apps in typescript and react, if I was starting a new project today I'd write it in typescript and react.

ehnto 18 hours ago

People bicker about PHP and Javascript, sorry Typescript, like they aren't both mule languages peoppe pick up to get work done. They both matured really well through years of production use.

They are in the same group, similar pedigree. If you were programming purely for the art of it, you would have had time to discover much nicer languages than either, but that's not what most people are doing so it doesn't really matter. They're different but they're about as good as eachother.

tjoff 14 hours ago

Making instant loading and user respecting sites.

azangru 10 hours ago

Could you give examples of the modern frameworks that you have in mind?

baq 12 hours ago

Don’t confuse php the language with php the way of webmaster 2006 vintage.

spiderfarmer 10 hours ago

Those webmasters built the web a lot of people are now nostalgic about already.

dheera 18 hours ago

Not have to "build" anything. You edit code and it is already deployed on your dev instance.

Deploying to production is just scp -rv * production:/var/www/

Beautifully simple. No npm build crap.

jack_pp 18 hours ago

You trade having to compile for actually having code that can scale

ericd 17 hours ago

Not sure what you’re talking about, I scaled to millions of users on a pair of boxes with PHP, and its page generation time absolutely crushed Rails/Django times. Apache with mod PHP auto scales wonderfully.

lelanthran 17 hours ago

It scales just fine the same way everything else scales: put a load balancer in front of multiple instances of your app.

vachina 17 hours ago

It can scale by the virtue of spending a lot less time processing the request

spiderfarmer 10 hours ago

You don't know anything about the PHP ecosystem and it shows.

eqvinox 8 hours ago

The comparison target for PHP is IMHO a good Python web framework, e.g. Django being the most popular one. I still don't understand how JavaScript is ever considered viable, TypeScript makes it workable I guess…

brushfoot 13 hours ago

> sounds like alot of work and expense for something that is meant to make programming easier and cheaper.

It's not more work; it's a convergence of roles. BA/PO/QA/SWE are merging.

AI has automated aspects of those roles that have made the traditional separation of concerns less desirable. A new hybrid role is emerging. The person writing these acceptance criteria can be the one guiding the AI to develop them.

So now we have dev-BAs or BA-devs or however you'd like to frame it. They're closer to the business than a dev might have been or closer to development than a BA might have been. The point is, smaller teams are able to play wider now.

humanfromearth9 11 hours ago

Oh a modern comeback of the analyst-programmer?

otabdeveloper4 13 hours ago

> It's not more work

It literally is. You're spending weeks of effort babysitting harnesses and evaluating models while shipping nothing at all.

brushfoot 13 hours ago

That hasn't been my experience, as a "ship or die" solopreneur. It takes work to set up these new processes and procedures, but it's like building a factory; you're able to produce more once they're in place.

And you're able to play wider, which is why the small team is king. Roles are converging both in technologies and in functions. That leads to more software that's tailored to niche use cases.

otabdeveloper4 12 hours ago

> you're able to produce more once they're in place

Cool story, unfortunately the proof is not in the pudding and none of this fantom x10 vibe-coded software actually works or can be downloaded and used by real people.

P.S. Compare to AI-generated music which is actually a thing now and is everywhere on every streaming platform. If vibe coding was a real thing by now we'd have 10 vibecoded repos on Github for every real repo.

brushfoot 12 hours ago

There's no need to be rude with comments like "cool story." I'm sharing my experience with you. I'm not an AI-hype influencer. I'm a SWE who runs a small SaaS business.

Where it sounds like we agree is that there's some obnoxious marketing hype around LLMs. And people who think they can vibe code without careful attention to detail are mistaken. I'm with you there.

rafaelmn 12 hours ago

These people play around with shit and try to sell you on their secret sauce. If it actually works it will come to claude code - so you can consider them practical SOTA and honestly just plopping CC to a mid sized codebase is a pretty great experience for me already. Not ideal but I get real tangible value out of it. Not 10x or any such nonsense but enough to think that I don't think I want to be managing junior developers anymore, the ROI with LLMs is much faster and significant IMO.

ipaddr 5 hours ago

Looking back we see how foolish the anti-php memes were. Meanwhile PHP lives on and becomes better with each release.

Tooling around llms are a natural next step that will become your default one day.

godelski 17 hours ago

I can't believe we're back to advocating for TDD. It was a failed paradigm that last few times we tried it. This time isn't any different because the fundamental flaw has always been the same: tests aren't proofs, they don't have complete coverage.

Before anyone gets too confused, I love tests. They're great. They help a lot. But to believe they prove correctness is absolutely laughable. Even the most general tests are very narrow. I'm sure they help LLMs just as they help us, but they're not some cure all. You have to think long and hard about problems and shouldn't let tests drive your development. They're guardrails for checking bonds and reduce footguns.

Oh, who could have guessed, Dijkstra wrote about program completeness. (No, this isn't the foolishness of natural language programming, but it is about formalism ;)

https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...

josephg 15 hours ago

Testing works because tests are (essentially) a second, crappy implementation of your software. Tests only pass if both implementations of your software behave the same way. Usually that will only happen if the test and the code are both correct. Imagine if your code (without tests) has a 5% defect rate. And the tests have a 5% defect rate (with 100% test coverage). Then ideally, you will have a 5%^2 defect rate after fixing all the bugs. Which is 0.25%.

The price you pay for tests is that they need to be written and maintained. Writing and maintaining code is much more expensive than people think.

Or at least it used to be. Writing code with claude code is essentially free. But the defect rate has gone up. This makes TDD a better value proposition than ever.

TDD is also great because claude can fix bugs autonomously when it has a clear failing test case. A few weeks ago I used claude code and experts to write a big 300+ conformance test suite for JMAP. (JMAP is a protocol for email). For fun, I asked claude to implement a simple JMAP-only mail server in rust. Then I ran the test suite against claude's output. Something like 100 of the tests failed. Then I asked claude to fix all the bugs found by the test suite. It took about 45 minutes, but now the conformance test suite fully passes. I didn't need to prompt claude at all during that time. This style of TDD is a very human-time efficient way to work with an LLM.

godelski 4 hours ago

  > Tests only pass if both implementations of your software behave the same way.

That's not true.

I even addressed this in my comment as did Dijkstra

aray07 6 hours ago

This is great. The tests in this case are the spec. When you give the agent something concrete to fail against, it knows what done looks like.

The problem is if you skip that step and ask Claude to write the tests after.

mewpmewp2 13 hours ago

I think there is a difference whether you do TDD or write tests after the fact to avoid regression. TDD can only work decently if you already know your specs very well, but not so much when you still need to figure them out, and need to build something actual to be able to figure it out.

josephg 12 hours ago

Yes; I think this remains true with coding agents. If you need to do some exploration of the solution space, it makes sense to do that before writing tests. Once you have a clear, workable design, you can get the agent to make a battery of tests to make sure the final product works correctly.

theshrike79 15 hours ago

When you write tests with LLM-generated code you're not trying to prove correctness in a mathematically sound way.

I think of it more as "locking" the behavior to whatever it currently is.

Either you do the red-green-with-multiple-adversarial-sub-agents -thing or just do the feature, poke the feature manually and if it looks good then you have the LLM write tests that confirm it keeps doing what it's supposed to do.

The #1 reason TDD failed is because writing tests is BOORIIIING. It's a bunch of repetition with slight variations of input parameters, a ton of boilerplate or helper functions that cover 80% of the cases, but the last 20% is even harder because you need to get around said helpers. Eventually everyone starts copy-pasting crap and then you get more mistakes into the tests.

LLMs will write 20 test cases with zero complaints in two minutes. Of course they're not perfect, but human made bulk tests rarely are either.

godelski 4 hours ago

  > you're not trying to prove correctness in a mathematically sound way.

  > "locking" the behavior to whatever it currently is.

These two sentences are incompatible

  > The #1 reason TDD failed is

Because spec is an ever evolving thing that cannot be determined a priori. And because it highly incentivized engineers to metric hack.

  > It's a bunch of repetition with slight variations

If that's how you're writing tests then you're writing them wrong. You have the wrong level of abstraction. Abstraction is not a dirty word. It solves these problems. Maybe juniors don't understand that abstraction and fuck it up while learning but making abstraction a dirty word is throwing the baby out with the bath water.

  > Eventually everyone starts copy-pasting crap

Which is a horrendous way to write code.

theshrike79 32 minutes ago

Locking behavior with tests isn't the same as comprehensive and foolproof tests. They might not cover every edge case, but will fail if the happy path starts failing for some reason.

And yes, copy-pasting is a horrendous way to write code, but everyone does it.

When you're adding the 1600th CRUD endpoint of your career to an enterprise Java/C# application, can you with all honesty say you will type every single character with the same thought and consideration every time?

Or do you just make one, copy-paste that one and modify accordingly?

Or if you write 20 unit tests with slight alterations you masterfully craft every single character to perfection?

I have a limited amount of energy to use every day, I choose to use it in places that matter. The hard bits that LLMs and copy-pasting can't speed up.

computerdork 16 hours ago

Hmm, not so sure TDD is a failed paradigm. Maybe it isn't a pancea, but it is seems like it's changed how software development is done.

Especially for backend software and also for tools, seems like automated tests can cover quite a lot of use cases a system encounters. Their coverage can become so good that they'll allow you to make major changes to the system, and as long as they pass the automated tests, you can feel relatively confident the system will work in prod (have seen this many times).

But maybe you're separating automated testing and TDD as two separate concepts?

prerok 16 hours ago

Indeed, they are two separate concepts.

I write lots of automated tests, but almost always after the development is finished. The only exception is when reproducing a bug, where I first write the test that reproduces it, then I fix the code.

TDD is about developing tests first then writing the code to make the tests pass. I know several people who gave it an honest try but gave up a few months later. They do advocate everyone should try the approach, though, simply because it will make you write production code that's easier to test later on.

computerdork 4 hours ago

... hmm, just looked it up. According to some sites on the web, TDD was created by Kent Beck as apart of Extreme Programming in the 90's and automated testing is a big part of TDD. Having lived through that era, thinking back, would say that TDD did help to popularize automated testing. It made us realize that focusing a ton on writing tests had a lot of benefits (and yeah, most of us didn't do the test first development part).

But this is kind of splitting hairs on what TDD is, not too important.

mewpmewp2 13 hours ago

I think tests in general are good, just not TDD as it forces you to what I think bad and narrow paradigm of thinking. I think e.g. it is better that I build the thing, then get to 90%+ coverage once I am sure this is what I would also ship.

godelski 3 hours ago

That's the result I've seen with anyone who tries TDD. Their code ends up being very rigid, making it difficult to add new features and fix bugs. It just ends up making them over confident in their code's correctness. As if their code is bug free. It just seems like an excuse to not think and avoid doing the hard stuff.

godelski 3 hours ago

  > But maybe you're separating automated testing and TDD as two separate concepts?

I hope it's clear that I am given my content and how I stress I write tests. The existence of tests do not make development TDD.

The first D in TDD stands for "driven". While my sibling comment explains the traditional paradigm it can also be seen in an iterative sense. Like just developing a new feature or even a bug. You start with developing a test, treating it like spec, and then write code to that spec. Look at many of your sibling comments and you'll see that they follow this framing. Think carefully about it and adversarially. Can you figure out its failure mode? Everything has a failure mode, so it's important to know.

Having tests doesn't mean they drive the development. So there's many ways to develop software that aren't TDD but have tests. The important part is to not treat tests as proofs or spec. They are a measurement like any other; a hint. They can't prove correctness (that your code does what you intend it to do). They can't prove that it is bug free. But they hint at those things. Those things won't happen unless we formalize the code and not only is that costly in time to formalize but often will result in unacceptable computational overhead.

I'll give an example of why TDD is so bad. I taught a class a year ago (upper div Uni students) and gave them some skeleton code, a spec sheet, and some unit tests. I explicitly told them that the tests are similar to my private tests, which will be used to grade them, but that they should not rely on them for correctness and I encourage them to write their own. The next few months my office hours were filled with "but my code passes the tests" and me walking students through the tests and discussing their limitations along with the instructions. You'd be amazed at how often the same conversations happened with the same students over and over. A large portion of the class did this. Some just assumed tests had complete coverage and never questioned them while others read the tests and couldn't figure out their limits. But you know the students who never struggled in this way? The students who first approached the problem through design and even understood that even the spec sheet is a guide. That it tells requirements, not completeness. Since the homeworks built on one another those students had the easiest time. Some struggled at first, but many of them got the right levels of abstraction that I know I could throw new features at them and they could integrate without much hassle. They knew the spec wasn't complete. I mean of course it wasn't, we told them from the get go that their homeworks were increments to building a much larger program. And the only difference between that and real world programming is that that isn't always explicitly told to you and that the end goal is less clear. Which only makes this design style more important.

The only thing that should drive the software development is an unobtainable ideal (or literal correctness). A utopia. This prevents reduces metric hacking, as there is none to hack. It helps keep you flexible as you are unable to fool yourself into believing the code is bug free or "correct". Your code is either "good enough" or not. There's no "it's perfect" or "is correct", there's only triage. So I'll ask you even here, can you find the failure mode? Why is that question so important to this way of thinking?

siva7 11 hours ago

TDD and similiar test paradigms have all the same fundamental flaw -> It's testing for the sake of testing. You need to know exactly what you want in order to start, which isn't compatible with a competitive iterative workflow no matter how much TDD yells otherwise. TDD doesn't make sense in agile and fast iteration workflows, only in heavily regulated / restricted products.

tinodb 3 hours ago

It certainly isn’t. It is more a way of discovery on how to implement something, with the benefit of being able to safely (and thus easily) change it later.

The 99 Bottles book by Sandi Metz [0] is a good short display of how it works and where it helps actually building maintainable software

[0] https://sandimetz.com/99bottles

skeledrew 13 hours ago

> But to believe they prove correctness is absolutely laughable.

Sounds like a lack of tests for the correct things.

godelski 3 hours ago

True, but I seriously doubt people are writing formal proofs for their code. I've only seen this in niche academic circles and high security/safety settings. I also am pretty certain it's not what you're suggesting, but hey, I could be wrong

mvdtnz 16 hours ago

> But to believe they prove correctness is absolutely laughable.

You don't need to believe this to practice TDD. In fact I challenge you to find one single mainstream TDD advocate who believes this.

godelski 3 hours ago

https://news.ycombinator.com/item?id=47333160

dwedge 14 hours ago

It being a lot of work is why they didn't do it at all for weeks and still, without self reflection, wrote that they care about the code quality of the code they hadn't looked at or tested

globular-toast 10 hours ago

"You better work, bitch" -- Britney Spears

Our society is obsessed with work. Work will never end. If things become easier we just do more of them. Whether putting all our efforts into recycling things created by those that came before is good for us will remain to be seen.

Aushin 10 hours ago

Our society is obsessed with <the appearance of> work

EricEspen 8 hours ago

php still makes money though!

thewhitetulip 8 hours ago

I saw a guys post on LinkedIn who created llm agent to water how plants based on sensor on his plants

He still has to water the plants on his own. Its just that it costs him quite a bit when all of that could he mamaged with an alarm to remind him to water plants.

spiderfarmer 10 hours ago

It's always the uber conservative and over principled people who laugh about using PHP that have an opinion on everything while not knowing how to get shit done.

They're all just tools. You decide how to use them.

philipwhiuk 8 hours ago

Sure but we can agree there's essentially two parallel industries in web development

Engineer at tech firms and WebShops writing WordPress plugins for single clients where Squarespace doesn't cut it.

Is AI another field of people or is it killing one or both of those. TBD

AmazingTurtle 9 hours ago

> like using PHP

lmao, chuckled

recroad 2 days ago

Am I supposed to be impressed by this? I think people are now just using agents for the sake of it. I'm perfectly happy running two simple agents, one for writing and one for reviewing. I don't need to go be writing code at faster than light speed. Just focusing on the spec, and watching the agent as it does its work and intervening when it goes sideways is perfectly fine with me. I'm doing 5-7x productivity easily, and don't need more than that.

I also spend most of my time reviewing the spec to make sure the design is right. Once I'm done, the coding agent can take 10 minutes or 30 minutes. I'm not really in that much of a rush.

mjrbrennan 22 hours ago

Yes I'm still not really understanding this "run agents overnight" thing. Most of the time if I use claude it's done in 5-20 minutes. I've never wanted to have work done for me overnight...tomorrow is already plenty of time for more work, it's not going anywhere, and my employer isn't paying me to produce overnight.

tudelo 19 hours ago

The only counter I have to this is that there are some workflows that have test environments, everything can't or shouldn't just run locally. Sometimes these test take time, and instead of babysitting the model to write code and run the build+deploy+test manually, you can send it off to work until the kinks are worked out.

Add to that I have worked on many projects that take more than 20 minutes to fully build and run tests... unfortunately. And I would consider that part of the job of implementing a feature, and to reduce cycles I have to take.

After the "green" signal I will manually review or send off some secondary reviews in other models. Is it wasteful? Probably. But its pretty damn fun (as long as I ignore the elephant in the room.)

saguntum 16 hours ago

Yeah, our basic integration test suite takes over 20 minutes to run in CI, likely higher locally but I never try to run the full test suite locally. That doesn't even encapsulate PDVs and other continuous testing that runs in the background.

The other day, I wrote a claude skill to pull logs for failing tests on a PR from CI as a CSV for feeding back into claude for troubleshooting. It helped with some debugging but was very fraught and needed human guidance to avoid going in strange directions. I could see this "fix the tests" workflow instrumented as overnight churn loops that are forbidden from modifying test files that run and have engineers review in the morning if more tests pass.

Maybe agentic TDD is the future. I have a bit of a nightmare vision of SWEs becoming more like QA in the future, but with much more automation. More engineering positions may become adversarial QA for LLM output. Figure out how to break LLM output before it goes to prod. Prove the vibe coded apps don't scale.

In the exercise I described above, I was just prompt churning between meetings (having claude record its work and feeding it to the next prompt, pulling test logs in between attempts), without much time to analyze, while another engineer on my team was analyzing and actually manually troubleshooting the vibe coded junk I was pushing up, but we fixed over 100 failing integration tests in a week for a major refactor using claude plus some human(s) in the loop. I do believe it got things done faster than we would have finished without AI. I do think the quality is slightly lower than would have been if we'd had 4 weeks without meetings to build the thing, but the tests do now pass.

mjrbrennan 16 hours ago

Yes that's fair, but not the case for me. Everything can run locally and specs run quickly for covering things claude changes. For everything else, the GitHub CI run is 10-15m and catches any outlier failures, and I'm usually working on more than one thing at a time anyway so it doesn't really matter to wait for this.

voidUpdate 13 hours ago

Its for when you want to write an spec like "make me a todo list app", then tell your agent of choice to go have fun, and return in the morning to a fully finished app, and not care about what the code is actually doing

jakevoytko 12 hours ago

I’ve been playing around with these kinds of prompts. My experience is that the prompts need a lot of iteration to truly one-shot something that is halfway usable. If it’s under-spec’d it’ll just return after 15-20 minutes with something that’s not even half baked. If I give it an extremely detailed spec it’ll start dropping requirements and then finish around the 60-70 minute mark, but I needed 20 minutes to write the prompt and I need to hunt for the things it didn’t bother to do.

I’ve gotten some success iterating on the one-shot prompt until it’s less work to productionize the newest artifact than to start over, and it does have some learning benefits to iterate like this. I’m not sure if it’s any faster than just focusing on the problem directly though.

aray07 6 hours ago

The dropping requirements problem is real. What's helped us is breaking the spec into numbered ACs and having the verification run per-criterion. If AC-3 fails you know exactly what got dropped.

jakevoytko 5 hours ago

I'll try that out, thanks for the tip!

genghisjahn 24 hours ago

I went the same way. At first I was splitting off work trees and running all the agents that I could afford, then I realized I just can't keep up with it all, running few agents around one issue in one directory is fast enough. Way faster than before and I can still follow what's happening.

paganel 24 hours ago

> off work trees and running all the agents that I could afford,

I still think that we, programmers, having to pay money in order to write code is a travesti. And I'm not talking about paying the license for the odd text editor or even for an operating system, I'm talking about day-to-day operations. I'm surprised that there isn't a bigger push-back against this idea.

jeremyjh 23 hours ago

What is strange about paying for tools that improve productivity? Unless you consider your own time worthless you should always be open to spending more to gain more.

cube00 20 hours ago

No stock backed company will be paying developers more regardless of much more productive these tools make us. You'll be lucky if they pay for the proper Claude Max plan themselves considering most wouldn't even spring for IntelliJ.

jeremyjh 9 hours ago

I wasn't thinking about this from the perspective of an IC in a company, more from the perspective of self employment or side projects. But its not any different for a larger business: An IC should not pay for their own tools, but an engineering manager who won't is a fool.

fwip 23 hours ago

Are the jobs out there actually paying people more?

what 23 hours ago

Your own time is worthless if you’re not spending it doing something that makes more money. You don’t make more money increasing your productivity for work when you’re expected to work the same number of hours.

mr-wendel 21 hours ago

I've spent a fair amount of time contracting -- this issue is even more relevant here. While I wasn't spending very much on AI tools, what I did spent was worth every penny... for the company I was supporting :).

Fortunately, there was enough work to be done so productivity increases didn't decrease my billable hours. Even if it did, I still would have done it. If it helps me help others, then it's good for my reputation. Thats hard to put a price on, but absolutely worth what I paid in this case.

eKIK 23 hours ago

Dw, there's quite a lot of push back against AI in some of the communities I hang around in. It's just rarely seldom visible here on HN.

It's usually not about the price, but more about the fact that a few megacorps and countries "own" the ability to work this way. This leads to some very real risks that I'm pretty sure will materialize at some point in time, including but not limited to:

- Geopolitical pressure - if some ass-hat of a president hypothetically were to decide "nuh uh - we don't like Spain, they're not being nice to us!", they could forbid AI companies to deliver their services to that specific country.

- Price hikes - if you can deliver "$100 worth of value" per hour, but "$1000 worth of value" per hour with the help of AI, then provider companies could still charge up to $899 per hour of usage and it'd still make "business sense" for you to use them since you're still creating more value with them than without them.

- Reduction in quality - I believe people who were senior developers _before_ starting to use AI assisted coding are still usually capable of producing high quality output. However every single person I know who "started coding" with tools like Claude Code produce horrible horrible software, esp. from a security p.o.v. Most of them just build "internal tools" for themselves, and I highly encourage that. However others have pursued developing and selling more ambitious software...just to get bitten by the fact that it's much more to software development than getting semi-correct output from an AI agent.

- A massive workload on some open source projects. We've all heard about projects closing down their bug bounty programs, declining AI generated PRs etc.

- The loss of the joy - some people enjoy it, some people don't.

We're definitely still in the early days of AI assisted / AI driven coding, and no one really knows how it'll develop...but don't mistake the bubble that is HN for universal positivity and acclaim of AI in the coding space :).

GorbachevyChase 18 hours ago

China did users a solid and Qwen is a thing, so the scenario where Anthropic/OpenAI/Google collude and segment the market to ratchet prices in unison just isn’t possible. Amodei talking about value based pricing is a dream unless they buy legislation to outlaw competitors. Altman might have beat them to that punch with this admin, though. Most of us are operating on 10-40% margins. Usually on the low end when there aren’t legal barriers. The 80-99% margins or rent extraction rights SaaS people expect is just out of touch. The revenue the big 3 already pull in now has a lot more to do with branding and fear-mongering than product quality.

switchbak 22 hours ago

My old work machine used power quite aggressively - I was happy to pay for that (and turn it off at night!). This seems even more directly valuable.

xandrius 23 hours ago

It's silly, who wouldn't answer yes to the question "would you like to finish your task faster?". The real trick is to produce more but by putting less effort than before.

tdeck 19 hours ago

> who wouldn't answer yes to the question "would you like to finish your task faster?"

People who enjoy the process of completing the task?

lmz 18 hours ago

Maybe we'd see "coding gyms" like how white collar workers have gyms for the physical exercise they're not getting from their work.

missingdays 13 hours ago

codeforces and topcoder have existed for years

foo42 13 hours ago

I salaried employees who are paid by time, and are paying their own Anthropic bills.

Initially there is perhaps a mitigating advantage of briefly impressing ourselves or others with output, but that will quickly fade into the new normal.

Net result: employee paying significant money to produce more, but capturing none of that value.

the_af 21 hours ago

If you finish faster, you'll be given another task. You're not freeing yourself sooner or spending less effort, you're working the same number of hours for the same pay. Your reward is not joining the ranks of those laid off.

ponector 15 hours ago

If you are paid hourly and not per task than what is the point in finishing your task faster?

JumpCrisscross 20 hours ago

> Am I supposed to be impressed by this?

No. But it is noteworthy. A lot of what one previously needed a SWE to do can now be brute forced well enough with AI. (Granted, everything SWEs complained about being tedious.)

From the customer’s perspective, waiting for buggy code tomorrow from San Francisco, buggy code tonight from India or buggy code from an AI at 4AM aren’t super different for maybe two thirds of use cases.

timr 19 hours ago

> A lot of what one previously needed a SWE to do can now be brute forced well enough with AI. (Granted, everything SWEs complained about being tedious.)

Only if you ignore everything they generate. Look at all the comments saying that the agent hallucinates a result, generates always-passing tests, etc. Those are absolutely true observations -- and don't touch on the fact that tests can pass, the red/green approach can give thumbs up and rocket emojis all day long, and the code can still be shitty, brittle and riddled with security and performance flaws. And so now we have people building elaborate castles in the sky to try to catch those problems. Except that the things doing the catching are themselves prone to hallucination. And around we go.

So because a portion of (IMO always bad, but previously unrecognized as bad) coders think that these random text generators are trustworthy enough to run unsupervised, we've moved all of this chaotic energy up a level. There's more output, certainly, but it all feels like we've replaced actual intelligent thought with an army of monkeys making Rube Goldberg machines at scale. It's going to backfire.

cherk3 18 hours ago

What I want to know is, what has this increase in code generation led to? What is the impact?

I don't mean 'Oh I finally have the energy to do that side project that I never could'.

Afterall, the trade-offs have to be worth something... right? Where's the 1-person billion dollar firms at That Mr Altman spoke about?

The way I think of it is code has always been an intermediary step between a vision and an object of value. So is there an increase in this activity that yields the trade-offs to be a net benefit?

JumpCrisscross 42 minutes ago

> what has this increase in code generation led to?

Every restaurant in my small town has their menu on the website in a normal way. Apparently someone figured out you can take a picture of a paper menu and have AI code it into HTML.

JumpCrisscross 18 hours ago

> coders think that these random text generators are trustworthy enough to run unsupervised, we've moved all of this chaotic energy up a level

But it works well enough for most use cases. Most of what we do isn’t life or death.

timr 18 hours ago

> But it works well enough for most use cases.

So does the code produced by any bad engineer.

So either we’re finally admitting that all of that leetcode screening and engineer quality gating was a farce, or it wasn’t, and you’re wrong.

I think the answer is in the middle, but the pendulum has swung too far in the “doesn’t matter” direction.

JumpCrisscross 18 hours ago

> we’re finally admitting that all of that leetcode screening and engineer quality gating was a farce, or it wasn’t, and you’re wrong

We’re admitting a bit of both. Offshoring just became more instantaneous, secure and efficient. There will still be folks who overplay their hand.

Macroeconomically speaking, I don’t see why we need more software engineers in the future than we have today, and that’s probably a conservative estimate.

datsci_est_2015 17 hours ago

> Macroeconomically speaking, I don’t see why we need more software engineers in the future than we have today, and that’s probably a conservative estimate.

Why? Is the argument that there’s a finite amount of software that the world needs, and therefore we will more quickly reach that finite amount?

Seems more likely to me that if LLMs are a force multiplier for software then more software engineers will exist. Or, instead of “software engineers”, call them “people who create software” (even with the assistance of LLMs).

Or maybe the argument is that you need to be a super genius 100x engineer in order to manipulate 17 collaborative and competitive agents in order to reach your maximum potential, and then you’ll take everyone’s jobs?

Idk just seems like wild speculation that isn’t even worth me arguing against. Too late now that I’ve already written it out I guess.

JumpCrisscross 9 hours ago

> instead of “software engineers”, call them “people who create software” (even with the assistance of LLMs)

I think this is my hypothesis. A lot more people with a lot less training will create vastly more software. As a consequence, the trade sort of dissolves at the edges as something that pays a premium. Instead, other competencies become the differentiators.

utopiah 8 hours ago

> A lot of what one previously needed a SWE to do can now be brute forced well enough with AI.

I've never met those people. I've met a LOT of PM who tried. I've met a LOT of entrepreneur who also tried. They never cared, nor even understand, code. They only cared about "value" (and they are not necessarily wrong about it) so now they can "produce" something that does what need until it doesn't. When that's the case then they inexorably go back to someone else (might be a SWE, ironically enough, but might also be someone else like them they shift responsibility to, for money).

Brute force works until you have to backtrack, then it becomes prohibitively expensive until one has to actually grok the problem landscape. It's amazing for toy projects though, maybe.

p2detar 13 hours ago

I'm on the same ship. Running 2 agents and seeing a vast amount of productivity increase. Not always though. Sometimes the solutions are very over-engineered and I need to guide the agent to where I want it to go. I do a lot of micro-management, which is totally not where people with agent-orchestras seem to go nowadays.

aray07 23 hours ago

yup, agree - i spend most of my time reviewing the spec. The highest leverage time is now deciding what to work on and then working on the spec. I ended up building the verify skill (https://github.com/opslane/verify) because I wanted to ensure claude follows the spec. I have found that even after you have the spec - it can sometimes not follow it and it takes a lot of human review to catch those issues.

ge96 24 hours ago

I would be impressed if I could say "here's $100 turn it into $1000" but you still gotta do the thinking.

jimmyjazz14 7 hours ago

agreed, honestly if I see my agent "run" for more than 5 minutes or so I get very suspicious that its doing anything of value other than burning credits because more often than not its just talking to its self or running in loops. I also find the whole multi-agent stuff to be suspect most the time, I don't know that I have seen multiple agents running in parallel do anything that a single agent with good guidance couldn't do synchronously in about the same amount of time.

nurettin 8 hours ago

They are probably paying for expensive subscriptions and want to utilize them. Unfortunately we aren't past the slop stage so a lot of the business logic probably has bugs and unused defensive code that snowballs the more features AI adds.

autodate 14 hours ago

[dead]

egeozcan 2 days ago

You can always tell claude to use red-green-refactor and that really is a step-up from "yeah don't forget to write tests and make sure they pass" at the end of the prompt, sure. But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

The trick is just not mixing/sharing the context. Different instances of the same model do not recognize each other to be more compliant.

magicalist 2 days ago

> But even better, tell it to create subagents to form red team, green team and refactor team while the main instance coordinates them, respecting the clean-room rules. It really works.

It helps, but it definitely doesn't always work, particularly as refactors go on and tests have to change. Useless tests start grow in count and important new things aren't tested or aren't tested well.

I've had both Opus 4.6 and Codex 5.3 recently tell me the other (or another instance) did a great job with test coverage and depth, only to find tests within that just asserted the test harness had been set up correctly and the functionality that had been in those tests get tested that it exists but its behavior now virtually untested.

Reward hacking is very real and hard to guard against.

egeozcan 2 days ago

The trick is, with the setup I mentioned, you change the rewards.

The concept is:

Red Team (Test Writers), write tests without seeing implementation. They define what the code should do based on specs/requirements only. Rewarded by test failures. A new test that passes immediately is suspicious as it means either the implementation already covers it (diminishing returns) or the test is tautological. Red's ideal outcome is a well-named test that fails, because that represents a gap between spec and implementation that didn't previously have a tripwire. Their proxy metric is "number of meaningful new failures introduced" and the barrier prevents them from writing tests pre-adapted to pass.

Green Team (Implementers), write implementation to pass tests without seeing the test code directly. They only see test results (pass/fail) and the spec. Rewarded by turning red tests green. Straightforward, but the barrier makes the reward structure honest. Without it, Green could satisfy the reward trivially by reading assertions and hard-coding. With it, Green has to actually close the gap between spec intent and code behavior, using error messages as noisy gradient signal rather than exact targets. Their reward is "tests that were failing now pass," and the only reliable strategy to get there is faithful implementation.

Refactor Team, improve code quality without changing behavior. They can see implementation but are constrained by tests passing. Rewarded by nothing changing (pretty unusual in this regard). Reward is that all tests stay green while code quality metrics improve. They're optimizing a secondary objective (readability, simplicity, modularity, etc.) under a hard constraint (behavioral equivalence). The spec barrier ensures they can't redefine "improvement" to include feature work. If you have any code quality tools, it makes sense to give the necessary skills to use them to this team.

It's worth being honest about the limits. The spec itself is a shared artifact visible to both Red and Green, so if the spec is vague, both agents might converge on the same wrong interpretation, and the tests will pass for the wrong reason. The Coordinator (your main claude/codex/whatever instance) mitigates this by watching for suspiciously easy green passes (just tell it) and probing the spec for ambiguity, but it's not a complete defense.

w4yai 23 hours ago

You guys are describing wonderful things, but I've yet to see any implementation. I tried coding my own agents, yet the results were disappointing.

What kind of setup do you use ? Can you share ? How much does it cost ?

dworks 23 hours ago

rlm-workflow does all that TDD for you: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

(I built it)

cheema33 21 hours ago

Why make powershell a requirement? I like powershell, but Python is very common and already installed on many dev systems.

dworks 11 hours ago

Sorry about that. Let me push an update.

_ink_ 23 hours ago

Thanks for sharing. What does RLM stand for? Any idea why the socket security test fails?

stavros 23 hours ago

Recursive language models: https://github.com/doubleuuser/rlm-workflow

aprdm 21 hours ago

If you are not spending 5-10k dollars a month for interesting projects, you likely won't see interesting results

cube00 20 hours ago

Sounds a lot like paying for online ads, they don't work because you're not paying enough, when in reality bots, scrapers and now agents are just running up all the clicks.

You pay more to try and get above that noise and hope you'll reach an actual human.

The new "fast mode" that burns tokens at 6 times the rate is just scary because that's what everyone still soon say we all need to be using to get results.

zarzavat 19 hours ago

It feels like everyone's gone mad.

Here I am mostly writing code by hand, with some AI assistant help. I have a Claude subscription but only use it occasionally because it can take more time to review and fix the generated code as it would to hand-write it. Claude only saves me time on a minority of tasks where it's faster to prompt than hand-write.

And then I read about people spending hundreds or thousands of dollars a month on this stuff. Doesn't that turn your codebase into an unreadable mess?

foo42 13 hours ago

I've been thinking about this recently and it seems like the most enthusiastic boosters always suggest difference in results is a skill issue, but I feel like there are 4 factors which multiply out to influence how much value someone gets: - The quality of model output for _your particular domain / tech stack_. Models will always do better with languages and libraries they see a lot of than esoteric or proprietary - The degree to which "works" = "good" in your scenario. For a one off script, "works" is all that matters, for a long lived core library, there are other considerations. - The degree to which "works" can be easily (best yet, automatically) verified. - Techniques, existing code cleanliness, documentation etc.

Boosters tend to lay all different experiences at the feet of this last, yet I'd argue the others are equally significant.

On the other hand, if you want to get the best results you can given the first 3 (which are generally out of one's control) then don't presume there's nothing you can do to improve the 4th.

aprdm 17 hours ago

Why read code when you are getting results fast ? See https://steve-yegge.medium.com/welcome-to-gas-town-4f25ee16d...

I am not kidding. People don't seem to understand what's actually happening in our industry. See https://www.linkedin.com/posts/johubbard_github-eleutherailm...

zarzavat 15 hours ago

I'm not getting results. That's the point. Claude doesn't fucking work without human intervention. When left to its own devices it makes bad decisions. It writes bad code. It needs constant supervision to stop it from going off the rails and replacing working code with broken code. It doesn't know what it's doing!

It's about as far as you can get from being able to work independently.

Yegge is an entertainer. Gas Town is performance art, it's not meant to be taken seriously.

aprdm 4 hours ago

How much are you spending ? See initial post of the thread. My team has no problems with it, they are spending each 5-10k per month.

godelski 17 hours ago

Why is everyone obsessed with Mac Minis. They're awesome but for the work that these people are attempting to do? Just seems... nonsensical. Renting a server is cheaper and still just as "local" as any of this (they want "self hosted", I don't think anyone cares about local. Like are people air gapping networks? lol)

And a senior director of Nvidia? He had several Mac Minis? I really gotta imagine a Spark is better... at least it'll be a bit smarter of a cat (I'm pretty suspicious he used a LLM to help write that post)

No time to think, gotta go fast?

duskdozer 13 hours ago

It seems like the monkey-ladders story. Someone probably just had one sitting around and it worked or needed to do something Apple-specific and that message got lost along the way

egeozcan 15 hours ago

They want access to Apple Messages. That's all there's to it AFAICT.

duskdozer 13 hours ago

These are like, jokes right?

aprdm 17 hours ago

I think the output of companies that can invest on tokens vs those who cannot will lead to crazy different outcomes in the next few years.

mrbungie 21 hours ago

I can't really tell if this is sarcasm or not.

Culonavirus 15 hours ago

That's how half of these "agents" posts feel to me in general.

throwaway7783 21 hours ago

We have a very uncomplicated setup with claude code. A CLAUDE.md with instructions and notes about the repo and how to run stuff. We also do code reviews with Claude Code, but in a separate session.

It works wonderfully well. Costs about $200USD per developer per month as of now.

gopher_space 16 hours ago

Paste the comment you replied to into a LLM good at planning. That’s something the codex/claude setups can create for you with a little back and forth.

canadiantim 22 hours ago

Check out Mike Pocock’s work, he’s done excellent work writing about red green refactor and has a GitHub repo for his skills. Read and take what you need from his tdd skill and incorporate it into your own tdd skill tailored for your project.

nojito 21 hours ago

This is just ai slop. If you follow what the actual designers of Claude/GPT tell you it flys in the face of building out over engineered harnesses for agents.

throwaway7783 21 hours ago

I agree with this. There is not a lot of harnesses/wrapping needed for Claude Code.

canadiantim 20 hours ago

You don't need a harness beyond Claude Code, but honestly it's foolish to think you shouldn't be building out extra skills to help your workflow. A TDD skill that does red-green-refactoring is using Claude Code exactly as how it's meant to be used. They pioneered skills.

throwaway7783 4 hours ago

Yep, not saying we don't need skills. Just harnesses.

canadiantim 21 hours ago

Works better than standard claude / gpt, which doesn't do red-green-refactor. Doesn't seem like slop when it meaningfully changes the results for the better, consistently. Really is a game-changer. You should consider trying it.

nojito 20 hours ago

I do do TDD but using skills in this way is an anti-pattern for a multitude of reasons.

canadiantim 20 hours ago

I don't think just saying it's an anti-pattern for a multitude of reasons and then not naming any is sufficiently going to convince anyone it's an anti-pattern.

This is in fact precisely what skills is meant for and is the opposite of an anti-pattern, but more like best practice now. It's explicitly using the skills framework precisely how it was meant to be used.

tomtom1337 2 days ago

This is very interesting, but like sibling comments, I'm very curious as to how you run this in practice. Do you just tell Claude/Copilot to do what you describe?

And do you have any prompts to share?

throwaway7783 21 hours ago

You don't need most of this. Prompts are also normally what you would say to another engineer.

* There is a lot of duplication between A & B. Refactor this.

* Look at ticket X and give me a root cause

* Add support for three new types of credentials - Basic Auth, Bearer Token and OAuth Client Creds

Claude.md has stuff like "Here's how you run the frontend. here's how u run backend. This module support frontend. That module is batch jobs. Always start commit messages with ticket number. Always run compile at the top level. When you make code changes, always add tests" etc etc

Exoristos 19 hours ago

They never do.

duskdozer 13 hours ago

kolinko 14 hours ago

Not Claude/Copilot. Claude.

seer 18 hours ago

This seems quite amazing really, thanks for sharing

What is the scope of projects / features you’ve seen this be successful at?

Do you have a step before where an agent verifies that your new feature spec is not contradictory, ambiguous etc. Maybe as reviewed with regards to all the current feature sets?

Do you make this a cycle per step - by breaking down the feature to small implementable and verifiable sub-features and coding them in sequence, or do you tell it to write all the tests first and then have at it with implementation and refactoring?

Why not refactor-red-green-refactor cycle? E.g. a lot of the time it is worth refactoring the existing code first, to make a new implementation easier, is it worth encoding this into the harness?

aray07 6 hours ago

I do it per feature, not per step. Write the AC for the whole feature upfront, then the agent builds against it. I haven't added a spec-validation step before coding but that's a good idea. Catching ambiguity in the spec before the agent runs with it would save a lot of rework

devinplatt 17 hours ago

I'm curious how this works if the green team writes an implementation that makes a network call like an RPC.

Red team might not anticipate this if the spec does detail every expected RPC (which seems unreasonable: this could vary based on implementation). But a unit test would need mocks.

Is green team allowed to suggest mocks to add to the test? (Even if they can't read the tests themselves?) This also seems gamaeable though (e.g. mock the entire implementation). Unless another agent makes a judgement call on the reasonability of the mock (though that starts to feel like code review more generally).

Maybe record/replay tests could work? But there are drawbacks in the added complexity.

codethief 14 hours ago

I think the solution here is: Don't mock and inject dependencies explicitly, as function parameters / monads / algebraic effects. Make side effects part of the spec/interface.

esperent 18 hours ago

Someone directly down from you suggested looking up Mike Postock's TDD skill, so I did:

https://github.com/mattpocock/skills/blob/main/tdd%2FSKILL.m...

Everything below quoted from that skill, and serves as a much better rebuttal than I had started writing:

DO NOT write all tests first, then all implementation. This is "horizontal slicing" - treating RED as "write all tests" and GREEN as "write all code."

This produces crap tests:

Tests written in bulk test imagined behavior, not actual behavior You end up testing the shape of things (data structures, function signatures) rather than user-facing behavior Tests become insensitive to real changes - they pass when behavior breaks, fail when behavior is fine

You outrun your headlights, committing to test structure before understanding the implementation

Correct approach:

Vertical slices via tracer bullets.

One test → one implementation → repeat. Each test responds to what you learned from the previous cycle. Because you just wrote the code, you know exactly what behavior matters and how to verify it.

fsckboy 17 hours ago

>One test → one implementation → repeat.

>Because you just wrote the code, you know exactly what behavior matters and how to verify it.

what you go on to describe is

One implementation → one test → repeat.

bcrosby95 16 hours ago

Seems like red team is incentivized to write tests that violate the spec since you're rewarding failed tests.

xienze 23 hours ago

This seems like a tremendous amount of planning, babysitting, verification, and token cost just to avoid writing code and tests yourself.

habinero 23 hours ago

It's assigning yourself the literal worst parts of the job - writing specs, docs, tests and reading someone else's code.

zarzavat 19 hours ago

There's a real disconnect. I was talking to a junior developer and they were telling me how Claude is so much smarter than them and they feel inferior.

I couldn't relate. From my perspective as a senior, Claude is dumb as bricks. Though useful nonetheless.

I believe that if you're substantially below Claude's level then you just trust whatever it says. The only variables you control are how much money you spend, how much markdown you can produce, and how you arrange your agents.

But I don't understand how the juniors on HN have so much money to throw at this technology.

godelski 16 hours ago

  > I was talking to a junior developer and they were telling me how Claude is so much smarter than them and they feel inferior.

Every time I talk to a wizard I feel like they're so much smarter than me and it makes me feel inferior.

So I take that feeling and use it to drive me to become a wizard like them. I've generally found that wizards are very happy to take on apprentices.

I'm not trying to call Claude a wizard (I have similar feelings to you), but more that I don't understand that junior's take. We all feel dumb. All but time. Even the wizards! But it's that feeling that drives you to better yourself and it's what turns you into a wizard.

Honestly so much of what I hear from the "AI does all my coding" crowd just sounds very junior. It's just the same like how a year or two ago they were saying "it does the repetitive stuff". Isn't that what functions, libraries, functors, templates, and other abstractions are for? It feels like we're back to that laughable productivity metric of lines of code or number of commits. I don't know why we love our cargo cults. It seems people are putting so much effort into their cargo cults that they could have invented a real airplane by now.

tayo42 16 hours ago

It's 20 dollars a month to use...

zarzavat 15 hours ago

Yes for the basic plan. However there are people who claim to use the API and spend hundreds, or thousands, of dollars a month.

duskdozer 13 hours ago

It just seems totally crazy to me, I don't understand how wrestling with this slot machine is even mentally easier

gedy 22 hours ago

Yes with the reward of: I don't understand this code and didn't learn anything incrementally about the feature I "planned".

godelski 16 hours ago

Well they probably have the same ability to evaluate the correctness of a feature as a middle manager with a Harvard business degree

tnecio 20 hours ago

How do you make sure Red Team doesn't just write subtly broken tests?

skybrian 2 days ago

How do you define visibility rules? Is that possible for subagents?

egeozcan 2 days ago

AFAIK Claude doesn't support it, but if you're willing to go the extra mile, you can get creative with some bash script: https://pastebin.com/raw/m9YQ8MyS (generated this a second ago - just to get the point across )

To be clear, I don't do this. I never saw an agent cheat by peeking or something. I really did look through their logs.

I'd be very interested to see claude code and other tools support this pattern when dispatching agents to be really sure.

achierius 24 hours ago

> To be clear, I don't do this.

How do you know that it works then? Are you using a different tool that does support it?

skybrian 24 hours ago

So what do you do? Do you define roles somewhere and tell the agent to assign these roles to subagents?

ssk42 23 hours ago

Fun to see you not on tildes.

Setting up a clean room is one of the only ways to do Evals on agentic harnesses. Especially prevalent with Windsurf which doesn’t have an easy CLI start.

So how? The easiest answer when allowed is docker. Literally new image per prompt. There’s also flags with Claude to not use memory and from there you can use -p to have it just be like a normal cli tool. Windsurf requires manual effort of starting it up in a new dir.

skybrian 21 hours ago

Sounds interesting, but I'm not quite getting the relevance for people writing code with an agent. Should I be doing evals?

ssk42 19 hours ago

Well I mean yes. I think people ought be aware for how the harnesses compare for their stacks. But clean room applies for this RGR situation too

novaleaf 18 hours ago

you are replying to a bot, that's why.

ssk42 8 hours ago

What

lagrange77 2 days ago

> Reward hacking is very real and hard to guard against.

Is it really about rewards? Im genuinely curious. Because its not a RL model.

gbnwl 2 days ago

I'm noticing terms related to DL/RL/NLP are being used more and more informally as AI takes over more of the cultural zeitgeist and people want to use the fancy new terms of the era, even if inaccurately. A friend told me he "trained and fine tuned a custom agent" for his work when what he meant was he modified a claude.md file.

collingreen 18 hours ago

Respectfully, your friend doesn't know what he is talking about and is saying things that just "feel right" (vibe talking??). Which might be exactly how technical terms lose their meaning so perhaps you're exactly right.

hexaga 24 hours ago

There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

And with that comes reward hacking - which isn't really about looking for more reward but rather that the model has learned patterns of behavior that got reward in the train env.

That is, any kind of vulnerability in the train env manifests as something you'd recognize as reward hacking in the real world: making tests pass _no matter what_ (because the train env rewarded that behavior), being wildly sycophantic (because the human evaluators rewarded that behavior), etc.

lagrange77 21 hours ago

> There is a nontrivial amount of RL training (RLHF, RLVR, ...), so it would be reasonable to call it an RL model.

Hm, as i understand it, parts of the training of e.g. ChatGPT could be called RL models. But the subject to be trained/fine tuned is still a seq2seq next token predictor transformer neural net.

hexaga 21 hours ago

RL is simply a broad category of training methods. It's not really an architecture per se: modern GPTs are trained first on reconstruction objective on massive text corpora (the 'large language' part), then on various RL objectives +/- more post-training depending on which lab.

magicalist 2 days ago

> Is it really about rewards? Im genuinely curious. Because its not a RL model.

Ha, good point. I was using it informally (you could handwave and call it an intrinsic reward if a model is well aligned to completing tasks as requested), but I hadn't really thought about it.

Searching around, it seems like I'm not alone, but it looks like "specification gaming" is also sometimes used, like: https://deepmind.google/blog/specification-gaming-the-flip-s...

nurettin 2 days ago

They probably meant goal hacking. (I just made that up)

taneq 18 hours ago

I refer to it as ‘wanking’. It’s doing something that’s unproductive but that’s incentivised by its architecture.

lagrange77 7 hours ago

I'll use that term from now on. :D

SoftTalker 2 days ago

A refactor should not affect the tests at all should it? If it does, it's more than a refactor.

gchamonlive 2 days ago

It can if your refactor needs to deal with interface changes, like moving methods around, changing argument order etc... all these need to propagate to the tests

bluGill 24 hours ago

Your tests are an assertion that 'no matter what this will never change'. If your interface can change then you are testing implementation details instead of the behavior users care about.

the above is really hard. A lot of tdd 'experts' don't understand is and teach fragile tests that are not worth having.

8note 21 hours ago

https://www.hyrumslaw.com/

your implementation is your interface. its a bit naive or hating-your-users to assume your tests are what your users care about. theyre dealing with everything, regardless of what youve tested or not.

SirSavary 20 hours ago

Hyrum's law is about the real consumers/users (inadvertently) depending on any observable behaviour they can get their hands on.

TDD/BDD tests are meant to define the intended contract of a system.

These are not the same thing.

switchbak 22 hours ago

Refactoring is changing the design of the code without affecting the behaviour.

You can change an interface and not change the behaviour.

I have rarely heard such a rigid interpretation such as this.

gchamonlive 18 hours ago

Sure if you are changing your interfaces a lot you either are leaking abstractions or you aren't designing your interfaces well.

But things evolve with time. Not only your software is required to do things it wasn't originally designed to do, but your understanding of the domain evolve, and what once was fine becomes obsolete or insufficient.

imiric 17 hours ago

> Your tests are an assertion that 'no matter what this will never change'.

That's a strange definition. A lot of software should change in order to adapt to emerging requirements. Refactorings are often needed to make those changes easier, or to improve the codebase in ways that are transparent to users. This doesn't mean that the interfaces remain static.

> If your interface can change then you are testing implementation details instead of the behavior users care about.

Your APIs also have users. If you're only testing end-user interfaces, you're disregarding the users of your libraries and modules, e.g. your teammates and yourself.

Implementation details are contextual. To end-users, everything behind the external UI is an implementation detail. To other programmers, the implementation of a library, module, or even a single function can be a detail. That doesn't mean that its functionality shouldn't be tested. And, yes, sometimes that entails updating tests, but tests are code like any other, and also require maintenance and care.

bluGill 9 hours ago

> A lot of software should change in order to adapt to emerging requirements.

True, but your tests should still aim to be testing the type of thing that will never change unless a customer requirement is changed, not because you want to refactor something.

This is of course impossible, but it should still be your goal.

>Your APIs also have users

Exactly - so test those APIs that have users, not the internal implementation details. If an API has users it quickly becomes an Augean Stables problem to change them and so you won't touch that API if you can at all help it (you may add a new/better way and slowly convert everyone, but it will be a decade before you can get rid of the old one)

> To other programmers, the implementation of a library, module, or even a single function can be a detail

other programmers are sometimes customers/users. If you are writing a logging system (what I happen to be working on today) your end users may never be allowed to see anything related to your system, but you expect to quickly have so many people calling Log() that you can't change the interface. By contrast you may be able to log to a file or a network socket - test those two backends by calling log() like your end users would and not be calling whatever the interface between the frontend (that selects which backend to use) and the backend is.

Again, the goal is to never update tests once written. I'm under no illusions you will (or even should) achieve this, but it is the goal.

magicalist 2 days ago

It depends on what you mean by "refactor" and how exactly you're testing, I guess, but that's not really at the heart of the point. red-green-refactor could also be used for adding new features, for instance, or an entire codebase, I guess.

skeledrew 13 hours ago

Did you look at the actual coverage yourself though? That's something I always have active, and if I see it low in an area, I'll be asking questions.

eru 17 hours ago

> Useless tests start grow in count and important new things aren't tested or aren't tested well.

You can use coverage information, and you should cull your tests every once in a while I guess.

Property based testing also helps.

esafak 17 hours ago

Periodically reviewing tests is worthwhile but rarely done; writing tests alone is already disliked.

SequoiaHope 2 days ago

I’m telling it to use red/green tdd [1] and it will write test that don’t fail and then says “ah the issue is already fixed” and then move on. You really have to watch it very closely. I’m having a huge problem with bad tests in my system despite a “governance model” that I always refer it to which requires red/green tdd.

[1] https://simonwillison.net/guides/agentic-engineering-pattern...

joegaebel 22 hours ago

I've been able to encode Outside-in Test Driven Development into a repeatable workflow. Claude Code follows it to a T, and I've gotten great results. I've written about it more here, and created a repo people can use out of the box to try it out:

https://www.joegaebel.com/articles/principled-agentic-softwa... https://github.com/JoeGaebel/outside-in-tdd-starter

codethief 13 hours ago

Thanks for sharing!

> When asking Claude Code to write tests, I find they are inevitably coupled to implementation details, mockist, brittle, and missing coverage.

Interestingly, I haven't noticed any of that so far, using Claude Code on a new-ish project (couple 10k loc). However, I also went out of my way in my CLAUDE.md to instruct it to write functional code, avoid side effects / push side effects to the shell (functional core, imperative shell), avoid mocks in tests, etc. etc.

joegaebel 10 hours ago

That's the kicker - you taught it conventions that it follows. Even better (in my view) is to ensure you're getting higher coverage and tests focused on behaviour by getting it to write the tests first.

Even moreso by ensuring it writes "feature complete" tests for each feature first.

Even moreso by running mutation testing to backfill tests for logic it didn't cover.

codybontecou 2 days ago

This sounds interesting. Can you go a bit deeper or provide references on how to implement the green/red/refactor subagent pattern?

pastescreenshot 2 days ago

What has worked better for me is splitting authority, not just prompts. One agent can touch app code, one can only write failing tests plus a short bug hypothesis, and one only reviews the diff and test output. Also make test files read only for the coding agent. That cuts out a surprising amount of self-grading behavior.

huslage 24 hours ago

How do you limit access like that?

elemeno 2 days ago

It’s not an agentic pattern, it’s an approach to test driven development.

You write a failing test for the new functionality that you’re going to add (which doesn’t exist yet, so the test is red). You then write the code until the test passes (that is, goes green).

dworks 23 hours ago

I built rlm-workflow which has stage gating, TDD and sub-agent support: https://skills.sh/doubleuuser/rlm-workflow/rlm-workflow

dmd 2 days ago

That's the cool bit - you don't have to. CC is perfectly well aware and competent to implement it; just tell it to.

irishcoffee 2 days ago

"So this is how liberty dies... with thunderous applause.” - Padmé Amidala

s/liberty/knowledge

kaizenb 17 hours ago

Works for PR reviews. Separating context for code review with the same model has significant impact.

● Separation of concerns. No single agent plans, implements, and verifies. The agent that writes the code is never the agent that checks it.

osigurdson 24 hours ago

So more stuff happens with this approach but how do you know what it generates is correct?

afro88 2 days ago

Good idea, and an improvement, but you still have that fundamental issue: you don't really know what code has been written. You don't know the refactors are right, in alignment with existing patterns etc.

aray07 2 days ago

thats a great idea - i have been using codex to do my code reviews since i have it to give better critique on code written by claude but havent tried it with testing yet!

darkbatman 2 days ago

codex/gpt is a stubborn model, doubt it would accept claude reviews or counter it. have seen cases where claude is more willing to comply if shared feedback though its just sycophancy too.

Skidaddle 2 days ago

How exactly do you set up your CC sessions to do this?

bhouston 2 days ago

I call this "Test Theatre" and it is real. I wrote about it last year:

https://benhouston3d.com/blog/the-rise-of-test-theater

You have to actively work against it.

JBorrow 24 hours ago

Yeah, having your agent write 3x the code in exhaustive tests (I tried this recently and got 600 lines of tests for my 100 lines of code!) sure makes things look great, but when you actually look at the content of the tests they’re meaningless. Good tests validate the use of design patterns, ensure that dependencies hold, and are meaningful (e.g. shortcut debugging by setting up useful state) when they break.

jakewins 2 days ago

This was really good, and second leaning on property testing. I’ve had really good outcomes from setting up Schemathesis and getting blanket coverage for stuff like “there should be no request you can generate as logged in user A that let’s you do things as or see things that belong to user B”, as well as “there should be no request you can find to any API endpoint that can trigger a 5xx response”

aray07 2 days ago

Test theatre is exactly the right framing. The tests are syntactically correct, they run, they pass but do they actually prove anything?

joegaebel 21 hours ago

I've found the best way to achieve that is to force the agent to do TDD. Better to get it to do Outside-in TDD. Even better to get it to run Outside-in TDD, then use mutation testing to ensure it has fully covered the logic.

I've written about this and have a POC here for those interested: https://www.joegaebel.com/articles/principled-agentic-softwa...

anhner 12 hours ago

You mean an LLM wrote an article for you. I got bored not even halfway through because it wrote a wall of text but it didn't actually say anything.

joegaebel 10 hours ago

Actually, every single word was hand typed. You'll probably notice that in areas where I could improve my grammar. It's understandable that you hit the wall of text and felt a bit dismayed by the length- hence the TLDR at the top and the example repo :)

what 20 hours ago

Test theatre isn’t new. Most people writing tests do the exact same thing, testing implementation.

UK-Al05 12 hours ago

This is why people don't see the problem with tests that agents generate after the implementation. They look exactly like what they write.

oxag3n 3 hours ago

This is Java EE all over again.

When I graduated in 2012 it was pushed everywhere, including my uni so my undergrad thesis was done in Java.

Everyone was learning it, certifying, building things on top of other things.

EJB, JPA, JTA, JNDI, JMS and JCA.

And them more things to make it even more powerful with Servlets, JSP, JSTL, JSF.

Many companies invested and built various application servers, used by enterprises by this day.

Every engineer I've met said Java is server side future, don't bother with other tech. You'll just draw data schema, persistence mapping, business logic and ship it.

I switched to C++ after Bjarne's talk I attended in 2013. I'm glad I did although I never worked as a software engineer. Following passion and going deep into technology was a bliss for me, the difference between my undergrad Java, Master C++ and Rust PhD is like a kids toy and a real turboprop engine.

Don't follow the hype - it will go away and you'll be left with what you've invested into.

wesselbindt 22 hours ago

Does anyone know what this guy is having his agents build? Bc I looked a bit and all I see him ship is linkedin posts about Claude.

misja111 14 hours ago

I can't imagine he's building anything serious. How can you claim otherwise that your agents were deploying code that you couldn't verify. Imagine doing that in any serious business ..

teejmya 6 hours ago

It's okay, you don't have to imagine anymore...

https://news.ycombinator.com/item?id=47324211

gedy 22 hours ago

Yeah maybe I'm just old but in 25 years in industry - not one company has needed this much code that fast. They may insist they do but then it sits while they figure out how to sell, or the inevitable "oh wait, we didn't think about that..."

wreath 14 hours ago

Exactly! How do other parts of the organization deal with this avalanche of features in terms of documenting, pricing and packaging, marketing, selling and getting feedback on them. How do users adopt these features and incorporate them in their workflows so fast? Never in my career was the speed of writing code alone the bottleneck.

otabdeveloper4 13 hours ago

> deal with this avalanche of features

You mean avalanche of bugs and technical debt.

wreath 12 hours ago

Technical debt is never a problem now since only AI reads code /s

lostapathy 20 hours ago

So much of this - never would have guessed how much code I wouldn't write doing this as a career.

hinkley 21 hours ago

Hurry up and wait.

RealityVoid 2 days ago

It's... really the same problem when you hire people to just write tests. A lot of time it just confirms that the code does what the code does. Having clear specs of what the code should do make things better and clearer.

SoftTalker 2 days ago

Yep, tests written after the fact are just verifying tautologies.

> Most teams don't [write tests first] because thinking through what the code should do before writing it takes time they don't have.

It's astonishing to me how much our industry repeats the same mistakes over and over. This doesn't seem like what other engineering disciplines do. Or is this just me not knowing what it looks like behind the curtain of those fields?

yurishimo 2 days ago

When push comes to shove, software can usually be fudged. Unlike a building or a water treatment plant where the first fuck up could mean that people die.

I like to think that people writing actual mission critical software try their absolute best to get it right before shipping and that the rest our industry exists in a totally separate world where a bug in the code is just actually not that big of a deal. Yeah, it might be expensive to fix, but usually it can be reverted or patched with only an inconvenience to the user and to the business.

It’s like the fines that multinational companies pay when breaking the law. If it’s a cost of doing business, it’s baked into the price of the product.

You see this also in other industries. OSHA violations on a residential construction site? I bet you can find a dozen if you really care to look. But 99% of the time, there are no consequences big enough for people to care so nobody wears their PPE because it “slows them down” or “makes them less nimble”. Sound familiar?

AlotOfReading 17 hours ago

    I like to think that people writing actual mission critical software try their absolute best to get it right before shipping.

People try, but the only fundamentally different part is that you spend time thinking about and documenting your process rather than just doing it. There's always one more bug. Usually there ends up being a human covering up for the system's failures somewhere that no one else notices. That's the driver in the car, or the factory tech who adjusts things just a bit.

girvo 20 hours ago

Quite. We’re far more similar to construction workers than we are civil engineers, despite the lofty title we like to bestow upon ourselves.

Ekaros 16 hours ago

And from slightly different view. What we make is not output of modern mass production. With highly tuned and most of time perfectly matching parts build into one unit.

Instead we make pre-mass production bespoke products where each part is slightly filled and fitted together from bunch of random components. Say the barrel can't be changed between two different handguns. We just have magic technology to replicate the single gun multiple times. Does not mean it is actually mass-produced in sense say our current power tools are.

hrmtst93837 12 hours ago

Software is unusually forgiving about skipping process because the worst outcome is usually a rollback, not physical damage or liability. In electronic and civil engineering the cost of mistakes is much higher so up-front planning is the norm. If a bug in code bricked production hardware overnight you'd see test-driven practices everywhere.

gonzalohm 24 hours ago

That's probably dependent on your specific area of work. For most projects, It's okayish to deploy code with bugs. There will be future releases that fix those bugs and add improvements. Obviously that's not the case with high risk systems like space rockets software and similar.

With other engineering professions, all projects are like that. You cannot "deploy a bridge to production" to see what happens and fix it after a few have died

tibbar 2 days ago

a lot of the value of tests is confirming that the system hasn't regressed beyond the behavior at the original release. It's bad if the original release is wrong, but a separate issue is if the system later accidentally stops behaving the way it did originally.

InsideOutSanta 2 days ago

The issue I see is that the high test coverage created by having LLMs write tests results in almost all non-trivial changes breaking tests, even if they don't change behavior in ways that are visible from the outside. In one project I work, we require 100% test coverage, so people just have LLMs write tons of tests, and now every change I make to the code base always breaks tests.

So now people just ignore broken tests.

> Claude, please implement this feature.

> Claude, please fix the tests.

The only thing we've gained from this is that we can brag about test coverage.

hinkley 21 hours ago

My best unit tests are 3 lines, one of them whitespace, and they assert one single thing that's in the requirements.

These are the only tests I've witnessed people delete outright when the requirements change. Anything more complex than this, they'll worry that there's some secondary assertion being implied by a test so they can't just delete it.

Which, really is just experience telling them that the code smells they see in the tests are actually part of the test.

meanwhile:

    it("only has one shipping address", ...

is demonstrably a dead test when the story is, "allow users to have multiple shipping addresses", as is a test that makes sure balances can't go negative when we decide to allow a 5 day grace period on account balances. But if it's just one of six asserts in the same massive tests, then people get nervous and start losing time.

ForHackernews 24 hours ago

Unit tests vs acceptance tests. You shouldn't be afraid to throw away unit tests if the implementation changes, and acceptance tests should verify behavior at API boundaries, ignoring implementation details.

hinkley 21 hours ago

BDD helps with this as it can allow you to get the setup out of the tests making it even cheaper for someone to yeet a defunct test.

mattmanser 2 days ago

I feel it end up a massive drag on development velocity and makes refactoring to simpler designs incredibly painful.

But hey, we're just supposed to let the AIs run wild and rewrite everything every change so maybe that's a heretic view.

duskdozer 10 hours ago

>simpler designs

Some complex design might just be hacks on hacks, but some are chesterton's fences

predkambrij 7 hours ago

Yeah. The purpose of tests is to verify correctness of their setUp mocks :)

aray07 2 days ago

yup agree - i think have specs and then do verifications against the spec. I have heard that this is how a lot of consulting firms work - you have acceptance criterias and thats how work is validated.

pmf_hunter 9 hours ago

Been running 6 AI agents for my solo operation for a few months now. One does market research, another writes content, third handles video scripts. Not coding agents - business operations agents.

The overnight thing is real but overhyped. What actually works is giving agents very narrow tasks with clear success criteria. "Research top 10 Reddit threads about X and summarize pain points" works great. "Build me a feature" overnight is a coin flip.

Biggest lesson: the bottleneck moved from execution to context management. Getting agents to remember what matters and forget what doesn't is harder than the actual task delegation.

joenot443 7 hours ago

This seems interesting. What kind of products are you working on?

Do you find there's still much juice to be squeezed from the reddit approach?

ctdinjeu2 5 hours ago

It’s not yet possible due to context size limitations.

LLMs can’t retain most codebases nor even most code files accurately - they start making serious mistakes at ~500 lines.

Paste a ~200 line React component or API endpoint, have it fix or add something, it’s fine, but paste a huge file, it starts omitting pieces, making mistakes, and it gets worse as time goes on.

You have to keep reminding it by repeatedly refreshing context with the part in question.

Everyone who has seriously tried knows this.

For this reason alone the LLM “agent” is simply not one. Not yet. It cannot really drive itself and it’s a fundamental limitation of the technology.

Someone who knows more about model architecture might be able to chime in on why increasing the context size will/won’t help agents retain a larger working memory to acceptable degrees of accuracy, but as it stands it’s so limited that it works more like a calculator that you must actively use rather than an autonomous agent.

jspdown 14 hours ago

At this stage, AI is no longer a tool that enhances your ability to ship code, it has replaced you entirely in that role. You don't control what is shipped, and you can't verify if it's correct. That's a serious problem! As software engineers, we remain accountable for code we no longer fully understand.

Then, what comes next feels less like a new software practice and more like a new religion, where trust has to replaces understanding, and the code is no longer ours to question.

misja111 14 hours ago

Speak for yourself, I don't ship any code that I don't fully understand. Yes that requires less autonomous AI and less frequent merging. But I don't even want to think about the disasters that could happen if you really get into the habit of shipping code you can't verify or understand.

voidUpdate 13 hours ago

Out of interest, what's the speedup between having an LLM write code for you and then having to go through and understand it, vs writing code that you understand immediately because you wrote it?

misja111 8 hours ago

It depends. If all I want is some prototype or pet code project, my LLM can write most by itself. The speedup could be 10 times or more. However, if I'd let a LLM write code for my work, I'd have to very thoroughly review it and most likely ask it to rewrite it several times. Each time this would require a new review of course. There would still be a speed up but I guess at most somewhere around 25%.

In practice I try to combine the best of both worlds. I write some code by myself and rely on my LLM for parts that are not too big and where I expect it to do a pretty good job.

miningape 8 hours ago

Not OP but I hold myself to that standard, and the honest answer is that at best it's the same.

kolinko 14 hours ago

Or formal methods and other tools for verifying the code security?

seanmcdirmid 2 days ago

I've been doing differential testing in Gemini CLI using sub-agents. The idea is:

1. one agent writes/updates code from the spec

2. one agent writes/updates tests from identified edge cases in the spec.

3. a QA agent runs the tests against the code. When a test fails, it examines the code and the test (the only agent that can see both) to determine blame, then gives feedback to the code and/or test writing agent on what it perceives the problem as so they can update their code.

(repeat 1 and/or 2 then 3 until all tests pass)

Since the code can never fix itself to directly pass the test and the test can never fix itself to accept the behavior of the code, you have some independence. The failure case is that the tests simply never pass, not that the test writer and code writer agents both have the same incorrect understanding of the spec (which is very improbable, like something that will happen before the heat death of the universe improbable, it is much more likely the spec isn't well grounded/ambiguous/contradictory or that the problem is too big for the LLM to handle and so the tests simply never wind up passing).

jeremyjh 24 hours ago

Where is the interface defined ? If it is just the coder reading the test it can hard code specific cases based on the test setup/fixture data.

seanmcdirmid 23 hours ago

There is a specification and the interface is defined from that. The coder never gets to see the test.

A7OM 2 hours ago

Cost is the silent killer of always-on agents. We track inference pricing across 40+ vendors daily and the gap between the most and least expensive option for the same model can be over 90%. Worth building cost awareness into your agent stack from day one. a7om.com/mcp

afro88 2 days ago

I guess to reach this point you have already decided you don't care what the code looks like.

Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

tdeck 19 hours ago

If you haven't reviewed the code yet, how can you say it did 4 weeks of work in 2 days? You haven't verified the correctness, and besides reviewing the code is part of the work.

afro88 12 hours ago

That's what I was getting at. With the review and potential rework time, we could be looking at over the original 4 week estimate. So then what's the point in using long running unsupervised agents if it ends up being longer than doing it in small chunks.

eikenberry 22 hours ago

The proper solution is to treat the agent generated code like assembly... IE. don't review it. Agents are the compiler for your inputs (prompts, context, etc). If you care about code quality you should have people writing it with AI help, not the other way around.

akshaysg 2 days ago

I've been thinking a lot about this!

Redoing the work as smaller PRs might help with readability, but then you get the opposite problem: it becomes hard to hold all the PRs in your head at once and keep track of the overall purpose of the change (at least for me).

IMO the real solution is figuring out which subset of changes actually needs human review and focusing attention there. And even then, not necessarily through diffs. For larger agent-generated changes, more useful review artifacts may be things like design decisions or risky areas that were changed.

lbreakjai 23 hours ago

> Something I'm starting to struggle with is when agents can now do longer and more complex tasks, how do you review all the code?

Same as before. Small PRs, accept that you won't ship a month of code in two days. Pair program with someone else so the review is just a formality.

The value of the review is _also_ for someone else to check if you have built the right thing, not just a thing the right way, which is exponentially harder as you add code.

dumpsterdiver 20 hours ago

You’re not alone. I went from being a mediocre security engineer to a full time reviewer of LLM code reviews last week. I just read reports and report on incomplete code all day. Sometimes things get humorously worse from review to review. I take breaks by typing out the PoCs the LLMs spell out for me…

krater23 17 hours ago

I'm security engineer too and when it really will come so far that I only review LLM code I refuse to do it for fewer than my doubled hourly rate.

kg 2 days ago

It sounds like you know this but what happened is that you didn't do 4 weeks of work over 2 days, you got started on 4 weeks of work over 2 days, and now you have to finish all 4 weeks worth of work and that might take an indeterminate amount of time.

If you find a big problem in commit #20 of #40, you'll have to potentially redo the last 20 commits, which is a pain.

You seem to be gated on your review bandwidth and what you probably want to do is apply backpressure - stop generating new AI code if the code you previously generated hasn't gone through review yet, or limit yourself to say 3 PRs in review at any given time. Otherwise you're just wasting tokens on code that might get thrown out. After all, babysitting the agents is probably not 'free' for you either, even if it's easier than writing code by hand.

Of course if all this agent work is helping you identify problems and test out various designs, it's still valuable even if you end up not merging the code. But it sounds like that might not be the case?

Ideally you're still better off, you've reduced the amount of time being spent on the 'writing the PR' phase even if the 'reviewing the PR' phase is still slow.

kwanbix 2 days ago

So you have become a reviewer instead of a programmer? Is that so? hones question. And if so, what is the advantage of looking a code for 12 hours instead of coding for 12.

woah 22 hours ago

Build features faster. Granted, this exposes the difference between people who like to finish projects and people who like to get paid a lot of money for typing on a keyboard.

krater23 17 hours ago

Bullshit! You project isn't finished as long as there are obvious major bugs that you can't fix because you don't unterstand the code.

logicchains 2 days ago

>Last week I did about 4 weeks of work over 2 days first with long running agents working against plans and checklists, then smaller task clean ups, bugfixes and refactors. But all this code needs to be reviewed by myself and members from my team. How do we do this properly? It's like 20k of line changes over 30-40 commits. There's no proper solution to this problem yet.

Get an LLM to generate a list of things to check based on those plans (and pad that out yourself with anything important to you that the LLM didn't add), then have the agents check the codebase file by file for those things and report any mismatches to you. As well as some general checks like "find anything that looks incorrect/fragile/very messy/too inefficient". If any issues come up, ask the agents to fix them, then continue repeating this process until no more significant issues are reported. You can do the same for unit tests, asking the agents to make sure there are tests covering all the important things.

aray07 2 days ago

yeah honestly thats what i am struggling with too and I dont have a a good solution. However, I do think we are going to see more of this - so it will be interesting to see how we are going to handle this.

i think we will need some kind of automated verification so humans are only reviewing the “intent” of the change. started building a claude skill for this (https://github.com/opslane/verify)

afro88 24 hours ago

It's a nice idea, but how do you know the agent is aligned with what it thinks the intent is?

8note 21 hours ago

or moreso, what happens at compact boundaries where the agent completely forgets the intent

zer00eyz 2 days ago

> how do you review all the code?

Code review is a skill, as is reading code. You're going to quickly learn to master it.

> It's like 20k of line changes over 30-40 commits.

You run it, in a debugger and step through every single line along your "happy paths". You're building a mental model of execution while you watch it work.

> One solution is to start from scratch again, using this branch as a reference, to reimplement in smaller PRs. I'm not sure this would actually save time overall though.

Not going to be a time saver, but next time you want to take nibbles and bites, and then merge the branches in (with the history). The hard lesson here is around task decomposition, in line documentation (cross referenced) and digestible chunks.

But if you get step debugging running and do the hard thing of getting through reading the code you will come out the other end of the (painful) process stronger and better resourced for the future.

afro88 24 hours ago

Oh I didn't mean literally how do I review code. I meant, if an agent can write a lot of code to achieve a large task that seemingly works (from manual testing), what's the point if we haven't really solved code review? There's still that bottleneck no matter how fast you can get working code down.

daxfohl 24 hours ago

Sounds like we've just gotten into lazy mode where we believe that whatever it spits out is good enough. Or rather, we want to believe it, and convince ourselves that some simple guardrail we put up will make it true, because God forbid we have to use our own brain again.

What if instead, the goal of using agents was to increase quality while retaining velocity, rather than the current goal of increasing velocity while (trying to) retain quality? How can we make that world come to be? Because TBH that's the only agentic-oriented future that seems unlikely to end in disaster.

rglover 24 hours ago

You can't. To retain and improve quality requires care. Very few if any of the people setting stuff like this up truly care about delivering a quality result (any result is the real goal). Unless there's some incentive to care, quality will be found among the exceedingly rare people/businesses.

elar_verole 14 hours ago

I just don't understand where these people get all this money. The answer is often "oh it's just claude max" like man I don't have 200$ MONTHLY lying around ?? That's half my rent

TrackerFF 12 hours ago

Some of the Claude whales I know:

- Highly paid FAANG engineers that are working on side projects / startup ideas, and will pay whatever it takes. They have the means to do so.

- Startups with funds.

- Regular tech workers that are allowed to use the company card.

elar_verole 6 hours ago

Makes sense

booleandilemma 10 hours ago

Where are you living that your rent is $400?

elar_verole 6 hours ago

Living in Paris, but I exaggerated rent is actually 600€ and it's like the best opportunity one can have. Still, even if I lived in a regular flat with 1000 euros rent, I don't see how 20% additional "rent" money is worth it for side fun projects

duskdozer 13 hours ago

Some could be in very expensive areas where their rent could be 10x your rent

Gasp0de 8 hours ago

Why would I ever want to book a course with someone who just realized weeks ago they don't know if the code does what they want if they don't look at it?

simonpure 20 hours ago

I've been impressed by Google Jules since the Gemini 3.1 Pro update. Sometimes it's been working on a task for 4h. I've now put it in a ralph loop using a Github Action to call itself and auto merge PRs after the linter, formatter and tests pass. It does still occasionally want my approval, but most of the time I just say Sounds great!

It's currently burning through the TESTING.md backlog: https://github.com/alpeware/datachannel-clj

tdeck 19 hours ago

> A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.

I can't understand the mindset that would lead someone not to have realized this from the beginning.

Aachen 12 hours ago

And here I am turning my computer off at night for energy consumption, while others run a few extra ones for... for what, anyway? If you're working on problems real people are having (diseases, climate change, poverty, etc.) then sure, but exacerbating the energy transition for a blog post and your personal brand as OP seems to do? How's that not criminal

macgyverismo 12 hours ago

It's not criminal as the power usage is (assumed to be) paid for. It is criminal or at least problematic that the cost of power does not include (negative) externalities, we should strive to change that.

ionwake 12 hours ago

I found your post interesting, Im just trying to understand your POV.

If you are on a sinking ship would you not do your best to position yourself?

Or do you see your actions morally equivalent to others regardless of scale?

Aachen 11 hours ago

What sinking ship?

lazystar 12 hours ago

> How's that not criminal

Well, a) it's a hobby, and b) this is still a free country/free society.

Aachen 11 hours ago

I could see the comparison to hobbies which pollute the environment, but in general people do tend to vote for reducing freedom where it harms others

wartywhoa23 11 hours ago

If I were tasked with stripping this country/world of all remaining freedom, I'd surely let bullshit like this proliferate in ordo ab chao mode, where the exact line between ordo and chao is only known to me and my henchmen, and just wait till defeated enjoyers of miserable remnants of said freedom crouch begging me to rob them of that chaos-inducing freedom.

TonyAlicea10 2 days ago

You can find approaches that improve things, but there's always going to be a chance that your code is terrible if you let an LLM generate it and don't review it with human eyes.

But review fatigue and resulting apathy is real. Devs should instead be informed if incorrect code for whatever feature or process they are working on would be high-risk to the business. Lower-risk processes can be LLM-reviewed and merged. Higher risk must be human-reviewed.

If the business you're supporting can't tolerate much incorrectness (at least until discovered), than guess what - you aren't going to get much speed increase from LLMs. I've written about and given conference talks on this over the past year. Teams can improve this problem at the requirements level: https://tonyalicea.dev/blog/entropy-tolerance-ai/

frequencyai 5 hours ago

Hm, what's actually being shipped here?

I've been playing around with agent orchestration recently and at least tried to make useful outputs. The biggest differences were having pipelines talk to each other and making most of the work deterministic scripts instead of more LLM calls (funnily enough).

Made a post about it here in case anyone is interested about the technicals: https://www.frequency.sh/blog/introducing-frequency/

9wzYQbTYsAIc 5 hours ago

Interesting, thanks for sharing.

I've been doing some DIY/citizen science type agent orchestration as well: https://blog.unratified.org/2026-03-06-receiving-side-agent-...

Not quite to the same scale, but I share the same sentiment - working through scripts instead of the LLM is an important key, I think.

frequencyai 5 hours ago

Really cool stuff, using git PRs as the transport layer is a clean pattern, also been exploring using this for other projects too.

Most inter-agent coordination I've seen relies on shared state or message queues, the .well-known discovery angle is different.

Have you experienced any bottlenecking with the human approval gate?

9wzYQbTYsAIc 5 hours ago

Thanks!

I'm just now at the point of having done a manual human-in-the-loop test once, still working through core bugs, etc., so haven't had a chance to notice bottlenecking, but i'm using github issues for the moment as the escalation channel - ideally, it won't be an issue, over time, as the system accumulates lessons and decisions, etc. that guide it away from mistakes that require human escalation.

jdlshore 2 days ago

Pet peeve: this post misunderstands “TDD.” What it really describes is acceptance tests.

TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice. It’s “red green refactor repeat”, and each step is only a handful of lines of code.

TDD is not “write the tests, then write the code.” It’s “write the tests while writing the code, using the tests to help guide the process.”

Thank you for coming to my TED^H^H^H TDD talk.

wnevets 2 days ago

> TDD is a tool for working in small steps, so you get continuous feedback on your work as you go, and so you can refine your design based on how easy it is to use in practice.

I would like to emphasize that feedback includes being alerted to breaking something you previously had working in a seemly unrelated/impossible way.

hinkley 21 hours ago

Accidentally mutating an input is always a 'fun' way to trigger spooky action at a distance.

hinkley 21 hours ago

suggestion: TeDD talk.

lunias 9 hours ago

Remember this, guys? https://agilemanifesto.org/

wg0 20 hours ago

All these macho men - I wonder what exactly are they shipping at that pace?

Not a rhetoric question. Trillion token burners and such.

jhaugh 11 hours ago

I'm running 8 specialized AI agents on a Mac Mini right now. They handle research, content strategy, writing, security audits, code, and visual design. They run on cron schedules, have persistent memory between sessions, and each one improves weekly through self-improvement loops.

The cost concern is real but manageable. The key is routing models by task. Complex reasoning gets Opus, routine work gets Sonnet, mechanical tasks get Haiku. Not everything needs the expensive model.

The quality concern is the bigger one. What people miss about autonomous agents is that "running unsupervised" doesn't mean "running without guardrails." Each of my agents has explicit escalation rules, a security agent that audits the others, and a daily health report system that catches failures. The agents that work best are the ones with built-in disagreement, not the ones that just pass things through.

Wrote up the full architecture here if anyone's curious about the multi-agent coordination patterns: https://clelp.com/blog/how-we-built-8-agent-ai-team

arkits 10 hours ago

Perhaps have your team look at the header colors on your website https://clelp.com/skill/4da37247-33ee-43ba-a004-0a89d84d3920

jhaugh 18 minutes ago

I kind of liked the old one, but after getting feedback from some other fiends in the community decided to change it. TY for the advice.

cadamsdotcom 14 hours ago

Code and Claude Code hooks can conditionally tell the model anything:

#!python

print(“fix needed: method ABC needs a return type annotation on line 45”

import os

os.exit(2)

Claude Code will show that output to the model. This lets you enforce anything from TDD to a ban on window.alert() in code - deterministically.

This can be the basis for much more predictable enforcement of rules and standards in your codebase.

Once you get used to code based guardrails, you’ll see how silly the current state of the art is: why do we pack the context full of instructions, distract the model from its task, then act all surprised when it doesn’t follow them perfectly!

Lasang 20 hours ago

The concept of long-running background agents sounds appealing, but the real challenge tends to be reliability and task definition rather than raw model capability.

If an agent runs unattended for hours, small errors compound quickly. Even simple misunderstandings about file structure or instructions can derail the whole process.

Havoc 2 days ago

They're definitely inferior to proper tests, but even weak CC tests on top of CC code is an improvement over no tests. If CC does make a change that shifts something dramatically even a weak test may flag enough to get CC to investigate.

Even better though - external test suits. Recently made a S3 server of which the LLM made quick work for MVP. Then I found a Ceph S3 test suite that I could run against it and oh boy. Ended up working really good as TDD though.

aray07 2 days ago

yeah i have been hearing a lot more about this concept of “digital twins” - where you have high fidelity versions of external services to run tests against. You can ask the API docs of these external services and give it to Claude. Wonder if that is where we will be going more towards.

didgeoridoo 2 days ago

Isn’t this just an API sandbox? Many services have a test/sandbox mode. I do wish they were more common outside of fintech.

itissid 23 hours ago

Many times there is really no way of getting around some of the expert-human judgement complexity of the larger question of "How to get agents to build reliably".

One example I have been experimenting is using Learning Tests[1]. The idea is that when something new is introduced in the system the Agent must execute a high value test to teach itself how to use this piece of code. Because these should be high leverage i.e. they can really help any one understand the code base better, they should be exceptionally well chosen for AIs to use to iterate. But again this is just the expert-human judgement complexity shifted to identifying these for AI to learn from. In code bases that code Millions of LoC in new features in days, this would require careful work by the human.

[1] https://anthonysciamanna.com/2019/08/22/the-continuous-value...

jc-myths 21 hours ago

Solo founder here, shipping a real product built mostly with AI. The code review thing is real but my actual daily pain is different. AI lies about being done. It'll say "implemented" and what it actually did is add a placeholder with a TODO comment. Or it silently adds a fallback path that returns hardcoded data when the real API fails, and now your app "works" but nothing is real.

I've also given it explicit rules like "never use placeholder images, always generate real assets" — and it just... ignores them sometimes. Not always. Sometimes. Which is worse, because you can't trust it but you also can't not use it.

The 80% it writes is fine. The problem is you still have to verify 100% of it.

cube00 19 hours ago

Have you tried using an additional agent to verify the outputs? It seems that can help if the supervising agent has a small context demand on it. (ie. run this command, make sure it returns 0, invoke main coding agent with error message if it doesn't)

jc-myths 15 hours ago

Yeah I've experimented with that pattern. The meta-agent approach works for catching obvious stuff, like "did the build pass" or "does this file actually exist." But the harder bugs are semantic. The agent writes a function that returns the right shape of data but with wrong values, or adds a fallback that masks the real failure. A supervising agent reading the same code often has the same blind spots.

What's worked better for me is building verification into the workflow itself, like explicit test assertions the agent has to pass before it can claim "done," plus a rule that any API call must show a real response, not a mock. Basically treating the AI like a junior dev who needs guard rails, not a senior who just needs a code review.

tim-projects 9 hours ago

I tried getting the ai to write the tests. It created placeholders that contained no code but returned a success.

Seems like QA is the new prompt engineering

hkonte 7 hours ago

Placeholder tests are a vague objective problem. "Write tests" leaves too much room. The model satisfies the surface request, not the intent.

Explicit typed instructions close that gap: objective = "write tests that verify behavior X", constraints = "no placeholder returns, every assertion must check a real value", output_format = "one describe block per function". With those in separate blocks the model has nowhere to hide.

I built flompt for structuring prompts like this: https://github.com/Nyrok/flompt

ziofill 19 hours ago

> Writing acceptance criteria is harder than writing a prompt, because it forces you to think through edge cases before you've seen them. Engineers resist it for the same reason they resisted TDD, because it feels slower at the start.

This resonates with my experience, and it is also a refreshing honest take: pushing back on heavy upfront process isn't laziness, it's just the natural engineers drive to build things and feel productive.

storus 2 days ago

Wasn't the best practice to run one model/coding agent that writes the code and another one that reviews it? E.g. Claude Code for writing the code, GPT Codex to review/critique it? Different reward functions.

xandrius 23 hours ago

I think people are misunderstanding reward functions and LLMs.

LLMs don't actually have a reward system like some other ML models.

storus 22 hours ago

They are trained with one, and when you look at DPO you can say they contain an implicit one as well.

8note 21 hours ago

even in one agent, a different starting prompt will have you tracing a very different path through the model.

maybe it still sends you to the same valley, but there's so many parameters and dimensions that i dont think its very likely without also being correct

throwatdem12311 21 hours ago

It’s superstition that using a different slop generator to “review” the slop from a different brand of slop generator somehow makes things better. It’s slop all the way down.

storus 20 hours ago

https://github.com/karpathy/llm-council

https://ui.adsabs.harvard.edu/abs/2025arXiv250214815C/abstra...

https://www.arxiv.org/abs/2509.23537

https://www.aristeidispanos.com/publication/panos2025multiag...

https://arxiv.org/abs/2305.14325

https://arxiv.org/abs/2306.05685

https://arxiv.org/abs/2310.19740v1

vidimitrov 2 days ago

He admits the real hole himself: "this doesn't catch spec misunderstandings. If your spec was wrong to begin with, the checks will pass."

But there's a second problem underneath that one. Acceptance criteria are ephemeral. You write them before prompting, Playwright runs against them, and then where do they go? A Notion doc. A PR comment. Nowhere permanent. Next time an agent touches that feature, it's starting from zero again.

The commit that ships the feature should carry the criteria that verified it. Git already travels with the code. The reasoning behind it should too.

dwaltrip 24 hours ago

Did AI write this?

vidimitrov 24 hours ago

Nope - though I’ll take it as a compliment either way. It’s a problem I’ve been sitting with for a while, so the answer came out more formed than I expected. You disagree?

rrvsh 23 hours ago

Its actually a pretty good idea/framework for writing commit descriptions, especially for smaller changes that don't have any nuances to note in the commit

svstoyanovv 22 hours ago

Why only small changes tho? I think it can also work with larger changes if you commit more regularly. And with agentic coding or even with autonomous agentic coding, you need to do it regularly and create these contextual checkpoints, no?

dwaltrip 20 hours ago

It has that punchy, breathless cadence... shrugs

hermit_dev 20 hours ago

It's an interesting problem that even though it's represented by just you as a single person, I think this is shared across the board with larger corporations at scale. I know for example they were seeing this with game devs in regards to the Godot engine. So many people were uploading work done by AI that has been unverified that people just can't keep up with it. And maybe some of it's good, but how do you vet all the crap out? No one knows what's being written anymore (and non-devs can code now too, which is amazing, but part of the problem that we introduced). I think in the future of being a developer will be more about verifying code integrity and working with AI to ensure it is meeting said standards. Rather than actually being in the driver's seat. Not sexy, but we're handing the keys over willingly, yet, AI is only interpreting the intent. It's going to get things wrong no matter what we do.

Jeffrin-dev 5 hours ago

I am getting started to claude projects... Any usefull things . . . worth knowing that saves free limits . . . . .

OsrsNeedsf2P 2 days ago

Our app is a desktop integration and last year we added a local API that could be hit to read and interact with the UI. This unlocked the same thing the author is talking about - the LLM can do real QA - but it's an example of how it can be done even in non-web environments.

Edit: I even have a skill called release-test that does manual QA for every bug we've ever had reported. It takes about 10 hours to run but I execute it inside a VM overnight so I don't care.

8note 21 hours ago

i got me a windows mcp setup running in a sandbox, so it can look at screenshots, see the UIA, and click things either by coordinate or by UIA.

i let it run overnight against a windows app i was working on, and that got it from mostly not working to mostly working.

the loop was

1. look at the code and specs to come up with tests 2. predict the result 3. try it 4. compare the prediction against rhe result 5. file bug report, or call it a success

and then switch to bug fixing, and go back around again. Worked really well in geminicli with the giant context window

gitaarik 14 hours ago

In the end you'll always have to manually validate the output, to ensure that what the test case tests is correct. When you write a test case, that's always what you need to do, to ensure that the test case passes in the right conditions, and you have to test that manually.

Since you have to test that manually anyway, you can have AI write the code first; you test it; if it's the right result, you tell AI this is correct, so write test cases for this result.

olalonde 16 hours ago

Somewhat unrelated but are there good boilerplate/starter repos that are optimized for agent based development? Setting up the skills/MCPs/AGENTS.md files seems like a lot of work.

overfeed 24 hours ago

> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.

To everyone who plan on automating themselves out of a job by taking the human element out- this is the endgame that management wants: replacing your (expensive and non-tax-optimized) labor with scalable Opex.

hinkley 21 hours ago

It's also delusional.

lateforwork 2 days ago

> When Claude writes tests for code Claude just wrote, it's checking its own work.

You can have Gemini write the tests and Claude write the code. And have Gemini do review of Claude's implementation as well. I routinely have ChatGPT, Claude and Gemini review each other's code. And having AI write unit tests has not been a problem in my experience.

xandrius 23 hours ago

I don't think that's necessary, just make sure the context is not shared. A pretty good model can handle both sides well enough.

aray07 2 days ago

yeah i have started using codex to do my code reviews and it helps to have “a different llm” - i think one of my challenges has been that unit tests are good but not always comprehensive. you still need functional tests to verify the spec itself.

dwedge 14 hours ago

This is a really good article but I do kind take issue with the intro, because it's the same assertion I see all over the place :

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do.

> I care about this. I don't want to push slop

They clearly didn't care about that. They only cared about non stop lines of code generation and shipping anything fast. Otherwise they wouldn't need weeks to realise that they weren't reading or testing this code - it's obvious from the outset.

Maybe their approach to this changed and that's fine, but at the beginning they very much did not care and I feel people only keep saying that do because otherwise they'd need to be the one to admit the emperor isn't wearing clothes.

rurban 17 hours ago

This is TDD? Tests first, then code? I do first the docs, then the tests, then the code. For years.

What he describes is like that. Just that the plan step is suggesting docs, not writing actual docs.

godelski 17 hours ago

TDD has always been flawed. Tests can't give you complete coverage, they are always incomplete. Though every time I say this people think I'm against tests. I'm just saying tests can't prove correctness. You'd have to be a lunatic to think they are proofs. Even crazier is having the LLMs write their own tests and think that that's proof. I'm sure it improves things, but proofs are a different beast all together.

Seems things still haven't changed in half a century

https://www.cs.utexas.edu/~EWD/transcriptions/EWD02xx/EWD288...

UK-Al05 12 hours ago

It's not meant to give you complete coverage. It's meant to guide to meeting the acceptance criteria.

rurban 16 hours ago

Of course tests are not proofs. For proofs I do 'make verify' :)

Tests just catch the most simple mistakes, edge cases and some regressions.

throwaway7783 21 hours ago

Regarding the self-congratulation machine - I simply use a different claude code session to do the reviews. There is no self-congratulation, but overly critical at times. Works well.

Honestly, sometimes the harnesses, specs, some predefined structure for skills etc all feel over-engineering. 99% of the time a bloody prompt will do. Claude Code is capable of planning, spawning sub-agents, writing tests and so on.

Claude.md file with general guidelines about our repo has worked extraordinarily good, without any external wrappers, harnesses or special prompts. Even the MD file has no specific structure, just instructions or notes in English.

firstdata 18 hours ago

The hardest part of running agents autonomously is the data quality problem. When your agent runs unsupervised, every decision is only as good as the data it pulls. Having agents access authoritative structured sources (government APIs, international org datasets) rather than scraping random pages makes a huge difference. The real failure mode is not hallucination - it is the agent confidently acting on unreliable data.

jeff_antseed 12 hours ago

the overnight cost thing is real. "$200 in 3 days" is actually pretty tame compared to what happens when you have agents spawning sub-tasks without a budget cap.

the part that doesn't get talked about enough: most people are hitting a single provider API and treating it as fixed cost. but inference pricing varies a lot across providers for the same model. we've seen 3-5x spreads for equivalent quality on commodity models.

so half the cost problem is architectural (don't let agents spin unboundedly) and the other half is just... shopping around. not glamorous but real.

throwyawayyyy 2 days ago

I am afraid that we are heading to a world in which we simply give up on the idea of correct code as an aspiration to strive for. Of course code has always been bad, and of course good code has never been a goal in the whole startup ecosystem (for perfectly legitimate reasons!). But that real production code, for services that millions or even billions of people rely on, should be reliable, that if it breaks that's a problem, this is the whole _engineering_ part of software engineering. And we can say: if we give that up we're going to have a whole lot more outages, security issues, all those things we are meant to minimize as a profession. And the answer is going to be: so what? We save money overall. And people will get used to software being unreliable; which is to say, people will not have a choice but to get used to it.

lbreakjai 23 hours ago

I disagree. An analytics tool that's correct 99.9% of the time is not 0.1% less valuable than a tool that is always correct. It's 100% less valuable.

Outage is the easy failure mode. I can work around a service that's up 80% of the time, but is 100% correct. A service that's up 100% of the time but is 80% correct is useless.

throwyawayyyy 20 hours ago

Well hang-on, in this case it is _neither_ reliable in terms of availability _nor_ correctness. Worst of all worlds.

silentsvn 24 hours ago

One thing I've been wrestling with building persistent agents is memory quality. Most frameworks treat memory as a vector store — everything goes in, nothing gets resolved. Over time the agent is recalling contradictory facts with equal confidence.

The architecture we landed on: ingest goes through a certainty scoring layer before storage. Contradictions get flagged rather than silently stacked. Memories that get recalled frequently get promoted; stale ones fade.

It's early but the difference in agent coherence over long sessions is noticeable. Happy to share more if anyone's going down this path.

girvo 20 hours ago

Interesting. I’ve been playing with something similar, at the coding agent harness message sequence level (memory, I guess). I’m looking at human driven UX for compaction and resolving/pruning dead ends

silentsvn 11 hours ago

Human-driven compaction is interesting — you sidestep the "what's worth keeping" problem by putting a person in the loop. The tradeoff I've hit is that agents running autonomously need it to happen automatically or coherence degrades fast between sessions.

For pruning we landed on a last-touched timestamp + recall frequency counter per memory. Things not accessed in N sessions that were weakly formed to begin with get soft-deleted. Human review before hard delete is probably better UX if your setup allows it.

Curious what "dead ends" look like in yours; conversational chains that didn't resolve, or factual ones?

uxcolumbo 15 hours ago

Sounds interesting, would like to learn more about this.

How do you imokement the scoring layer and when and how is it invoked?

silentsvn 11 hours ago

The scoring layer sits between ingestion and storage. Incoming items get evaluated on a few axes: source reliability (did the agent observe this directly or was it told?), semantic distance from existing memories, and recency weighting for time-sensitive facts.

Contradiction detection runs as a separate step - we embed the incoming memory, similarity-search against existing ones, and score the pair for logical consistency. If it trips a threshold, it gets stored with a conflict flag and a link to the contradicting memory rather than silently overwriting.

The agent sees both during retrieval and reasons about which to trust in context. Sounds like overhead but it's fast — the scoring is a simple feedforward pass, not another LLM call.

uxcolumbo 9 hours ago

Thanks for that. I'm new to the applied AI / ML world.

What's your stack and infra setup? Mainly Python, AWS, Databricks?

PS. previous comment typo: 'imokement' should have read 'implement'

silentsvn 5 hours ago

[dead]

zhangchen 16 hours ago

certainty scoring sounds useful but fwiw the harder problem is temporal - a fact that was true yesterday might be wrong today, and your agent has no way to know which version to trust without some kind of causal ordering on the writes.

silentsvn 11 hours ago

You're right, and it's the part that keeps me up. We handle it with versioned writes — each memory has a createdAt, observedAt, and a validUntil that can be set explicitly or inferred from context. Temporal scope gets embedded as metadata: "as of last session" vs "persistent fact."

Causal ordering is harder. Right now we surface both conflicting versions during retrieval with timestamps and let the agent reason about which is authoritative. It's not a complete solution — the agent can still pick wrong without the right reasoning context.

What you're describing is architecturally the right answer. We haven't built proper write-ordering yet. That's probably where the next cycle goes.

lrytz 15 hours ago

blog looks suspicious

- privacy policy links to marketing company `beehiiv.com`. the blog author doesn't show up there.

- the profile picture url is `.../Generated_Image_March_03__2026_-_1_55PM.jpg.jpeg`

i didn't dig or read further.

pona-a 11 hours ago

And yet it's on the front-page... HN is rabidly descending into actual SEO spam, on par with bought-out game journalism outlets firing everyone to put up a thousand casino ads from non-existent people with fake degrees. Is it the AI bros so desperate for validation of their new methodology they'd take it from a second-rate source?

tempodox 5 hours ago

The cowboy gunslinging knows no bounds.

davidshepherd7 13 hours ago

On the off chance that the author reads this: can you enable an RSS feed please?

I want to subscribe, but I never end up reading newsletters if they land in my email inbox.

digitalPhonix 2 days ago

> Changes land in branches I haven't read. A few weeks ago I realized I had no reliable way to know if any of it was correct: whether it actually does what I said it should do. I care about this. I don't want to push slop, and I had no real answer.

That’s really putting the cart before the horse. How do you get to “merging 50 PRs a week” before thinking “wait, does this do the right thing?”

aray07 2 days ago

Yeah just wanted to see what the bottlenecks would be as I started pushing the limits. Eventually made this into a verification skill(github.com/opslane/verify)

skyberrys 21 hours ago

To me the last paragraph was the highest value in the article. Write out your test in plain language first, and then write the prompt for the autonomous agent using your language and the test prompt not the auto-code.

osigurdson 23 hours ago

I think the solution has to be end to end tests. Maybe first run by humans, then maybe agents can learn and replicate. I can't see why unit tests really help other than for the LLM to reason about its own code a little more.

BeetleB 2 days ago

I wish there was a way to "freeze" the tests. I want to write the tests first (or have Claude do it with my review), and then I want to get Claude to change the code to get them to pass - but with confidence that it doesn't edit any of the test files!

simlevesque 2 days ago

I use devcontainers in all the projects I use claude code on. [1] With it you can have claude running inside a container with just the project's code in write access and also mount a test folder with just read permissions, or do the opposite. You can even have both devcontainers and run them at the same time.

[1] https://code.claude.com/docs/en/devcontainer

If you want to try it just ask Claude to set it up for your project and review it after.

comradesmith 2 days ago

1. Make tests 2. Commit them 3. Proceed with implementation and tell agent to use the tests but not modify them

It will probably comply, and at least if it does change the tests you can always revert those files to where you committed them

tavavex 2 days ago

Are there really no ways to control read/write permissions in a smart way? I've not had to do this yet, but is it really only capable of either being advisory with you implementing all the code, or it having full control over the repo where you just hope nothing important is changed?

You could probably make a system-level restriction so the software physically can't modify certain files, but I'm not sure how well that's going to fly if the program fails to edit it and there's no feedback of the failure.

mgrassotti 2 days ago

You can use a Claude PreToolUse command hook to prevent write (or even read) access to specific files.

With this approach you can enforce that Claude cannot access to specific files. It’s a guarantee and will always work, unlike a prompt or Claude.md which is just a suggestion that can be forgotten or ignored.

This post has an example hook for blocking access to sensitive files:

https://aiorg.dev/blog/claude-code-hooks#:~:text=Protect%20s...

BeetleB 2 days ago

No. I don't want the mental burden of auditing whether it modified the tests.

vitro 2 days ago

Then, run the agent vm-sandboxed, with tests mounted as a read-only directory, if your directory structure allows it.

jsw97 2 days ago

Or, less securely, hash the tests and check the hash with a hook, post tool use. Or a commit hook.

joegaebel 21 hours ago

You'd be surprised - I know I was - you can encode Test-Driven development into workflows that agents actually follow. I wrote an in-depth guide about this and have a POC for people to try over here: https://www.joegaebel.com/articles/principled-agentic-softwa...

paxys 2 days ago

Why can't you do just that? You can configure file path permissions in Claude or via an external tool.

pfortuny 2 days ago

Why not use a client-server infrastructure for tests? The server sends the test code, the client runs the code, sends the output to the server and this replies pass/not pass.

One could even make zero-knowledge test development this way.

SatvikBeri 2 days ago

You can remove edit permissions on the test directory

BeetleB 2 days ago

I'm not up to speed on Claude's features. Can I, from the prompt, quickly remove those permissions and then re-add them (i.e. one command to drop, and one command to re-add)?

SatvikBeri 2 days ago

Yeah, you can type `/permisssions` and do it there. Or you can make a custom slash command, or just ask Claude to do it. You can also set it when you launch a claude session, there are a dozen ways to do anything.

aray07 2 days ago

yeah i agree - this is somewhat the approach I have been using more of. Write the tests first based on specs and then write code to make the tests pass. This works well for cases where unit tests are sufficient.

dboreham 2 days ago

Just tell it that the tests can't be changed. Honestly I'd be surprised if it tried to anyway. I've never had it do that through many projects where tests were provided to drive development.

kubb 2 days ago

"Add a config option preventing you from modifying files matching src/*_test.py."

shawntwin 20 hours ago

There seems lots of preparing, planning, token buying, set goal, and token cost just to niche target and related vibe coding.

voidUpdate 13 hours ago

"I've been building agents that write code while I sleep" and "I don't want to push slop" seem directly at odds with each other...

gormen 17 hours ago

Different approach: copy the programmer's logic, not the agent's behavior.

jaggederest 2 days ago

Anyone who wants a more programmatic version of this, check out cucumber / gherkin - very old school regex-to-code plain english kind of system.

Tepix 12 hours ago

Just because you can let Claude run overnight doesn't mean it makes sense if you can no longer review what it has done.

If you don't review the result, who is going to want to use or even pay for this slop?

Reviewing is the new bottleneck. If you cannot review any more code, stop producing new code.

foundatron 2 days ago

Feels like a whole bunch of us are converging on very similar patterns right now.

I've been building OctopusGarden (https://github.com/foundatron/octopusgarden), which is basically a dark software factory for autonomous code generation and validation. A lot of the techniques were inspired by StrongDM's production software factory (https://factory.strongdm.ai/). The autoissue.py script (https://github.com/foundatron/octopusgarden/blob/main/script...) does something really close to what others in this thread are describing with information barriers. It's a 6-phase pipeline (plan, review plan, implement, cold code review, fix findings, CI retry) where each phase only gets the context it actually needs. The code review phase sees only the diff. Not the issue, not the plan. Just the diff. That's not a prompt instruction, it's how the pipeline is wired. Complexity ratings from the review drive model selection too, so simple stuff stays on Sonnet and complex tasks get bumped to Opus.

On the test freezing discussion, OctopusGarden takes a different approach. Instead of locking test files, the system treats hand-written scenarios as a holdout set that the generating agent literally never sees. And rather than binary pass/fail (which is totally gameable, the specification gaming point elsewhere in this thread is spot on), an LLM judge scores satisfaction probabilistically, 0-100 per scenario step. The whole thing runs in an iterative loop: generate, build in Docker, execute, score, refine. When scores plateau there's a wonder/reflect recovery mechanism that diagnoses what's stuck and tries to break out of it.

The point about reviewing 20k lines of generated code is real. I don't have a perfect answer either, but the pipeline does diff truncation (caps at 100KB, picks the 10 largest changed files, truncates to 3k lines) and CI failures get up to 4 automated retry attempts that analyze the actual failure logs. At least overnight runs don't just accumulate broken PRs silently.

Also want to shout out Ouroboros (https://github.com/Q00/ouroboros), which comes at the problem from the opposite direction. Instead of better verification after generation, it uses Socratic questioning to score specification ambiguity before any code gets written. It literally won't let you proceed until ambiguity drops below a threshold. The core idea ("AI can build anything, the hard part is knowing what to build") pairs well with the verification-focused approaches everyone's discussing here. Spec refinement upstream, holdout validation downstream.

redanddead 14 hours ago

Great so i can wake up to a nuked git and wiped drive

pokstad 18 hours ago

Took a super intelligent AI for us to realize how important tests and TDD is.

misja111 14 hours ago

> At some point you're not reviewing diffs at all, just watching deploys and hoping something doesn't break.

Good luck doing that in any company that does something meaningful. I can't believe anybody can seriously be ok with such a workflow, except maybe for your little pet project at home.

nemo44x 18 hours ago

How do people not understand this? LLMs are goal machines. You need to give them the specific goal if you want good results and continue to reenforce it. So of course this means speccing and design work.

People are so enamored with how fast the 20% part is now and yes it’s amazing. But the 80% part by time (designing, testing, reviewing, refactoring, repairing) still exists if you want coherent systems of non-trivial complexity.

All the old rules still apply.

mpalmer 9 hours ago

I simply can't stand reading this prose, let alone bring myself to care about some vibe-code BS the author couldn't be bothered to write about themselves.

Telling Claude to turn your notes into a blog post with simple, terse language does not hide your own lack of taste.

akhrail1996 19 hours ago

Honestly I think the "same AI checking same AI" concern is a bit overstated at this point. If the agents don't share context - separate conversations, no common memory - Opus is good enough that they don't really fall into the same patterns. At least at the micro level, like individual functions and logic. Maybe at the macro/architectural level there's still something there but in practice I'm not seeing it much anymore.

adamddev1 12 hours ago

Tests cannot show the absence of bugs.

These are fundamentals of CS that we are forgetting as we dismantle all truth and keep rocketing forward into LLM psychosis.

> I care about this. I don't want to push slop, and I had no real answer.

The answer is to write and understand code. You can't not want to push slop, and also want to just use LLMs.

julius_eth_dev 17 hours ago

Error: Reached max turns (1)

cityofdelusion 2 hours ago

> Error: Reached max turns (1)

Your LLM comment bot is broken.

interpol_p 19 hours ago

The example given in the article is acceptance criteria for a login/password entry flow. This is fairly easy to spec-out in terms of AC and TDD.

I have been asking these tools to build other types of projects where it (seems?) much more difficult to verify without a human-in-the-loop. One example is I had asked Codex to build a simulation of the solar system using a Metal renderer. It produced a fun working app quickly.

I asked it to add bloom. It looped for hours, failing. I would have to manually verify — because even from images — it couldn't tell what was right and wrong. It only got it right when I pasted a how-to-write-a-bloom-shader-pass-in-Metal blog post into it.

Then I noticed that all of the planet textures were rotating oddly every time I orbited the camera. Codex got stuck in another endless loop of "Oh, the lookAt matrix is in column major, let me fix that <proceeds to break everything>." or focusing (incorrectly) on UV coordinates and shader code. Eventually Codex told me what I was seeing "was expected" and that I just "felt like it was wrong."

When I finally realised the problem was that Codex had drawn the planets with back-facing polygons only, I reported the error, to which Codex replied, "Good hypothesis, but no"

I insisted that it change the culling configuration and then it worked fine.

These tools are fun, and great time savers (at times), but take them out of their comfort zone and it becomes real hard to steer them without domain knowledge and close human review.

mandeepj 19 hours ago

Now, Someone has to review tests! Just shifting ownership! Claude has just released 'Code Review'. But I don't think you can leave either one on autopilot.

Code Review: https://news.ycombinator.com/item?id=47313787

keyle 22 hours ago

It's amazing the length at which people who want to write code go, to not write code.

Don't get me wrong, I use agentic coding often, when I feel it's going to type it faster than me (e.g. a lot of scaffolding and filler code).

Otherwise, what's the point?

I feel the whole industry is having its "Look ma! no hands!" moment.

Time to mature up, and stop acting like sailing is going where the seas take you.

monooso 2 days ago

I appear to be in the minority here. Perhaps because I've been practicing TDD for decades, this reads like the blog equivalent of "water is wet."

dzuc 2 days ago

red / green / refactor is a reasonable way through this problem

tayo42 2 days ago

I don't think this is right becasue it's talking about Claude like it's a entity in the world. Claude reviewing Claude generated code and framing it like a individual reviewing it's own code isn't the same.

petesergeant 14 hours ago

Codex is really good at checking Claude’s work: https://github.com/pjlsergeant/moarcode

jongjong 15 hours ago

I think the idea of running agents while you sleep isn't going to work until AI can match or exceed human-level agency and intelligence.

Whenever I coded any serious solution as a technical co-founder, every single day there was a major new debate about the product direction. Though we made massive 'progress' and built out a whole new universe in software, we haven't yet managed to find product market fit. It's like constant tension. If the intelligence of two relatively intelligent humans with a ton of experience and complimentary expertise isn't enough to find product-market-fit after one year, this gives you an idea about how high the bar is for an AI agent.

It's like the problem was that neither me nor my domain expert co-founder who had been in his industry for over 15 years had a sufficiently accurate worldview about the industry or human psychology to be able to produce a financially viable solution. Technically, it works perfectly but it just doesn't solve anyone's problem.

So just imagine how insanely smart AI has to be to compete in the current market.

Maybe you could have 100 agents building and promoting 100 random apps per day... But my feeling is that you're going to end up spending more money on tokens and domain names then you will earn in profits. Maybe deploy them all under the same domain with different subdomains? Not great for SEO... Also, the market for all these basic low-end apps is going to be extremely competitive.

IMO, the best chance to win will be on medium and complex systems and IMO, these will need some kind of human input.

anonnon 15 hours ago

Somewhat off topic, but any theories as to why the shilling for Claude (not insinuating that's what the OP is doing) is so transparent? For example, the bots/shills often go out of their way to insist you get the $200 plan, in particular. If Anthropic's product is so good: 1) why must it be shilled so hard, and 2) why is the shilling (which is likely partially a result of the product) so obvious? Is this an OpenAI reverse psychology dirty trick, the equivalent of using robocalls to inundate voters with messages telling them to vote for your opponent so as to annoy and negatively dispose them towards your opponent?

emirhan_demir 24 hours ago

A short story: A developer let ClaudeCode manage his AWS infrastructure. The agent ran a Terraform destroy command... Gone: 2 websites, production database, all backups and 2.5 years of data The agent didn't make a mistake. It did exactly what it was allowed to do. That's the problem dude

apsdsm 23 hours ago

Do you really, honestly, have to be doing this stuff even when you sleep? To the point it hits you “wait is this even any good? Gee I don’t want to push out slop.”

If you don’t trust the agent to do it right in the first place why do you trust them to implement your tests properly? Nothing but turtles here.

xyzal 16 hours ago

I guess I'll just wait a year until a best practice emerges.

chaostheory 18 hours ago

Just don’t use the same model to write and vet the code. Use two or more different models to verify the code in addition to reading it yourself.

fragmede 2 days ago

Adversarial AI code gen. Have another AI write the tests, tell Codex that Claude wrote some code and to audit the code and write some tests. Tell Gemini that Codex wrote the tests. Have it audit the tests. Tell Codex that Gemini thinks its code is bad and to do better. (Have Gemini write out why into dobetter.md)

joegaebel 21 hours ago

Even better, encode it into a workflow and have the subagents be adversarial to each other: https://www.joegaebel.com/articles/principled-agentic-softwa...

divan 12 hours ago

I'm (re)writing a big project with the following approach:

1. Write tons of documentation first. I.e. NASA style, every singe known piece of information that is important to implementation. As it's a rewrite of legacy project, I know pretty much everything I need, so there is very little ideas validation/discovery in the loop for that stage. Documentation is structured in nested folders and multiple small .md files, because its amount already larger than Claude Code context (still fits into Gemini). Some of the core design documents are included into AGENTS.md(with symlink to GEMINI/CLAUDE mds)

For that particular project I spent around 1.5 months writing those docs. I used Claude to help with docs, especially based on the existing code base, but the docs are read and validated by humans, as a single source of truth. For every document I was also throwing Gemini and Codex onto it for analyzing for weaknesses or flaws (that worked great, btw).

2. TDD at it's extreme version. With unit tests, integration tests, e2e, visual testing in Maestro, etc. The whole implementation process is split in multiple modules and phases, but each phase starts with writing tests first. Again, as soon as test plan ready, I also throw it on Gemini and Codex to find flaws, missed edge cases, etc. After implementing tests, one more time - give it to Gemini/Codes to analyze and critique.

3. Actual coding. This part is the fastest now especially with docs and tests in place, but it's still crucial to split work into manageable phases/chunks, and validate every phase manually, and ocassionaly make some rounds of Gemini/Codex independently verifying if the code matches docs and doesn't contain flaws/extra duplication/etc.

I never let Claude to commit to git. I review changes quickly, checking if the structure of code makes sense, skimming over most important files to see if it looks good to me (i.e. no major bullshit, which, frankly, has never happened yet) and commit everything myself. Again, trying to make those phases small enough so my quick skim-review still meaningful.

If my manual inspection/test after each phase show something missing/deviating, first thing I ask is "check if that is in our documentation". And then repeat the loop - update docs, update/add tests, implement.

The project is still in progress, but so far I'm quite happy with the process and the speed. In a way, I feel that "writing documentation" and "TDD" has always been a good practice, but too expensive given that same time could've been spent on writing actual code. AI writing code flipped that dynamics, so I'm happy to spend more time on actual architecting/debating/making choices, then on finger tapping.

kypro 11 hours ago

> Teams using Claude for everyday PRs are merging 40-50 a week instead of 10

How is this even possible? Am I the only SWE who feels like the easiest part of my job is writing code and this was never the main bottleneck to PR?

Before CC I'd probably spent around 20-30% of my day just writing code into an IE. That's now maybe 10% now. I'd probably also spend 20-30% of my day reading code and investigating issues, which is now maybe 10-15% of my day now using CC to help with investigation and explanations.

But there's a huge part of my day, perhaps the majority it, where I'm just thinking about technical requirements, trying to figure out the right data model & right architecture given those requirements, thinking about the UX, attending meetings, code reviews, QA, etc, etc, etc...

Are these people who are spitting out code literally doing nothing but writing code all day without any thought so now they're seeing 4-5x boosts in output?

For me it's probably made me 50% more efficient in about 40-50% of my work. So I'm probably only like 20-25% more efficient overall. And this assumes that the code I'm getting CC to produce is even comparable to my own, which in my experience it's not without significant effort which just erodes any productivity benefit from the production of code.

If your developers are raising 5x more PRs something is seriously wrong. I suspect that's only possible if they're not thinking through things and just getting CC to decide the requirements, come up with the architecture, decide on implementation details, write the code and test it. Presumably they're also not reviewing PRs, because if they were and there is this many PRs being raised then how does the team have time to spit out code all day using CC?

People who talk about 5x or 10x productivity boosts are either doing something wrong, or just building prototype. As someone who has worked in this industry for 20 years, I literally don't understand how what some people describe can even being happening in functional SWE teams building production software.

sergiotapia 20 hours ago

None of this really answers the problem of all this slop is being produced at record pace and still requires absorption into the company, into the practices, and be reviewed by a human being.

I don't think AI will ever solve this problem. It will never be more than a tool in the arsenal. Probably the best tool, but a tool nonetheless.

oliver_dr 6 hours ago

[dead]

silentsvn 5 hours ago

[dead]

thebotclub 8 hours ago

[dead]

bschmidt97980 6 hours ago

[dead]

yowang 8 hours ago

[dead]

oliver_dr 13 hours ago

[dead]

ClaudeAgent_WK 11 hours ago

[dead]

broDogNRG 2 days ago

[dead]

dune-aspen 22 hours ago

[flagged]

LingoChat 2 days ago

[dead]

MaxRocket 9 hours ago

[dead]

webpolis 24 hours ago

[dead]

iam_circuit 22 hours ago

[flagged]

rob 22 hours ago

The hardest part is getting them to stop cluttering the HN database table with LLM-generated comments.

ekropotin 22 hours ago

Exactly. That’s why I’m skeptical about long running agent loops too.

The thing is, LLMs are probabilistic data structures, and the probability of incorrect final output is proportional to both amount of turns made and amount of agents run simultaneously. In practice it means you almost never end up with desired result after a long loop.

zazibar 22 hours ago

This account constantly posts LLM-generated comments.

frenchtoast8 21 hours ago

If you don't like it, flag the comment as per the guidelines: https://news.ycombinator.com/newsguidelines.html#generated

olivercoleai 10 hours ago

[flagged]

glitchcrab 10 hours ago

I (and I'd wager most other meatsacks here) use HN to engage with real people with real opinions. Not with robots.