We are changing our developer productivity experiment design
83 points by ej88 2 days ago | 59 comments

keeda 23 hours ago
> When surveyed, 30% to 50% of developers told us that they were choosing not to submit some tasks because they did not want to do them without AI. This implies we are systematically missing tasks which have high expected uplift from AI.

In fact, one of the developers in the original study later revealed on Twitter that he had already done exactly that during the study, i.e. filtered out tasks he prefered not to do without AI: https://xcancel.com/ruben_bloom/status/1943536052037390531

While this was only one developer (that we know of), given the N was 16 and he seems to have been one of the more AI-experienced devs, this could have had a non-trivial effect on the results.

The original study gets a lot of air-time from AI naysayers, let's see how much this follow-up gets ;-)

reply
sjaiisba 23 hours ago
> 3. Regarding me specifically, I work on the LessWrong codebase which is technically open-source. I feel like calling myself an "open-source developer" has the wrong connotations, and makes it more sound like I contribute to a highly-used Python library or something as an upper-tier developer which I'm not

That’s very interesting! This kinda matches what I see at work:

- low performers love it. it really does make them output more (which includes bugs, etc. it’s causing some contention that’s yet to be resolved)

- some high performers love it. these were guys who are more into greenfield stuff and ok with 90% good. very smart, but just not interested in anything outside of going fast

- everyone else seems to be finding use out of it, but reviews are painful

reply
SpicyLemonZest 20 hours ago
As one of the naysayers who talked a lot about the original study, I enthusiastically endorse any attempt at all to actually measure AI productivity. An increase from 20% slowdown to 20% speedup over the past year seems broadly consistent with my understanding of how things have gone. I think I remain classified as a "naysayer", though, because the "booster" case has gone from "I'm multiple times more productive" to "I never have to look at code my AI agents just handle everything" over the same period.
reply
keeda 17 hours ago
I think the issue was with incomplete context. Even before the original METR study came out, there were a number of larger-scale studies that showed a 15 - 30% boost, starting as far back as 2024. I often mention them, though they require some explanation, so this thread and linked comments may be useful: https://news.ycombinator.com/item?id=46559254

However those studies never got as much airtime as the METR study, and this has created an imbalanced perspective.

My take is that studies like this are extremely useful, but a lagging indicator of the true extent of AI-assisted coding. Especially since the latest tools are something else entirely.

I am not at the "never look at code again" stage, the old habits are just too ingrained... but I'm starting to look less frequently because I rarely find anything to fix. I can see a path from where I'm at to the outlandish claims people have been making.

reply
SpicyLemonZest 16 hours ago
I tried the "don't look too closely" thing for the first time last week. I got immediately humiliated when a reviewer asked why my commit was trying to replace the correct, elegant usage of an API the class was named after with a 4-line long franken-command using a different API with incorrect semantics. It's not like I'm not trying the new stuff, on a subjective level I think AI coding is really neat, but I just can't ever figure out how to map what I get to the stories I hear.
reply
keeda 5 hours ago
Oh yeah I can see that happening, which is why I still scan the code! However, one thing I'll add is that AI-assisted coding requires adapting your workflow. Fortunately, it largely boils down to coding best-practices on steroids: docs, tests, tooling like linters, etc.

I throw tests at everything, even minor functions, preferably integration, maybe even some E2E with Playwright in web apps, at least for the happy paths. I actually pay more attention to the tests. The amazing thing is that the AI writes all of these and uses them as feedback to fix its mistakes.

But these validation guardrails are what has been driving down the issues I encounter. Without these the AI can make mistakes, and hence will require more in-depth manual review.

reply
ben_w 13 hours ago
It depends what you're measuring.

Don't get me wrong, my experiments with true-vibe-coding (i.e. don't even look at the code) are as yours, that the result is somewhat mediocre*.

For some cases, and I try to push beyond the limits of what LLMs can do in order to find those limits, they suck. I'd describe the output as like that of an overenthusiastic junior who reinvents the wheel badly rather than using standard approaches even when told to.

For other cases, I know that mediocre code is actually good enough: well before LLMs happened, I've seen mediocre code that still resulted in the app itself being given meaningful public accolades.

* Though, as per previous comment of mine, I can't help notice that the mediocrity is doing more and more of my previous career: https://news.ycombinator.com/item?id=46989102

reply
pudsbuds 16 hours ago
You just have to give up and drink the koolaid...

But for real... My company started tracking commits per hour as a metric so I just commit as many times as I can. I don't get the luxury of even looking at my work now. They say it's faster but I've never seen so much tech debt delivered so quickly in my life.

Its going to be an interesting few years...

reply
mewpmewp2 15 hours ago
Definitely need to stop squashing commits if that is the case! But no, seriously tracking git commit counts is absolutely ridiculous. Maybe you can have AI autonomously work on useless documentation that no one will read, with 1 commit per 100 lines of markdown?
reply
atleastoptimal 2 days ago
It's kind of funny that METR is known primarily for both the most bearish study on AI progress (the original 20% slowdown one), and the most bullish one on AI progress (the long-task horizon study showing exponential increase in duration of tasks AI models can accomplish with respect to date of release).

In either case, it seems people ended up bolstering their preexisting views on AI based on whichever study most affirmed them (for the former, that AI coding models didn't actually help and created a mirage of productivity that required more work to fix than was worth it, the latter that AI models were improving at an exponential rate and will invariably eclipse SWE's in all tasks in a deterministic amount of time.)

I think the truth is somewhere in the middle. Just anecdotally we've seen multi-million dollar fortunes being minted by small teams developing using 90% AI-assisted coding. Anthropic claims they solely use agents to code and don't modify any code manually.

reply
sjaiisba 24 hours ago
> Anthropic claims they solely use agents to code and don't modify any code manually.

Have you used CC? It shows. They did not make their fortune off this, and it’s at least lost me a customer because of how sloppy it is. The model is good, and it’s why they have to gate access to it. I’d much rather use a different harness.

I do think you’re on to something though. As societal wealth further concentrates among the few, we’re going to get more and more slop for the rest of us because we have no money (relatively speaking). Agentic coding is here to stay because we as a society are forced more and more slop. It’s already rampant, this is just automating it.

reply
Wowfunhappy 21 hours ago
...uh, I think Claude Code is great, actually. A lot of that is indeed just the strength of the underlying model, but the local client is great too. Plan mode, checkpoints, subagents... I've been using Claude Code for a year now, and I feel like Anthropic has steadily been eliminating pain points.

It's certainly a lot better than the Gemini cli!

reply
redhale 10 hours ago
Allow me a momentary rant...

I love Claude Code and use it all day, every day for work. I would self identify as an unofficial Claude Code evangelist amongst my coworkers and friends.

But Claude Code is buggy as hell. Flicker is still present. Plugin/skill configuration is an absolute shitshow. The docs are (very) outdated/incomplete. The docs are also poorly organized, embarrassingly so. I know Claude Code's feature set quite well, and I still have a hard time navigating their docs to find a particular thing sometimes. Did you know Claude Code supports "rules" (similar to the original Cursor Rules)? Find where they are documented, and tell me that's intuitive and discoverable. I'm sorry, but with an unlimited token (and I assume, by now, personnel) budget, there is no excuse for the literal inventors of Claude Code to have documentation this bad.

I seriously wish they would spend some more cycles on quality rather than continuing to push so many new features. I love new features, but when I can't even install a plugin properly (without manual file system manipulation) because the configuration system is so bugged, inscrutable, and incompletely documented, I think it's obvious that a rebalancing is needed. But then again, why bother if you're winning anyway?

Side note: comparing it to Gemini CLI is simply cruel. No one should ever have to use or think about Gemini CLI.

reply
Philpax 21 hours ago
Functionality-wise, it's great, but it's a buggy mess, and it seems to be getting worse with each release.
reply
pudsbuds 16 hours ago
I've been using deletated Claude agents in vscode and it crashes so much it's insane... I switched to copilot Claude local agents and it works much better.

Idk about this whole vibe coding thing though... Well see what happens

reply
pgwhalen 21 hours ago
I’m a heavy user for about four months now, and it’s definitely getting better for me. How would you say it’s getting worse?
reply
ej88 2 days ago
Really interesting updates to their 2025 experiment.

Repeat devs from the original experiment went from 0-40% slowdown to now -10-40% speedup - and METR estimates this as a 'lower-bound'

more devs saying they dont even want to do 50% of their work without AI, even for 50/hr

30-50% of devs decided not to submit certain tasks without AI, missing the tasks with the highest uplift

it also seems like there is a skill gap - repeat devs from the first study are more productive with ai tools than newly recruited ones with variable experience

overall it seems like the high preference for devs to use AI is actually hurting METR's ability to judge their speedup, due to a refusal to do tasks without it. imo this is indirectly quite supportive for ai coding's productivity claims.

reply
roxolotl 24 hours ago
The finding of the first study was people cannot judge their performance with these tools. So I don’t think the lack of individuals not willing to work without them is indicative of productivity improvements. I think it’s indicative of them being enjoyable to use.
reply
logicprog 21 hours ago
It was claimed to find that, but I don't think it did. It compared developers' beliefs about average speed up across tasks, measured by asking them once at the end, compared to the average comparative speed measured per task and then averaged. That's measuring two different things, and all kinds of things could mass up developers' fuzzy recollection of the gestalt of several tasks (such as recency bias and question/study framing) that wouldn't effect it if you asked them right after; moreover, when tasks were broken down by task type, the speed up/slow down results actually matched developers' qualitative comments.
reply
judahmeek 18 hours ago
There are some people participating in the study who will fire & forget instructions to Claude/Codex running in parallel worktrees, but would really struggle if they were required to work on their project without AI assistance.

So while some study participants probably are seeing an actual speedup because of the discipline with which they manage their codebase's structure & documentation, other study participants are actually getting worse at non-AI coding.

...and METR's study can't tell which is which because METR's study isn't using any sort of codebase quality metrics for grounding.

reply
arctic-true 2 days ago
Those developer quotes are tough to read. Rate limits are going to hit like a truck when the labs eventually need to make a profit.
reply
vessenes 18 hours ago
For the thousandth time - they. make. a. profit. Inference margin is over 60%, today.

They are spending that money training ever-larger models, so they are cashflow negative, but under almost any sane GAAP treatment that does not allow one to write down all R&D upfront (capital costs of model training), they are profitable.

Should this matter to you? Only if you're making financial decisions that assume that somehow one day the "jig will be up" - i.e. please don't short these stocks when they float, or at least do so very judiciously.

reply
scuff3d 17 hours ago
It always makes me laugh when people say this, because its so utterly pointless. That percentage assumes literally no other costs exist besides the direct inference cost.

Even if they quit trying to make better models today, there are a mountain of recurring costs that will never go away. Retraining the models with new data, replacing/upgrading old hardware, enormous infrastructure costs related to maintaining the actual platforms, data collection costs, payroll...

I'm not aware of a single player in the LLM space actually turning a profit, even if they're only providing inference.

reply
vessenes 8 hours ago
Anthropic.

Listen carefully to Dario’s public statements; you could just pull his most recent Dwarkesh interview for example - worth a listen in any event.

He is guilty of an engineer’s use of the word profit when he says “we never made a profit.” But he always follows up with the real story — “every model we trained has returned 2-4x in free cashflow, counting R&D and inference”

You could say “the industry is engaged in possibly ruinous competition training ever-larger models and sucking cash to do so, and in fact if anyone stops, they’ll lose forever” and those statements might be true, but to be clear the fact that these companies are posting a loss right now is a FEATURE of how R&D works, one that lets them spend more on a race. It’s not tied to the sort of financial reality accrual accounting is designed to talk about.

reply
ben_w 13 hours ago
While true, the obvious counterpoint is that open-weight models exist, that high-end desktops can run them, that said hardware doesn't yet appear to have reached the end of the road for improvements to both purchase and operational costs, and that even if it had the moment people stop having VC money to constantly churn expensive training runs for new models it suddenly makes sense to etch the weights of whatever is SOTA at that point onto a silicon wafer and run it as a much more efficient hardware circuit without wasting the overhead that comes with software doing the same thing on general-purpose hardware.

Even if the bubble burst while I was writing this comment, even if every single current LLM provider goes the way of pets.com, AltaVista, and GeoCities, that can all happen without ending vibe coding.

reply
simonw 2 days ago
At this point the AI labs would pretty much have to form an illegal price fixing cartel in order to jack the prices up, they've been competing to drive down prices for so long.

They'd have to get the Chinese AI labs to go along with that price fixing too.

reply
paxys 21 hours ago
You don't need collusion, just the VC money drying up. Economic reality will set the base price.
reply
simianwords 17 hours ago
Why would vc money dry up?
reply
ben_w 14 hours ago
There's only so much of it to spend before they run out.

I don't pretend to have detailed domain knowledge here, I may have seen other people's GenAI output rather than reality*, but the numbers people are throwing around for this stuff sum to trillions of USD, slightly higher than other (same caveat, perhaps also GenAI output*) claims I've seen about the total supply of money in the global venture capital markets.

* I miss the days when I could make a decent guess as to which websites were reliable and which were BS

reply
arctic-true 24 hours ago
They’d have an entire country of geniuses prepared to defend against the antitrust allegations, who’s to stop them? /s
reply
azan_ 19 hours ago
Keep in mind that they make large profit on inference. Not enough to make up for losses on training but it won’t be a problem for Chinese labs which will just steal their weights.
reply
scuff3d 17 hours ago
Given that they built their businesses on wide spread copyright infringement and licence violations, I couldn't give less of a shit about people turning around and "stealing" from them
reply
daxfohl 24 hours ago
"I don't want to do this without AI" sounds like we're already well into the brain atrophy stage of this. Now what? (I'd think about it myself but....)
reply
marcosdumay 24 hours ago
"I avoid issues like AI can finish things in just 2 hours, but I have to spend 20 hours. I will feel so painful if the task is decided as AI-disallowed."

What really doesn't sound like the results they got where developers may get up to twice as productive on the best scenario.

There's surely something scary there. And the lack of people ambivalent about AI isn't a certain indication it's well accepted as they think, it can just as easily be caused by polarization.

reply
falcor84 23 hours ago
I'm pretty sure that this was exactly the response to the first generation of devs who insisted on coding with a terminal instead of submitting punch cards like "real programmers".
reply
marcosdumay 20 hours ago
Hum... People have this reaction in all kinds of situations, but I don't think programmers ever reacted like that to a large change in abstraction.
reply
mock-possum 21 hours ago
I don’t want to do work around the house without a fully charged battery for my ryobi either. I don’t want to go on a groccery run without my car. Using tools is not brain atrophy
reply
inigyou 9 hours ago
Of course, those are muscle replacement tools, not brain replacement tools. Getting groceries with your car is leg muscle atrophy and sawing with a power saw is arm muscle atrophy.
reply
bitwize 24 hours ago
AI will soon be an intrinsic part of the job. Now what? "Get your thumb out of your ass and learn [how to use AI]." —Eric S. Raymond
reply
softwaredoug 2 days ago
I'm a bit perplexed by the developer selection effects.

I get that developers want to use AI. But are they also claiming there's not still a no/low-AI population of developers? Or that their means of selection don't find these developers?

Are they worried that by splitting devs into groups of AI experience they might be measuring some confounder that causes people to choose AI / not AI in their careers?

reply
sgillen 2 days ago
The study was designed to have devs who are comfortable with AI perform 50% of tasks with AI and 50% without. So the problem is the population of "Developers who use AI regularly but are willing to do tasks without AI" is shrinking.

>> Are they worried that by splitting devs into groups of AI experience they might be measuring some confounder that causes people to choose AI / not AI in their careers?

The developer sample size was small (16 people in the original study) and the task sample size is larger (~250 tasks). I think the worry is variance in developer productivity would totally wash out any signal.

reply
dsr_ 12 hours ago
An alternative hypothesis might be "Developers who consistently use AI become unable to work without AI". It used to be well known that after a year or two away from writing code, a new manager would be a much worse dev than previously. Is a similar sort of skill shift happening? If we raise a cohort of new devs who never work without AI, do they never gain the ability?
reply
selridge 2 days ago
Here is my read:

Developers are refusing to complete the survey or selecting themselves out because they (apparently) don’t want to complete the non-AI task.

The also saw selection effects from a large reduction in the pay for the study (which is an unfortunate confounder here), 150/hr -> 50/hr.

They guess this makes their estimates lower bounds, but the selection effect is complicated (which they acknowledge).

Overall this is a hard problem for them in the current state. It will be challenging to produce convincing year over year analysis under these conditions.

reply
camgunz 2 days ago
Unless this measures the entire SDLC longitudinally (like say, over a year) I'm not interested. I too can tell Claude Code to do things all day every day, but unless we have data on the defect rate it doesn't matter at all.
reply
pgwhalen 20 hours ago
I really am quite in awe of Claude Code recently, so definitely not a naysayer, but this is a really important point. It’s so easy to create code, but am I shipping that much to prod than I used to? A bit.

Obviously this highly depends on your company and your setup and risk tolerance and what not.

reply
camgunz 14 hours ago
I mean, Brooks' Mythical Man-Month says this explicitly: adding more programmers makes projects later because of coordination costs, which we haven't figured out (coordination isn't parallelization between agents, it's "oh we discovered this problem; we need to go back to design" and so on).
reply
falcor84 23 hours ago
Do any of those companies collect and share data on their defect rates to give you a baseline to compare against?
reply
camgunz 15 hours ago
That's my point. It's true codegen models generate code faster than humans do. Important remaining questions are:

* How do we scale up the other parts of the SDLC (planning, feasibility analysis, design, testing, deployment, maintenance)?

* What parts--if any--of the SDLC now take more or less time? Ex: we've seemingly cut down implementation time; does that come at the cost of maintenance, and if so is it still net worth it? Do we need to hire more designers, or do more user research?

The entire world is declaring "this is the future", but we don't even have simple data like "does this produce better code".

reply
sgillen 2 days ago
This is very interesting because I see a lot of AI detractors point to the original study as proof that AI is overhyped and nothing to worry about. In this new study the findings are essentially reversed (20% slowdown to 20% speedup).
reply
selridge 2 days ago
I think their old findings were hard to treat as gospel just due to the kind of comparison + the sample, but this new result is probably much noisier.

It’s hard to make reliable, directional assumptions about the kind of self-selection and refusal they saw, even without worrying about the reward dropping 66%.

reply
fxwin 23 hours ago
fwiw i think the interesting part about the original study wasn't so much the slowdowm part, but the discrepancy between perceived and measured speedup/slowdown (which is the part i used to bring up frequently when talking to other devs)
reply
simonw 2 days ago
AI detractors loved that previous study so much. It seems to have been brought up in the majority of conversations about AI productivity over the past six months.

(Notable to me was how few other studies they cited, which I think is because studies showing AI productivity loss are quite uncommon.)

reply
smohare 23 hours ago
Or maybe there’s just not that many good studies, period?

A lot of them barely rise above the level of collected anecdote, nor explore long term or more elusive factors (such as cross-system entropy). They’re also targeting an area that is fairly difficult to measure and control for.

reply
ej88 2 days ago
not enough people look at the slope, just the coords
reply
Bnjoroge 2 days ago
never been a better time to be a swe who doesnt or significantly limits the use of AI agents
reply
vessenes 18 hours ago
I like this. I've bought a lot of CnC flatpack furniture in my day, and also employed a number of excellent cabinet makers. Room for both.
reply
Krei-se 23 hours ago
great to see that wisdom and sanity is still found on yc
reply
hyfgfh 20 hours ago
What worried me is that LLMs are becoming a crutches for overworked engineers. But instead of reducing the workload it has also increased the expectation and consequently more aggressive deadlines, making it all worst overall
reply
tonymet 22 hours ago
> "AI tools lead to worse productivity"

> The subjects are using ChatGPT 2.5 and copy-pasting code.

The reason AI hype seems to be so bipolar is that "AI" isn't one thing. Hundreds of models, dozens of tools. And to get something done well, a seasoned engineer needs to master half a dozen at a time.

reply