What's Anthropic's optimization target??? Getting you the right answer as fast as possible! The variability in agent output is working against that goal, not serving it. If they could make it right 100% of the time, they would — and the "slot machine" nonsense disappears entirely. On capped plans, both you and Anthropic are incentivized to minimize interactions, not maximize them. That's the opposite of a casino. It's ... alignment (of a sort)
An unreliable tool that the manufacturer is actively trying to make more reliable is not a slot machine. It's a tool that isn't finished yet.
I've been building a space simulator for longer than some of the people diagnosing me have been programming. I built things obsessively before LLMs. I'll build things obsessively after.
The pathologizing of "person who likes making things chooses making things over Netflix" requires you to treat passive consumption as the healthy baseline, which is obviously a claim nobody in this conversation is bothering to defend.
What makes you believe this? The current trend in all major providers seem to be: get you to spin up as many agents as possible so that you can get billed more and their number of requests goes up.
> Slot machines have variable reward schedules by design
LLMs by all major providers are optimized used RLHF where they are optimized in ways we don't entirely understand to keep you engaged.
These are incredibly naive assumptions. Anthropic/OpenAI/etc don't care if you get your "answer solved quickly", they care that you keep paying and that all their numbers go up. They aren't doing this as a favor to you and there's no reason to believe that these systems are optimized in your interest.
> I built things obsessively before LLMs. I'll build things obsessively after.
The core argument of the "gambling hypothesis" is that many of these people aren't really building things. To be clear, I certainly don't know if this is true of you in particular, it probably isn't. But just because this doesn't apply to you specifically doesn't mean it's not a solid argument.
Well stated
Simply, cut-throat competition. Given multiple nations are funding different AI-labs, quality of output and speed are one of the most important things.
Goal of any business in principle is profit, by your terms all of them are misaligned.
Matter of fact is that customers are receiving value and the value has been a good proxy for which company will grow to be successful and which will fail.
> You're just being pedantic and cynical.
What he means is that your point does not align with his narrow world view, and he's labelling you as a pedant and a cynic to justify writing off your opinion altogether.
It's a projection of his fragile world view. Don't take it personally.
Private companies will turn towards the best, fastest, cheapest (or some average of them). Country borders don’t really matter. All labs are fighting to get the best thing out in the public for that reason, because winning comes with money, status, prestige, and actually changing the world. This kind of incentives are rare.
Winner takes what exactly? They can rip off react apps quicker than everyone else? How terrifying.
That said universe don't obligate us to think the cosmos is all about competition. Cooperation is always possible as a viable path, often with far more long term benefits at scale.
Competition is superfluous self inflict masochism.
Assuming that is still true, then they absolutely have an incentive to keep your tokens/requests to the absolute minimum required to solve your problem and wow you.
I was surprised when I saw that Cursor added a feature to set the number of agents for a given prompt. I figured it might be a performance thing - fan out complex tasks across multiple agents that can work on the problem in parallel and get a combined solution. I was extremely disappointed when I realized it's just "repeat the same prompt to N separate agents, let each one take a shot and then pick a winner". Especially when some tasks can run for several minutes, rapidly burning through millions of tokens per agent.
At that point it's just rolling dice. If an agent goes so far off-script that its result is trash, I would expect that to mean I need to rework the instructions and context I gave it, not that I should try the same thing again and hope that entropy fixes it. But editing your prompt offline doesn't burn tokens, so it's not what makes them money.
The best-of-N feature is a bit like rolling N dice instead of one. But it can be quite useful if you use different models with different strengths and weaknesses (e.g. Claude/GPT-5/Gemini), rather than assigning all to N instances of Claude, for example. I like to use this feature in ask mode when diving into a codebase, to get an explanation a few different ways.
Now companies are fighting for the attention of a finite number of customers, so they keep their prices in line with those around them.
I remember when Google started with PPC - because few companies were using it, it cost a fraction of recent prices.
And the other issue to solve is future lack of electricity for land data centers. If everyone wants to use LLM… but data centers capacity is finite due to available power -> token prices can go up. But IMHO devs will find an innovative approach for tokens, less energy demanding… so token prices will probably stay low.
Gemini increased the same Flash's price by something like 5x IIRC when it got better.
I think branding is the entire game.
My illiterate, LLM-addict cousin is convinced that Claude is the answer to the ultimate question of life, the universe, and everything.
Criticisms of the code he (read: Claude) generates are not relevant to him -- Claude is the most intelligent being to ever exist, therefore, to critique its output is a naive waste of breath.
They charge what the market will bear.
If "what the market will bear" is lower than the cost of production then they will stop offering it.
Intermittent variable rewards, whether produced by design or merely as a byproduct, will induce compulsive behavior, no matter the optimization target. This applies to Claude
Does this mean I should not garden because it's a variable reward? Of course not.
Sometimes I will go out fishing and I won't catch a damn thing. Should I stop fishing?
Obviously no.
So what's the difference? What is the precise mechanism here that you're pointing at? Because sometimes life is disappointing is a reason to do nothing. And yet.
Anthropic's optimization target is getting you to spend tokens, not produce the right answer. It's to produce an answer plausible enough but incomplete enough that you'll continue to spend as many tokens as possible for as long as possible. That's about as close to a slot machine as I can imagine. Slot rewards are designed to keep you interested as long as possible, on the premise that you _might_ get what you want, the jackpot, if you play long enough.
Anthropic's game isn't limited to a single spin either. The small wins (small prompts with well defined answers) are support for the big losses (trying to one shot a whole production grade program).
The majority of us are using their subscription plans with flat rate fees.
Their incentive is the precise opposite of what you say. The less we use the product, the more they benefit. It's like a gym membership.
I think all of the gambling addiction analogies in this thread are just so strained that I can't take them seriously. Even the basic facts aren't even consistent with the real situation.
I swear this whole conversation is motivated reasoning from AI holdouts who desperately want to believe everybody else is getting scammed by a gambling scheme, that they don't stop and think about the situation rationally. Insofar as Claude is dominant, it's only because Claude works the best. There is meaningful competition in this market, as soon as Anthropic drops the ball they'll be replaced.
Not so for functionally-code-illiterate addicts. If they need to maintain the illusion that they know how to write code (at the level of their coworkers), they will easily pay $200 to keep the machine spinning
https://apxml.com/models/glm-5
To run GLM-5 you need access to many, many consumer grade GPUs, or multiple data center level GPUs.
>They will likely get cheaper to run over time as well (better hardware).
Unless they magically solve the problem of chip scarcity, I don't see this happening. VRAM is king, and to have more of it you have to pay a lot more. Let's use the RTX 3090 as an example. This card is ~6 years old now, yet it still runs you around $1.3k. If you wanted to run GLM-5 I4 quantization (the lowest listed in the link above) with a 32k context window, you would need *32 RTX 3090's*. That's $42k dollars you'd be spending on obsolete silicon. If you wanted to run this on newer hardware, you could reasonable expect to multiply that number by 2.
Also, how much bang for the buck do those 3090s actually give you compared to enterprise-grade products?
they want me to not spend tokens. that way my subscription makes money for them rather than costing them electricity and degrading their GPUs
If you're on anything but their highest tier, it's not altogether unreasonable for them to optimize for the greatest number of plan upgrades (people who decide they need more tokens) while minimizing cancellations (people frustrated by the number of tokens they need). On the highest tier, this sort of falls apart but it's a problem easily solved by just adding more tiers :)
Of course, I don't think this is actually what's going on, but it's not irrational.
Understood.
> they want me to not spend tokens.
No, they want you to expand your subscription. Maybe buy 2x subscriptions.
I mean this only works if Anthropic is the only game in town. In your analogy if anyone else builds a casino with a higher payout then they lose the game. With the rate of LLM improvement over the years, this doesn't seem like a stable means of business.
So, if there's a way to get people addicted to AI conversations, that's an excellent way to make money even if you are way behind your competitors, as addicted buyers are much more loyal that other clients.
or a hobbyist gardener?
Dealing with organic and natural systems will, most of the time, have a variable reward. The real issue comes from systems and services designed to only be accessible through intermittent variable rewards.
Oh, and don't confuse Claude's artifacts working most of the time with them actually optimizing to be that way. They're optimizing to ensure token usage. I.E. LLMs have been fine-tuned to default to verbose responses. They are impressive to less experienced developers, often easier to detect certain types of errors (eg. Improper typing), and will make you use more tokens.
This is an incorrect understanding of intermittent variable reward research.
Claims that it "will induce compulsive behavior" are not consistent with the research. Most rewards in life are variable and intermittent and people aren't out there developing compulsive behavior for everything that fits that description.
There are many counter-examples, such as job searching: It's clearly an intermittent variable reward to apply for a job and get a good offer for it, but it doesn't turn people into compulsive job-applying robots.
The strongest addictions to drugs also have little to do with being intermittent or variable. Someone can take a precisely measured abuse-threshold dose of a drug on a strict schedule and still develop compulsions to take more. Compulsions at a level that eclipse any behavior they'd encounter naturally.
Intermittent variable reward schedules can be a factor in increasing anticipatory behavior and rewards, but claiming that they "will induce compulsive behavior" is a severe misunderstanding of the science.
The variability in eg soccer kicks or basketball throws is also there but clearly there is a skill element and a potential for progress. Same with many other activities. Coding with LLMs is not so different. There are clearly ways you can do it better and it's not pure randomness.
There is absolutely no incentive to do that, for any of these companies. The incentive is to make the model just bad enough you keep coming back, but not so bad you go to a competitor.
We've already seen this play out. We know Google made their search results worse to drive up and revenue. Exact same incentives are at play here, only worse.
IF I USE LESS TOKENS, ANTHROPIC GETS MORE MONEY! You are blindly pattern matching to "corporation bad!" without actually considering the underlying structure of the situation. I believe there's a phrase for this to do with probabilistic avians?
You’re right on the money: the important thing to look at are the incentive structures.
Basically all tech companies from the post-great financial crisis expansion (Google, post Balmer Microsoft, Twitter, Instagram, Airbnb, Uber, etc) started off user-friendly but all eventually converged towards their investment incentive structure.
One big exception is Wikipedia. Not surprising since it has a completely different funding model!
I’m sure Anthropic is super user friendly now, while they are focused on expansion and founding devs still have concentrated policial sway. It will eventually converge on its incentive structures to extract profit for shareholders like all other companies.
And the incentive is even strong for the lower tiers. They want answers to be just good enough to keep you using it, but bad enough that you're pushed towards buying the higher tier.
https://www-cdn.anthropic.com/58284b19e702b49db9302d5b6f135a...
(cmd-f "slot machine")
Are you totally sure they are not measuring/optimizing engagement metrics? Because at least I can bet OpenAI is doing that with every product they have to offer.
That is a generous interpretation. Mighr be correct. But they dont make as much money if you quickly get the right answer. They make more money if you spend as many tokens as possible being on that "maybe next time" hook.
Im not saying theyre actually optimizng for that. But charlie munger said "show me the incentives, and ill show you the outcome"
This didn't used to be the case, so I assume that it must be intentional.
"Should I add oregano to brown beans or would that not taste good?"
"Great instinct! Based on your interests in building new apps and learning new languages, you are someone who enjoys discovering new things, and it makes sense that you'd want to experiment with new flavor profiles as well. Your combination of oregano and brown beans is a real fusion of Italian and Mexican food, skillfully synthesizing these two cultures.
Here's a list of 5 random unrelated spices you can also add to brown beans:
Also, if you want to, I can create a list of other recipes that incorporate these oregano. Just say the words "I am hungry" and I will get right back to it!"
Also, random side note, I hate ChatGPT asking me to "say the word" or "repeat the sentence". Just ask me if I want it and then I say yes or no, I am not going to repeat "go oregano!" like some sort of magic keyphrase to unlock a list of recipes.
The analogy was too strained to make sense.
Despite being framed as a helpful plea to gambling addicts, I think it’s clear this post was actually targeted at an anti-LLM audience. It’s supposed to make the reader feel good for choosing not to use them by portraying LLM users as poor gambling addicts.
Even a perfect LLM will not be able to produce perfect outputs because humans will never put in all the context necessary to zero-shot any non-trivial query. LLMs can't read your mind and will always make distasteful assumptions unless driven by users without any unique preferences or a lot of time on their hands to ruminate on exactly how they want something done.
I think it will always be mostly boring back-and-forth until the jackpot comes. Maybe future generations will align their preferences with the default LLM output instead of human preferences in that domain, though.
yeah I think the bluesky embed is much more along the lines of what I'm experiencing than the OP itself.
I found it interesting that Google removed the "summary cards" supposedly "to improve user experience" however the AI overview was added back.
I suspect the AI overview is much more influenceable by advertisement money then the summary cards where.
This is subtly different. It's not clear that the people depicted like making things, in the sense of enjoying the process. The narrative is about LLMs fitting into the already-existing startup culture. There's already a blurry boundary between "risky investment" and "gambling", given that most businesses (of all types, not just startups) have a high failure rate. The socially destructive characteristic identified here is: given more opportunity to pull the handle on the gambling machine, people are choosing to do that at the expense of other parts of their life.
But yes, this relies on a subjective distinction between "building, but with unpredictable results" and "gambling, with its associated self-delusions".
Wait, what? Anthropic makes money by getting you to buy and expend tokens. The last thing they want is for you to get the right answer as fast as possible. They want you to sometimes get the right answer unpredictably, but with enough likelihood that this time will work that you keep hitting Enter.
In an environment where providers are almost entirely interchangeable and tiniest of perceived edges (because there's still no benchmark unambiguously judging which model is "better") make or break user retention, I just don't see how it's not ludicrous on its face that any LLM provider would be incentivized to give unreliable answers at some high-enough probability.
I think their greater argument was to highlight how agentic coding is eroding work life balance, and that companies are beginning to make that the norm.
Ideas are a dime a dozen, now proofs of concept are a load of tokens a dozen.
If Dave the developer is paying, Dave is incentivized to optimize token use along with Anthropic (for the different reasons mentioned).
If the Dave's employer, Earl, is paying and is mostly interested in getting Dave to work more, then what incentive does Dave have to minimize tokens? He's mostly incentivized by Earl to produce more code, and now also by Anthropic's accidentally variable-reward coding system, to code more... ?
Trust me, we all feel like the house is our friend until its isn't!
To the bluesky poster's point: Pulling out a laptop at a party feels awkward for most; pulling out your phone to respond to claude barely registers. That’s what makes it dangerous: It's so easy to feel some sense of progress now. Even when you’re tired and burned out, you can still make progress by just sending off a quick message. The quality will, of course, slip over time; but far less than it did previously.
Add in a weak labor market and people feel pressure to stay working all the time. Partly because everyone else is (and nobody wants to be at the bottom of the stack ranking), and partly because it’s easier than ever to avoid hitting a wall by just "one more message". Steve Yegge's point about AI vampires rings true to me: A lot of coworkers I’ve talked to feel burned out after just a few months of going hard with AI tools. Those same people are the ones working nights and weekends because "I can just have a back-and-forth with Claude while I'm watching a show now".
The likely result is the usual pattern for increases in labor productivity. People who can’t keep up get pushed out, people who can keep up stay stuck grinding, and companies get to claim the increase in productivity while reducing expenses. Steve's suggestion for shorter workdays sound nice in theory, but I would bet significant amounts of money the 40-hour work week remains the standard for a long time to come.
This isn't generally true at all. The "all tech companies are going to 996" meme comes up a lot here but all of the links and anecdotes go back to the same few sources.
It is very true that the tech job market is competitive again after the post-COVID period where virtually nobody was getting fired and jobs were easy to find.
I do not think it's true that the median or even 90th percentile tech job is becoming so overbearing that personal time is disappearing. If you're at a job where they're trying to normalize overwork as something everyone is doing, they're just lying to you to extract more work.
It starts with people who feel they’ve got more to lose (like those supporting a family) working extra to avoid looking like a low performer, whether that fear is reasonable or not. People aren’t perfectly rational, and job-loss anxiety makes them push harder than they otherwise would. Especially now, when "pushing harder" might just mean sending chat messages to claude during your personal time.
Totally anecdotal (strike 1), and I'm at a FAANG which is definitely not the median tech job (strike 2), but it’s become pretty normal for me to come back Monday to a pile of messages sent by peers over the weekend. A couple years ago even that was extremely unusual; even if people were working on the weekend they at least kept up a facade that they weren't.
It's more like being hooked on a slot machine which pays out 95% of the time because you know how to trick it.
(I saw "no actual evidence pointing to these improvements" with a footnote and didn't even need to click that footnote to know it was the METR thing. I wish AI holdouts would find a few more studies.)
Steve Yegge of all people published something the other day that has similar conclusions to this piece - that the productivity boost for coding agents can lead to burnout, especially if companies use it to drive their employees to work in unsustainable ways: https://steve-yegge.medium.com/the-ai-vampire-eda6e4f07163
Yeah I really feel that!
I recently learned the term "cognitive debt" for this from https://margaretstorey.com/blog/2026/02/09/cognitive-debt/ and I think it's a great way to capture this effect.
I can churn out features faster, but that means I don't get time to fully absorb each feature and think through its consequences and relationships to other existing or future features.
But for what I've seen both validating my and others coding agents outputs I'd estimate a much lower percentage (Data Engineering/Science work). And, oh boy, some colleages are hooked to generating no matter the quality. Workslop is a very real phenomenon.
I was really impressed with how it parsed the structured checklist. I was not at all impressed by how it digested the paper. Lots of disguised errors.
After I read your comment, I gave Codex 5.3 the task of setting up an E2E testing skeleton for one of my repos, using Playwright. It worked for probably 45 minutes and in the end failed miserably: out of the five smoke tests it created, only two of them passed. It gave up on the other three and said they will need “further investigation”.
I then stashed all do that code and gave the exact same task to Opus 4.5 (not even 4.6), with the same prompt. After 15 mins it was done. Then I popped Codex’s code from the stash and asked Opus to look at it to see why the three m of the five tests Codex wrote didn’t pass. It looked at them and found four critical issues that Codex had missed. For example, it had failed to detect that my localhost uses https, so the the E2E suite’s API calls from the Vue app kept failing. Opus also found that the two passing tests were actually invalid: they checked for the existence of a div with #app and simply assumed it meant the Vue app booted successfully.
This is probably the dozenth comparison I’ve done between Codex and Opus. I think there was only one scenario where Codex performed equally well. Opus is just a much better model in my experience.
I think it's much more preferable to pick the most reliable one and use it as the primary model, and think of others as fallbacks for situations where it struggles.
see how perplexity does it: https://www.perplexity.ai/hub/blog/introducing-model-council
There's also this article on hbr.org https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies...
This is a real thing, and it looks like classic addiction.
Claude Code wasting my time with nonsense output one in twenty times seems roughly correct. The rest of the time it's hitting jackpots.
Right but the <100% chance is actually why slot machines are addictive. If it pays out continuously the behaviour does not persist as long. It's called the partial reinforcement extinction effect.
“It’s not like a slot machine, it’s like… a slot machine… that I feel good using”
That aside if a slot machine is doing your job correctly 95% of the time it seems like either you aren’t noticing when it’s doing your job poorly or you’ve shifted the way that you work to only allow yourself to do work that the slot machine is good at.
I think you are mistaken on what the "payout" is. There's only one reason someone is working all hours and during a party and whatnot: it's to become rich and powerful. The payout is not "more code", it's a big house, fast cars, beautiful women etc. Nobody can trick it into paying out even 1% of the time, let alone 95%.
Not everyone gets hooked on those, but I do. I've played a bunch of those long-winded idle games, and it looks like a slight addiction. I would get impatient that it takes so long to progress, and it would add anxiety to e.g. run this during breaks at work, or just before going to sleep. "Just one more click".
And to be perfectly honest, it seems like the artificial limits of Anthropic (5 hour session limits) dig into similar mechanism. I do less non-programming hobbies since I've got myself a subscription.
If you are unfamiliar with the various ways that naive code would fail in production, you could be fooled into thinking generated code is all you need.
If you try to hold the hand of the coding agents to bring code to a point where it is production ready, be prepared for a frustrating cycle of models responding with ‘Fixed it!’ while only having introduced further issues.
Tests, linting, guidance in response to key events (Claude Code hooks are great for this), automatically passing the agent’s code plan to another model invocation then passing back whatever feedback that model has on the plan so you don’t have to point out the same flaws in plans over and over.. custom scripts that iterate your codebase for antipatterns (they can walk the AST or be regex based - ask your agent to write them!)
Codify everything you’re looping back to your agent about and make it a guardrail. Give your agent the tools it needs to give itself grounding.
An agent without guardrails or grounding is like a person unconnected to their senses: disconnected from the world, all you do is dream - in a dream anything can happen, there’s nothing to ensure realism. When you look at it that way it’s a miracle coding agents produce anything useful at all :)
My paraphrase of their caveats:
- experts on their own open source proj are not representative of most software dev
- measuring time undervalues trading time for effort
- tools are noticeably better than they were a year ago when the study was conducted
- it really does take months of use to get the hang of it (or did then, less so now)
Before you respond to these points, please look at the full study’s treatment of the caveats! It’s fantastic, and it’s clear almost no one citing the study actually read it.
[0]: https://metr.org/blog/2025-07-10-early-2025-ai-experienced-o...
And to another point: work life balance is a huge challenge. Burnout happens in all departments, not just engineering. Managers can get burnout just as easily. If you manage AI agents, you'll just get burnout from that too.
It’s funny because this is what we do already at many jobs but now its just telling a computer to tell a computer what to do. A higher level of abstraction.
What's wild to me is that there's a whole other segment of people that treat tokens as, I dunno, some kind of malicious gatekeeping to the magical program generator. Some kind of endorphin rush of extracting functional code from a naive and poorly formed idea.
To the former group, the gambling metaphor is flatly ridiculous. The AI is a tool and tokens are your allocation for tool time. To the latter, someone is trying to stifle you and strangle your creativity behind arbitrary limits.
I don't know how to feel about this other than uneasy and worried.
The difference is that in gambling 'the house always wins', but in our case we do make progress towards our goal of conquering the world with our newly minted apps.
The situation where this comparison holds is when vibe coding leads nowhere and you don't accomplish anything but just burn through tokens.
watch as the hysteria passes, and just like the tv scare, nobody cares anymore in roughly 20 years or so. shame on all of you
how do you block video on your PC? or do you literally mean audiovisual information broadcast onto actual television sets is the evil?
In short, read a book.
- "you get hypnotized by the tube into a passive state of consumption"
- "Watch their slack jawed faces....and their attention stays glued"
both statements apply equally to books. read here if you dont believe me.
https://engines.egr.uh.edu/talks/what-people-said-about-book...
youve got a case of the feelies my friend
What? Your vibe coded slop is just going to be competing with someone else's vibe coded slop.
I do think it can be addictive, but there are many things that are addictive that aren't gambling.
I think a better analogy is something like extreme sport, where people can get addicted to the point it can be harmful.
Maybe someone can show me how you're supposed to do it, because I have seen no evidence that AI can write code at all.
When it works for pure generation it's beautiful, when it doesn't it's ruinous enough to make me take two steps back. I'll have another go at getting with all the pure agentic rage everyone's talking about soon enough.
Step 2: download Zed and paste in your API Key
Step 3: Give detailed instructions to the assistant, including writing ReadMe files on the goal of the project and the current state of the project
Step 4: stop the robot when it's making a dumb decision
Step 5: keep an eye on context size and start a new conversation every time you're half full. The more stuff in the context the dumber it gets.
I spent about 500 dollars and 16 hours of conversation to get an MVP static marketplace [0], a ruby app that can be crawled into static (and js-free!) files, without writing a single line of code myself, because I don't know ruby. This included a rather convoluted data import process, loading the database from XML files of a couple different schemas.
Only thing I had to figure out on my own was how to upload the 140,000 pages to cloudflare free tier.
Yeah I can't stop myself when I'm about to make a dumb decision, just look at my github repo. I ported Forth to a 1980s sampler and wrote DSP code on an 8-bit Arduino.
How am I going to stop a robot making dumb decisions?
Also, this all sounds like I'm doing a lot of skivvy work typing stuff in (which I hate) and not actually writing much code (which is the bit I like).
It is at this point where you can say “NONONO YOU ABSOLUTE DONKEY stop that we just want a FastAPI endpoint!!” And it will go “You’re absolutely right, I was over complicating this!”
I did waste about 20 minutes trying to do a recursive link following crawl (to write each rendered page to file), because Opus wanted to write a ruby task to do it. It wasn’t working so I googled it and found out link following is a built in feature of cURL…
If I wasn't already convinced that agentic tools were slot machines, here's a very strong argument in favor of that theory…
1. If you don't use it soon enough, they keep it (shame on them, do the things you need to in order to be a money transmitter, you have billions of dollars)
2. Pay-go with billing warning and limits. You can use Claude like this through Google VertexAI
when its actually writing code its pretty hands off, unless you need to course correct to point it in a better direction
We’re well past the need to retry the same prompt multiple times in order to get working code. The models with their harnesses are properly agentic now, they can find the right context, make a plan, write the code, run the tests and fix the bugs with little to no intervention from a human.
The hardest part now is keeping up with them when it comes to approving the deliverables and updating the architecture and spec as new things are discovered by using the software. Not new bugs but corrections to your own assumptions you had before the feature was built.
The hard part is almost entirely management.
That’s something to seriously think about.
Touché.
I really cannot tell
But that's for personal pleasure. This post is receeding from the concerns about "token anxiety," about the addiction to tokens. This post is about work culture & late capitalism anxiety, about possible pressures & systems society might impose.
I reflect a lot on AI doesn't reduce the work, it intensifies it. https://hbr.org/2026/02/ai-doesnt-reduce-work-it-intensifies... The spirit of this really nails something core to me. We coders especially get help doing so much of menial now. Which means we spend a lot more time making intense analysis and critiques, are much more doing the hard thought work of 'is what we have here as good as it can be'. Finding new references or patterns to feed back into the AI to steer already working implementations towards better outcomes.
And my heart tells me that corporations & work life as we know it are almost universally just really awful about supporting reflective contemplative work like this. Work wants output. It doesn't want you sit in a hammock and think about it. But increasingly I tell you the key to good successful software is Hammock Driven Development. It's time to use our brains more, in quiet reflection. https://github.com/matthiasn/talk-transcripts/blob/master/Hi...
996 sounds like garbage on its own, as a system of toil. But I also very much respect an idea of continuous work, one that also intersperses rest throughout the day. Doing some chores or going to the supermarket or playing with the kid can be an incredibly good way to let your preconscious sift through the big gnarly problems about. The response to the intensity of what we have, to me, speaks of a need to spread out the work, to de-concentrate it, to build in more than hammock time. I was on the fence about whether the traditional workday deserved to survive before AI hit, and my feels about it being a gross mismatch have massively intensified since.
As I started my post with, I personally have a much more positive experience, with what yes feels like a token addiction. But it doesn't feel like an anxiety. It feels like the greatest most exciting adventure, far beyond what I had hoped for in life ever. This is wildly fun, going far far further out than I had ever hoped to get to see. I'm not "anxiously" pulling the lever arm on the token machine, I'm just thrilled to get to do it. To have time to reflect and decide, I have 3-8 things going at once (and probably double they back burnered but open, on Niri rows!) to let myself make slower decisions, to analyze, while keeping the things that can safely move forwards moving forwards.
That also seems like something worker exploitative late capitalism is mostly hot garbage at too! Companies really try to reduce in flight activities. Sprint planning is about crafting deliberate work. But our freedom and agency here far outstrips these dusty old practices. It is anxiety inducing to be so powerful so capable & to have a bureaucracy that constraints and confines, that wants only narrow windows of our use.
Also, shame on Tim Kellogg for not God damned linking the actual post he was citing. Garbagefire move. https://writing.nikunjk.com/p/token-anxiety https://news.ycombinator.com/item?id=47021136
I _kind_ of get this if we're talking about working on big, important, world-changing problems. If it's another SaaS app or something like that, I find it pretty depressing.
When people talk about leaving their agents to run overnight, what are those agents actually doing? The limited utility I've had using agent-supported software development requires a significant amount of hand holding, maybe because I'm in an industry with limited externally available examples to build am model off of (though all of the specifications are public, I've yet to see an agent build an appropriate implementation).
So it's much more transactional...I ask, it does something (usually within seconds), I correct, it iterates again...
What sort of tasks are people putting these agents to? How are people running 'multiple' of these agents? What am I missing here?
I might run 3-4 claude sessions because that's the only way to have "multiple chats" to e.g. ask unrelated things. Occasionally a task takes long enough to keep multiple sessions busy, but that's rather rare and if it happens its because the agent runs a long running task like the whole test suite.
The story of running multiple agents to build full features in parallel... doesn't really add up in my experience. It kinda works for a bit if you have a green field project where the complexity is still extremely low.
However once you have a feature interaction matrix that is larger than say 3x3 you have to hand hold the system to not make stupid assumptions. Or you prompt very precisely but this also takes time and prevents you from ever running into the parallel situation.
The feature interaction matrix size is my current proxy "pseudo-metric" for when agentic coding might work well and at which abstraction level.
But so far that doesn't change the reality - I can't find any opportunities to let an agent run for more than 30 minutes at best, and parallel agents just seem to confuse each other.
E.g. imagine building a google docs clone where you have different formatting options. Claude would happily build bold and italic for you but if afterwards you add headings, tables, colors, font size, etc. It would just produce a huge if/else tree instead of building a somewhat sensible text formatting abstraction.
Tbf I wouldn't actually know how to build this myself but e.g. bold and italic work together but if you add a "code block" thing that should probably not work with font color and putting a table inside that also makes no sense.
Claude might get some of these interactions intuitively correct but at some point you'll have so many NxM interactions between features that it just forgets half of them and then the experience becomes sloppy and crashes on all edge cases.
The point of good software engineering is to simplify the matrix to something that you can keep arguing about e.g. classify formatting options into categories and then you only have to argue and think about how those categories interact.
This is the kind of thing LLMs just aren't really good at if the problem space isn't in the training data already => doing anything remotely novel. And I haven't seen it improve at this either over the releases.
Maybe this kind of engineering will eventually be dead because claude can just brute force the infinitely growing if/else tree and keep it all in context but that does not seem very likely to me. So far we still have to think of these abstraction levels ourselves and then for the sub-problems I can apply agentic coding again.
Just need to make sure that Claude doesn't breach these abstractions, which it also happily does to take short cuts btw.
More seriously, what in the world "novel" physics device did you invent?
You "invented" ("Designed") a "device" "using physics", and nobody has designed that "device" before, making it novel.
"From first principles" is a fun statement because people like Aristotle also thought they were reasoning from "first principles" and look how far it got them. The entire point of science is that "first principles" are actually not something we have access to, so we should instead prioritize what literally happens and can be observed. It's not possible as far as we know to trick mother nature into giving us the answer we want rather than the real answer.
Did you ever actually build or test this "device"?
Outside that I'm juggling 2-3 sessions at most with nothing staying unattended for more than 10 minutes.
I came from embedded, where I wasn't able to use agents very effectively for anything other than quick round trip iterative stuff. They were still really useful, but I definitely could never envision just letting an agent run unattended.
But I recently switched domains into vaguely "fullstack web" using very popular frameworks. If I spend a good portion of my day going back and forth with an agent, working on a detailed implementation plan that spawns multiple agents, there is seemingly no limit* to the scope of the work they are able to accurately produce. This is because I'm reading through the whole plan and checking for silly gotchyas and larger implementation mistakes before I let them run. It's also great because I can see how the work can be parallelized at certain parts, but blocked at others, and see how much work can be parallelized at once.
Once I'm ready, I can usually let it start with not even the latest models, because the actual implementation is so straightforwardly prompted that it gets it close to perfectly right. I usually sit next to it and validate it while it's working, but I could easily imagine someone letting it run overnight to wake up to a fresh PR in the morning.
Don't get me wrong, it's still more work that just "vibing" the whole thing, but it's _so_ much more efficient than actually implementing it, especially when it's a lot of repetitive patterns and boilerplate.
* I think the limit is how much I can actually keep in my brain and spec out in a well thought out manner that doesn't let any corner cases through, which is still a limit, but not necessarily one coming from the agents. Once I have one document implemented, I can move on to the next with my own fresh mental context which makes it a lot easier to work.
It's because you're not writing it, you adopted the role of Project Manager or Chief Engineer. How much cognitive debt are you accumulating?
Hope it helps!
This is definitely a way to keep those who wear Program and Project manager hats busy.
As i build with agents, i frequently run into new issues that arent in scope for the task im on and would cause context drift. I have the agent create a github issue with a short problem description and keep going on the current task. In another terminal i spin up a new agent and just tell it “investigate GH issue 123” and it starts diving in, finds the root cause, and proposes a fix. Depending on what parts of the code the issue fix touches and what other agents ive got going, i can have 3-4 agents more or less independently closing out issues/creating PRs for review at a time. The agents log their work in a work log- what they did, what worked what didnt, problems they encountered using tools - and about once a day i have an agent review the worklog and update the AGENTS.md with lessons learned.
If you have a loop set up, e.g., using OpenClaw or a Ralph loop, you can stretch that out further.
I would suggest that when you get to that point really, you want some kind of adversarial system set up with code reviews (e.g., provided by CodeRabbit or Sourcery) and automation to feed that back into the coding agent.
Is it possible? Yes, I've had success with having a model output a 100 step plan that tried to deconflict among multiple agents. Without re-creating 'Gas town', I could not get the agents to operate without stepping on toes. With _me_ as the grand coordinator, I was able to execute and replicate a SaaS product (at a surface level) in about 24hrs. Output was around 100k lines of code (without counting css/js).
Who can prove that it works correctly though? An AI enthusiasts will say "as long as you've got test coverage blah blah blah". Those who have worked large scale products know that tests passing is basically "bare minimum". So you smoke test it, hope you've got all the paths, and toss it up and try to collect money from people? I don't know. If _this_ is the future, this will collapse under the weight of garbage code, security and privacy breaches, and who knows what else.
Providing material for attention-grabbing headlines and blog posts, primarily. Can't (in good conscience, at least) claim you had an agent running all night if you didn't actually run an agent all night.
He is building a trading automation for personal use. In his design he gets a message on whatsapp/signal/telegram and approves/rejects the trade suggestion.
To define specifications for this, he defined multiple agents (a quant, a data scientist, a principal engineer, and trading experts - “warren buffett”, “ray dalio”) and let the agents run until they reached a consensus on what the design should be. He said this ran for a couple of hours (so not strictly overnight) after he went to sleep; in the morning he read and amended the output (10s of pages equivalent) and let it build.
This is not a strictly-defined coding task, but there are now many examples of emerging patterns where you have multiple agents supporting each other, running tasks in parallel, correcting/criticising/challenging each other, until some definition of “done” has been satisfied.
That said, personally my usage is much like yours - I run agents one at a time and closely monitor output before proceeding, to avoid finding a clusterfuck of bad choices built on top of each other. So you are not alone my friend :-)
Do some people just create complete SaaSlop apps with it overnight? Of course, just put together a plan (by asking the LLM to write the plan) with everything you want the app to do and let it run.
I can see the utility in creating very simple web-based tools where there's a monstrous wealth of public resources to build a model off of, but even the most recent models provided by Anthro, OpenAI, or MSFT seem prone to not quite perfection. And every time I find an error I'm left wondering what other bugs I'm not catching.
I think of my agents like golems from disc world, they are defined by their script. Adding texture to them improves the results so I usually keep a running tally of what they have worked on and add that to the header. They are a prompt in a folder that a script loops over and sends to gemeni(spawning an agent and moving to the next golem script)
I also was curious to see if it could be used it for developing some small games, whenever I would run into a problem I couldn't be bothered to solve or needed a variety of something I would let a few llms work on it so in the morning I had something to bounce off. I had pretty good success with this for RTS games and shooting games where variety is something well documented and creativity is allowed. I imagine there could be a use here, I've been calling it dredging cause I imagine myself casting a net down into the slop to find valuables.
I did have an idea where all my sites and UI would be checked against some UI heuristic like Oregon State's inclusivity heuristic but results have been mixed so far. The initial reports are fine, the implementation plans are ok but it seems like the loop of examine, fix, examine... has too much drift? That does seem solvable but I have a concern that this is like two lines that never touch but get closer as you approach infinity.
There is some usefulness in running these guys all night but I'm still figuring out when its useful and when its a waste of resources.
In my case, I built a small api that claude can call to get tasks. I update the tasks on my phone.
The assumption is that you have a semi-well structured codebase already (ours is 1M LOC C#). You have to use languages with strong typing + strict compiler.You have to force claude to frequently build the code (hence the cpu cores + ram + nmve requirement).
If you have multiple machines doing work, have single one as the master and give claude ssh to the others and it can configure them and invoke work on them directly. The usecase for this is when you have a beefy proxmox server with many smaller containers (think .net + debian). Give the main server access to all the "worker servers". Let claude document this infrastructure too and the different roles each machine plays. Soon you will have a small ranch of AI's doing different things, on different branches, making pull requests and putting feedback back into task manager for me to upvote or downvote.
Just try it. It works. Your mind will be blown what is possible.
We only give it very targeted tasks, no broad strokes. We have a couple of "prompt" templates, which we select when creating tasks. The new opus model one shots about 90% of tasks we throw at it. Getting a ton of value from diagnostic tasks, it can troubleshoot really quickly (by ingesting logs, exceptions, some db rows).
You can draw the line wherever you want. :) Personally, I wish I'd built a new gaming rig a year ago so I could mess with local models and pay all these same costs.
Generate material for yet another retarded twitter hype post.