Evaluating AGENTS.md: are they helpful for coding agents?
192 points by mustaphah 2 days ago | 154 comments

deaux 16 hours ago
I read the study. I think it does the opposite of what the authors suggest - it's actually vouching for good AGENTS.md files.

> Surprisingly, we observe that developer-provided files only marginally improve performance compared to omitting them entirely (an increase of 4% on average), while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average).

This "surprisingly", and the framing seems misplaced.

For the developer-made ones: 4% improvement is massive! 4% improvement from a simple markdown file means it's a must-have.

> while LLM- generated context files have a small negative effect on agent performance (a decrease of 3% on average)

This should really be "while the prompts used to generate AGENTS files in our dataset..". It's a proxy for prompts, who knows if the ones generated through a better prompt show improvement.

The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency. Exactly the kind of thing very common in closed-source, yet incredibly rare in public Github projects that have an AGENTS.md file - the huge majority of which are recent small vibecoded projects centered around LLMs. If 4% gains are seen on the latter kind of project, which will have a very mixed quality of AGENTS files in the first place, then for bigger projects with high-quality .md's they're invaluable when working with agents.

reply
nielstron 14 hours ago
Hey thanks for your review, a paper author here.

Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

But ultimately I agree with your post. In fact we do recommend writing good AGENTS.md, manually and targetedly. This is emphasized for example at the end of our abstract and conclusion.

reply
vidarh 14 hours ago
Without measuring quality of output, this seems irrelevant to me.

My use of CLAUDE.md is to get Claude to avoid making stupid mistakes that will require subsequent refactoring or cleanup passes.

Performance is not a consideration.

If anything, beyond CLAUDE.md I add agent harnesses that often increase the time and tokens used many times over, because my time is more expensive than the agents.

reply
_joel 13 hours ago
CLAUDE.md isn't a silver bullet either, I've had it lose context a couple of questions deep. I do like GSD[1] though, it's been a great addition to the stack. I also use multiple, different LLMs as a judge for PRs, which captures a load of issues too.

[1] https://github.com/gsd-build/get-shit-done

reply
yorwba 11 hours ago
In this context, "performance" means "does it do what we want it to do" not "does it do it quickly". Quality of output is what they're measuring, speed is not a consideration.
reply
vidarh 2 hours ago
The point is that whether it does what you tell it in a single iteration is less important then whether it avoids stupid mistakes. Any serious use will put it in a harness.
reply
sdenton4 11 hours ago
You're measuring binary outcomes, so you can use a beta distribution to understand the distribution of possible success rates given your observations, and thereby provide a confidence interval on the observed success rates. This week help us see whether that 4% success rate is statistically significant, or if it is likely to be noise.
reply
bee_rider 10 hours ago
I’ve only ever gotten, like, slight wording suggestions from reviewers. I wish they would write things like this instead—it is possibly meaningful and eminently do-able (doesn’t even require new data!).
reply
sdenton4 7 hours ago
Taking a slightly closer look at the paper, you've got K repositories and create a set of test cases within each repository, totaling 130-ish tests. There may be some 'repository-level' effects - ie, tasks may be easier in some repo's than others.

Modeling the overall success rate then requires some hierarchical modeling. You can consider each repository as a weighted coin, and each test within a repository as flip of that particular coin. You want to estimate the overall probability of getting heads, when choosing a coin at random and then flipping it.

Here's some Gemini hints on how to proceed with getting the confidence interval using hierarchical bayes: https://gemini.google.com/corp/app/e9de6a12becc57f6

(Still no need for further data!)

reply
regularfry 10 hours ago
> Regarding the 4% improvement for human written AGENTS.md: this would be huge indeed if it were a _consistent_ improvement. However, for example on Sonnet 4.5, performance _drops_ by over 2%. Qwen3 benefits most and GPT-5.2 improves by 1-2%.

Ok so that's interesting in itself. Apologies if you go into this in the paper, not had time to read it yet, but does this tell us something about the models themselves? Is there a benchmark lurking here? It feels like this is revealing something about the training, but I'm not sure exactly what.

reply
nielstron 7 hours ago
It could... but as pointed out by other the significance is unclear and per-model results have even less samples than the benchmark average. So: maybe :)
reply
deaux 9 hours ago
Thank you for turning up here and replying!

> The LLM-generated prompts follow the coding agent recommendations. We also show an ablation over different prompt types, and none have consistently better performance.

I think the coding agent recommended LLM-generated AGENTS.md files are almost without exception really bad. Because the AGENTS.md, to perform well, needs to point out the _non_-obvious. Every single LLM-generated AGENTS.md I've seen - including by certain vendors who at one point in time out-of-the-box included automatic AGENTS.md generation - wrote about the obvious things! The literal opposite of what you want. Indeed a complete and utter waste of tokens that does nothing but induce context rot.

I believe this is because creating a good one consumes a massive amount of resources and some engineering for any non-trivial codebase. You'd need multiple full-context iterations, and a large number of thinking tokens.

On top of that, and I've said this elsewhere, most of the best stuff to put in AGENTS.md is things that can't be inferred from the repo. Things like "Is this intentional?", "Why is this the case?" and so on. Obviously, the LLM nor a new-to-the-project human could know this or add them to the file. And the gains from this are also hard to capture by your performance metric, because they're not really about the solving of issues, they're often about direction, or about the how rather than the what.

As for the extra tokens, the right AGENTS.md can save lots of tokens, but it requires thinking hard about them. Which system/business logic would take the agent 5 different file reads to properly understand, but can we summarize in 3 sentences?

reply
nielstron 7 hours ago
Yes that's a great summary and I agree broadly.

Note with different prompt types I refer to different types of meta-prompts to generate the AGENTS.md. All of these are quite useless. Some additional experiments not in the paper showed that other automated approaches are also useless ("memory" creating methods, broadly speaking).

reply
c0rleyma 2 hours ago
I will read the paper, but I am curious if the methods promoted by eng/researchers at openai for models like codex 5.2/5.3 work? ie, is having a separate agent look at recent agent sessions and deduce problems the agents ran into and update agents.md (or more likely, the indexed docs referenced in an agents.md) actually helpful? A priori that seems like the main kind of meta prompting/harness you might expect to work more robustly.
reply
giancarlostoro 12 hours ago
> The biggest usecase for AGENTS.md files is domain knowledge that the model is not aware of and cannot instantly infer from the project. That is gained slowly over time from seeing the agents struggle due to this deficiency.

This. I have Claude write about the codebase because I get tired of it grepping files constantly. I rather it just know “these files are for x, these files have y methods” and I even have it breakdown larger files so it fits the entire context window several times over.

Funnily enough this makes it easier for humans to parse.

reply
belval 11 hours ago
My pet peeve with AI is that it tends to work better in codebase where humans do well and for the same reason.

Large orchestration package without any tests that relies on a bunch of microservices to work? Claude Code will be as confused as our SDEs.

This in turns lead to broader effort to refactor our antiquated packages in the name of "making it compatible with AI" which actually means compatible with humans.

reply
giancarlostoro 7 hours ago
In my opinion it’s not just compatible with AI its code that now fits in your head. Lots of famous “we can rewrite it later” remarks throughout my career… Well the AI can rewrite it, and now you can understand it.

Always make it write out a plan, write out unit tests that match the codebase as-is, and if adjusted are only changed in how they call the code in the future, giving you confidence that the rewrite didn't break core logic.

reply
anamexis 8 hours ago
Why is that a pet peeve, though? Seems like a win/win.
reply
SerCe 12 hours ago
In Theory There Is No Difference Between Theory and Practice, While In Practice There Is.

In large projects, having a specific AGENTS.md makes the difference between the agent spending half of its context window searching for the right commands, navigating the repo, understanding what is what, etc., and being extremely useful. The larger the repository, the more things it needs to be aware of and the more important the AGENTS.md is. At least that's what I have observed in practice.

reply
bootsmann 16 hours ago
This reads a lot like bargaining stage. If agentic AI makes me a 10 times more productive developer, surely a 4% improvement is barely worth the token cost.
reply
koiueo 15 hours ago
> If agentic AI makes me a 10 times more productive

I'm not sure what you are suggesting exactly, but wanted to highlight this humongous "if".

reply
zero_k 14 hours ago
It's not only about the token cost! It's also my TIME cost! Much-much more expensive than tokens, it turns out ;)
reply
staticassertion 15 hours ago
If something makes you 10x as effective and then you improve that thing by 4%...
reply
croes 15 hours ago
10x is that quantity or quality?
reply
Ragnarork 15 hours ago
Also, "perceived" or "real"?
reply
zero_k 14 hours ago
Honestly, the more research papers I read, the more I am suspicious. This "surprisingly" and other hyperbole is just to make reviewers think the authors actually did something interesting/exciting. But the more "surprises" there are in a paper, the more I am suspicious of it. Often such hyperbole ought to be at best ignored, at worst the exact opposite needs to be examined.

It seems like the best students/people eventually end up doing CS research in their spare time while working as engineers. This is not the case for many other disciplines, where you need e.g. a lab to do research. But in CS, you can just do it from your basement, all you need is a laptop.

reply
MITSardine 12 hours ago
Well, you still need time (and permission from your employer)! Research is usually a more than full time job on its own.
reply
pgt 13 hours ago
4% is yuuuge. In hard projects, 1% is the difference between getting it right with an elegant design or going completely off the rails.
reply
Arifcodes 14 hours ago
The study measures the wrong thing. Task completion ("does the PR pass tests?") is a narrow proxy for what AGENTS.md actually helps with in production.

I run a system with multiple AI agents sharing a codebase daily. The AGENTS.md file doesn't exist to help the agent figure out how to fix a bug. It exists to encode tribal knowledge that would take a human weeks to accumulate: which directory owns what, how the deploy pipeline works, what patterns the team settled on after painful debates. Without it, the agent "succeeds" at the task but produces code that looks like it was written by someone who joined the team yesterday. It passes tests but violates every convention.

The finding that context files "encourage broader exploration" is actually the point. I want the agent to read the testing conventions before writing tests. I want it to check the migration patterns before creating a new table. That costs more tokens, yes. But reverting a merged PR that used the wrong ORM pattern costs more than 20% extra inference.

reply
Gigachad 12 hours ago
What are you putting in the file? When I’ve looked at them they just looked like a second readme file without the promotional material in a typical GitHub readme.
reply
Arifcodes 11 hours ago
The useful stuff is different from a README. A README tells humans how to use the project. An AGENTS.md tells the AI how to work on it.

Mine typically includes:

- Build/test commands that aren't obvious from package.json (e.g. "run migrations before tests") - Architecture decisions that would take the agent 10 minutes to reverse-engineer ("auth goes through middleware X, not controller Y") - Known gotchas ("don't touch the legacy billing module, it's being replaced next sprint") - Deploy process specifics ("push to main auto-deploys staging, prod needs a manual tag") - Coding conventions that aren't in the linter ("we use Result types for errors, never throw")

The ones that look like READMEs are indeed useless. The good ones read more like the notes you'd give a new senior engineer on their first day. Stuff that's obvious to the team but invisible to an outsider.

reply
francisofascii 10 hours ago
> The good ones read more like the notes you'd give a new senior engineer on their first day.

If the notes are meant for new developers, wouldn't those notes go into the actual readme.md file?

reply
CuriouslyC 11 hours ago
Seems like that is knowledge a human would want too, and you are dumping it in the agents file for lack of a clearer place to put it.
reply
0x696C6961 12 hours ago
That's basically all it is. It's a readme file that is guaranteed to be read. So the agent doesn't spend 10 minutes trying to re-configure the toolchain because the first command it guessed didn't work.
reply
pamelafox 18 hours ago
This is why I only add information to AGENTS.md when the agent has failed at a task. Then, once I've added the information, I revert the desired changes, re-run the task, and see if the output has improved. That way, I can have more confidence that AGENTS.md has actually improved coding agent success, at least with the given model and agent harness.

I do not do this for all repos, but I do it for the repos where I know that other developers will attempt very similar tasks, and I want them to be successful.

reply
viraptor 17 hours ago
You can also save time/tokens if you see that every request starts looking for the same information. You can front-load it.
reply
sebazzz 17 hours ago
Also take the randomness out of it. Sometimes the agent executing tests one way, sometimes the other way.
reply
Maxion 12 hours ago
I've found https://github.com/casey/just to be very very useful. Allows to bind common commands simple smaller commands that can be easily referenced. Good for humans too.
reply
NicoJuicy 16 hours ago
Don't forget to update it regularly then
reply
averrous 17 hours ago
Agree. I also found out that rule discovery approach like this perform better. It is like teaching a student, they probably have already performed well on some task, if we feed in another extra rule that they already well verse at, it can hinder their creativity.
reply
imiric 17 hours ago
That's a sensible approach, but it still won't give you 100% confidence. These tools produce different output even when given the same context and prompt. You can't really be certain that the output difference is due to isolating any single variable.
reply
pamelafox 17 hours ago
So true! I've also setup automated evaluations using the GitHub Copilot SDK so that I can re-run the same prompt and measure results. I only use that when I want even more confidence, and typically when I want to more precisely compare models. I do find that the results have been fairly similar across runs for the same model/prompt/settings, even though we cannot set seed for most models/agents.
reply
ChrisGreenHeur 15 hours ago
same with people, no matter what info you give a person you cant be sure they will follow it the same every time
reply
amluto 2 days ago
My personal experience is that it’s worthwhile to put instructions, user-manual style, into the context. These are things like:

- How to build.

- How to run tests.

- How to work around the incredible crappiness of the codex-rs sandbox.

I also like to put in basic style-guide things like “the minimum Python version is 3.12.” Sadly I seem to also need “if you find yourself writing TypeVar, think again” because (unscientifically) it seems that putting the actual keyword that the agent should try not to use makes it more likely to remember the instructions.

reply
mlaretallack 2 days ago
I also try to avoid negative instructions. No scientific proof, just a feeling the same as you, "do not delete the tmp file" can lead too often to deleting the tmp file.
reply
strokirk 17 hours ago
It’s like instructing a toddler.
reply
justanothersys 16 hours ago
i definitely have gone so far as to treat my llm readable docs in this way and have found it very effective
reply
hnbad 16 hours ago
I recall that early LLMs had the problem of not understanding the word "not", which became especially evident and problematic when tasked with summarizing text because the summary would then sometimes directly contradict the original text.

It seems that that problem hasn't really been "fixed", it's just been paved over. But I guess that's the ugly truth most people tend to forget/deny about LLMs: you can't "fix" them because there's not a line of code you can point to that causes a "bug", you can only retrain them and hope the problem goes away. In LLMs, every bug is a "heisenbug" (or should that be "murphybug", as in Murphy's Law?).

reply
joquarky 9 hours ago
Same thing happens for humans:

"Don't think of a green elephant"

Alan Watts talked of this concept where the harder you try to suppress a thought or sensation, the more mental energy you give it, making it stronger.

reply
likium 18 hours ago
For TypeVar I’d reach for a lint warning instead.
reply
amluto 9 hours ago
Then little toddler LLM will announce something like “I implemented what you requested and we’re all done. You can run the lint now.” And I’ll reply “do it yourself.”

I can only assume that everyone reporting amazing success with agent swarms and very long running tasks are using a different harness than I am :)

reply
bonesss 15 hours ago
I also have felt like these kinds of efforts at instructions and agent files have been worthwhile, but I am increasingly of the opinion that such feelings represent self-delusion from seeing and expecting certain things aided by a tool that always agrees with my, or its, take on utility. The agent.md file looks like it’d work, it looks how you’d expect, but then it fails over and over. And the process of tweaking is pleasant chatting with supportive supposed insights and solutions, which means hours of fiddling with meta-documentation without clear rewards because of partial adherence.

The papers conclusions align with my personal experiments at managing a small knowledge base with LLM rules. The application of rules was inconsistent, the execution of them fickle, and fundamental changes in processing would happen from week-to-week as the model usage was tweaked. But, rule tweaking always felt good. The LLM said it would work better, and the LLM said it had read and understood the instructions and the LLM said it would apply them… I felt like I understoood how best to deliver data to the LLMs, only to see recurrent failures.

LLMs lie. They have no idea, no data, and no insights into specific areas, but they’ll make pleasant reality-adjacent fiction. Since chatting is seductive, and our time sense is impacted by talking, I think the normal time versus productivity sense is further pulled out of ehack. Devs are notoriously bad at estimating where they’re using time, long feedback loops filled with phone time and slow ass conversation don’t help.

reply
avhception 17 hours ago
When an agent just plows ahead with a wrong interpretation or understanding of something, I like to ask them why they didn't stop to ask for clarification. Just a few days ago, while refactoring minor stuff, I had an agent replace all sqlite-related code in that codebase with MariaDB-based code. Asked why that happened, the answer was that there was a confusion about MariaDB vs. sqlite because the code in question is dealing with, among other things, MariaDB Docker containers. So the word MariaDB pops up a few times in code and comments.

I then asked if there is anything I could do to prevent misinterpretations from producing wild results like this. So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding. But I didn't add it. Out of the 25 lines of my AGENTS.md, many are already variations of that. The first three:

- Do not try to fill gaps in your knowledge with overzealous assumptions.

- When in doubt: Slow down, double-check context, and only touch what was explicitly asked for.

- If a task seems to require extra changes, pause and ask before proceeding.

If these are not enough to prevent stuff like that, I don't know what could.

reply
Sevii 17 hours ago
Are agents actually capable of answering why they did things? An LLM can review the previous context, add your question about why it did something, and then use next token prediction to generate an answer. But is that answer actually why the agent did what it did?
reply
gas9S9zw3P9c 17 hours ago
It depends. If you have an LLM that uses reasoning the explanation for why decisions are made can often be found in the reasoning token output. So if the agent later has access to that context it could see why a decision was made.
reply
kgeist 11 hours ago
LLMs often already "know" the answer starting from the first output token and then emulate "reasoning" so that it appeared as if it came to the conclusion through logic. There's a bunch of papers on this topic. At least it used to be the case a few months ago, not sure about the current SOTA models.
reply
nrds 6 hours ago
Wait, that's not right, let me think through this more carefully...
reply
Kubuxu 16 hours ago
Reasoning, in majority of cases, is pruned at each conversation turn.
reply
DonHopkins 14 hours ago
The cursor-mirror skill and cursor_mirror.py script lets you search through and inschpekt all of your chat histories, all of the thinking bubbles and prompts, all of the context assembly, all of the tool and mcp calls and parameters, and analyze what it did, even after cursor has summarized and pruned and "forgotten" it -- it's all still there in the chat log and sqlite databases.

cursor-mirror skill and reverse engineered cursor schemas:

https://github.com/SimHacker/moollm/tree/main/skills/cursor-...

cursor_mirror.py:

https://github.com/SimHacker/moollm/blob/main/skills/cursor-...

  The German Toilet of AI

  "The structure of the toilet reflects how a culture examines itself." — Slavoj Zizek

  German toilets have a shelf. You can inspect what you've produced before flushing. French toilets rush everything away immediately. American toilets sit ambivalently between.

  cursor-mirror is the German toilet of AI.

  Most AI systems are French toilets — thoughts disappear instantly, no inspection possible. cursor-mirror provides hermeneutic self-examination: the ability to interpret and understand your own outputs.

  What context was assembled?
  What reasoning happened in thinking blocks?
  What tools were called and why?
  What files were read, written, modified?

  This matters for:

  Debugging — Why did it do that?
  Learning — What patterns work?
  Trust — Is this skill behaving as declared?
  Optimization — What's eating my tokens?

  See: Skill Ecosystem for how cursor-mirror enables skill curation.
----

https://news.ycombinator.com/item?id=23452607

According to Slavoj Žižek, Germans love Hermeneutic stool diagnostics:

https://www.youtube.com/watch?v=rzXPyCY7jbs

>Žižek on toilets. Slavoj Žižek during an architecture congress in Pamplona, Spain.

>The German toilets, the old kind -- now they are disappearing, but you still find them. It's the opposite. The hole is in front, so that when you produce excrement, they are displayed in the back, they don't disappear in water. This is the German ritual, you know? Use it every morning. Sniff, inspect your shits for traces of illness. It's high Hermeneutic. I think the original meaning of Hermeneutic may be this.

https://en.wikipedia.org/wiki/Hermeneutics

>Hermeneutics (/ˌhɜːrməˈnjuːtɪks/)[1] is the theory and methodology of interpretation, especially the interpretation of biblical texts, wisdom literature, and philosophical texts. Hermeneutics is more than interpretive principles or methods we resort to when immediate comprehension fails. Rather, hermeneutics is the art of understanding and of making oneself understood.

----

Here's an example cursor-mirror analysis of an experiment with 23 runs with four agents playing several turns of Fluxx per run (1 run = 1 completion call), 1045+ events, 731 tool calls, 24 files created, 32 images generated, 24 custom Fluxx cards created:

Cursor Mirror Analysis: Amsterdam Fluxx Championship -- Deep comprehensive scan of the entire FAFO tournament development:

amsterdam-flux CURSOR-MIRROR-ANALYSIS.md:

https://github.com/SimHacker/moollm/blob/main/skills/experim...

amsterdam-flux simulation runs:

https://github.com/SimHacker/moollm/tree/main/skills/experim...

reply
mkesper 12 hours ago
Just an update re German toilets: No toilet set up in the last 30 years (I know of) uses a shelf anymore. This reduces water usage by about 50% per flush.
reply
DonHopkins 12 hours ago
But then what do you have to talk about all day??!
reply
Onavo 14 hours ago
Well, the entire field of explainable AI has mostly thrown in the towel..
reply
bananapub 13 hours ago
of course not, but it can often give a plausible answer, and it's possible that answer will actually happen to be correct - not because it did any - or is capable of any - introspection, but because it's token outputs in response to the question might semi-coincidentally be a token input that changes the future outputs in the same way.
reply
bandrami 17 hours ago
Isn't that question a category error? The "why" the agent did that is that it was the token that best matched the probability distribution of the context and the most recent output (modulo a bit of randomness). The response to that question will, again, be the tokens that best match the probability distribution of the context (now including the "why?" question and the previous failed attempt).
reply
tibbar 17 hours ago
if the agent can review its reasoning traces, which i think is often true in this era of 1M token context, then it may be able to provide a meaningful answer to the question.
reply
bandrami 17 hours ago
Wait, no, that's the category error I'm talking about. Any answer other than "that was the most likely next token given the context" is untrue. It is not describing what actually happened.
reply
tibbar 16 hours ago
I think this statement is on the same level as "a human cannot explain why they gave the answer they gave because they cannot actually introspect the chemical reactions in their brain." That is true, but a human often has an internal train of thought that preceded their ultimate answer, and it is interesting to know what that train of thought was.

In the same way, it is often quite instructive to know what the reasoning trace was that preceded an LLM's answer, without having to worry about what, mechanically, the LLM "understood" about the tokens, if this is even a meaningful question.

reply
bandrami 16 hours ago
But it's not a reasoning trace. Models could produce one if they were designed to (an actual stack of the calls and the states of the tensors with each call, probably with a helpful lookup table for the tokens) but they specifically haven't been made to do that.
reply
rocqua 16 hours ago
When you put an LLM in reasoning mode, it will approximately have a conversation with itself. This mimics an inner monologue.

That conversation is held in text, not in any internal representation. That text is called the reasoning trace. You can then analyse that trace.

reply
bandrami 16 hours ago
Unless things have changed drastically in the last 4 months (the last time I looked at it) those traces are not stored but reconstructed when asked. Which is still the same problem.
reply
ehsanu1 15 hours ago
They aren't necessarily "stored" but they are part of the response content. They are referred to as reasoning or thinking blocks. The big 3 model makers all have this in their APIs, typically in an encrypted form.

Reconstruction of reasoning from scratch can happen in some legacy APIs like the OpenAI chat completions API, which doesn't support passing reasoning blocks around. They specifically recommend folks to use their newer esponses API to improve both accuracy and latency (reusing existing reasoning).

reply
tibbar 15 hours ago
For a typical coding agent, there are intermediate tool call outputs and LLM commentary produced while it works on a task and passed to the LLM as context for follow up requests. (Hence the term agent: it is an LLM call in a loop.) You can easily see this with e.g. Claude Code, as it keeps track of how much space is left in the context and requires "context compaction" after the context gradually fills up over the course of a session.

In this regard, the reasoning trace of an agent is trivially accessible to clients, unlike the reasoning trace of an individual LLM API call; it's a higher level of abstraction. Indeed, I implemented an agent just the other day which took advantage of this. The OP that you originally replied to was discussing an agentic coding process, not an individual LLM API call.

reply
bandrami 14 hours ago
Well, right, I see those reasoning stages in reasoning models with Ollama and if you ask it what its reasoning was after the fact what it says is different than what it said at the time.
reply
tibbar 4 hours ago
I can't speak to your specific set up, but it sounds like you're halfway there if you can access the previous traces? All anyone can ask for is "show me the traces that led up to this point"; the "why did you do this" is a notational convenience for querying that data. If your set up isn't summarizing those traces correctly, then that sounds like a specific bug in the context or model quality, but the point is that the traces exist and are queryable in the first place, however you choose to do that.

(I am still primarily talking about agent traces, like the original OP, not internal reasoning blocks for a particular LLM call, though - which may or may not be available in context afterwards.)

In particular, asking "why" isn't a category error here, although there's only a meaningful answer if the model has access to the previous traces in its context, which is sometimes true and sometimes not.

reply
dash2 16 hours ago
There can be higher- and lower-level descriptions of the same phenomenon. when the kettle boils, it’s because the water molecules were heated by the electric element, but it’s also because I wanted a cup of tea.
reply
ChrisGreenHeur 15 hours ago
the llm has no wants
reply
mikkupikku 9 hours ago
If the reason the LLM retroactively invents for it's previous mistakes is still useful for getting the LLM to not make that kind of mistake again, then the distinction you're driving at doesn't matter.
reply
rafaelmn 16 hours ago
> Any answer other than "that was the most likely next token given the context" is untrue.

"Because the matrix math resulted in the set of tokens that produced the output". "Because the machine code driving the hosting devices produced the output you saw". "Because the combination of silicon traces and charges on the chips at that exact moment resulted in the output". "Because my neurons fired in a particular order/combination".

I don't see how your statement is any more useful. If an LLM has access to reasoning traces it can realistically waddle down the CoT and figure out where it took a wrong turn.

Just like a human does with memories in context - does't mean that's the full story - your decision making is very subconscious and nonverbal - you might not be aware of it, but any reasoning you give to explain why you did something is bound to be an incomplete story, created by your brain to explain what happened based on what it knows - but there's hidden state it doesn't have access to. And yet we ask that question constantly.

reply
ChrisGreenHeur 15 hours ago
well, do you want something useful or something true?

the word why is used to get something true.

reply
rocqua 16 hours ago
If you want to be pedantic about it you could phrase it as follows.

When the LLM was in reasoning mode, in the reasoning context it often expressed statement X. Given that, and the relevance of statement X to the taken action. It seems likely that the presence of statement X in the context contributed to this action. Besides, the presence of statement X in the reasoning likely means that given the previous context embeddings of X are close to the context.

Hence we think that the action was taken due to statement X.

And that output could have come from an LLM introspecting it's own reasoning.

I don't think that phrasing things so pedanticaly is worth the extra precision though. Especially not for the statement that inspecting the reasoning logs of sn LLM can help give insight on why an LLM acted a certain way.

reply
tomashubelbauer 16 hours ago
Just this morning I have run across an even narrower case of how AGENTS.md (in this case with GPT-5.3 Codex) can be completely ignored even if filled with explicit instructions.

I have a line there that says Codex should never use Node APIs where Bun APIs exist for the same thing. Routinely, Claude Code and now Codex would ignore this.

I just replaced that rule with a TypeScript-compiler-powered AST based deterministic rule. Now the agent can attempt to commit code with banned Node API usage and the pre-commit script will fail, so it is forced to get it right.

I've found myself migrating more and more of my AGENTS.md instructions to compiler-based checks like these - where possible. I feel as though this shouldn't be needed if the models were good, but it seems to be and I guess the deterministic nature of these checks is better than relying on the LLM's questionable respect of the rules.

reply
iamflimflam1 15 hours ago
Not that much different from humans.

We have pre-commit hooks to prevent people doing the wrong thing. We have all sorts of guardrails to help people.

And the “modern” approach when someone does something wrong is not to blame the person, but to ask “how did the system allow this mistake? What guardrails are missing?”

reply
MITSardine 12 hours ago
I wonder if some of these could be embedded in the write tool calls?
reply
geraneum 17 hours ago
> So I got the advice to put an instruction in AGENTS.md that would urge agents to ask for clarification before proceeding.

You may want to ask the next LLM versions the same question after they feed this paper through training.

reply
lebuin 16 hours ago
It seems like LLMs in general still have a very hard time with the concepts of "doubt" and "uncertainty". In the early days this was very visible in the form of hallucinations, but it feels like they fixed that mostly by having better internal fact-checking. The underlying problem of treating assumptions as truth is still there, just hidden better.
reply
avhception 16 hours ago
Doubt and uncertainty is left for us humans.
reply
hnbad 16 hours ago
LLMs are basically improv theater. If the agent starts out with a wildly wrong assumption it will try to stick to it and adapt it rather than starting over. It can only do "yes and", never "actually nevermind, let me try something else".

I once had an agent come up with what seemed like a pointlessly convoluted solution as it tried to fit its initial approach (likely sourced from framework documentation overemphasizing the importance of doing it "the <framework> way" when possible) to a problem for which it to me didn't really seem like a good fit. It kept reassuring me that this was the way to go and my concerns were invalid.

When I described the solution and the original problem to another agent running the same model, it would instantly dismiss it and point out the same concerns I had raised - and it would insist on those being deal breakers the same way the other agent had dimissed them as invalid.

In the past I've often found LLMs to be extremely opinionated while also flipping their positions on a dime once met with any doubt or resistance. It feels like I'm now seeing the opposite: the LLM just running with whatever it picked up first from the initial prompt and then being extremely stubborn and insisting on rationalizing its choice no matter how much time it wastes trying to make it work. It's sometimes better to start a conversation over than to try and steer it in the right direction at that point.

reply
sensanaty 11 hours ago
I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.

Even the "thinking" blocks in newer models are an illusion. There is no functional difference between the text in a thought block and the final answer. To the model, they are just more tokens in a linear sequence. It isn't "thinking" before it speaks, the "thought" is the speech.

Treating those thoughts as internal reflection of some kind is a category error. There is no "privileged" layer of reasoning happening in the silicon that then gets translated into the thought block. It’s a specialized output where the model is forced to show its work because that process of feeding its own generated strings back into its context window statistically increases the probability of a correct result. The chatbot providers just package this in a neat little window to make the model's "thinking" part of the gimmick.

I also wouldn't be surprised if asking it stuff like this was actually counter productive, but for this I'm going off vibes. The logic being that by asking that, you're poisoning the context, similar to how if you try generate an image by saying "It should not have a crocodile in the image", it will put a crocodile into the image. By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.

reply
Bolwin 7 hours ago
You're entirely correct in that it's a different model with every message, every token. There's no past memory for it to reference.

That said it can still be useful because you have a some weird behavior and 199k tokens of context, with no idea where the info is that's nudging it to do the weird thing.

In this case you can think of it less as "why did you do this?" And more "what references to doing this exist in this pile of files and instructions?"

reply
bavell 10 hours ago
Agreed. I wish more people understood the difference between tokens, embeddings, and latent space encodings. The actual "thinking" if you can call it that, happens in latent space. But many (even here on HN) believe the thinking tokens are the thoughts themselves. Silly meatbags!
reply
Majromax 10 hours ago
Thinking happens in latent space, but the thinking trace is then the projection of that thinking onto tokens. Since autoregressive generation involves sampling a specific token and continuing the process, that sampling step is lossy.

However, it is a genuine question whether the literal meanings of thinking blocks are important over their less-observable latent meanings. The ultimate latent state attributable to the last-generated thinking token is some combination of the actual token (literal meaning) and recurrent thinking thus far. The latter does have some value; a 2024 paper (https://arxiv.org/abs/2404.15758) noted that simply adding dots to the output allowed some models to perform more latent computation resulting in higher-skill answers. However, since this is not a routine practice today I suspect that genuine "thinking" steps have higher value.

Ultimately, your thesis can be tested. Take the output of a reasoning model inclusive of thinking tokens, then re-generate answers with:

1. Different but semantically similar thinking steps (i.e. synonyms, summarization). That will test whether the model is encoding detailed information inside token latent space.

2. Meaningless thinking steps (dots or word salad), testing whether the model is performing detailed but latent computation, effectively ignoring the semantic context of

3. A semantically meaningful distraction (e.g. a thinking trace from a different question)

Look for where performance drops off the most. If between 0 (control) and 1, then the thinking step is really just a trace of some latent magic spell, so it's not meaningful. If between 1 and 2, then thinking traces serve a role approximately like a human's verbalized train of thought. If between 2 and 3 then the role is mixed, leading back to the 'magic spell' theory but without the 'verbal' component being important.

reply
Majromax 10 hours ago
> I really hate that the anthropomorphizing of these systems has successfully taken hold in people's brains. Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.

"Thinking meat! You're asking me to believe in thinking meat!"

While next-token prediction based on matrix math is certainly a literal, mechanistic truth, it is not a useful framing in the same sense that "synapses fire causing people to do things" is not a useful framing for human behaviour.

The "theory of mind" for LLMs sounds a bit silly, but taken in moderation it's also a genuine scientific framework in the sense of the scientific method. It allows one to form hypothesis, run experiments that can potentially disprove the hypothesis, and ultimately make skillful counterfactual predictions.

> By asking it why it did something wrong, it'll treat that as the ground truth and all future generation will have that snippet in it, nudging the output in such a way that the wrong thing itself will influence it to keep doing the wrong thing more and more.

In my limited experience, this is not the right use of introspection. Instead, the idea is to interrogate the model's chain of reasoning to understand the origins of a mistake (the 'theory of mind'), then adjust agents.md / documentation so that the mistake is avoided for future sessions, which start from an otherwise blank slate.

I do agree, however, that the 'theory of mind' is very close to the more blatantly incorrect kind of misapprehension about LLMs, that since they sound humanlike they have long-term memory like humans. This is why LLM apologies are a useless sycophancy trap.

reply
seanmcdirmid 7 hours ago
> Asking it why it did something is completely useless because you aren't interrogating a person with a memory or a rationale, you’re querying a statistical model that is spitting out a justification for a past state it no longer occupies.

Asking it why it did something isn’t useless, it just isn’t fullproof. If you really think it’s useless, you are way too heavily into binary thinking to be using AI.

Perfect is the enemy of useful in this case.

reply
sensanaty 4 hours ago
I genuinely fail to see the usefulness, though, it seems counterproductive to me to do this kinda stuff. In my experience I just throw out the whole chat/session as soon as I notice it's starting to repeat mistakes/start doing stupid shit consistently, the few times I've tried interrogating it I could immediately tell all it was doing is, for lack of a better word, being a sycophant and aping my words back at me.
reply
mustaphah 12 hours ago
This is like trying to fix hallucination by telling LLM not to hallucinate.
reply
delaminator 11 hours ago
so many times have ended up here :

"You're absolutely correct. I should have checked my skills before doing that. I'll make sure I do it in the future."

reply
prodigycorp 15 hours ago
LLMs are generally bad at writing non-noisy prompts and instructions. It's better to have it write instructions post hoc. For instance, I paste this prompt into the end of most conversations:

  If there’s a nugget of knowledge learned at any point in this conversation (not limited to the most recent exchange), please tersely update AGENTS.md so future agents can access it. If nothing durable was learned, no changes are needed. Do not add memories just to add memories.
  
  Update AGENTS.md **only** if you learned a durable, generalizable lesson about how to work in this repo (e.g., a principle, process, debugging heuristic, or coding convention). Do **not** add bug- or component-specific notes (for example, “set .foo color in bar.css”) unless they reflect a broader rule.
  
  If the lesson cannot be stated without referencing a specific selector or file, skip the memory and make no changes. Keep it to **one short bullet** under an appropriate existing section, or add a new short section only if absolutely necessary.

It hardly creates rules, but when it does, it affects rules in a way that positively affects behavior. This works very well.

Another common mistake is to have very long AGENTS.md files. The file should not be long. If it's longer than 200 lines, you're certainly doing it wrong.

reply
joquarky 9 hours ago
> If nothing durable was learned, no changes are needed.

Off topic, but oh my god if you don't do this, it will always do the thing you conditionally requested it to do. Not sure what to call this but it's my one big annoyance with LLMs.

It's like going to a sub shop and asking for just a tiny bit of extra mayo and they heap it on.

reply
Bolwin 7 hours ago
Llms generally seem trained with the assumption that if you mention it, you want it.

I don't think the instruction following benches test for this much and I don't know how you'd measure it well

reply
medler 2 days ago
Quite a surprising result: “across multiple coding agents and LLMs, we find that context files tend to reduce task success rates compared to providing no repository context, while also increasing inference cost by over 20%.”
reply
tartakovsky 18 hours ago
Well, task == Resolving real GitHub Issues

Languages == Python only

Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)

So seems like a highly skewed sample and who knows what can / can't be generalized. Does make for a compelling research paper though!

reply
nielstron 17 hours ago
Hey, paper author here. We did try to get an even sample - we include both SWE-bench repos (which are large, popular and mostly human-written) and a sample of smaller, more recent repositories with existing AGENTS.md (these tend to contain LLM written code of course). Our findings generalize across both these samples. What is arguably missing are small repositories of completely human-written code, but this is quite difficult to obtain nowadays.
reply
menaerus 16 hours ago
Why stick to python-only repositories though?
reply
troupo 16 hours ago
To reduce the number of variables to account for. To be able to finish the paper this year, and not the next century. To work with a familiar language and environments. To use a language heavily represented in the training data.

I mean, it's not that hard to understand why.

reply
menaerus 16 hours ago
[flagged]
reply
troupo 13 hours ago
All research is conducted in constraints. It's not hard to understand those constraints by simply thinking.

Besides, one could actually open the research, and scroll to section 5 where they acknowledge the need to expand beyond Python:

--- start quote ---

5. Limitations and Future Work

While our work addresses important shortcomings in the literature, exciting opportunities for future research remain.

# Niche programming languages

The current evaluation is focused heavily on Python. Since this is a language that is widely represented in the training data, much detailed knowledge about tooling, dependencies, and other repository specifics might be present in the models’ parametric knowledge, nullifying the effect of context files. Future work may investigate the effect of context files on more niche programming languages and toolchains that are less represented in the training data, and known to be more difficult for LLMs

--- end quote ---

reply
menaerus 8 hours ago
You still did not answer my question and you're still being a d*ck. I understand now why - because you have no idea what I am talking about.
reply
bootsmann 15 hours ago
> Libraries (um looks like other LLM generated libraries -- I mean definitely not pure human: like Ragas, FastMCP, etc)

How does this invalidate the result? Aren't AGENTS.md files put exactly into those repos that are partly generated using LLMs?

reply
locknitpicker 17 hours ago
I think that is a rather fitting approach to the problem domain. A task being a real GitHub issue is a solid definition by any measure, and I see no problem picking language A over B or C.

If you feel strongly about the topic, you are free to write your own article.

reply
rmnclmnt 18 hours ago
Yesterday while i was adding some nitpicks to a CLAUDE.md/AGENTS.md file, I thought « this file could be renamed CONTRIBUTING.md and be done with it ».

Maybe I’m wrong but sure feels like we might soon drop all of this extra cruft for more rationale practices

reply
nielstron 17 hours ago
Exactly my thoughts... the model should just auto ingest README and CONTRIBUTING when started.
reply
rmnclmnt 14 hours ago
And that makes total sense. Honestly working since a few days with Opus 4.6, it really feels like a competent coworker, but need some explicit conventions to follow … exactly when onboarding a new IC! So i think there is a bright light to be seen: this will force having proper and explicit contribution rules and conventions, both for humans and robots
reply
delaminator 11 hours ago
You could have claude --init create this hook and then it gets into the context at start and resume

Or create it in some other way

    {
      "hookSpecificOutput": {
        "hookEventName": "SessionStart",
        "additionalContext": "<contents of your file here>"
      }
    }
I thought it was such a good suggestion that I made this just now and made it global to inject README at startup / resume / post compact - I'll see how it works out

https://gist.github.com/lawless-m/fa5d261337dfd4b5daad4ac964...

    #!/bin/bash
    # ~/.claude/hooks/inject-readme.sh

    README="$(pwd)/README.md"

    if [ -f "$README" ]; then
      CONTENT=$(jq -Rs . < "$README")
      echo "{\"hookSpecificOutput\" :{\"hookEventName\":\"SessionStart\",\"additionalContext\":${CONTENT}}}"
      exit 0
    else
      echo "README.md not found" >&2
      exit 1
    fi
with this hook

    {
      "hooks": {
        "SessionStart": [
          {
            "matcher": "startup|clear|compact",
            "hooks": [
              {
                "type": "command",
                "command": "~/.claude/hooks/inject-readme.sh"
              }
            ]
          }
        ]
      }
    }
reply
delaminator 8 hours ago
Unlike other content - what you put in here survives compacting
reply
gordonhart 10 hours ago
Exactly, it's the same documentation any contributor would need, just actually up-to-date and pared down to the essentials because it's "tested" continuously. If I were starting out on a new codebase, AGENTS.md is the first place I'd look to get my bearings.
reply
Arifcodes 9 hours ago
The study measures task completion on SWE-bench style issues, but that misses the main reason most of us write these files. I don't use AGENTS.md to help the model solve GitHub issues faster. I use it to stop the agent from doing dumb things that waste my time later.

Things like: don't use TypeVar in new code, always run migrations through our wrapper, never modify the shared proto files without updating the generated code. These are guardrails, not performance optimizers. The study's framing around "task success rate" misses that the value is in reducing the cleanup work after the agent "succeeds."

The finding that context files encourage "broader exploration" actually supports this. I want the agent to check more files and run more tests, even if it costs 20% more tokens. Tokens are cheap. Debugging a subtle regression the agent introduced because it didn't know about an invariant in the codebase is not.

reply
pajtai 2 days ago
I'd be interested to see results with Opus 4.6 or 4.5

Also, I bet the quality of these docs vary widely across both human and AI generated ones. Good Agents.md files should have progressive disclosure so only the items required by the task are pulled in (e.g. for DB schema related topics, see such and such a file).

Then there's the choice of pulling things into Agents.md vs skills which the article doesn't explore.

I do feel for the authors, since the article already feels old. The models and tooling around them are changing very quickly.

reply
deaux 15 hours ago
Agree that progressive disclosure is fantastic, but

> (e.g. for DB schema related topics, see such and such a file).

Rather than doing this, put another AGENTS.md file in a DB-related subfolder. It will be automatically pulled into context when the agent reads any files in the file. This is supported out of the box by any agent worth its salt, including OpenCode and CC.

IMO static instructions referring an LLM to other files are an anti-pattern, at least with current models. This is a flaw of the skills spec, which refers to creating a "references" folder and such. I think initial skills demos from Anthropic also showed this. This doesn't work.

reply
gordonhart 10 hours ago
> This is supported out of the box by any agent worth its salt, including OpenCode and CC.

I thought Claude Code didn't support AGENTS.md? At least according to this open issue[0], it's still unsupported and has to be symlinked to CLAUDE.md to be automatically picked up.

[0] https://github.com/anthropics/claude-code/issues/6235

reply
deaux 10 hours ago
You're right, for CC it's "nested CLAUDE.md files". The support I meant was about the "automatic inclusion in context upon sibling-or-child file touch" feature, rather than the name of the file.
reply
gordonhart 10 hours ago
Fair, I was hoping there was a feature that I was missing. Minor papercut to have to include harness-specific files/symlinks in your repo but it's probably a temporary state until the tools and usage patterns are more settled.
reply
deaux 9 hours ago
Nah, this is intentional by Anthropic, out of the top 20 coding agents 19 support AGENTS.md (fake numbers but I've seen someone else go through them). It's just a dumb IE6-style strategy.
reply
prodigycorp 14 hours ago
This is probably the best comment in the thread. I've totally forgotten about nested AGENTS.md files, gonna try implementation today.
reply
deaux 9 hours ago
If you have for example a monorepo, then you'll probably want a super lean top-level one - could be <15 lines - and then one per app. In those, only stuff that applies to the app as a whole. Then feature-specific context can be put at the level of the feature - hopefully your codebase is structured by domain rather than layer! The feature-level ones too, IMO, should usually be <15 lines. I just checked one of ours, it's 80 (GPT-5) tokens. It's basically answering potential "is this intentional?" questions - things that an LLM (or fresh human) can't possibly know the answer to because they're product decisions that aren't expressed in code. Tribal knowledge that would be in a doc somewhere. For 99% of decisions it's not needed, but there's that 1% where we've made a choice that goes against the cookie-cutter grain. If we don't put that in an AGENTS file, then every single time it's relevant there's a good chance it will make a wrong assumption. Or sometimes, a certain mechanic is inferable from the code, but it would take 10 different file reads to figure out something that is core to how the feature works, and takes 2 sentences to explain. Then it just saves a whole lot of time.

It does depend on the domain. If you're developing the logic for a game, you'll need more of them and they'll be longer.

Another advantage of this split is that because they're pulled into context at just the right time, the attention layer generally does a better job of putting sufficient importance on it during that part of the task, compared to if it were in the project-level AGENTS file that was loaded at the very top of the conversation.

reply
dpkirchner 2 days ago
Progressive disclosure is good for reducing context usage but it also reduces the benefit of token caching. It might be a toss-up, given this research result.
reply
deaux 16 hours ago
Those are different axes - quality vs money.

Progressive disclosure is invaluable because it reduces context rot. Every single token in context influences future ones and degrades quality.

I'm also not sure how it reduces the benefit of token caching. They're still going to be cached, just later on.

reply
kkapelon 17 hours ago
It is still baffling to me why we need AGENTS.md

Any well-maintained project should already have a CONTRIBUTING.md that has good information for both humans and agents.

Sometimes I actually start my sessions like this "please read the contributing.md file to understand how to build/test this project before making any code changes"

reply
benreesman 17 hours ago
If the harnesses had a simple system prompt "read repository level markdown and honor the house style"?

Think of the agent app store people's children man, it would be a sad Christmas.

reply
CharlieDigital 11 hours ago
Just symlink CONTRIBUTING as AGENTS
reply
kkapelon 11 hours ago
This only works on systems that support symlinks. It also pollutes the root folder with more files.

I understand the sentiment, but it is really strange that the people that are pushing for agents.md haven't seen https://contributing.md/

Is it even mentioned at GitHub docs https://docs.github.com/en/communities/setting-up-your-proje...

reply
GBintz 12 hours ago
we've been running AGENTS.md in production on helios (https://github.com/BintzGavin/helios) for a while now.

each role owns specific files. no overlap means zero merge conflicts across 1800+ autonomous PRs. planning happens in `.sys/plans/{role}/` as written contracts before execution starts. time is the mutex.

AGENTS.md defines the vision. agents read the gap between vision and reality, then pull toward it. no manager, no orchestration.

we wrote about it here: https://agnt.one/blog/black-hole-architecture

agents ship features autonomously. 90% of PRs are zero human in the loop. the one pain point is refactors. cross-cutting changes don't map cleanly to single-role ownership

AGENTS.md works when it encodes constraints that eliminate coordination. if it's just a roadmap, it won't help much.

reply
bavell 10 hours ago
Might be some interesting nuggets in your article but my eyes rolled so hard at this part I had to stop reading:

"The system does not assign tasks.

It defines gravity."

Helios looks cool though!

reply
energy123 18 hours ago
Their definition of context excludes prescriptive specs/requirements files. They are only talking about a file that summarizes what exists in the codebase, which is information that's otherwise discoverable by the agent through CLI (ripgrep, etc), and it's been trained to do that as efficiently as possible.

Also important to note that human-written context did help according to them, if only a little bit.

Effectively what they're saying is that inputting an LLM generated summary of the codebase didn't help the agent. Which isn't that surprising.

reply
MITSardine 15 hours ago
I find it surprising. The piece of code I'm working on is about 10k LoC to define the basic structures and functionality and I found Claude Code would systematically spend significant time and tokens exploring it to add even basic functionality. Part of the issue is this deals with a problem domain LLMs don't seem to be very well trained on, so they have to take it all in, they don't seem to know what to look for in advance.

I went through a couple of iterations of the CLAUDE.md file, first describing the problem domain and library intent (that helped target search better as it had keywords to go by; note a domain-trained human would know these in advance from the three words that comprise the library folder name) and finally adding a concise per-function doc of all the most frequently used bits. I find I can launch CC on a simple task now, without it spending minutes reading the codebase before getting started.

reply
tumetab1 12 hours ago
That's also my experience.

The article is interesting but I think it deviates from a common developer experience as many don't work on Python libraries, which likely heavily follow patterns that the model itself already contains.

reply
nielstron 17 hours ago
Hey, a paper author here :) I agree, if you know well about LLMs it shouldn't be too surprising that autogenerated context files are not helping - yet this is the default recommendation by major AI companies which we wanted to scrutinize.

> Their definition of context excludes prescriptive specs/requirements files.

Can you explain a bit what you mean here? If the context file specifies a desired behavior, we do check whether the LLM follows it, and this seems generally to work (Section 4.3).

reply
eknkc 18 hours ago
I only put things when the LLM gets something wrong and I need to correct it. Like “no, we create db migrations using this tool” kind of corrections. So far it made them behave correctly in those situations.
reply
theLiminator 18 hours ago
I'd take any paper like this with a grain of salt. I imagine what holds true for models in time period X could drastically be different just given a little more time.

Doesn't mean it's not worth studying this kind of stuff, but this conclusion is already so "old" that it's hard to say it's valid anymore with the latest batch of models.

reply
nielstron 17 hours ago
This is life of an LLM researcher. We literally ran the last experiments only a month ago on what were the latest models back then...
reply
climike 13 hours ago
[dead]
reply
flatcoke 11 hours ago
I use AGENTS.md daily for my personal AI setup. The biggest win is giving the agent project-specific context — things like deployment targets, coding conventions, and what not to do. Without it, the agent makes generic assumptions that waste time.
reply
einrealist 13 hours ago
What is the purpose of an AGENTS.md file when there are so many different models? Which model or version of the model is the file written for? So much depends on assumptions here. It only makes sense when you know exactly which model you are writing for. No wonder the impact is 'all over the place'.
reply
climike 13 hours ago
[dead]
reply
mindwok 16 hours ago
In my experience AGENTS.md files only save a bit of time, they don't meaningfully improve success. Agents are smart enough to figure stuff out on their own, but you can save a few tool calls and a bit of context by telling them how to build your project or what directories do what rather than letting it stumble its way there.
reply
BlueHotDog2 11 hours ago
I've found that even documenting non-obvious dependencies between tasks can significantly improve agent performance and reduce debugging time
reply
4b11b4 17 hours ago
This paper shoulda just done a study on elixirs usage_rules?

https://github.com/ash-project/usage_rules

reply
ozim 16 hours ago
I pretty much add to my prompt bunch of stuff, with AGENST.md or any file I can just add one line "hey read up that file".
reply
rmunn 16 hours ago
If I understand the paper correctly, the researchers found that AGENTS.md context files caused the LLMs to burn through more tokens as they parsed and followed the instructions, but they did not find a large change in the success rate (defined by "the PR passes the existing unit tests in the repo").

What wasn't measured, probably because it's almost impossible to quantify, was the quality of the code produced. Did the context files help the LLMs produce code that matched the style of the rest of the project? Did the code produced end up reasonably maintainable in the long run, or was it slop that increased long-term tech debt? These are important questions, but as they are extremely difficult to assign numbers to and measure in an automated way, the paper didn't attempt to answer them.

reply
reconnecting 13 hours ago
Check the logs, no one really requests AGENTS.md from the server.
reply
mikkupikku 13 hours ago
The only thing I use CLAUDE.md for is explaining the purpose and general high level design principles of the project so I don't have to waste my time reiterating this every time I clear the context. Things like this is a file manager, the deliverable must always be a zipapp, Wayland will never be supported.

I added these to that file because otherwise I will have to tell claude these things myself, repeatedly. But the science says... Respectfully, blow it out your ass.

reply
DaanDL 10 hours ago
I chuckled at "Wayland will never be supported" :-D
reply
sensanaty 12 hours ago
Most of these AI-guiding "techniques" seem more like reading into tea leaves to me than anything actually useful.

Even with the latest and greatest (because I know people will reflexively immediately jump down my throat if I don't specify that, yes, I've used Opus 4.6 and Gemini 3 Pro etc. etc. etc. etc., I have access to all of the models by way of work and use them regularly), my experience has been that it's basically a crapshoot that it'll listen to a single one of these files, especially in the long run with large chats. The amount of times I still have to tell these things to not generate React in my Vue codebase that has literally not a single line of JSX anywhere and instructions in every single possible file I can put it in to NOT GENERATE FUCKING REACT CODE makes me want to blow my brains out every time it happens. In fact it happened to me today with the supposed super intelligence known as Opus 4.6 that has 18 trillion TB of context or whatever in a fresh chat when I asked for a quick snippet I needed to experiment with.

I'm not even paying for this crap (work is) and I still feel scammed approximately half the time, and can't help but think all of these suggestions are just ways to inflate token usage and to move you into the usage limit territory faster.

reply
Foobar8568 11 hours ago
Claude/Opus 4.6 Can you add a console.log in food XYZ?

No problem, x agents, hundreds/closed to one million token usage to add a line of code.

Gemini 3 : can you review the commit A (console.log one ) you have made the most significant change in your 200kloc code base, this key change will allow you to get great insight into your software.

Codex : I have reviewed your change, you are missing tests and integration tests.

But I fully agree, overall I feel there are a lot of tea leaves readers online and LinkedIn.

reply
Razengan 9 hours ago
I think they can be helpful for humies too: the act of writing the instructions and describing your stuff in a clear way, and also reading it later.
reply
0xbadcafebee 17 hours ago
Research has shown that most earlier "techniques" to get better LLM response no longer work and are actively harmful with modern models. I'm so grateful that there's actual studies and papers about this and that they keep coming out. Software developers are super cargo culty and will do whatever the next guy does (and that includes doing whatever is suggested in research papers)
reply
rmunn 16 hours ago
Software developers don't have to be cargo-culty... if they're working on systems that are well-documented or are open-source (or at least source-available) so that you can actually dig in to find out how the system works.

But with LLMs, the internals are not well-documented, most are not open-source (and even if the model and weights are open-source, it's impossible for a human to read a grid of numbers and understand exactly how it will change its output for a given input), and there's also an element of randomness inherent to how the LLM behaves.

Given that fact, it's not surprising to find that developers trying to use LLMs end up adding certain inputs out of what amounts to superstition ("it seems to work better when I tell it to think before coding, so let's add that instruction and hopefully it'll help avoid bad code" but there's very little way to be sure that it did anything). It honestly reminds me of gambling fallacies, e.g. tabletop RPG players who have their "lucky" die that they bring out for important rolls. There's insufficient input to be sure that this line, which you add to all your prompts by putting it in AGENTS.md, is doing anything — but it makes you feel better to have it in there.

(None of which is intended as a criticism, BTW: that's just what you have to do when using an opaque, partly-random tool).

reply
octoclaw 10 hours ago
[dead]
reply
kittbuilds 13 hours ago
[dead]
reply
AlexYzhov 15 hours ago
[dead]
reply
szundi 16 hours ago
[dead]
reply
imiric 17 hours ago
Many of the practices in this field are mostly based on feelings and wishful thinking, rather than any demonstrable benefit. Part of the problem is that the tools are practically nondeterministic, and their results can't be compared reliably.

The other part is fueled by brand recognition and promotion, since everyone wants to make their own contribution with the least amount of effort, and coming up with silly Markdown formats is an easy way to do that.

EDIT: It's amusing how sensitive the blue-pilled crowd is when confronted with reality. :)

reply