The commenter who says "add obfuscation and success drops to zero" is right but that's also the wrong approach imo. The experiment isn't claiming AI can defeat a competent attacker. It's asking whether AI agents can replicate what a skilled (RE) specialist does on an unobfuscated binary. That's a legitimate, deployable use case (internal audit, code review, legacy binary analysis) even if it doesn't cover adversarial-grade malware.
The more useful framing: what's the right threat model? If you're defending against script kiddies and automated tooling, AI-assisted RE might already be good enough. If you're defending against targeted attacks by people who know you're using AI detection, the bar is much higher and this test doesn't speak to it.
What would actually settle the "ready for production" question: run the same test with the weakest obfuscation that matters in real deployments (import hiding, string encoding), not adversarial-grade obfuscation. That's the boundary condition.
Let's say instead of reversing, the job was to pick apples. Let's say an AI can pick all the apples in an orchard in normal weather conditions, but add overcast skies and success drops to zero. Is this, in your opinion, still a skilled apple picking specialist?
I’ve been using Ghidra to reverse engineer Altium’s file format (at least the Delphi parts) and it’s insane how effective it is. Models are not quite good enough to write an entire parser from scratch but before LLMs I would have never even attempted the reverse engineering.
I definitely would not depend on it for security audits but the latest models are more than good enough to reverse engineer file formats.
They can make diagrams for you, give you an attack surface mapping, and dig for you while you do more manual work. As you work on an audit you will often find things of interest in a binary or code base that you want to investigate further. LLMs can often blast through a code base or binary finding similar things.
I like to think of it like a swiss army knife of agentic tools to deploy as you work through a problem. They won't balk at some insanely boring task and that can give you a real speed up. The trick is if you fall into the trap of trying to get too much out of an LLM you end up pouring time into your LLM setup and not getting good results, I think that is the LLM productivity trap. But if you have a reasonable subset of "skills" / "agents" you can deploy for various auditing tasks it can absolutely speed you up some.
Also, when you have scale problems, just throw an LLM at it. Even low quality results are a good sniff test. Some of the time I just throw an LLM at a code review thing for a codebase I came across and let it work. I also love asking it to make me architecture diagrams.
Are people sharing these somewhere?
Using the agent and seeing where it get stuck, then creating a workflow/skill/whatever for how to overcome that issue, will also help you understand what scenarios the agents and models are currently having a hard time with.
You'll also end up with fewer workflows/skills that you understand, so you can help steer things and rewrite things when inevitably you're gonna have to change something.
Still, Ghidra's most painful limitation was extremely slow time with Go Lang. We had to exclude that example from the benchmark.
I tried a few approaches - https://github.com/jtang613/GhidrAssistMCP (was the harderst to set) Ghidra analyzeHeadless (GPT-5.2-Codex worked with it well!) and PyGhidra (my go-to). Did you try to see which works the best?
I mean, very likely (especially with an explicit README for AI, https://github.com/akiselev/ghidra-cli/blob/master/.claude/s...) your approach might be more convenient to use with AI agents.
That said, it should be easier to use as a human to follow along with the agent and Claude Code seems to have an easier time with discovery rather than stuffing all the tool definitions into the context.
It would be interesting to have an experiment where these models are able to test exploiting but their alignment may not allow that to happen. Perhaps combining models together can lead to that kind of testing. The better models will identify, write up "how to verify" tests and the "misaligned" models will actually carry out the testing and report back to the better models.
THIS is the takeaway. These tools are allowing *adjacency* to become a powerful guiding indicator. You don't need to be a reverser, you can just understand how your software works and drive the robot to be a fallible hypothesis generator in regions where you can validate only some of the findings.
this is why human oversight in the loop matters so much even when using frontier models. the model is a hypothesis generator, not the decision maker. i've found the same thing building content pipelines -- the expensive models (opus 4.6 etc) genuinely produce better first-pass output, but you still can't trust them end to end. the moment you remove human review the quality craters in subtle ways you only notice later.
the multi-agent approach someone mentioned above (one model flags, another validates) is interesting but adds complexity. simpler to just have the human be the validation layer.
Perhaps it would make sense to provide LLMs with some strategy guides written in .md files.
Let's say you tell it that there might be small backdoors. You've now primed the LLM to search that way (even using "may"). You passed information about the test to test taker!
So we have a new variable! Is the success only due to the hint? How robust is that prompt? Does subtle wording dramatically change output? Does "may", "does", "can", "might" work but "May", "cann", or anything else fail? Have you the promoter unintentionally conveyed something important about the test?
I'm sure you can prompt engineer your way you greater success but by doing so you also greatly expand the complexity of the experiment and consequently make your results far less robust.
Experimental design is incredibly difficult due to all the subtleties. It's a thing most people frequently fail at (including scientists) and even more frequently fool themselves into believing stronger claims than the experiment can yield.
And before anyone says "but humans", yeah, same complexity applies. It's actually why human experimentation is harder than a lot of other things. There's just far more noise in the system.
But could you get success? Certainly. I mean you could tell it exactly where the backdoors are. But that's not useful. So now you got to decide where that line is and certainly others won't agree.
But when we're trying to share results, "a talented engineer sat with the thread and wrote tests/docs/harnesses to guide the model" is less impressive than "we asked it and it figured it out," even though the latter is how real work will happen.
It creates this perverse scenario (which is no one's fault!) where we talk about one-shot performance but one-shot performance is useful in exactly 0 interesting cases.
Sometimes it feels like it's not dissimilar to spending 4 hours to automate a 10 minute task that I thought I'll need forever but ended up just using it once in the past 5 months. But sometimes I unlock something that saves a huge amount of time, and can be reused in many steps of other projects.
Even where it works, it is quite hard to specify human strategic thinking in a way that an AI will follow.
Open-source GitHub: https://github.com/QuesmaOrg/BinaryAudit
I had been searching for a good benchmark that provided some empirical evidence of this sycophancy, but I hadn't found much. Measuring false positives when you ask the model to complete a detection related task may be a good way of doing that.
> Claude Opus 4.6 found it… and persuaded itself there is nothing to worry about > Even the best model in our benchmark got fooled by this task.
That is quite strange. Because it seems almost as if a human is required to make the AI tools understand this.
Your approach, however, makes a lot of sense if you are ready to have your own custom or fine-tuned model.
A bad actor already has most of the work done.
The code is open-source; you can run it yourself using Harbor Framework:
git clone git@github.com:QuesmaOrg/BinaryAudit.git
export OPENROUTER_API_KEY=...
harbor run --path tasks --task-name lighttpd-* --agent terminus-2 --model openrouter/anthropic/claude-opus-4.6 --model openrouter/google/gemini-3-pro-preview --model openrouter/openai/gpt-5.2 --n-attempts 3
Please open PR if you find something interesting, though our domain experts spend fair amount of time looking at trajectories.
At the same time, various task can be different, and now all things that work the best end-to-end are the same as ones that are good for a typical, interactive workflow.
We used Terminus 2 agent, as it is the default used by Harbor (https://harborframework.com/), as we want to be unbiased. Very likely other frameworks will change the result.
Too bad the author didn't really share the agents they were using so we can't really test this ourselves.
e.g. an intentional weakness in systemd + udev + binfmt magic when used together == authentication and mandatory access control bypass. Each weakness reviewed individually just looks like benign sub-optimal code.
Is there code that does something completely different than its comments claim?
Or put another way, each of these three through three hundred applications or services by themselves may be intended to perform x,y,z functions but when put together by happy coincidence they can perform these fifty-million other unintended functions including but not limited to bypassing authentication, bypassing mandatory access controls, avoiding logging and auditing, etc... oh and it can automate washing your dishes, too.
depending on the length of the piece of code,
is probably the most honest answer right now.
I wonder if a hybrid approach would work better: use AI to flag suspicious sections, then have a human reverser focus only on those. Kind of like how SAST tools work for source code - nobody expects them to catch everything, but they narrow down where to look.
Optimising a model for a certain task, via fine-tuning (aka post-training), can lead to loss of performance on other tasks. People want codex to "generate code" and "drive agents" and so on. So oAI fine-tuned for that.
Lol.
> Gemini 3 Pro supposedly “discovered” a backdoor.
Yup, sounds typical for Gemini...it tends to lie.
Very good article. Sounds super useful to apply its findings and improve LLMs.
On a similar note.... reverse engineering is now accessible to the public. Tons of old software is now be easy to RE. Are software companies having issues with this?
They may have not noticed an improvement, but it doesn't mean there isn't any.
In fact, this is what authors said themselves: "However, this approach is not ready for production. Even the best model, Claude Opus 4.6, found relatively obvious backdoors in small/mid-size binaries only 49% of the time. Worse yet, most models had a high false positive rate — flagging clean binaries." So I'm not sure if we're even discussing the same article.
I also don't see a comparison with any other methodology. What is the success rate of ./decompile binary.exe | grep "(exec|system)/bin/sh"? What is the success rate of state-of-the-art alternative approaches?
This is detecting the pattern of an anomaly in language associated with malicious activity, which is not impressive for an LLM.
The tasks here are entry level. So we are impressed that some AI models are able to detect some patterns, while looking just at binary code. We didn't take it for granted.
For example, only a few models understand Ghidra and Radare2 tooling (Opus 4.5 and 4.6, Gemini 3 Pro, GLM 5) https://quesma.com/benchmarks/binaryaudit/#models-tooling
We consider it a starting point for AI agents being able to work with binaries. Other people discovered the same - vide https://x.com/ccccjjjjeeee/status/2021160492039811300 and https://news.ycombinator.com/item?id=46846101.
There is a long way ahead from "OMG, AI can do that!" to an end-to-end solution.
Our example instruction is here: https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/lig...
> However, [the approach of using AI agents for malware detection] is not ready for production.
Then the methodology does not support that. It's "the approach of using AI agents for malware detection with next to zero documentation or guidance is not ready for production."
Agree it is a good test to try, but there are huge benefits beings able to understand (better recreate) 0-conf tests.
The question we asked is if they can solve a problem autonomously, with instructions that would be clear for a reverse engineering specialist.
That say, I found these useful for many binary tasks - just not (yet) the end-to-end ones.
With a longer and more detailed prompt (while still keeping the prompt completely non-specific to a particular type of malware/backdoor), the AI could most likely solve the problem autonomously much better.
What level of autonomy though? At one point some human have to fire them off, so already kind of shaky what that means here. What about providing a bunch of manuals in a directory and having "There are manuals in manuals/ you can browse to learn more." included in the prompt, if they get the hint, is that "autonomously"?
see:
- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dns...
- https://github.com/QuesmaOrg/BinaryAudit/blob/main/tasks/dro...
The second one is more impressive. I'd like to see the reasoning trace.
Before even looking at the binary, Claude announces it will“look at the authentication functions, especially password checking logic which is a common backdoor target.” It finds the password checking function (svr_auth_password) using strings. And that is the function they decided to backdoor.
I’m experienced with reverse engineering but not experienced with these kinds of CTF-type challenges, so it didn’t occur to me that this function would be a stereotypical backdoor target…
They have a different task (dropbear-brokenauth2-detect) which puts a backdoor in a different function, and zero agents were able to find that one.
On the original task (dropbear-brokenauth-detect), in their runs, Claude reports the right function as backdoored 2 out of 3 times, but it also reports some function as backdoored 2 out of 2 times in the control experiment (dropbear-brokenauth-detect-negative), so it might just be getting lucky. The benchmark seemingly only checks whether the agent identifies which function is backdoored, not the specific nature of the backdoor. Since Claude guessed the right function in advance, it could hallucinate any backdoor and still pass.
But I don’t want to underestimate Claude. My run is not finished yet. Once it’s finished, I’ll check whether it identified the right function and, if so, whether it actually found the backdoor.