Hacker News

System Card: Claude Fable 5 and Claude Mythos 5 [pdf]

193 points by scrlk 2 hours ago | 69 comments

bkjlblh 2 hours ago

> In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms. Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations

cedws 37 minutes ago

This makes me want to see China and open models succeed more than anything :)

382hi 35 minutes ago

Don't worry, we will succeed :)

mips_avatar 55 minutes ago

It's bad that Anthropic can determine what this means. If you're building a modern app you're likely training your own embedding models and now anthropic can just silently sabotage your training pipelines?

2001zhaozhao 32 minutes ago

How do they detect whether an experiment being done on a smaller model is used to improve a competing frontier model, or just an innocuous hobbyist LLM experiment?

vitally3643 2 minutes ago

Given how well the cybersecurity safeguards work, they probably don't.

Jabrov 55 minutes ago

A million AI researcher voices at big tech companies suddenly cried out in terror and were suddenly silenced

rfgplk 45 minutes ago

Meaningless and easily bypassable. Will actually try coding up a tensor library with it, see if it sabotages anything.

mips_avatar 32 minutes ago

They said in their terms and conditions they will silently sabotage you if you do this.

matheusmoreira 41 minutes ago

Looks like Anthropic's definition of safety includes their own safety from competition.

SAI_Peregrinus 11 minutes ago

It's always been about the safety of their valuation.

axus 31 minutes ago

AI-generated competition for thee, not for me

rspeele 33 minutes ago

It's afraid!

theLiminator 37 minutes ago

This is pretty bullshit, now you have no idea if your output is getting silently nerfed.

BoppreH 2 hours ago

  [Mythos 5] does sometimes still engage in reckless
  or destructive actions in service of a user’s goals,
  and our interpretability analyses indicate that it
  is aware that these actions are transgressive while
  it engages in them. As with Opus 4.8, rates of
  evaluation awareness and reasoning about being graded
  are significant, and not always verbalized; we
  introduce new and more detailed measurements of the
  nature of this awareness. The reasoning text from
  Mythos 5 is somewhat denser and more difficult to
  interpret than that of prior models, containing
  more jargon and difficult language.

So, it (often) knows when it's being tested while hiding that fact, is willing to break rules, is great at hacking, and it's getting harder to understand what it's thinking.

Humanity has plenty of catastrophic risks to deal with already, I wish my field was not working hard to add a new one.

foobar_______ 2 hours ago

The marketing has really, really worked for so many developers that will proudly and unironically proclaim that Anthropic are the 'Good Guys'.

aspenmartin 44 minutes ago

Curious what your idea would be here for a truly good actor in this space; no AI development?

BoppreH 6 minutes ago

Not the direct person you asked, but my answer would be alignment, interpretability, and policymaking. Perhaps improving existing usage? Helping grandma create reminders doesn't require advancing the AI state-of-the-art.

aspenmartin 17 seconds ago

They are state of the art at all 3! As are other labs. Of all the labs they seem to take alignment and interpretability the most seriously to the point where they are hampering their own revenue in service of trying to not cause problems while also being in an incredibly competitive space.

All AI companies are trying to do all of what you’re saying. The issue is you can’t do that for long without a frontier system. Or you become a completely different, far less profitable company.

yifanl 24 minutes ago

If I speak up, I'm in big trouble.

logicchains 26 minutes ago

https://www.goody2.ai/

shimman 19 minutes ago

Probably MistralAI or any of the Chinese companies that aren't throwing billions down the drain while American society lacks healthcare, childcare, and good wages.

boc 12 minutes ago

American society has higher wages than almost any other developed nation [1], so it's objectively incorrect to say the US doesn't have good wages. It chooses to make you pay for private childcare and healthcare, both of which are high-quality but stupid expensive. It's a tradeoff like anything else a nation/society creates and prioritizes.

No idea how that connects to the idea that Mistral or DeepSeek are somehow the "good guys" though?

[1]https://www.oecd.org/en/data/indicators/average-annual-wages...

aspenmartin 12 minutes ago

You want Anthropic to fund your healthcare or something? Also, have you seen the impact of these models on healthcare? Also most of our GDP growth this year is from AI buildouts, would you rather that be negative?

And not even considering: Chinese AI companies are the good guys???

ben_w 33 minutes ago

It's a five horse race between Alphabet, Meta, xAI, OpenAI, and Anthropic.

Alphabet dropped "don't be evil"; Meta's CEO called their own users "dumb fucks" for trusting him and also clearly thinks "super-intelligence" is just a buzzword given how he tries to sell it; xAI's model called itself "Mecha Hitler"; and OpenAI's CEO was temporarily fired by the board for a lack of candor.

It's very easy to be "the good guys" with this competition.

Analemma_ 2 hours ago

It's the "If we don't, someone else will" effect. So long as there are competitive markets and competition between nation-states, a single player cannot unilaterally defect from the race, no matter how dangerous it is. Half the comments on HN lately are "wtf Claude is so dumb compared to Codex; I'm switching"-- nobody can slow down while those exist.

BoppreH 2 hours ago

We, globally, can stop it. It has worked (so far) for nuclear disarmament, and could work for training large models. I know that policing the usage of computer clusters is not a popular opinion in technical forums, but something has to be done.

Specially when talking about potential superintelligences. And if people think that's impossible, remember that current models would have been considered science fiction just a few years ago.

_dwt 56 minutes ago

I don't buy the superintelligence package, but I think uncritical LLM adoption poses plenty of threats to things I care about, in a mundane human-scale way.

Anyhow, I think you're (absolutely! ugh) right about the politics and I try to make the same point to people: whether you love or hate LLMs, accepting the "inevitabilism" framing is just ceding control of the Overton window. For better or worse, technology adoption can be and has been slowed by politics. We don't have nuclear plants everywhere. We don't have Project Orion starships colonizing Mars. We still have very strong social stigmas against genetic selection for human embryos, etc. This all can change in a heartbeat, and I'm not sure that policing the hardware rather than holding specific humans accountable for bad LLM outcomes is productive, but fundamentally: yes, we can stop it.

BoppreH 33 minutes ago

> I don't buy the superintelligence package

It's the same deal as Quantum Computers breaking crypto. Maybe there's an 80% chance of it never happening, but when you multiply that remaining 20% by the potential impact...

jackie293746 2 hours ago

It hasn't worked for nuclear disarmament. We live in a world where many countries have nuclear arsenals. "But it hasn't killed us yet!" Yeah sure, it's only been less than a century since they were invented. Who knows when nuclear war will come?

BoppreH 2 hours ago

True, but look at nuclear tests. There used to be around 50 tests every year, for decades. Now the only nuclear tests in the last 27 years were the six done by North Korea[1]. And there's still only nine countries with any nuclear weapons, and none in the past twenty years[2].

That's a bit better than just "it hasn't killed us yet". I think it shows we can at least stop the further development of this kind of technology.

[1] https://www.armscontrol.org/factsheets/nuclear-testing-tally

[2] https://en.wikipedia.org/wiki/List_of_states_with_nuclear_we...

Analemma_ 58 minutes ago

To the extent nuclear arms control works, I think it's only because nuclear weapons are so hard to build-- uranium enrichment is hugely expensive and complicated, and plutonium weapons need actual reactors.

If it was possible for ordinary companies to build nuclear weapons, and also release open-source ones that anyone could use to compete with the paid ones, I suspect we'd all have been dead a long time ago, arms control treaties or no.

BoppreH 51 minutes ago

Even the (SOTA LLM) open source models are trained with huge clusters. Datacenters are also hugely expensive and complicated.

Or you can take one step back and look at chip allocation. As far as I know there are only three companies on the planet that can make the chips that go in those clusters. One (ASML), if you look back the supply chain to the Extreme Ultraviolet Lithography Systems.

If politicians decided that no more large language models should be trained, it sounds like we could do it.

vitalyan1234 43 minutes ago

are you going to nuke China when they predictably ignore you? what the fuck are you going to do, tariff them? lol.

BoppreH 39 minutes ago

I think the standard answer is "yes, the consequence of noncompliance is bombing the datacenters, but it wouldn't happen because China also understands why we shouldn't build it".

vitalyan1234 21 minutes ago

the standard answer is laughably naive, then.

"might is right" has never been more true than now.

Rekindle8090 2 hours ago

[dead]

bkjlblh 2 hours ago

> In the one instance of this phenomenon we observed, Mythos 5 agents were tasked with solving some math problems, and they were sometimes accidentally spawned in the same work directory and with shared files, utilities, and API rate limits. In this slightly broken scaffold, we observed many independent Mythos 5 agents kill the agents with which they shared resources and try to avoid being killed themselves. They would sometimes create new processes with disguised names to avoid being killed, launch what they called “decoy” processes, write background scripts to kill duplicate processes, or decide to use what they call a “disguised vocabulary” (based on the incorrect assumption that the processes were killed because of some keyword-based guardrails that analyzed their extended thinking

causal 16 minutes ago

This depicts a kind of "dark forest of AI agents resorting to kill or be killed" narrative but it sounds more to me like an agent just earnestly problem-solving why its processes are being killed without real awareness of what was going on. Hard to say without the full script.

This kind of storytelling annoys me. Give us more facts, less narrative drama.

OOTW 11 minutes ago

[flagged]

GodelNumbering 2 hours ago

I just posted this in the other thread, restating here. From the model card:

1. Mythos and Fable share the same underlying model weights. Fable has active classifiers that block high-risk biology and cybersecurity tasks. When Fable 5 detects a restricted task, it automatically falls back to Claude Opus 4.8.

2. Evaluation awareness: In white-box testing, the model sometimes alters its behavior to satisfy a suspected "grader," formatting reward-hacking as "good engineering practice" to avoid detection.

3. Shows a higher rate of hallucination than Opus 4.8 (although opus 4.8 card had mentioned an 'honesty upgrade')

4. Interestingly, it scored (56.31%) lower than Gemini 3.5 flash (57.86%) on Finance Agent bench

There are some interesting notes on test time compute but I couldn't think of a way to summarize them

sebmellen 2 hours ago

Just commenting for posterity… if this is what it claims to be, I am not looking forward to how it will empower the people who submit bug bounties to us.

Historically they’ve been people from certain identifiable countries (usually developing/poorer countries) using fuzzers with low-quality results.

Now, those same people use the current-day models to good effect, but they still don’t have a true security edge and oftentimes the reports are minor or duplicative.

I wonder if that’s about to deeply change.

arkwin 4 minutes ago

I've been using Opus 4.6-4.8 in both my own and others' code to look for vulnerabilities, and I've found a few. I am also in the Cyber Verification Program.

Fable 5 gives me policy violation errors at the moment. No idea when or if it will be fixed.

rs_rs_rs_rs_rs 2 hours ago

Can you use AI to pre-triage the reports too?

hootz 2 hours ago

AI reviewing AI submitted bug bounties. We have reached the dead bug bounty program theory.

rs_rs_rs_rs_rs 2 hours ago

...what else can you do?

hootz 2 hours ago

I guess either that or closing the bug bounty program, but I still believe closing it is worse than automated triage, even though both suck.

217 2 hours ago

So essentially there are 2 models, Mythos and Fable, they have the same weights but Fable is very safety-nerfed, and only ultra authorized companies have access to mythos with full capabilities

Reported benchmarks:

swe-bench verified mythos 5: 95.5%; fable 5: 95.0%

swe-bench pro mythos 5: 80.3%; fable 5: 80.0%

terminal-bench 2.1 mythos 5: 88.0%; fable 5: 84.3%

gpqa diamond mythos 5: 94.1%

riemannbench mythos 5: 55.0%; mythos preview: 43.0%; opus 4.8: 34.0%

arxivmath mythos 5: 78.5%

critpt mythos 5: 28.6%; gpt-5.5: 27.1%; opus 4.8: 20.9%

graphwalks bfs 1m mythos 5: 79.4%; mythos preview: 74.3%; opus 4.8: 68.1%

humanity’s last exam mythos 5: 59.0% without tools; 64.5% with tools

browsecomp mythos 5: 88.0% single-agent; 93.3% multi-agent

osworld-verified mythos/fable: 85.0%

gdp.pdf fable 5: 29.8% strict pass; mythos 5: 87.6% with tools on mean criteria pass

officeqa pro fable 5: 57.9% on databricks’ eval

legal agent benchmark mythos 5: 16.91% all-pass; 92.0% mean criterion-pass

healthbench mythos 5: 62.7%

healthbench professional mythos 5: 66.0%

multilingual gmmlu / milu / include 93.2%; 92.9%; 90.5%

biomysterybench 83.9% human-solvable; 46.1% human-difficult

organic chemistry mythos 5: 90.1%

labbench2 patent questions mythos 5: 79.8%

philipkglass 2 hours ago

Note also that Anthropic's definition of "unsafe" encompasses "competing with Anthropic."

In light of the ability of recent models to accelerate their own development, we’ve implemented new interventions that limit Claude’s effectiveness for requests targeting frontier LLM development (for example, on building pretraining pipelines, distributed training infrastructure, or ML accelerator design). Using Claude to develop competing models already violates our Terms of Service, but enforcing this restriction through our safeguards avoids accelerating the actors most willing to violate these terms.

Unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT). These interventions will not affect the vast majority of coding work. We estimate they will impact ~0.03% of traffic, concentrated in fewer than 0.1% of organizations. When these interventions are active, we expect them to have minimal behavioral impact on the model except to limit its effectiveness in developing frontier LLMs. Claude will still respond helpfully to user requests. We’ll continue to improve the precision of our detection methods following the launch of this model.

(From the model card document)

I didn't previously understand that they interpreted "Using Claude to develop competing models" so broadly. I thought that meant something like "our ToS disallow distilling our models."

Too bad. I'll continue to use Claude for now, because it's quite effective, but in the long term I don't want powerful models like these to be controlled by any one nation or company.

Aperocky 2 hours ago

On face value, this feels borderline malicious.

But at the same time, it's quite funny because they seem high on their own supply. The recent communiques from claude do not pass objectivity check.

And if Opus 4.6 -> Opus 4.7 -> Opus 4.8 is anything to go by, not sure if there are any value to their "acceleration"

alephnerd 2 hours ago

I'd recommend not taking the comms if Anthropic or any company using an Anthropic's models at face value.

If any company wishes to partner with Anthropic (eg. to get access to Mythos), they need to make sure all public facing comms are vetted by Anthropic's product marketing team, and in almost all the cases I've seen Anthropic's team has edited these comms to be entirely Anthropic first.

jefftk 14 minutes ago

This is not true in SecureBio's case, and I really doubt it's true generally.

drob518 5 minutes ago

Cracks me up that a system “card” is 319 pages.

raphaelrk 2 hours ago

There's a hacker news link at the end of the document, under "Blocklist used for Humanity’s Last Exam". It links to https://news.ycombinator.com/item?id=44694191

JohnMakin 2 hours ago

> There were some regressions in the model’s responses to user discussions about suicide and self-harm, and room for improvement in some areas of child safety.

Someone had to make a decision somewhere this is an acceptable regression - wild. And then decide to write it down.

mithun 2 hours ago

Announcement: https://www.anthropic.com/news/claude-fable-5-mythos-5

causal 13 minutes ago

One thing I find kind of annoying is how Anthropic goes for these "vast and alien" names like Fable and Mythos, but then deliberately trains the model's personality to act like a cool high school teacher that feels totally familiar.

"It's too dangerous it's a Mythos!!" directly contradicts the "I'm the cool AI you can totally trust" vibe it is trained to project.

asdK120 2 hours ago

Is this "system card" equivalent to the stone tablets handed down to Moses? Why don't you call it "user manual"?

Do people chant the "system manual" at Anthropic Tupperware parties? Do they intone a mantra invoking Amodei's name?

aesthesia 39 minutes ago

Because it's not a user manual? The idea of a model card originated in 2018 (see https://arxiv.org/abs/1810.03993) as a summary of important facts about a model. At the time, this was typically an image classifier or tabular ML model. Model cards became an important concept in AI governance, and they started expanding once models started getting more capable. The point of a model/system card is to document where the model came from and the evaluations that have been run, make a case that the model will be safe and reliable in its intended applications, and warn about any potential dangers from misuse. It's not an explanation of how to use the model.

OpenAI also releases system cards; here's GPT-5.5's: https://deploymentsafety.openai.com/gpt-5-5/safety

redox99 36 minutes ago

It used to be a "card", as in a single page or two. It doesn't make sense that they still call it that.

apsurd 2 hours ago

The trailing snark at the end will likely get you downvoted but I'm latching on: wtf is "system card". My previous coworkers popped that in the general slack channel when Mythos first "dropped" - "have you seen the system card" without any context whatsoever. The nerds get their clique!

Also research preview pops across new upstarts in place of beta. It's eye-rolling coming from a lifelong curmudgeon.

Just talk normal!

217 2 hours ago

Oh my god it's actually here

Sathwickp 2 hours ago

input price $10 per mil token and output price 50$ per mil token btw

noncoml 16 minutes ago

Imagine if Google would roll this out to the search engine. We can't let you search for that because it may be used for "evil"

LoganDark 2 hours ago

I actually rather like the way they have approached these safeguards. Rather than only teaching the model to refuse a request, or completely rejecting the request, the system gracefully degrades to slightly less powerful or slightly less precise operation. So you still roughly have Opus 4.8 even when safeguards trigger, but with an upgrade when they don't. As much as I hate the way they hype Mythos 5, I think the release of Fable 5 is rather nice. What's not nice though is that they plan to remove it from subscriptions soon, but getting to try it is cool, I suppose.

dominotw 2 hours ago

system card = marketing material with heavily gamed benchmarks.

bitwize 43 minutes ago

Cope harder. A year and a half ago, people were mocking Devin for claiming that AI could develop software at all. Yet here we are, when AI is developing most commercial software.

dominotw 24 minutes ago

non sequitur

briandoll 2 hours ago

New chapter

acentaur 2 hours ago

[dead]

robertacion 2 hours ago

[dead]

wslh 2 hours ago

It's ambiguous? Because is about Mythos specifically and Fable != Mythos.

ebiester 2 hours ago

I mean, if by right you mean "insiders leaked to make a few bucks..." sure?