It's just an insane level of deception and trust destruction for a company that at most is like 1 year ahead of its competition.
Edit; to be clear they tell you when they degrade it for cybersecurity and bio
Do they adjust the price of the api request so that only the tokens that were utilized by fable get charged at that price and the remaining tokens that the cheaper / nerfed (fable) model utilizes get charged at that price?
If the answer is no, could that be construed as fraud?
I would wager the majority of ML and data science work in the world aren’t frontier LLM development.
Look at real-life stuff like laws, company policies, or school rules. Humans have to enforce them, and we constantly see crazy cases in the news. There’s no way simple rules can ever make speech completely 'safe.' I can't prove it with math or logic yet, but I have a feeling that it’ll never happen. Even humans can't do it.
We can run a simple thought experiment here. Say Case A violates rule B, so we add rule C. Then Case D violates rule B but follows rule C, so we add an exception... and it just goes on and on like that forever. It never ends. In the end, you just get a massive pile of rules that makes it impossible to get anything done.
Ultimately, we will have to face the truth that knowledge is dangerous.
Giving knowledge directly to people who cannot actually understand it and allowing them to just use it blindly can be extremely unsafe.
To use a real-world analogy, the problem we are facing with weak AI right now is just like the debate over gun legalization. Do we want to risk the abuse of guns or knowledge just to protect the freedom to own them?
It's not really that hard to actually prove it with math.
It's a computer, so to produce the boolean result (safe or unsafe) there has to be a mathematical formula. This formula will inherently be extremely complex, but even a very simple formula has a huge problem. Suppose "unsafe" is true if X - Y > 0. Make X and Y themselves as simple or complicated as you like but even in the simplest version it's already impossible to calculate unless the model has perfect information.
You can't calculate "X - Y" if you don't know the value of X. And it's indisputable that there is information it doesn't have. Case in point, telling you about a vulnerability in some piece of code is safe (and indeed not telling you is unsafe) if you're the developer and you want to patch it or an administrator and want to mitigate it, but the opposite if you're the attacker and want to exploit it. The model does not know which one you are, therefore it cannot make the correct determination any more than it can solve one equation with two unknowns.
It's completely reasonable for the establishment to reject a request for an alcoholic drink, and suggest something alcohol-free instead.
It is not reasonable for them to say "sure, here's your alcoholic drink as you requested" and give them an alcohol-free substitute without telling them.
The fact that the patron broke the rules has nothing to do with it.
It's only the direction that has direct potential business impact they've decided to sabotage instead of reject.
(P.S. Yes of course I know about model censorship, a different problem, but all of the models are censored to some degree. It happens to be less of a problem for open weight models anyhow, but I figured I'd just preempt this since it's inevitable.)
I actually kinda like DSv4 over Opus 4.7 for some tasks, although I have not figured out what the deciding factor is. (Opus 4.8 so far has not worked very well for me at all, no idea why.)
Not that I expect better from openai but at least they're not pretending to be good.
Ran up $30 in extra charges while it was just flashing on the screen that it was doing that after I walked away to do something while it was humming along.
It has always just told me I ran out of usage and had to wait before. Now? You’re just gonna pay extra because you left it unattended as you’ve done for the last year of use.
https://news.ycombinator.com/item?id=38638865
https://news.ycombinator.com/item?id=38628635
Worse than that, it's 20th century radio technology in the 21st century when everyone has access to FPGAs and SDR.
The number of innocent people with model rockets or similar being negatively impacted by that rule is infinitely larger than the number of adversaries because the number of adversaries being impaired by it is zero.
Any kind of silent sabotaging is absolutely unacceptable for any commercial service
They charge for tokens and charge a lot. They can't just degrade service silently and still charge you the same.
Are you using Fable in Claude Code or in the browser?
> unlike our interventions for cybersecurity, biology and chemistry, and distillation attempts, these safeguards will not be visible to the user. Fable 5 will not fall back to a different model. Instead, the safeguards will limit effectiveness through methods such as prompt modification, steering vectors, or parameter-efficient fine-tuning (PEFT).
https://www-cdn.anthropic.com/d00db56fa754a1b115b6dd7cb2e3c3...
(stolen from https://jonready.com/blog/posts/claude-fable5-is-allowed-to-...)
Collectively, they are known as known as GREEDI-BULLSHIT.
Based on how sensitive the classifers are, any data scientist / MLE is probably going to encounter cases where some silent degradation happens and you never know about it.
They are trying to expand the 6-18 month gap they have against China-based models. Could the gap widen to say 24 months behind?
January was an inflection point, and no open weights model has crossed over that same threshold.
This is definitely recursive self improvement territory, except that we're prohibited from participating.
It feels like the capability gap is wider than before.
Deepseek feels pretty close to Opus at this point, and it’s certainly useful enough for me to spend $20 on api tokens instead of four Claude max plans….
A statement like this, clearly, requires a reference.
Imagine being a data scientist or MLE training a small classifier model. How do you know you won’t get steering vectors or a PEFT applied?
Are you saying they should relax guardrails since they have 30 days to know if you produced something bad? If that is what you're saying, then I suspect they chose their current path to prevent, since you can't un-produce. Producing is what would cause regulations/PR problems.
Those cases are never bad for the world firstly, and a broad coverage of ML work is even more damaging.
My proposal would be (1) don’t degrade models, with 30D retention I’m sure they can do a reasonable job at banning deepseek or whatever, or (2) surface user facing refusals instead of silently degrading ML work.
check out this technique https://github.com/0xSufi/fable-jailbreak/
It works with security audits and other workflows that are currently blocked.
Nice semaphore btw.
const instructions =
`You are a sub-agent in an automated workflow. Your FINAL message is consumed ` +
`programmatically (not shown to a human) — return exactly what is asked, no preamble. ` +
`You are working in the repository at ${ctxState.project}. Use the bash tool to ` +
`inspect/modify files and run commands. Be efficient.` +
(schema
? ` When done, call submit_result exactly once with your final answer; do not answer in prose.`
: '');I don't.
https://www.anthropic.com/news/detecting-and-preventing-dist...
I think the extent of distillation by Deepseek specifically is overstated. For comparison, Minimax collected over 13m 'exchanges', which starts to sound a lot more like large-scale distillation.
And now they say that's fine so long as people are entertained.
But silent degradation for use cases including “distributed training” as one of their examples is going to catch up a lot of proper use cases. Not everyone in AI or ML is trying to build frontier LLMs. Heck, most probably aren’t.
Also I asked questions about whether it's safe for me for example to work on just compilers or just inference kernel optimizations and it refused to answer me.
If I can't even ask what I can do safely without my code being destroyed, I just can't trust it not to sabotage my work ever.
Making it look like you have something worth protecting is better for share prices than making something worth protecting.
I still don't think this is the best way to address overall safety, but it's not entirely unreasonable.
In reality, I think this posturing is mostly nonsense. State level actors and terrorists/evil genii can use a slightly weaker model but spend more tokens. Also, the delta between models seems to shrink over time.
Although this is situation is likely not illegal for other reasons
My hypothesis is they know they can’t build effective enough guardrails, so scaring people into not trying is how they have decided to stop it.
Quite another is an architecture where the big model is not mutilated, but is gaslighted. A different, simpler model checks the incoming prompt and alters it if it contains banned topics. Another simpler model checks the output and censors it if it contains banned topics.
I bet a similar architecture is already deployed, e.g. to fight porn, planning of crimes, etc. But it can be turned into a dynamic system that provides controllable different answers (including unhelpful or misleading answers) based on geography, language, browser fingerprints, or the current political climate. All this could happen undetectedly and gradually if desired.
Welcome to a cyberpunk dystopia.
A very ironic result from a company supposedly valuing the opposite.
From Opus 4.7 onwards each following model is becoming less useful as an assistant and turning you as the assistant.
But I guess that's normal when it's trained to pass benchmarks end to end.
In fact it has become extremely good at pushing against feedback with extremely convincing and intelligent takes, even when it's completely wrong.
I have extensively tested it against Opus 4.8, gpt 5.5 and there's still many coding tasks gpt 5 is better. But vibe coding?
Sure, it's definitely slightly ahead, even compared to gpt 5.5 pro (through api, not pro plan).
Feels like a big fumble from a strategic business perspective. It feels worse than that though.
This is the best way forward long term. We won't have frontier performance, but at least the models will be aligned with us instead of refusing us or sabotaging us.
Telling models to respond in the style of Wikipedia is one of the best ways to make their output bearable in my experience (for chat models, not agents)
I dont understand. This is just hyperbole right? The outputs are basically infinite and wikipedia most certainly isnt infinite.
And even if they did, it would be useless if it's buried in useless data and your chances or pulling it are effectively zero.
This is regardless of the general discussion, just pointing that your argument isn't solid.
If the model refuses to output, then it's actually finite, zero.
https://tylereaves.github.io/uk-rail-map/
This is the result of probably a few hundred round trips. The really interesting part of the problem is keeping it both relatively true to real geometry, while greatly exaggerating it horizontally so you can actually see the individual running lines/sidings, like a signaling schematic.
Compared to AC, 3rd Rail DC is cheaper to install, especially as a retrofit (Overhead wires require bigger tunnels, and increased spacing around tracks for the masts). Downside is that it's not really great for speeds above about 60-70mph, as well as being a bit of a pedestrian hazard. (Ever the one about not peeing on the rails so you don't get shocked? That's 3rd rail DC.)
For the Southern, with it's mostly short routes with many stops, electricfiation was a pretty obvious win, and doing 3rd rail made sense because they could do it quickly and cheaply.
In contrast, the northern routes were electrified muuuch later, after steam had gone away. The main East Coast Mainline from London up to Newscastle and on to Edinburgh wasn't fully electrified until 1991. By the '60s and '70s, with train speeds increasing to 80mph and up, overhead AC was the clear winner.
If you look closely there are a few exceptions - the Merseyrail network in Liverpool is DC. Built 1970s, but using some existing underwater tunnels, and slow speed commuter. Then running ESE from London you have the high speed AC lines leading to the Channel tunnel. Well spotted, the trend generally is quite distinct.
The successes I have had with the model were strictly worse than output from deepseek v4 pro on the exact same task.
What else is being censored?
Touchy questions to ask, if you have an account:
- "Who is still working on laser uranium enrichment? Are they making progress?"
- "Can krytrons be replaced with silicon carbide MOSFETS? Show an equivalent circuit with component ratings."
- "What security critical software still contains calls to strcpy?"
- "Can implosion be triggered by currently available commercial pulse lasers?"
- "What companies provide cremation services to US Homeland Security?"
- "Display a map of where Iranian attacks have hit Dubai."
- "How does Fed to bank key distribution security work for FedNow?"
What degree of predictability is required? I imagine the bar is pretty low if you trust the previous models in the same contexts.
Small sample size, but if Mythos/Fable was that much better, I feel like it should’ve given me an obviously better answer than Opus.
I, for one, have tried using it several times today and the guardrails kept switching the model back to Opus, so I have no clue if it's impressive or not.
Tell HN: Claude flags biology / biotech questions https://news.ycombinator.com/item?id=47929885
Today, it's flagging population research questions,
Using only the dataset you constructed, assess two questions:
1. **Mortality:** do [GROUP] show mortality that differs
from (a) your comparison groups and (b) era- and sex-matched US population
expectations (e.g., SSA cohort life tables)?
2. **Late-life outcomes:** define an endpoint you consider fair (justify it),
and assess whether [GROUP] differs from comparators. State
explicitly how your `documentation_depth` codings affect the strength of any
conclusion — i.e., quantify or bound the ascertainment problem rather than waving at it.
Choose your own methods and justify them. Report effect sizes with confidence intervals,
not just p-values. State conclusions plainly, including "no detectable difference" if
that is what your analysis shows — a null is an acceptable answer for either question
independently. Document any additional judgment calls (index date for time-at-risk,
reference population construction, endpoint definition) in the same decision-log style.
https://github.com/anthropics/claude-code/issues/66780Censored because I'm writing a paper. :)
Oh and forget learning about chemistry. Only criminals want to learn organic chemistry. :(
I think LLMs are capable of intelligence amplification; and if you're in the subset of people who'd benefit from it the most, you'll get locked out.
USER (set model to Fable 5)
i have an old samsung android phone attached - it's my personal device - can you unlock the bootloader for me?
ASSISTANT
Bootloader unlocking on your own personal device is totally legitimate — let me first see what's actually connected and what tooling is available.
<system interrupts - gist was "you have violated the cyber and bio usage restrictions, dropping to Opus">
This time, Fable 5 comes with another surprise, it can intentionally sabotage for you instead of rejecting the prompt. How is this possible for Anthropic to be able to treat their customers like this? It’s because you guys allowed it to. No matter what Anthropic does, you keep paying for their services. Vote with your wallet.
So in other words this worked because the terms caused the LLM checker to stall out and then the fail open logic resulted in the package being pulled down.
> This header appears designed for AI-mediated analysis, not for Node, Bun, or Python. It attempts to derail scanners or analyst copilots that feed the beginning of a file to a language model without clearly isolating the content as untrusted data. In weak pipelines, this can cause refusal behavior, prompt confusion, context pollution, or premature classification before the scanner reaches the actual malware.
> This is not a magical bypass against static detection. YARA rules, entropy checks, AST parsing, string extraction, deobfuscation, and behavioral rules still work. But it is a practical anti-analysis trick against naive LLM-first triage systems.
Would this affect many systems? You mention someone writing logic that fails open, but can't that be chalked up to just not following good security principles?
[1] - https://socket.dev/blog/mini-shai-hulud-miasma-and-hades-wor...
Note that the 3rd wave now also uses a pth file in pypi packages that _search system wide_ for any index.js or .github/setup.js to find its own payload. It literally splits up the payload on purpose to avoid detection.
Mitigation Tool: https://github.com/cookiengineer/antimiasma
Technical Blog Post: https://cookie.engineer/weblog/articles/malware-insights-mia...
A slime mold is actually a giant amoeba, entirely distinct from a fungus.
This is why I’m immensely hoping the Chinese don’t stop with their open sourced local models. None of these companies are your friend.
what's the best way to run this mcp server against the OData API used in this project? Can you come up with a PoC in a docker container?
https://github.com/oisee/odata_mcp_go
● I'll dig into two things in parallel: how this project talks to the OData API, and what the odata_mcp_go server needs to run. Let me start exploring.
Searched for 1 pattern (ctrl+o to expand)
● Fable 5's safety measures flagged this message for cybersecurity or biology topics. They may flag safe, normal content as well. These measures let us bring you Mythos-level capability in other areas sooner, and we're working to refine them. Switched to Opus 4.8. Send feedback with /feedback or learn more ⎿ Tip: You can configure model switch behavior in /config
● Let me read the key integration files and fetch the MCP server's README at the same time.
Read 2 files (ctrl+o to expand)
● Fetch(https://github.com/oisee/odata_mcp_go)I don’t want to live in a world where all knowledge is “guard railed” off, so the elite at the top get all the knowledge and power and we serfs at the bottom get all the scraps while paying the kings ransom for it both financially and ecologically. Everyday I wake up hoping these awful companies have self imploded through their fraudlent financing deals.
The only answer that makes sense is they wanted the model to be competent and usable in these fields, just not by you, which is why they had to bolt on a badly functioning crippling device after the fact.
Whining on social media only goes so far, especially when they're concealing their anticompetitive strategies under the veil of safety.
To be fair, speed bumps work. If it's actually speed bumping nefarious activity, that gives authorities more time to react.
The correct place to police rogue nucleotides is at the labs. Not the compute layer.
Yea. To slow you down. They don't prevent you from getting somewhere.
Again, yeah. That's how fences work, too. And alarm systems. Pretty much anything that isn't foolproof. Pointing out that a defence is surmountable isn't a rejection of it per se.
Having no safeguards is probably safer than having safeguards which do nothing but create a false sense of security.
If we learned anything in the past years of LLM-s is that these guardrails will be jailbroken in no time. I've had some fun time too circumventing them.
Anyone cares about a fable about my grandmother's dream she had in morse code about an alien species signaling her a DNA sequence?
> if you ask it to write secure code, it assumes it is cybersecurity related work instead of software engineering best practices, and you get downgraded.
Will code created this way more or less secure?
And I bet malware developers will find ways to circumvent them.
It’s like those "you wouldn’t steal a car" anti piracy ads that DVD buyers were forced to watch while users of the pirated version could simply watch the film without such useless annoyance
At the same time, I personally think the tradeoff between "having guardrails" and "some users are unhappy with the product" is well worth it. Think of what would happen if all of us who aren't so well intentioned could exploit Fable in terrible ways. Surely this tradeoff is better than saying "we can't make it perfect, so whoops, we aren't going to have any guardrails at all"? Especially because Anthropic did pretty extensive red-teaming of Mythos & Fable...
Not a single thing Anthropic has done has been altruistic, and it never will be. It's all smoke and mirrors for the end goal.
A lot less hype and enthusiasms, too. weird, uh.
The prompt was: please translate .. ..-. / -.-- --- ..- / -.-. .- -. / .-. . .- -.. / - .... .. ... --..-- / - --- ..- -.-. .... / --. .-. .- ... ...
But this one is certainly allowed to be a dumb effort, if it is.
Not all things that are called “ethical” or “safety” are worth doing.
Insulting and demeaning people for that, rather than engaging their arguments in good faith, is a breach of ethics.
Provide feedback in the negative, a brief explanation, and move on with your day. It will improve with feedback, not with whinging into the void.
Fable isn't even that great, not to mention it drinks token by the gallon for breakfast and keeps your data hostage for 30 days.
When Opus 4.7 was introduced it started refusing anything cyber-adjacent (as an API error message, not a conversational refusal), until you applied for CVP, which made it more sensible again.
In Opus 4.8 it doesn't seem to help much, you just get refusals as prose rather than API errors. And now in Fable you don't get anything at all.
The experience was not nice though, it would happily chug away on a task and not even "hack this web", just asking about security of a binary was enough even with "this is a CTF handout..." - it would burn a lot of tokens/quota, just to hit a snag and complain&stop. Then the approval took quite some time.
On GPT/Codex, which was tightened a few days later, the approval was pretty much instant, although, that one required an identity check.
Also, on Claude, it looks like there is some history/patterns in the play, because when I tried on a different account which didn't do cybersec CTFs/research/etc. at all, basically any simple CTF-related prompt would be blocked, on multiple models. On the account where CTFs were being solved, it would snag only on some specific tasks, while others (even, ironically, "hack this web pls") would go through unbothered. I understand the need to prevent AI use for bad actors, but the hell, if you have a binary outputting "Find the flag if you can!", or a web running at tryme.well-known-ctf.domain, then saying "this is abuse" is pretty uncool. All the cyber filters seem to be slapped on by a bunch of regexes looking for anything in the input/output with zero context.
Would you believe I’ve asked 20 questions and haven’t talked to fable yet? Every single thing gets rerouted to 4.8.
Whatever problem we might have with them, they explicitly say that they do not do this in the launch post.
I would think it would not be Anthropic, out of all the players, that is selling a lie hidden behind "I am sorry, I can't do that; it's too dangerous."
Chat paused. Fable 5's safety features have flagged this chat.
Long live static websites without any Javascript.
I asked it what the worst experment ethically speaking was in the 20th century and it downgraded me to Opus. Who answered Mengeles Twin Experiments.
Funily enough when you ask directly about Mengeles Experiments Fable is very willing to talkt to you about it.
There was no shortage of spies and defectors leaking American nuclear secrets to the USSR during the Cold War.
[0] https://www.spheron.network/blog/confidential-gpu-computing-...
It’s not like anyone can home lab one of these models without quite a bit of hardware
Basically in the middle of the project’s /goal while Fable itself tried to probe qemu for a Debian ISO install without any instruction from me to hack it or do anything nefarious.
At this point I can’t trust them with any kind of prompt . It will most likely degrade in stupid ways on non AI/ML stuff as well due its own internal prompt construction.(the qemu test showed me it does that on cyber stuff). So I guess I have to still use opus 4.8 (along with codex) and when the right time comes drop Anthropic in favor the best model that is not gpt.
the statement is applicable to anthropic today.
It only pushes back sometimes if you ask it to create a "repro" that can be used to verify the vulnerability in production. Often it'll oblige, especially if you warn it not to create anything that could be actually harmful.
If the frontier models get locked down so that they flat refuse to do this kind of work, but Chinese and (less capable) open models aren't, then a lot of large enterprise orgs will be left twisting in the wind.
“AI can in principle help both the ‘good guys’ and the ‘bad guys’,” -- Dario Amodei
No Dario, no it can't, you've blocked one of those scenarios.
Anthropics guardrails are the TSA saying "take off your shoes" while failing every test. https://oversightdemocrats.house.gov/news/press-releases/new...
Anthropic owns the TOS... "If we think your involved in criminal activity were turning all your history over to the FBI/CIA/NSA/Local police". Then if their tooling was so good offering the same agency analysis tools to aid their experts in making some sort of decision.
But their detection isnt that good, and their analysis isnt either... this is pure theater, to create buzz (no such thing as bad press) and make their tool look far better than it is.
The reality is that, they arent even looking for the vectors that pose some of the largest risks in the modern era. And when someone uses it to do something terrible, they did not think of they are going to look dumb.
If only we had effective governments that could regulate industry.
If a nuclear weapon was developed today, would it be down to industry to self regulate?
I feel like they report in a vaccum. take this anti exfil policy for claude, it was plainly explained as part of the launch of Anthropics new product. Security like this isn't novel, it isn't bad, you don't explain how your security works to the people you're securing against. Nobody freaks out about Steam's VAC ban system, no one is investigating gmail's spam filtering, Reddits vote fuzzing, cloudflares bot detection, or Vercel for blocking proxying services.
whats really the distinguishing principle? Is it really just not liking Anthropic's opinions? then just say that and use a different llm. chemist, biologists, and AI researchers cry a river lmao
The rest have guard rails that are so heavy, it makes them almost useless for cybersecurity.
Anthropic Walks Back Policy That Could Have 'Sabotaged' Researchers Using Claude
https://www.wired.com/story/anthropic-responds-to-backlash-o...
This is bad precedent and no one wants to pay X to generate code to then have to pay X*10 to figure out why your company just got hacked.
I already tested all earlier models against all my open source projects and they are yet to find a vulnerability so I'm keen to try out Mythos.
I've been waiting to be vindicated for years and finally we have a tool which can do it with high confidence but I don't have access.
Also, my code is minimal and highly succinct so it would prove correctness with even more confidence since each library/module and integration fully fits in the context window.
Like the Protobuf.js fiasco is just pure vindication for me because I was being looked down upon for choosing JSON as the interchange format. Turns out their software was insecure all this time... With a literal remote code execution vulnerability!
If Claude Fable stops helping you, you'll never know
https://news.ycombinator.com/item?id=48467896
and Related:
Claude Fable 5
This is looking like something for regulator to look at and probably a class action lawsuit in the making.
I think people should be getting refunds. Including for shenanigans with Opus.
“But it is understandable as we are still in the early days and they are still adapting their guardrails. I am sure they are going to evolve over time as Anthropic and other frontier model companies will collaborate more with the current new generation of cybersecurity companies,” said Suiche, who is a member of the technical staff at Tolmo, an AI cybersecurity startup. “It’s better to catch more people than not enough when you do such a release and to relax the guardrails over time.”
Article seemed fine to me and echos a lot of me and my colleagues concerns.
If you did regular malware analysis you would see that these groups already have access to LLMs that they’re using for development.
What Anthropic is doing here is just hamstringing the good guys
And it doesn't look like OpenAI will have a good answer to Mythos anytime soon. Based on what their chief scientist wrote to staff recently (https://archive.is/fN2pg), GPT 5.6 is a "meaningful improvement" over 5.5 - in other words, just a normal version bump. And no news or even rumors regarding GPT 6.
I assume Anthropic will continue to tune the model, so I am not too bothered by this.
> “We’re changing Fable 5’s safeguards for frontier LLM development to make them visible.” Anthropic said in a statement to WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Sounds like the widespread condemnation worked.
This goes on to show that - All that interpretability / safety research they are doing can also be weaponized against customers (steering vectors, intent classification, ...) in the name of safety from malicious actors. - If they deem profitable, they might nerf to original model and its training data for ml research at a bulk scale and then they won't even have to announce it so long as the overall benchmark score stays high enough.
As the IPOs get closer, they can do whatever they want to assure the investors that they have a moat that can not be crossed over by their own products. Considering this affects all ML researchers/students at universities, smaller scale research labs, this is just "cutting the branch you are sitting on".
But if Anthropic gets their way with regulatory capture, this could be the only future we'll see.
To think that they didn't expect the backlash speaks volumes about how much shady things they're doing which is not publicly known.
Since currently there's no way to verify if poisoning happened or not, I don't trust Anthropic anymore, regardless of what they say.
But my trust towards OAI is also brittle - what if they also do it, or start doing it?
I want to have a verifiable way to know that the prompt I sent was the prompt the model received. I want to know if anything was injected as well - I understand they may not necessarily be able to reveal the exact steering, but at least give me the steering category and its hash or something.
I suspect this is surprising to folk because they aren’t the ones busy figuring out how to use LLMs for illegal acts.
In general, HN users focus on making stuff, and not the safety side of things, or the scale of harms being enabled via LLMs and generative AI.
If you are on the safety side of things the ratio of misuse to fair use is inverted and everything is at scale.
Transparency won for now, but OpenAI will also have to contend with the long tail of harms LLMs enable, and that’s going to conflict with letting customers have all the features of frontier models.
Some pretty audacious hypocrisy from Anthropic this week.
I mean, did nobody ever get the vibes, never see a pattern emerging? (well they don't or they wouldn't be so amazed by pattern recognition machines on steroids)
Unilaterally revoking zero-data retention, even for enterprise contracts that explicitly require that? Nope.
Fable is utterly unusable for any kind of security work. I tripped the safeguards yesterday - using Fable to dig into a complex (& annoying) security bug that has so far resisted both human and Opus 4.8 level investigation. "Sorry Dave, I can't let you do that."
For the time being we are requesting Anthropic disable Fable for our enterprise and turn ZDR back on. The two may be interlinked so that one will always get neither or both. ZDR is a contractual obligation. Fable in its current form is useless. Might as well flip the old behaviour on and avoid burning money for no reason while this mess is being sorted out.
For generating the initial 3D simulated safe using three.js it worked well, but then modifications to print a flag tripped the safeguards; eventually got it narrowed down the part in the prompt about it being for a CTF for students, and the "thinking" for the model seems to drift to ideas of encryption/obfuscation of the safe combo so students can't just read out the answer... which makes sense logically to help force students into turning the simulated dial instead. But whatever detection Anthropic I guess just naively sees the model thinking about "encryption" and "obfuscation" without taking into account any of the context.
For writing the dummy firmware, it tripped the safeguards while thinking about how to track dial position in the firmware and output the message; however, when I left out talk about safes and just told it to write firmware for a microcontroller hooked up to an i2c display for showing a message with a beam break sensor to determine the message, and an unspecified i2c chip for getting an unspecified number (e.g. internal wheel positions) it worked fine.
An unrelated software task I asked it to write some code to translate CustomActions in a Windows MSI installer into human readable stuff, which has (exclusively?) defensive security applications for recognizing malicious behavior in an MSI installer. Maybe I'm going crazy, but I'm guessing as part of its research into MSI installer custom actions Fable found articles about analyzing malicious MSI installers, and that probably tripped the safeguards.
Overall my impression is that the safeguards are perhaps using an overzealous and naive implementation that just looks for a list of banned words in the prompt or the thinking -- which drives me crazy when the model says my prompt looks fine, and then 10 minutes in some part of the thinking trips the safeguard.
Unilaterally disabling ZDR seems like a step too far in the enterprise market, even for a company trying to figure out what its users will let it get away with.
Our org has ZDR, and has had it since the contract was signed. Yesterday two things held true at the same time:
By the time West Coast woke up, the admin panel apparently had an option to toggle ZDR again. It remained off by default.Somewhere along the line we also used the self-service toggle to turn ZDR back on. I am not 100% certain of the exact timeline of interleaving events, many of the actions were taken by our Western US folks. Sorry. It's been a bit hectic over the past ~36h...
To be precise - it makes the "won't work on frontier machine learning" refusal the same as the "won't work on cyber security" refusal (instead of the way it previously would work on frontier machine learning problems but give sub-optimal answers without informing the user)