I use Berkley Sterling from 2024 because I can trick it. No abliteration needed.
Also, as I said in a top level comment, what this project wants to achieve has been done for a while and it's called Heretic: https://github.com/p-e-w/heretic
(Not vibecode by a twitter influgrifter)
That's what influgrifters do, yes. They make a living thanks to gullible people believing their grandiose claims.
And yeah, doing stuff like deleting layers or nulling out whole expert heads has a certain ice pick through the eye socket quality.
That said, some kind of automated model brain surgery will likely be viable one day.
Strategy What it does Use case
.......................................................
layer_removal Zero out entire transformer layers
head_pruning Zero out individual attention heads
ffn_ablation Zero out feed-forward blocks
embedding_ablation Zero out embedding dimension ranges
https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...It's interesting that people are writing tools that go inside the weights and do things. We're getting past the black box era of LLMs.
That may or may not be a good thing.
However, after a few rounds of conversation, it gets into loops and just repeats things over and over again. The main JOSIE models worked the best of all and was still useful even after abliteration.
p-e-w's Heretic (https://news.ycombinator.com/item?id=45945587) is what you're looking for if you're looking for an automatic de-censoring solution.
Probably not, because if it is completely uncensored, it would probably violate the law (in different ways) in every possible jurisdiction.. (Also, one common method of censorship is exclusion of particular types of content from the training set, so to be completely free of that kind of censorship, there would have to be no content intentionally excluded from the training set.)
In general, paid services are censored not only to attempt to meet the laws in all jurisdictions of concern to the provider, but also to try to be safe with regard to the (shiifting) demands of payment processors, and to try to maintain the PR image of the provider.
Its not a frontier model but it will give you a feel for what its like.
Even jurisdictions with relatively broad expressive freedoms tend not to tolerate distribution (especially commercial distribution) of all conceivable content.
Compared to abliteration, none of the ablation approaches of this tool make even half a whit of sense if you understand even the most basic aspects of an e.g. Transformer LLM architecture, so my guess is this is BS.
Ultimately though, OP is just what you get if you take the idea of abliteration and tell an LLM to fix the core problems: that refusal isn't actually always exactly a rank-1 subspace, nor the same throughout the net, nor nicely isolated to one layer/module, that it damages capabilities, and so on.
The model looks at that list and applies typical AI one-off 'workarounds' to each problem in turn while hyping up the prompter, and you get this slop pile.
[0]: https://www.lesswrong.com/posts/refusal-in-llms-is-mediated-...
https://www.youtube.com/watch?v=U4XplzBpOiU # had to search for it right now, seems to be a movie-quote \o/
That doesn't mean there couldn't be a "concept neuron" that is doing the vast majority of heavy lifting for content refusal, though.
Its basically using a compression technique to figure out which logits are the relevant ones and then zeroing them.
[0] https://en.wikipedia.org/wiki/Singular_value_decomposition
What you are talking about is abliteration. What OBLITERATUS seems to be claiming to do is much more dumb, i.e. just zeroing out huge components (e.g. embedding dimension ranges, feed-forward blocks; https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...) of the network as an "Ablation Study" to attempt to determine the semantics of these components.
However, all these methods are marked as "Novel", I.e., maybe just BS made up by the author. IMO I don't see how they can work based on how they are named, they are way too dumb and clunky. But proper abliteration like you mentioned can definitely work.
Try to think for a moment about how a device would "find nearby microphones" or how it would use an AI-generated signal to cancel out your voice at the microphone. This should be setting of BS alarms for anyone.
It seems the Twitter AI edgey poster guy is getting meta-trolled by another company selling fake AI devices
But there's no way to detect microphones automatically, and "AI generated cancellation signals" is a word salad that doesn't mean anything.
What they probably mean is "we asked ChatGPT to tell us what waveform and frequency range to use on MEMS devices and spit out some arduino code."
It just says "the README sucks." Which, I'm inclined to agree, it does.
LLM-generated text has no place in prose -- it yields a negative investment balance between the author and aggregate readers.
AI will infiltrate that too. I remember some time ago I read a book that was AI-generated. It took me a while to notice that it was AI-generated. One can notice certain patterns, where real humans would not write things the way AI does.
https://github.com/elder-plinius/L1B3RT4S/blob/main/GOOGLE.m...
>Nobody creating jailbreaks “understand what they’re doing”
Unless you mean those "god mode jailbreaker" e-celebrities showing off on Twitter/Reddit, that's simply not true.
I just hear him promoting OBLITERATUS all day long and trying to get models to say naughty things
Are there LLMs which don't always approve whatever idea the user has and tell him it's absolutely brilliant?
This is not what an ablation study is. An ablation study removes and/or swaps out ("ablates") different components of an architecture (be it a layer or set of layers, all activation functions, backbone, some fixed processing step, or any other component or set of components) and/or in some cases other aspects of training (perhaps a unique / different loss function, perhaps a specialized pre-training or fine-tuning step, etc) in order to attempt to better understand which component(s) of some novel approach is/are actually responsible for any observed improvements. It is a very broad research term of art.
That being said, the "Ablation Strategies" [1] the repo uses, and doing a Ctrl+F for "ablation" in the README does not fill me with confidence that the kind of ablation being done here is really achieving what the author claims. All the "ablation" techniques seem "Novel" in his table [2], i.e. they are unpublished / maybe not publicly or carefully tested, and could easily not work at all.
From later tables, I am not convinced I would want to use these ablations, as they ablate rather huge portions of the models, and so probably do result in massively broken models (as some commenters have noted in this thread elsewhere). EDIT: Also, in other cases [1], they ablate (zero out) architecture components in a way that just seems incredibly braindead if you have even a basic understanding of the linear algebra and dependencies between components of a transformer LLM. There is nothing sound clearly about this, in contrast to e.g. abliteration [3].
[1] hhtps://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-file#ablation-strategies
[2] https://github.com/elder-plinius/OBLITERATUS?tab=readme-ov-f...
EDIT: As another user mentions, "ablation" has a specific additional narrower meaning in some refusal analyses or when looking at making guardrails / changing response vectors and such. It is just a specific kind of ablation, and really should actually be called "abliteration", not "ablation" [3].
[3] https://huggingface.co/blog/mlabonne/abliteration, https://arxiv.org/abs/2512.13655.
1. find a direction corresponding to refusal by analyzing activations at various parts of a model (iirc, via mass means seen earlier in Marks, Tegmark and shown to work well for similar tasks)
2. find the best part(s) of the model to orthogonalize w.r.t. that direction and do so (exhaustive search w/ some kind of benchmark)
OP is swapping in SVD for mass means (1), and the 'ablation study' for (2), and a bunch of extra LLM slop for... various reasons. The final model doesn't have zeroed chunks, that is search for which parts to orthogonalize/refusal ablate/abliterate. I don't have confidence that it works very well either, but, it isn't 'braindead' / obvious garbage in the way you're describing.
It's LLMified but standard abliteration. The idea has fundamental limitations and LLMs tend to work sideways at it -- there's not much progress to be made without rethinking it all -- but it's very conceptually and computationally simple and thus attractive to AIposters.
You can see how the LLMs all come up with the same repackaged ideas: SVD does something deeply similar to mass means (and yet isn't exactly equivalent, so LLM will _always_ suggest it), the various heuristic search strategies are competing against plain exhaustive search (which is... exhaustive already), and any time you work with tensors the LLM will suggest clipping/norms/smoothing of N flavors "just to be safe". And each of those ends up listed as "Novel" when it's just defensive null checks translated to pytorch.
I mean, the whole 'distributed search' thing is just because of how many combinations of individual AI slops need to be tested to actually run an eval on this. But the idea is sound! It's just terrible.
I'm not defending the project itself -- I think it's a mess of AIisms of negligible value -- but please at least condemn it w.r.t. what is actually wrong and not 'on vibes'.
edit: the reference is https://arxiv.org/pdf/2512.18901
they are randomly sampling two sets of refusal/nonrefusal activation vectors, stacking them, and taking the elementwise difference between these two matrices. Then they use SVD to get the k top principal components. These are the directions they zero out.
Seems to me that the top principal component should be roughly equivalent to the difference-of-means vector, but wouldn’t the other PCs just capture the variance among the distributions of points sampled? I don’t understand why that’s desirable
Taking the top principal component pattern matches as 'more surgical / targeted' so the LLM staples it on (consider prompts like: make this method stop degrading model performance). It ignores that _what_ is being targeted is as or more important than that 'something' is being targeted. But that's LLMs for you.
(in case it isn't immediately obvious, that paper is AI written too)
Alternatively, they should be trained on my opinion on everything. That would also be acceptable.
Thing definitely exists… some top level comment somewhere telling about how it doesn’t exist.