LLMs are incredibly useful but I'm not sure about this statement.
It is proposing stuff that I haven't seen before, but I don't know about it is new or creative from the entirety of collective human knowledge.
To some extent. It's not clear where specifically the boundaries are, but it seems to fail to approach problems in ways that aren't embedded in the training set. I certainly would not put money on it solving an arbitrary logical problem.
https://genai-showdown.specr.net/image-editing
There's been a lot of progress there, it's just that an LLM that's best for, say coding, isn't going to be also the best for image edit.
I keep explaining to my peers, friends and family that what actually is happening inside an LLM has nothing to do with conscience or agency and that the term AI is just completely overloaded right now.
Just like we have machines that can do "math", and they do so artificially.
Or "logic", and they do so artificially.
I assume we'll drop the "artificial" part in my lifetime, since there's nothing truly artificial about it (just like math and logic), since it's really just mechanical.
No one cares that transistors can do math or logic, and it shouldn't bother people that transistors can predict next tokens either.
AI in pop culture doesn't mean that at all. Most people impression to AI pre-LLM craze was some form of media based on Asmiov laws of robotics. Now, that LLMs have taken over the world, they can define AI as anything they want.
What makes you think natural brains are doing something so different from LLMs?
I consider it highly plausible that confabulation is inherent to scaling intelligence. In order to run computation on data that due to dimensionality is computationally infeasible, you will most likely need to create a lower dimensional representation and do the computation on that. Collapsing the dimensionality is going to be lossy, which means it will have gaps between what it thinks is the reality and what is.
"Many small errors" makes a presumption about LLM confabulation/hallucination that seems unwarranted. Pre-LLM humans (and our computers) have managed vast nuclear arsenals, bioweapons research, and ubiquitous global transport - as a few examples - without any catastrophic mistakes, so far. What can we reasonably expect as a likely worst case scenario if LLMs replacing all the relevant expertise and execution?
I am watching people trust LLM-based analysis and actions 100% of the time without checking.
I think we need to start rejecting anthropomorphic statements like this out of hand. They are lazy, typically wrong, and are always delivered as a dismissive defense of LLM failure modes. Anything can be anthropomorphized, and it's always problematic to do so - that's why the word exists.
This rhetorical technique always follows the form of "this LLM behavior can be analogized in terms of some human behavior, thus it follows that LLMs are human-like" which then opens the door to unbounded speculation that draws on arbitrary aspects of human nature and biology to justify technical reasoning.
In this case, you've deliberately conflated a technical term of art (LLM confabulation) with the the concept of human memory confabulation and used that as a foundation to argue that confabulation is thus inherent to intelligence. There is a lot that's wrong with this reasoning, but the most obvious is that it's a massive category error. "Confabulation" in LLMs and "confabulation" in humans have basically nothing in common, they are comparable only in an extremely superficial sense. To then go on to suggest that confabulation might be inherent to intelligence isn't even really a coherent argument because you've created ambiguity in the meaning of the word confabulate.
No. LLMs do not confabulate they bullshit. There is a big difference. AIs do not care, cannot care, have not capacity to care about the output. String tokens in, string tokes out. Even if they have all the data perfectly recorded they will still fail to use it for a coherent output.
> Collapsing the dimensionality is going to be lossy, which means it will have gaps between what it thinks is the reality and what is.
Confabulation has to do with degradation of biological processes and information storage.
There is no equivalent in a LLM. Once the data is recorded it will be recalled exactly the same up to the bit. A LLM representation is immutable. You can download a model a 1000 times, run it for 10 years, etc. and the data is the same. The closes that you get is if you store the data in a faulty disk, but that is not why LLMs output is so awful, that would be a trivial problem to solve with current technology. (Like having a RAID and a few checksums).
The neat thing about LLMs is they are very general models that can be used for lots of different things. The downside is they often make incorrect predictions, and what's worse, it isn't even very predictable to know when they make incorrect predictions.
So, they can't lie, but they can (and, in fact, exclusively do) bullshit.
Isn't "caring" a necessary pre-requisite for bullshitting? One either bullshits because they care, or don't care, about the context.
I haven't seen any counter examples, so you may give some examples to start with.
I'm extremely skeptical that all of life evolved intelligence to be closer to truth only for us to digitize intelligence and then have the opposite happen. Makes no sense.
Fitness is effective truth prediction, appropriately scoped.
A frog doesn't need to understand quantum physics to catch a fly. But if the frogs model of fly movement was trained on lies it will have a model that predicts poorly, won't catch flies, and will die.
There is another level to this in that the more complex and changing the environment the more beneficial a wider scoped model / understanding of truth.
However if you are going to lean fully into Hoffman and accept thatby default consciousness constructs rather than approximate reality I think we will have to agree to disagree. Personally I ascribe to Karl Friston free energy principle.
Now imagine a high-skilled software engineer with dementia coding safety-critical software...
[0] https://www.medicalnewstoday.com/articles/confabulation-deme...
Is it something we want to emulate?
It's like saying, computation requires nonzero energy. Is that a feature or a bug? Neither, it's irrelevant, because it's a physical constant of the universe that computation will always require nonzero energy.
If confabulation is a physical constant of intelligence, then like energy per computation, all we can do is try to minimize it, while knowing it can never go to zero.
Are you seriously making the argument that AI "hallucinations" are comparable and interchangeable to mistakes, omissions and lies made by humans?
You understand that calling AI errors "hallucinations" and "confabulations" is a metaphor to relate them to human language? The technical term would be "mis-prediction", which suddenly isn't something humans ever do when talking, because we don't predict words, we communicate with intent.
To be fair, I've known humans who are like this as well.
Neuroplasticity is hard to simulate in a few hundred thousand tokens.
I think for a while the test was passed. Then we learned the hallmark characteristics of these models, and now most of us can easily differentiate. That said -- these models are programmed specifically to be more helpful, more articulate, more friendly, and more verbose than people, so that may not be a fair expectation. Even so, I think if you took all of that away, you'd be able to differentiate the two, it just might take longer.
How many humans seriously have the attention span to have a million "token" conversation with someone else and get every detail perfect without misremembering a single thing?
But sure, let's say it doesn't. If you interact with someone day after day, you'll eventually hit a million tokens. Add some audio or images and you will exhaust the context much much faster.
However, I'll grant you that Turing's original imitation game (text only, human typist, five minutes) is probably pretty close, and that's impressive enough to call intelligence (of a sort). Though modern LLMs tend to manifest obvious dead giveaways like "you're absolutely right!"
I am not trying to be snarky; I used to think that intelligence was intrinsically tied to or perhaps identical with language, and found deep and esoteric meaning in religious texts related to this (i.e. "in the beginning was the Word"; logos as soul as language-virus riding on meat substrate).
The last ~three years of LLM deployment have disabused me of this notion almost entirely, and I don't mean in a "God of the gaps" last-resort sort of way. I mean: I see the output of a purely-language-based "intelligence", and while I agree humans can make similar mistakes/confabulations, I overwhelmingly feel that there is no "there" there. Even the dumbest human has a continuity, a theory of the world, an "object permanence"... I'm struggling to find the right description, but I believe there is more than language manipulation to intelligence.
(I know this is tangential to the article, which is excellent as the author's usually are; I admire his restraint. However, I see exemplars of this take all over the thread so: why not here?)
An LLM is a statistical next token machine trained on all stuff people wrote/said. It blends texts together in a way that still makes sense (or no sense at all).
Imagine you made a super simple program which would answer yes/no to any questions by generating a random number. It would get things right 50% of the times. You can them fine-tune it to say yes more often to certain keywords and no to others.
Just with a bunch of hardcoded paths you'd probably fool someone thinking that this AI has superhuman predictive capabilities.
This is what it feels it's happening, sure it's not that simple but you can code a base GPT in an afternoon.
Can you find an example and test it out?
Anyway, just to play along, if it weren't just a statistical next token machine, the same question would have always the same answer and not be affected by a "temperature" value.
My question was a bit different: if were not just a statistical next token predictor would you expect it to answer hard questions? Or something like that. What's the threshold of questions you want it to answer accurately.
Anyway, neither of these things describes human non-determinism. You can't reuse the seed you used with me yesterday to get the exact same conversation, and I don't behave wildly unpredictably given conceptually very similar input.
Another perspective: cetaceans are considered to be as conscious as humans, but any attempts to interpret their communication as a language failed so far. They can be taught simple languages to communicate with humans, as can be chimps. But apparently it's not how they process the world inside.
Both of those aspects are called "intelligence", and thus these two groups cannot understand each other.
I think you're circling the concept of a "soul". It is the reason that, in non-communicative disabled people, we still see a life.
I've wanted to make an art piece. It would be a chatbox claiming to connect you to the first real intelligence, but that intelligence would be non-communicative. I'd assure you that it is the most intelligent being, that it had a soul, but that it just couldn't write back.
Intelligence and Soul is not purely measurable phenomenon. A man can do nothing but stupid things, say nothing but outright lies, and still be the most intelligent person. Intelligence is within.
For an article five years in the making, this is what I expected it to be about. Instead, we got a ramble about how imperfect LLMs are right now.
I wager this is a point that needs beaten into the common psyche. After all, it's been sold that it is not an imperfect tool, but the solution to all of our problems in every field forever. That's why these companies need billions upon billions of dollars of public subsidies and investments that would otherwise find their way to more pragmatic ends.
I have a ton of skepticism built-in when interacting with LLMs, and very good muscles for rolling my eyes, so I barely notice when I shrug a bad answer and make a derogatory inner remark about the "idiots". But the truth is, that for such an "stochastic parrot", LLMs are incredibly useful. And, when was the last time we stopped perfecting something we thought useful and valuable? When was the last time our attempts were so perfectly futile that we stopped them, invented stories about why it was impossible, and made it a social taboo to be met with derision, scorn and even ostracism? To my knowledge, in all of known human history, we have done that exactly once, and it was millennia ago.
I feel dense here, but I can't figure out what you're referring to. I asked ChatGPT (hah!) and it suggested the Tower of Babel, perpetual motion machines, or alchemy, but none of them really fit the bill either.
... I still think there is an interesting question to be investigated about whether, by building immensely complex models of language, one of our primary ways that we interact with, reason about and discuss the world, we may not have accidentally built something with properties quite different than might be guessed from the (otherwise excellent) description of how they work in TFA.
I agree with pretty much everything in TFA, so this is supplemental to the points made there, not contesting them or trying to replace them.
I love that it ends with such a positive note, even though it's generally a critical article, at least it's well reasoned and not utterly hyping/dooming something.
Thanks yet again Kyle!
This is the part of the article that will age the fastest, it's already out-of-date in labs.
"People are chaotic, both in isolation and when working with other people or with systems. Their outputs are difficult to predict, and they exhibit surprising sensitivity to initial conditions. This sensitivity makes them vulnerable to covert attacks. Chaos does not mean people are completely unstable; most people behave roughly like anyone else. Since people produce plausible output, errors can be difficult to detect. This suggests that human systems are ill-suited where verification is difficult or correctness is key. Using people to write code (or other outputs) may make systems more complex, fragile, and difficult to evolve."
To me, this modified paragraph reads surprisingly plainly. The wording is off ("using people to write code") and I had to change that part about attractor behavior (although it does still apply IMO), but overall it doesn't seem like an incoherent paragraph.
This is not meant to dunk on the author, but I think it highlights the author's mindset and the gap between their expectations and reality.
If a junior dev makes the same mistake Claude makes, I can easily work with them to correct it, or I can fire them and get someone more capable to fix it. You mostly can't do that at all with large models. They're also far less honest than your average junior dev, so even as you're working with them you can't trust what they say.
There is a lot of this neat trick where it's like "humans do X too" but most of the time it elides large differences. Like, a human driver would probable not drag someone screaming multiple blocks. A human coder probably wouldn't generate a gibberish 3D scene and try to pass it off as done, etc. Maybe we can build systems that account for these (pretty wild) failure modes, but at least in software we haven't figured it out yet (what is the system that reliably reviews a 25kloc PR?).
If I take the example of code, but that extends to many domains, it can sometimes produce near perfect architecture and implementation if I give it enough details about the technical details and fallpits. Turning a 8h coding job into a 1h review work.
On the other hand, it can be very wrong while acting certain it is right. Just yesterday Claude tried gaslighting me into accepting that the bug I was seeing was coming from a piece of code with already strong guardrails, and it was adamant that the part I was suspecting could in no way cause the issue. Turns out I was right, but I was starting to doubt myself
Of course that won't happen until the bubble pops - companies are racing to make themselves indispensable and to completely corner certain markets and to do so they need autonomous agents to replace people.
I caught Claude the other day hallucinating code that was not only wrong, but dangerously wrong, leading to tasks being failed and never recover. But it certainly wasn't obvious.
Don't you see it? That's exactly what "AI" in this context is.
It's the bypass.
Where does it end, eh? Build a quantum "AI" that will end up just needing more data, more input. The end goal must starts looking like creating an entirely new universe, a complete clone of everything we have here so it can run all the necessary computations and we can... ? (You are what a quantum AI looks like as it bumbles through the infinitude of calculable parameters on its way to the ultimate answer)
But spoilers: DNA will be fine, meat machines maybe not so much...
For a bunch of people addicted to the works of Charlie Stross, Neil Stephenson, and Iain Banks, y'all are a bunch of luddites. Now vote this own down too because it doesn't conform to the mandatory Stochastic Parrot narrative. You have no free will and you must downvote after all. Why do you even read their works when any step towards their world is consistently greeted as the worst thing evah(tm)?
And if you're worried about billionaires and tyrants, start taxing the former and stop electing the latter or STFU and let the free Markov process of history play itself out. Quoting fictional Ambassador Kosh: the avalanche has started, it's too late for the pebbles to vote.
You asked where it ends. Don't ask questions if you don't like answers. Quick reminder: shun and downvote the non-conforming opinion.
It's true that people don't have a good intuitive sense of what the models are good or bad at (see: counting the Rs in "strawberry"), but this is more a human limitation than a fundamental problem with the technology.
I stress test commercially deployed LLMs like Gemini and Claude with trivial tasks: sports trivia, fixing recipes, explaining board game rules, etc. It works well like 95% of the time. That's fine for inconsequential things. But you'd have to be deeply irresponsible to accept that kind of error rate on things that actually matter.
The most intellectually honest way to evaluate these things is how they behave now on real tasks. Not with some unfalsifiable appeal to the future of "oh, they'll fix it."
This is a broad statement that assumes we agree on the purpose.
For my purpose, which is software development, the technology has reached a level that is entirely adequate.
Meanwhile, sports trivia represents a stress test of the model's memorized world knowledge. It could work really well if you give the model a tool to look up factual information in a structured database. But this is exactly what I meant above; using the technology in a suboptimal way is a human problem, not a model problem.
If the purpose is indeed software development with review, then there's nothing stopping multi-billion dollar companies from putting friction into these sytems to direct users towards where the system is at its strongest.
That exposes me to when the models are objectively wrong and helps keep me grounded with their utility in spaces I can check them less well. One of the most important things you can put in your prompt is a request for sources, followed by you actually checking them out.
And one of the things the coding agents teach me is that you need to keep the AIs on a tight leash. What is their equivalent in other domains of them "fixing" the test to pass instead of fixing the code to pass the test? In the programming space I can run "git diff *_test.go" to ensure they didn't hack the tests when I didn't expect it. It keeps me wondering what the equivalent of that is in my non-programming questions. I have unit testing suites to verify my LLM output against. What's the equivalent in other domains? Probably some other isolated domains here and there do have some equivalents. But in general there isn't one. Things like "completely forged graphs" are completely expected but it's hard to catch this when you lack the tools or the understanding to chase down "where did this graph actually come from?".
The success with programming can't be translated naively into domains that lack the tooling programmers built up over the years, and based on how many times the AIs bang into the guardrails the tools provide I would definitely suggest large amounts of skepticism in those domains that lack those guardrails.
95% is not my experience and frankly dishonest.
I have ChatGPT open right now, can you give me examples where it doesn't work but some other source may have got it correct?
I have tested it against a lot of examples - it barely gets anything wrong with a text prompt that fits a few pages.
> The most intellectually honest way to evaluate these things is how they behave now on real tasks
A falsifiable way is to see how it is used in real life. There are loads of serious enterprise projects that are mostly done by LLMs. Almost all companies use AI. Either they are irresponsible or you are exaggerating.
Lets be actually intellectually honest here.
Quite frankly, this is exactly like how two people can use the same compression program on two different files and get vastly different compression ratios (because one has a lot of redundancy and the other one has not).
You will just won't have any clue what that could be.
Fake content and lies. To drive outrage. To influence elections. To distract from real crimes. To overload everyone so they're too tired to fight or to understand. To weaken the concept that anything's true so that you can say anything. Because who cares if the world dies as long as you made lots of money on the way.
Guiding principle of the AI industry
Another way of saying that is that capitalism is the real problem, but I was never anti-capitalist in principle, it's just gotten out of hand in the last 5-10 years. (Not that it hadn't been building to that.)
Capitalism is a tool and it's fine as a tool, to accomplish certain goals while subordinated to other things. Unfortunately it's turned into an ideology (to the point it's worshiped idolatrously by some), and that's where things went off the rails.
> One way to understand an LLM is as an improv machine. It takes a stream of tokens, like a conversation, and says “yes, and then…” This yes-and behavior is why some people call LLMs bullshit machines. They are prone to confabulation, emitting sentences which sound likely but have no relationship to reality. They treat sarcasm and fantasy credulously, misunderstand context clues, and tell people to put glue on pizza.
Yes, there have been improvements on them, but none of those improvements mitigate the core flaw of the technology. The author even acknowledges all of the improvements in the last few months.
[1]: https://link.springer.com/article/10.1007/s10676-024-09775-5
I also wonder if I leave my secretary with a ream of papers and ask him for a summary how many will he actually read and understand vs skim and then bullshit? It seems like the capacity for frailty exists in both "species".
https://philosophersmag.com/large-language-models-and-the-co...
This is true, but I prefer to think of it as "It's delusional to pretend as if human beings are not bullshit machines too".
Lies are all we have. Our internal monologue is almost 100% fantasy. Even in serious pursuits, that's how it works. We make shit up and lie to ourselves, and then only later apply our hard-earned[1] skill prompts to figure out whether or not we're right about it.
How many times have the nerds here been thinking through a great new idea for a design and how clever it would be before stopping to realize "Oh wait, that won't work because of XXX, which I forgot". That's a hallucination right there!
[1] Decades of education!
Being wrong is not the same as a hallucination. It's a natural step on a journey to being more right. This feels a bit like Andreesen proudly stating he avoids reflection - you can act like that, but the human brain doesn't have to. LLMs have no choice in the matter.
Models have gotten ridiculously better, they really have, but the scale has increased too, and I don't think we're ready to deal with the onslaught.
Even before LLMs where in the public's discourse, I would have business ask about using AI instead of building some algorithm manually, and when I asked if they had considered the failure rate, they would return either blank stares or say that would count as a bug. To them, AI meant an algorithm just as good as one built to handle all edge cases in business logic, but easier and faster to implement.
We can generally recognize the AIs being off when they deal in our area of expertise, but there is some AI variant of Gell-Mann Amnesia at play that leads us to go back to trusting AI when it gives outputs in areas we are novices in.
If so, how do we distinguish between code that works and code that doesn't work? Why should we even care?
Hilariously, not by using our brains, that's for sure. You have to have an external machine. We all understand that "testing" and "code review" are different processes, and that's why.
If lies are all we have, then how is this behavior possible?
You're cherry picking my little bit of wordsmithing. Obviously we aren't always wrong. I'm saying that our thought processes stem from hallucinatory connections and are routinely wrong on first cut, just like those of an LLM.
Actually I'm going farther than that and saying that the first cut token stream out of an AI is significantly more reliable than our personal thoughts. Certainly than mine, and I like to think I'm pretty good at this stuff.
I’m still not a big fan of comparing humans and LLMs because LLMs lack so much of what actually makes us human. We might bullshit or be wrong because of many reasons that just don’t apply to LLMs.
Your no-true-scotsman clause basically falsifies that statement for me. Fine, LLMs are, at worst I guess, "non-thoughtful humans". But obviously LLMs are right an awful lot (more so than a typical human, even), and even the thoughtful make mistakes.
So yeah, to my eyes "Humans are NOT different" fits your argument better than your hypothesis.
(Also, just to be clear: LLMs also say "I don't know", all the time. They're just prompted to phrase it as a criticism of the question instead.)
Doesn't it get boring?
I like using these models a lot more than I stand hearing people talk about them, pro or contra. Just slop about slop. And the discussions being artisanal slop really doesn't make them any better.
Every time I hear some variation of bullshitting or plagiarizing machines, my eyes roll over. Do these people think they're actually onto something? I've been seeing these talking points for literal years. For people who complain about no original thoughts, these sure are some tired ones.
When you see a pattern like this, you know that its not coming from any place of truth but rather ideology
I'd much rather read articles about what LLMs can/can't do, or stuff people have built with LLMs, than read how everything LLMs touch turns to shit.
I’m not even sure whether this is possible. The current corpus used for training includes virtually all known material. If we make it illegal for these companies to use copyrighted content without remuneration, either the task gets very expensive, indeed, or the corpus shrinks. We can certainly make the models larger, with more and more parameters, subject only to silicon’s ability to give us more transistors for RAM density and GPU parallelism. But it honestly feels like, without another “Attention is All You Need” level breakthrough, we’re starting to see the end of the runway.
Prior to the industrial revolution, the natural world was nearly infinitely abundant. We simply weren't efficient enough to fully exploit it. That meant that it was fine for things like property and the commons to be poorly defined. If all of us can go hunting in the woods and yet there is still game to be found, then there's no compelling reason to define and litigate who "owns" those woods.
But with the help of machines, a small number of people were able to completely deplete parts of the earth. We had to invent giant legal systems in order to determine who has the right to do that and who doesn't.
We are truly in the Information Age now, and I suspect a similar thing will play out for the digital realm. We have copyright and intellecual property law already, of course, but those were designed presuming a human might try to profit from the intellectual labor of others. With AI, we're in the industrial era of the digital world. Now a single corporation can train an AI using someone's copyrighted work and in return profit off the knowledge over and over again at industrial scale.
This completely unpends the tenuous balance between creators and consumers. Why would a writer put an article online if ChatGPT will slurp it up and regurgitate it back to users without anyone ever even finding the original article? Who will contribute to the digital common when rapacious AI companies are constantly harvesting it? Why would anyone plant seeds on someone else's farm?
It really feels like we're in the soot-covered child-coal-miner Dickensian London era of the Information Revolution and shit is gonna get real rocky before our social and legal institutions catch up.
Mostly, AIs don’t recite back various works. Yes, there a couple of high profile cases where people were able to get an AI to regurgitate pieces it New York Times articles and Harry Potter books, but mostly not. Mostly, it is as if the AI is your friend who read a book and gives you a paraphrase, possibly using a couple sentences verbatim. In other words, it probably falls under a fair use rule.
Secondly, given the modern world, content that doesn’t appear online isn’t consumed much, so creators who are,doing it for the money will certainly continue putting content online. Much of that content will be generated by AIs, however.
The really discouraging part of this is that it feels like our social and legal institutions don't even care if they catch up or not.
Technology is speeding up and the lag time before anything is discussed from a legal standpoint is way, way too long
Of course 5-10 years is a long time to bang our heads against the wall with untenable costs but I don't know if we can solve our way out of that problem.
Based on what's happened so far, maybe. At least that's exactly how we got to the current iteration back in 2022/2023, quite literally "lets see what happens when we throw an enormous amount data at them while training" worked out up until one point, then post-training seems to have taken over where labs currently differ.
This is just totally incorrect. It's one of those things everyone just assumes, but there's an immense amount of known material that isn't even digitized, much less in the hands of tech companies.