Hacker News

21 points by pseudolus 3 days ago | 27 comments

Amazing that they are trying to solve this with hardware rather than with a new software architecture but I suppose the current technology underlying LLM software must be far and away the best theoretically or most established, or the time taken to seek a new model isn't worth it for the big companies.

I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?

aurareturn 49 minutes ago

It is both a software and hardware problem. Software because you can train LLMs that get better at very large contexts. Hardware because no matter what you do in software, you still need faster and bigger chips.

Yann LeCunn has been very wrong in the past about LLMs.[0] The approach he wants to take is to train using sensor data in the physical world. I think it's going to fail because there's near infinite amount of physical data down to Schrodinger's equation on how particles behave. There's too much signal to noise. My guess is that they'll need magnitudes more compute to even get something useful but they do not have more compute than OpenAI and Anthropic. In other words, I think LLMs will generate revenue as a stepping stone for OpenAI and Anthropic such that they will be the ones who will ultimately train the AI that LeCunn dreams of.

[0]https://old.reddit.com/r/LovingAI/comments/1qvgc98/yann_lecu...

zaphar 41 minutes ago

I don't know. Some of those statements still look correct at the time they were made and then people found out how to work around them. I don't think anyone has shown his general assumption is wrong really. The issue is we don't know what the ceiling is for these things is yet because we haven't hit it. But that doesn't mean there is no ceiling.

aurareturn 38 minutes ago

His generation assumptions were wrong. That's the point.

zaphar 8 minutes ago

I haven't seen any indication that they are. Can you point me at some?

AntiUSAbah 2 hours ago

Nvidia has so much money, it would be a waste if they wouldn't attack current problems on multiply points at once.

People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.

I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.

Could also be that they have internal use cases which require this amount of context.

Schlagbohrer 2 hours ago

What does this mean: "In addition, because most AI models are not trained uniformly across their maximum context length, their reasoning quality tends to degrade gradually near the limit rather than fail abruptly."

Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.

Jabrov 2 minutes ago

They absolutely are. The “maximum context window” of a model is a byproduct of the context length it was trained on.

If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128

vessenes 32 minutes ago

Context is the vector of tokens (numbers) that goes into the first layers of the neural network.

When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.

Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?

That’s what the paper is referring to.

andai 2 hours ago

Not sure how it is now, but a while back most of the training data was short interactions.

I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).

(I think it might also have something to do with RoPE, but that's beyond me.)

AntiUSAbah 2 hours ago

So for the context to work well, you need some attention mechanism which makes sure that details are not getting lost due to context amount.

or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.

Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.

yorwba 2 hours ago

The attention residuals paper uses attention across layers for the same token, in addition to the usual case of attention across tokens within the same layer, but it doesn't do anything to address the "lost in too much context" problem. At least the number of layers is currently still low enough that there's probably no equivalent "lost in too many layers" problem yet.

AntiUSAbah 4 minutes ago

Seems you are right, i have to re-read a few things;

smallerize 2 hours ago

I think it means most of the training data is short. And a lot of the long-context examples are conversations where the middle turns are less important.

gbnwl 2 hours ago

[dead]

Havoc 52 minutes ago

Having it would be useful but I'd say long before you get there one should think about structuring the data in a more meaningful sense. Breaking tasks out into subagents etc.

schnitzelstoat 3 hours ago

Is such a large context window even desirable? It seems like it would consume an awful lot of tokens and, unless one was very careful to curate the context, could even result in worse performance.

vessenes 29 minutes ago

Yes. That is, it is if you imagine a magically good self attention mechanism that could decide what in the context to attend to at any one moment. Then it would be like working with a polymath that has incredible memory. Or bringing in that aged but still senior Chief of Staff of a large company that knows where every body was buried, and why every decision was made at the time it was made, or a professor of film that has seen and can remember thousands of films.

Shockingly, we seem to have found a self attention mechanism of that quality, it just has the sad property of growing at O(N^2) where N is the context length.

AureliusMA 2 hours ago

I remember when a large context was 8k! Nowadays that would seem extremely small, because we have new use-cases that require much larger context sizes. Maybe in the future, we will invent ways to use inference on very large contexts that we cannot even imagine today.

AntiUSAbah 2 hours ago

Thats either the R&D part of this chip or Nvidia has the use case.

Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.

Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.

withinboredom 3 hours ago

For larger codebases ... maybe it will cut down on "let me create a random number wrapper for the 15th time" type problems.

Weryj 2 hours ago

You should already have skills which mention these utilities.

But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?

withinboredom 2 hours ago

"type problems" was doing the heavy lifting there, not literally "this utility".

faangguyindia 2 hours ago

imagine if you were making a database software and u could fit source code of all existing databases and their github issues in context.

__alexs 3 hours ago

Does having 1 billion tokens mean more total tokens in the context window are actually good quality, or do we just get more dumb tokens?

RugnirViking 3 hours ago

the article is almost entirely about this, yes.

Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs

AureliusMA 3 hours ago

How large would a 1 billion token kv even be ?!

AntiUSAbah 2 hours ago

30TB for 4 bit, 60tb for 8bit res

alexreysa 2 hours ago

[flagged]