Models aren't trained across their context, their context is their short term memory at runtime, right? Nothing to do with training. They are trained on a static dataset.
If your model only ever sees 8K token samples during training, it won’t be as good at 128K context length than if you had trained on samples from 8 to 128
When you train, you teach the model to, among other things ‘self attend’ to the input vector, ultimately projecting that vector into a large embedding space.
Thought experiment —- if 99% of the time the last 100,000 digits of your vector was zero, how likely is it that you’d have high quality embedding trained by doing gradient descent on those outputs?
That’s what the paper is referring to.
I noticed that the longer a chat gets, the more unpredictable the models behavior becomes (and I think that's still a common jailbreak technique too).
(I think it might also have something to do with RoPE, but that's beyond me.)
or lets say it differently: The LLM gets trained on static data but also on the capability of handling context in itself.
Kimi introduced this https://github.com/MoonshotAI/Attention-Residuals but i'm pretty sure closed labs like Google had something like this for a while.
Shockingly, we seem to have found a self attention mechanism of that quality, it just has the sad property of growing at O(N^2) where N is the context length.
Nvidia uses ML for finetuning and architecturing their chips. this might be one use case.
Another one would be to put EVERYTHING from your company into this context window. It would be easier to create 'THE' model for every company or person. It might also be saver than having a model train with your data because you don't have a model with all your data, only memory.
But maybe that’s enough tokens to feed an entire lifetime of user behaviour in for the digital twin dystopia?
Current approaches require fancy tricks to fit tokens into memory, and spread attention thinner over larger numbers of tokens. The new approach tries to find a way to keep everything in a single shared memory, and process the tokens in parallel using multiple GPUs
I know Yann LeCun is trying to do a completely different architecture and I think that's expected to take 2-3 years before showing commercial results, right? Is that why they're finding it quicker to change the hardware?
Yann LeCunn has been very wrong in the past about LLMs.[0] The approach he wants to take is to train using sensor data in the physical world. I think it's going to fail because there's near infinite amount of physical data down to Schrodinger's equation on how particles behave. There's too much signal to noise. My guess is that they'll need magnitudes more compute to even get something useful but they do not have more compute than OpenAI and Anthropic. In other words, I think LLMs will generate revenue as a stepping stone for OpenAI and Anthropic such that they will be the ones who will ultimately train the AI that LeCunn dreams of.
[0]https://old.reddit.com/r/LovingAI/comments/1qvgc98/yann_lecu...
People, Researcher, Investor etc. probably also want to see what would be possible and someone has to do it.
I can also imagine, that an inferencing optimized system like this could split the context for different requests if it doesn't need to use the full context.
Could also be that they have internal use cases which require this amount of context.