I don’t think they’re going to downsize though, I think the big players are just going to use the freed up memory for more workflows or larger models because the big players want to scale up. It’s a cat and mouse race for the best models.
My experience doesn't disagree, at least. I've been using Qwen for coding locally a bit. It is much better than I thought it would be. But also still falls short in some obvious ways compared to the frontiers.
The demand for memory isn't going to go down, we'll just be able to do more with the same amount of memory.
Given that increasing model size doesn't yield proportional increases in intelligence, there is a world where these datacenters don't have a positive ROI if we make these models even a fraction as effective as the human brain.
I think that either investors were extremely skittish that the stocks might crash and jumped at the first sign of trouble (creating a self-fulfilling prophecy) or they were trading on non-public information and analysts who don't have access to said information are reading too much into the temporal coincidence of the Google Research blog highlighting this paper.
Constantly sitting around trying to solve problems that nobody has made headway on for hundreds of years. Or inventing theorems around 15th century mysticism that won't be applicable for hundreds of years.
Now if you'll excuse me I need to multiply some numbers by 3 and divide them by 2 ... I'm so close guys.
This part sounds especially cool. I did not think about this application when reading the other articles about TurboQuant. It would be cool to have access to this performance optimization for local RAG.
Compute, bytes of ram used, bytes in model, bytes accessed per iteration, bytes of data used for training.
You can trade the balance if you can find another way to do things, extreme quantisation is but one direction to try. KANs were aiming for more compute and fewer parameters. The recent optimisation project have been pushing at these various properties. Sometimes gains in one comes at the cost of another, but that needn't always be the case.
Can we please start talking about this in that context? We already know what TurboQuant will do to DRAM demand. We already know what it will do to context windows. There is no need to speculate. There is no need to panic sell stocks.
Unfortunately, nobody at big companies know, what exactly math will win, so competition not end.
So, researchers will try one solution, then other solution, etc, until find something perfect, or until semiconductors production (Moore's Law) made enough semiconductors to run current models fast enough.
I believe, somebody already have silver bullet of ideal AI algorithm, which will lead all us to AGI, when scaled in some big company, but this knowledge is not obvious at the moment.
I am no expert, so this is a shallow take, but I think the global LLM already reaches its limit, and general AGI could only be possible if it's living in the moment, i.e., retraining every minute or so, and associating it with a much smaller device that can observe the surroundings, like a robot or such.
Instead of KV cache, I have an idea of using LoRA's instead: having a central LLM unchanged by learning, surrounded by a dozen or thousands of LoRAs, made orthogonal to each other, each competed by weights to be trained every 1 min say. The LLM, since it's a RNN anyway, provides "summarize what your state and goal is at this moment" and trains the LoRAs with the summary along with all the observations and say inputs from the users. The output of the LoRAs feeds back to the LLM for it to decide the weights for further LoRAs training.
Anyways, I am just thinking there needs to be a structure change of some kind.
the models are still very stupid atm something needs to change
The real question though is how close are we to the point where the pressure is more for efficiency rather than capability. Anecdotally I think it's a ways off. Right now the general vibe I get is that people feel AI is very impressive for how cheap it is to use, which suggests to me that a lot of users would be very willing to pay more for more capable models. So the tipping point where AI hardware demand might slow down seems a ways off.
The problem with AI is that it's not obvious what the upper limit of capability demand might be. And until or if we get there, there will always be demand for the more capable models that run on centralized computing resources. Even if at some point I'm able to run a model on my local desktop that's equivalent to current Claude Opus, if what Anthropic is offering as a service is significantly better in a way that matters to my use case, I will still want to use the SaaS one.
Only if it's competitively priced. You wouldn't want to use the SaaS if the breakeven in investment on local instances is a matter of months.
Right now people are shelling out for Claude Code and similar because for $200/m they can consume $10k/m of tokens. If you were actually paying $10k/m, than it makes sense to splurge $20k-$30k for a local instance.
If that's the plan (there is no plan) then it expires at some point, because it's a spiral and such spirals always bottom out.
Of course they will - if that happens all these AI token providers won't have a use for all that hardware they bought. You'll be buying used H100s and H200s off eBay for pennies on the dollar.
Oh it gets worse than that, the money which caused all of this by OpenAI was taken from Japanese banks at cheap interest rates (by softbank for the stargate project), and the Japanese Banks are able to do it because of Japanese people/Japanese companies and also the collateral are stocks which are inflated by the value of people who invest their hard earned money into the markets
So in a way they are using real hard earned money to fund all of this, they are using your money to basically attack you behind your backs.
I once wrote an really long comment about the shaky finances of stargate, I feel like suggesting it here: https://news.ycombinator.com/item?id=47297428
That's ridiculous, "infinite money" isn't a thing. They will spend as much as they can not because they want to keep local solutions out, but because it enables them to provide cheaper services and capture more of the market. We all eventually benefit from that.
My reading of GP is that he was being sarcastic - "infinite amounts of circular fake money" is probably a reference to these circular deals going on.
If A hands B investment of $100, then B hands A $100 for purchase of hardware, A's equity in B, on paper, is $100, plus A has revenue of $100 (from B), which gives A total assets of $200.
Obviously it has to be shuffled more thoroughly, but that's the basic idea that I thought GP was referring to.
Though I'm not an expert, maybe my understanding of the memory allocation is wrong.
The power efficiency alone is a strong enough pressure to use centralized model providers.
My 3090 running 24b or 32b models is fun, but I know I'm paying way more per token in electricity, on top of lower quality tokens.
It's fun to run them locally, but for anything actually useful it's cheaper to just pay API prices currently.
Then we can make them even bigger.
But what if it becomes "good enough", that for most intents and purposes, small models can be "good enough"
There are some people here/on r/localllama who I have seen run some small models and sometimes even run multiple of them to solve/iterate quickly and have a larger model plug into it and fix anything remaining.
This would still mean that larger/SOTA models might have some demand but I don't think that the demand would be nearly enough that people think, I mean, we all still kind of feel like there are different models which are good for different tasks and a good recommendation is to benchmark different models for your own use cases as sometimes there are some small models who can be good within your particular domain worth having within your toolset.
It's simple: then we'll make our intents and purposes bigger.
They say prostitution is the oldest industry of all. We know how to achieve human-level intelligence quite well. The outstanding challenge is figuring out how to produce an energy efficient human-level intelligence.
Not many people would like today models comparable to what was SOTA 2 years ago.
To run models locally and have results as good as the models running in data centers we need both efficiency and to hit a wall in AI improvement.
None of those two conditions seem to become true for the near future.
It doesn't, it induces demand. Why? Because there's always too many people with cars who will fill those lanes.
PS: This doesn't mean that better public transportation could deliver more bang for the buck than the n-th additional car lane. But never ever have I heard from anybody that they chose to buy a car or use an existing car more often because an additional lane has been built.
https://en.wikipedia.org/wiki/Induced_demand#cite_note-vande...
[1] http://www.incompleteideas.net/IncIdeas/BitterLesson.html
Isn't that a classic tit for tat decision and head for a loss?
Excellence and prestige are valuable too. You get those expensive ML for a small discount, public/professional perception, etc. Considering the public communication from Google, that isn't complete sociopathic, they know this war isn't won in one night, they are the only sustainably funded company in the competition. Surely they are at risk with their business, but can either go rampant or focus. They decided to focus.
We flagged these issues to the authors before submission. They acknowledged them, but chose not to fix them. The paper was later accepted and widely promoted by Google, reaching tens of millions of views.
We’re speaking up now because once a misleading narrative spreads, it becomes much harder to correct. We’ve written a public comment on openreview (https://openreview.net/forum?id=tO3AS KZlok ).
We would greatly appreciate your attention and help in sharing it."
https://x.com/gaoj0017/status/2037532673812443214
The main breakthrough [rotating by an orthogonal matrix to make important outliers averaged acrossed more dimensions] comes from RaBitQ. Sounds like the RaBitQ team was much more involved, and earlier, and the turbo quant paper very deliberately tries to avoid crediting and acknowledging RaBitQ.
My understanding is that the efficacy of these methods isn't in dispute, what turboquant did was adapt the method that was being used in vector databases and adapted it for transformers, and passed it of more as a new invention than an adaptation.
https://openreview.net/forum?id=tO3ASKZlok