It's prefill; slow prefill kills agentic workloads dead.
If you have 100,000 tokens at ~150tok/s per the OP, you're looking at:
You have: 100000 / (150/s)
You want: hms
11 min + 6.6666667 sec
Which is quite a wait indeed.This is also a problem for all of the Mac local LLMs. Macs are a great way to get a lot of high bandwidth memory, but their compute is very far behind current gen dedicated GPUs. Some of the expensive Mac Studio setups allow you to run very large models with usable tokens/s, but you can be waiting a long time for it to get to the point of generating those tokens.
Also, the cheap HPE pulls on eBay need some proprietary HPE magic to work, and I have yet to see anyone figure that out.
i've ran some multi vendor frankenstein setups before and sometimes it even works, so i'm curious to hear your experience with it.
In any event, not all of us have a unique writing style worth preserving just like not all of us can write clear and clean code. Just saying.
I feel like writing could use a similar harness, where it attempts to minimally reword the authors sentences, perhaps just tweaking grammar, spelling, etc. In the coding example i think the human code would be near unchangeable, the LLM would pivot around it - but in the writing example i think the human writing would have to be more mutable. I imagine it would be a configurable setting.
I've not really seen a system which focuses on this human<->LLM look, but it feels interesting to me.
I’m really in the “who gives a shit” camp on something like this. A lot of people probably have an LLM punch up a blog post. It is good at turning bullet points and notes into prose, fixing run-ons, etc. Maybe I’m naive but I trust that the kind of person who posts a clearly noncommercial post like this on HN gives a crap enough that they read the final draft and confirmed it isn’t inaccurate.
This pearl-clutching about the mere use of AI regardless of how responsible or appropriate the use is, seems like a professor in 1985 throwing an essay back in a student’s face as “this was obviously printed from a computer and not typewritten like a PROPER essay! I can tell just by looking at it!”
- In 2017, the v100 was a ~$10,000 GPU. I believe there was a PCI-e version but this is probably so cheap because SXM2 is going to be harder to use;
- A 5090 has 1800GB/s of internal memory bandwidth (compared to 900GB/s in the 9 year old GPU). Of course a 5090 is substantially more expensive;
- A 5090 has ~21k CUDA cores vs ~5k;
- The current $10k NVidia GPU is the RTX 6000 Pro w/ 96GB of VRAM. It has slightly more CUDA cores but it otherwise pretty much just a 5090. This is unsurprising. NVidia uses VRAM for market segmentation.
Consider this: in 5-10 years, the trillions spent on AI data centers will likewise be sold for scrap most likely. That's how short the runway is for OpenAI and Anthropic to recover that investment.
Anyway, I'm kind of impressed the author managed to get this all to work. I don't think it even would've occurred to me that someone had made an SXM2 adapter, particularly because it's not even used anymore. Like props to whoever did that.
Even more interesting: it'll devalue all of SaaS and the entire US tech sector.
We might have just shot our most valuable non-AI tech products in the foot.
The resulting economic crash will affect everyone, we're (IMHO) looking towards a dotcom-bust level wipeout. And many SaaS and other companies run asset-lean (i.e. they have no server hardware because that's all cloud, no real estate because it's all either wework or conventionally rented), margin-lean (the VC business model requires that, as the basic recipe is to achieve market domination by burning cash) and cash-lean (often enough, it's less than a quarter of expenses on the bank accounts).
All that "lean-ness" looks great on an investor's quarterly release sheet: no massive amounts of wealth tied up in assets and no cash sitting around on bank accounts that could be released towards investors as dividends or, if it comes from third parties, costs the company interest... but it prevents resiliency against crises.
Counterpoint: the fiber buildout during the dotcom boost. That crashed the economy pretty hard when the bubble burst, but we are still benefitting from all the dark fiber that was arranged for and built out back in that era. A lot of today's ISPs were able to grab up that fiber after the bust for cents on the dollar.
Assume that OpenAI and Anthropic go bust, which at least one of them likely will, and possibly a fair few of the datacenters that are under construction will also collapse. Someone will be able to snatch these physical assets again for cents on the dollar and run open-weight models on them or train new ones.
The problem isn't (and no, this is not an AI tell, everything I write here got typed on a 2022 M2 MBA by hand) the assets, they will be put up for productive usage, just as with any other large bankruptcy or bubble in history. The problem is the "IOU" that is being passed from one hand to the next like a hot potato. Assuming a recovery of, maybe, 20% after the collapse, at 1.6 trillion dollars of assets under management by some kind of private investment/debt we're looking at about 1.3 trillion dollars in valuation that is going to be wiped out.
And given that a lot of the investment market is actually backed by pension funds... this is going to be a bloodbath. Not only will there be a lot of people laid off in addition to the layoffs we already saw "due to AI", but when the pension funds and thus their payouts collapse? We'll see retirees flooding the employment markets who just try to make a living, rendering the situation for everyone else even worse. Flipping burgers used to be a gig for students, these days students compete with people of all ages desperate to survive - and thus desperate to undercut others in wages.
Another problem will be the capacity buildout in the semiconductor industry. It's already heading toward an oligopoly after numerous boom-bust cycles: you only have two and a half GPU chip vendors (NV, AMD, Intel), two vendors of general-purpose CPU vendors (Intel and AMD - I exclude Apple because they do not sell their CPUs to any third party and ARM because 99% of non-Apple ARM chips do not go towards servers, desktops and laptops), three RAM manufacturers (Samsung, SKhynix, Micron) and two and a half physical chip manufacturers (TSMC, Samsung, Intel). When the AI bubble bursts, it will be one of a hell of an effort to prevent at least one actor from going bankrupt.
[1] https://prospect.org/2025/11/19/ai-bubble-bigger-than-you-th...
2X NVIDIA Tesla V100 32GB NVLink Water Cooled X99 E5-2686v4 AI Workstation PC
Item Quantity
Intel Xeon E5-2686 v4 CPU 1
2U CPU Cooler 1
Jingyue X99 Motherboard 1
DDR3 Memory 32GB
SSD 480GB
AMD Radeon R5 240 4K Display Card 1
NVIDIA Tesla V100 32GB SXM2 GPU 2
NVLink SXM2 Dual-GPU Baseboard 1
Corsair Water Cooling System 2
850W Bronze Power Supply 1
Dual-GPU 300G NVLink SXM2 Baseboard 1
8654 Data Cable 2
8654 to PCIe Adapter Card 1The thought of throwing away working cards sounds so bizarre to me. I can't believe companies would dispose them into the landfill like that, it is at least worth giving away for refuse.
sigh
Had to stop there. Annoying. I can't stand AI use for writing. It makes any otherwise great article feel so disingenuous.
Because humans write exactly like this /s
The project is still very cool, but it’s a little less enjoyable to read when everything sounds the same. It would be just as annoying for people to manually write in a corporate/marketing style, because humanity is what makes the small web interesting.
Not from individual human content, that's for sure - maybe MLM marketing copy? Sleazy 4AM ads?
I mean, every time this response comes up, I keep asking the person to point at something written prior to 2022 that gets 80%+ on the LLM detectors, and yet no one can find anything.
Maybe you, postalrat, can find something written in this style that was published prior to 2022.
If they way you thought was to run a bunch of if statements, generate content, then feed that content back to get a "score" of what seems the most plausible, run the if statements again, and adjust / merge responses, then you would write similarly. The recognizable cadence of LLM generated content is pretty clearly the result of a lot of if statements being fused together.
Isn't a rasbpi with 16gb of RAM $300 now?
I don't think this is a fair characterization of the situation. I use frontier models via API pre-paid tokens every single day, and I can barely rack up $100 per month. The fact that we figured out how to burn double this in 20 minutes is impressive, but I don't think it reflects the reality that many are experiencing right now. There are some exceptionally gluttonous approaches to harnessing LLMs that I think are serving as convenient straw men in these discussions.
Paying for the API will almost always be more economical than self-hosting equivalent infrastructure. I am not against self-hosting, but the article suggests a primarily economic motivation for this effort. If you are consuming fewer than 10^9 tokens per month, I really don't think it's worth your time to try and compete with the hyperscalars. Most of the money is to be found in the integration of this technology with existing businesses.