Still, this is a great idea, and one I hope takes off. I think there's a good argument that the future of AI is in locally-trained models for everyone, rather than relying on a big company's own model.
One thought: The ability to conveniently get this onto a 240v circuit would be nice. Having to find two different 120v circuits to plug this into will be a pain for many folks.
* this section written by me typing on keyboard *
* this section produced by AI *
And usually both exist in document and lengthy communications. This gets what I wanted across with exactly my intention and then I can attach 10x length worth of AI appendix that would be helpful indexing and references.
Are references helpful when they're generated? The reader could've generated them themselves. References would be helpful if they were personal references of stuff you actually read and curated. The value then would be getting your taste. References from an AI may well be good-looking nonsense.
Yes, sometimes this is true, but not always.
Note, it's not one prompt (there aren't really "one prompt" any more, prompt engineering is such a 2023-2024 thing), or purely unreviewed output. It's curated output that was created by AI but iterated with me since it goes with and has to match my intention. And most of the time I don't directly prompt the agent any more. I go through a layer of agent management that inject more context into the agent that actually work on it.
Maybe the volume for them is ok that well-intentioned but poor quality PRs can be politely(or otherwise, culture depending) disregarded and the method of generation is not important.
Then you could focus fire, like the script kiddies did with DDoS in the old days on fixing whatever preferred issues you have.
fundamentally, looks like they are shipping consumer off-the-shelf hardwares in a custom box.
Or could be the server edition 6000s that just have a heatsink and rely on the case to drive air through them, those are 600W cards.
(I work for an electrical contractor so my sense of ease might be overcorrecting)
If it shipped with like 4090+ (for a higher price) it’d be more tempting.
https://x.com/__tinygrad__/status/1983917797781426511
Stopped due to raising GPU prices:
Wouldn't there be a massive duplication of effort in that case? It'll be interesting to see how the costs play out. There are security benefits to think about as well in keeping things local-first.
old construction in the US sometimes did this intentionally (so old, the house didn’t have grounds. Or to “pass” an inspection and sell a place) but if a licensed electrician sees this they have to fix it.
I’m dealing with a 75 year old house that’s set up this way, the primary issue this is causing is that a 50amp circuit for their HVACs are taking a shorter path to ground inside the house instead of in the panel.
As a result the 50 amp circuit has blown through several of the common 20amp grounds and neutrals and left them with dead light fixtures and outlets because they’re bridged all over the place.
If an HVAC or two does this, I’d advise against this for your 3200 watt AI rig.
EU, you don’t want to try to energize your ground. They use step down transformers or power supplies capable of taking 115-250 (their systems are 240-250V across the load and neutral lines. Not 120 across the load and neutral like ours.)
in the US. you’re talking about energizing your ground plane with 120v and I don’t want to call that safe… but it’s REALLY NOT SAFE to make yourself the shortest path to ground on say. a wet bathroom floor. with 220V-250v.
I can’t tell what practice you’re referring to. Are you perhaps referring to older wiring that connects large appliances to a neutral and two hots but no ground, e.g. NEMA 10-30R receptacles? Those indeed suck and are rather dangerous. Extra dangerous if the neutral wiring is failing or undersized anywhere.
But even NEMA 10-30R receptacles are still 120V RMS phase-to-ground. (And, bizarrely, there’s an entire generation of buildings where you might find proper 4-conductor wiring to the dryer outlet and a 10-30R installed — you can test the wiring and switch to 14-30R without any rewiring.)
The exception for residential wiring is when the neutral feed from the utility transformer fails, in which case you may have 240V phase-to-phase with the actual Earth floating somewhere in the middle (via the service’s ground connection), which can result in phase-to-neutral and phase-to-ground measured anywhere in the house varying from 0 to 240V RMS.
> wet bathroom floor
A GFCI receptacle adds a considerable degree of safety and can be installed with arbitrarily old wiring. It’s even permitted by code to install one with no ground connection as long as you label it appropriately — look it up in your local code.
I believe that’s kinda naughty.
It works, but it energizes your ground plane and people do get mildly shocked. that’s making me a little nervous.
So holes have been drilled in ceilings and walls and single wire neutrals or grounds have been fished down the walls, repeatedly, by yours truly , but there’s still at least one “gfci” outlet that’s wired this way And they’re balking at getting an electrician back out here for.
bridging neutral to ground because the neutral lines dead, uh, “works” to be technical but whoever did this moved on years ago and heaven only knows how many outlets or fixtures this was done in. I’m just finding out one by one as someone goes “hey this stopped working!”and you pull it and the neutral or ground blew like a fuse.
So that’s my whole point, this is an extremely bad idea for a 3200watt computer.
yes, they are all getting snipped and blank wall plated and marked as hazards that need to be remediated with a Dymo labeler as I discover them.
I don’t work here I just live here and have kind of a slummy owner who doesn’t want to do anything about any of it and doesn’t care if the plumbing or electrical works.
But they paid some guy like $4000 to install a totally unnecessary subpanel that’s bridging conflicting phases into the same circuits because he didn’t figure out this was what was going on. Dios Mio. I would have fixed the whole house for $1000. Miracle this hovel hasn’t burned to the ground yet.
I’m putting up with it for now but should probably bail before it does.
In Europe, you could plug the two power supplies into an appropriately sized 240V circuit.
In an apartment you can't rewire, you could set it up in your kitchen, which in the modern US code should have two separate 20A circuits. You will need to put it to sleep while you use appliances.
But this is re: European 240/250 which is 240 between its load and neutral
I’d say don’t energize either systems ground plane, but , really, don’t do this in EU
So basically you need a brand new circuit run if you don't have two 120V circuits next to each other. But if you're spending $65k on a single machine, an extra grand for an electrician to run conduit should be peanuts. While you're at it I would def add a whole-home GFCI, lightning/EMI arrestor, and a UPS at the outlet, so one big shock doesn't send $65k down the toilet.
He's not saying you would use it as two separate 120v circuits sharing a ground but rather as a single 240v circuit. His point is that it's easy to rewire for 240v since it's the same as all the other wiring in your house just with both poles exposed.
Of course you do have to run a new wire rather than repurpose what's already in the wall since you need the entire circuit to yourself. So I think it's not as trivial as he's making out.
But then at that wattage you'll also want to punch an exhaust fan in for waste heat so it's not like you won't already be making some modifications.
Can confirm.
The only place where there's isolation is stuff like USB ports to avoid dangerous ground loop currents.
That said I believe the PSU itself provides full isolation and won't backfeed so using two on separate circuits should (maybe?) be safe. Although if one circuit tripped the other PSU would immediately be way over capacity. Hopefully that doesn't cause an extended brownout before the second one disables itself.
No need for separate circuits, just use a double adapter.
Oh wait, I get it, it's bike shedding.
I have no idea who would buy this. Maybe if you think Vera Rubin is three years out? But NV ships, man, they are shipping.
This is already solved by running LM Studio on a normal computer.
If you compare tokens/kWh efficiency then my math has Mac Studio being about 1.5x more efficient.
Only if you upload your data to that cloud server you rented. Then, by definition, you are.
> AWS and GCP are copying all private data on servers?
Every computer copies data when moving it. Several times, in fact. Through network card buffers, switches, system memory, disk caches, and finally to some form of semi-permanent storage.
I don't have to think Amazon is stealing my data to be aware that Amazon S3 buckets containing privileged information are routinely found open. I don't have to think that Google is spying on me to know that operating equipment my business owns on prem and does not share requires me to trust fewer people and less complex systems than doing the same work from the cloud.
You are very quick to make foolish assumptions and assign them to others.
Has this guy never worked on a B2B product before? Nobody is going to order a $10 million piece of infrastructure through your website's order form. And they are definitely going to want to negotiate something, even if it's just a warranty. And you'll do it because they're waving a $10 million check in your face.
The tone of this website is arrogant to the point of being almost hostile. The guy behind this seems to think that his name carries enough weight to dictate terms like this, among other things like requiring candidates to have already contributed to his product to even be considered for a job. I would be extremely surprised if anyone except him thinks he's that important.
Besides a lot of self congratulatory pats on the back for how elegant it is. Honestly, when I read it, it looked confusing as all the other ML libraries. Not actually simple like Karpathy’s stuff.
All that to say, I do really want it to succeed. They should probably hire some practical engineers and not just guys and gals congratulating themselves how elegant and awesome they are.
> Can you fill out this supplier onboarding form?
That's very important context, as anyone who has been asked to fill out a supplier onboarding form (hi) will attest.
> we don't offer any customization to the box or ordering process
Every B2B deal of that size that I've ever seen requires at least weeks of meetings between the customer and vendor, in which every detail is at least discussed if not negotiated. That would certainly constitute a "customization" to this guy's prescribed ordering process, which is to "Buy it now" [1] through the website at the stated price like you're ordering a jar of peanuts on Amazon. This is not "framing", it's what the guy said. If it isn't what he meant then he needs to fix his copy.
[1] Yes, there is an actual "Buy it now" button for a $65,000 business purchase that takes you to a page that looks just like a Stripe form. There isn't even a textbox for delivery instructions. Wild.
On a website where we frequently talk about disruptive business models, this whole attitude kinda stinks.
Sure, I guess. Far more likely that they won't succeed, and it will be because of their pointless refusal to cooperate with others. I'm curious why you think we should "disrupt" companies putting a little due diligence into massive purchases.
> On a website where we frequently talk about disruptive business models, this whole attitude kinda stinks.
I could say the same thing about making a comment like this on a website where groupthink is rightfully mocked.
First encounter with geohot eh?
> 20,000 lbs
> concrete slab
Huge-scale IT systems are typically delivered in one or more 42/44u cabinets, and are designed to be installed on raised floors.
I mean I'm sure lots of companies do this in practice because tickets for higher-paying customers naturally get prioritized, but directly stating your intention to do it on your home page is hilarious.
This guy desperately needs a marketing intern to look over his copy. Or hell, anyone who knows how to talk to humans.
There’s nothing remotely unusual about being selective about who or how to bring on new customer in B2B sales.
Their preference is for more simplicity than normal —- many businesses make it much harder
I mean, you're not wrong: buying enterprise software from Oracle or Microsoft or Salesforce is pure pain.
But nobody expects buying niche hardware from a tiny vendor to involve the usual 128 pre/post sale meetings and 256 hours of professional services.
Also, relevant VP buying these things usually do understand the difference between AMD and Nvidia stacks really well. Like, really-really well.
There are certain quirks of this platform's user base that always make me laugh. For example, HNers absolutely love to imply something condescending about the other guy's workplace in order to make their point.
Watch this, I can do it too: Working with managers who make $65,000 (or $10 million) purchases with no more due diligence than reading a marketing page and clicking "Buy it now" is not the flex you think it is.
And I honestly see almost no correlation between the amount of negotiation involved, and value received.
Some of the most useful things we've integrated were either free or meant that only the "buy it now" button had to be clicked.
Some of the absolutely worst systems I had to work with were purchased after making a call to that "let us know" number.
This tiny guy is mostly saying that he doesnt have the time for enterprise bla-bla. I am not sure he can organise enterprise sales with this attitude but can definitely relate to it!
The YouTube rap video of geohotz telling Sony lawyers suing him to blow him is still up.
His style of dealing with corporate matters is certainly unconventional
Edit: found a third party referencing the claim but it doesn't belong in the title here I think:
Meet the World’s Smallest ‘Supercomputer’ from Tiiny AI; A Machine Bold Enough to Run 120B AI Models Right in the Palm of Your Hand
https://wccftech.com/meet-the-worlds-smallest-supercomputer-...
Now I'm wondering if the HN title was submitted by some AI bot that couldn't tell the difference.
I think Tinygrad should think about recycling. Are they planning ahead in this regard? Is anyone? My thought is if there was a central database of who own what and where, at least when the recycling tech become available, people will know where to source their specific trash (and even pay for it.) Having a database like that in the first place could even fuel the industry.
I almost sure it’s possible to custom build a machine as powerful as their red v2 within 9k budget. And have a lot of fun along the way.
So, context is probably more $/programming worth than inference speed.
$12,000, $65,000, $10,000,000.
the town near my hometown has 650 – 800 houses (according to chatgpt).
crazy.
A typical home just consumes rather little energy, now that LED lighting and heat pump cooling / heating became the norm.
We're not all solidly middle-class (especially in Southern and Eastern Europe) and as such we cannot afford those heat pumps. But we'll have to eat the increased energy costs brought by insane server configurations like the ones from the article, so, yeey!!!
Do you live in a deprived rural village in a very poor country? Because you can't even run a heater and the oven with 3kW.
With 6 GPUs you have to deal with risers, pcie retimers, dual PSUs and custom case for so value proposition there was much better IMO
I'm currently shopping for offline hardware and it is very hard to estimate the performance I will get before dropping $12K, and would love to have a baseline that I can at least always get e.g. 40 tok/s running GPT-OSS-120B using Ollama on Ubuntu out of the box.
Not revolutionary in any way, but nice. Unless I'm missing something here?
It's funny though... we're using deepseek now for features in our service and based on our customer-type we thought that they would be completely against sending their data to a third-party. We thought we'd have to do everything locally. But they seem ok with deepseek which is practically free. And the few customers that still worry about privacy may not justify such a high price point.
If private inference is actually non-negotiable, then sure, put GPUs in your colo and enjoy the infra pain, vendor weirdness, and the meeting where finance learns what those power numbers meant.
"likely" doesn't inspire much confidence. Surely, they have those numbers, and if it was, they'd publicize the comparisons.
Can they/someone else give more details as to what workloads pytorch is more than 2x slower than the hardware provides? Most of the papers use standard components and I assume pytorch is already pretty performant at implementing them at 50+% of extractable performance from typical GPUs.
If they mean more esoteric stuff that requires writing custom kernels to get good performance out of the chips, then that's a different issue.
How do you test/generate these numbers?
* RAM - $1500 - Crucial Pro 128GB Kit (2x64GB) DDR5 RAM, 5600MHz CP2K64G56C46U5, up to 4 sticks for 128GB or 256GB, Amazon
* GPU - $4700 - RTX Pro 5000 48GB, Microcenter
* CPU/Mobo bundle - $1100 - AMD Ryzen 7 9800X3D, MSI X870E-P Pro, ditch the 32GB RAM, Microcenter
* Case - $220, Hyte Y70, Microcenter
* Cooler - $155, Arctic Cooling Liquid Freezer III Pro, top-mount it, Microcenter
* PSU - $180, RM1000x, Microcenter
* SSD - $400 - Samsung 990 pRO 2TB gen 4 NVMe M.2
* Fans - $100 - 6x 120mm fans, 1x 140mm fan, of your choice
Look into models like Qwen 3.5
This is certainly not the most effective use of $7k for running local LLMs.
The answer is a 16" M5 Max 128GB for $5k. You can run much bigger models than your setup while being an awesome portable machine for everything else.
In terms of GPU memory bandwidth (models fitting in the ~48GB of RTX 5000 Pro card), the RTX card I described above has over 2x the bandwidth of an M5 Max.
If leveraging system RAM (the 128GB-256GB outside the GPU) to run larger models, then the memory bandwidth is ~6x slower than M5 Max.
For models fitting in the ~48GB RTX memory, like dense Qwen3.5 27B models, the RTX will be 2-4x faster than M5 Max. For models that don't fit in the 48GB RTX memory, the M5 Max will be 5-20x faster.
Also worth considering future upgrades: Do you plan to throw away the machine in a few years, or pick up multiple used RTX 6000 Pro cards when people start ditching them?
https://marketplace.nvidia.com/en-us/enterprise/personal-ai-...
A small joke at this weeks GTC was the "BOGOD" discount was to sell them at $4K each...
Machines with the 4xx chips are coming next month so maybe wait a week or two.
It's soldered LPDDR5X with amd strix halo ... sglang and llama.cpp can do that pretty well these days. And it's, you know, half the price and you're not locked into the Nvidia ecosystem
Mac Studio or Mac Mini, depending on which gives you the highest amount of unified memory for ~$5k.
I’m pretty curious to see any benchmarks on inference on VRAM vs UM.
Raptor Lake + 5080: 380.63 GB/s
Raptor Lake (CPU for reference): 20.41 GB/s
GB10 (DGX Spark): 116.14 GB/s
GH200: 1697.39 GB/s
This is a "eh, it works" benchmarks, but should give you a feel for the relative performance of the different systems.In practice, this means I can get something like 55 tokens a sec running a larger model like gpt-oss-120b-Q8_0 on the DGX Spark.
So for an LLM inference is relatively slow because of that bandwidth, but you can load much bigger smarter models than you could on any consumer GPU.
Nowadays I find most things work fine on Arm. Sometimes something needs to be built from source which is genuinely annoying. But moving from CUDA to ROCm is often more like a rewrite than a recompile.
Isn't everyone* in this segment just using PyTorch for training, or wrappers like Ollama/vllm/llama.cpp for inference? None have a strict dependency on Cuda. PyTorch's AMD backend is solid (for supported platforms, and Strix Halo is supported).
* enthusiasts whose budget is in the $5k range. If you're vendor-locked to CUDA, Mac Mini and Strix Halo are immediately ruled out.
For 5K one can get a desktop PC with RTX 5090, that has 3x more compute, but 4x less VRAM - so depending on the workload may be a better option.
I could swear I filed a GitHub issue asking about the plans for that but I don't see it. Anyway I think he mentioned it when explaining tinygrad at one point and I have wondered why that hasn't got more attention.
As far as boxes, I wish that there were more MI355X available for normal hourly rental. Or any.
Obviously any Turing machine can run any size of model, so the “120B” claim doesn’t mean much - what actually matters is speed and I just don’t believe this can be speedy enough on models that my $5000 5090-based pc is too slow for and lacks enough vram for.
120B could run, but I wouldn't want to be the person who had to use it for anything.
To be fair, the 120B claim doesn't appear on the webpage. I don't know where it came from, other than the person who submitted this to HN
Also nobody is comparing this box to an $10M nVidia rack scale deployment. They're comparing it to putting all of the same parts into their Newegg basket and putting it together themself.
A single box with those specs without having to build/configure (the red and green) - I could see being useful if you had $ and not time to build/configure/etc yourself.
The point is that they care now.
But even in the amd stack things (like ck and aiter) consumer cards are not even second class citizens. They are a distance third at best. If you just want to run vllm with the latest model, if you can get it running at all there are going to be paper cuts all along the way and even then the performance won't be close to what you could be getting out of the hardware.
the boxes look cool but how good are they really? the cheapest box seems pricey at 12 for a what is essentially a few gaming gpus. i dont see why you couldnt make that like half the price. u could do a PC/server build thats much much faster for way less. size doesnt matter if its more than twice the price i think...
the more expensive box has atleast real processing gpus but afaik also not very popular ones, this one seems maybe more fair priced (there seems a big difference in bang for buck between these???).
the third one suggested looks like a joke.
dont get me wrong, this seems like a really cool idea. But i dont see it taking off as the prices are corporate but the product seems more home use.
maybe in time they will find a better balance, i do respect the fact that the component market now is sour as hell and making good products with stable prices is pretty much i possible.
id love one of these machines someday, maybe when i am less poor, or when they are xD.
(love the styling of everything, this is the most critical i could be from a dumb consumer perspective, which i totally am btw.)
He's an interesting guy. Seems to be one who does things the way he thinks is right, regardless of corporate profits.
720x RDNA5 AT0 XL 25,920 GB VRAM 23,040 GB System RAM
~ $10 Million
Who is the target market here?
But let’s be real, 12k is kinda pushing it - what kind of people are gonna spend $65k or even $10M (lmao WTAF) on a boutique thing like this. I dont think these kinds of things go in datacenters (happy to be corrected) and they are way too expensive (and probably way too HOT) to just go in a home or even an office “closet”.
I had the same feeling as throwadem when reading this. Your comment clarify what they meant by "everyone"
I'm not sure what tinygrad is but I assume the markup is because the customer is making a conscious choice to support the tinygrad project. But what's unusual is there is apparently no reason whatsoever to buy this hardware, even if you plan on using tinygrad exclusively for your project. At least with System76 hardware I get (in theory) first class support for Pop!_OS.
Sorry, what? Is this just a scam?
But let’s be clear: the risks are the same if you are wiring money through Western Union or wiring through any other bank. Once you wire the money you do not have the same protections as other payment mechanisms. And if you don’t get the product as described, you are likely out your money. This is compared to other forms of payment like credit cards where you are protected. With a credit card you can issue a charge back to the seller and get your money back in the case of fraud. With a wire transfer you cannot.
Theres a lot there that makes sense & I think needs to be considered. But a lot just seems to be out of the blue, included without connection, in my view. Feels like maybe are in-grouo messages, that I don't understand. How this is headered as against democracy is unclear to me, and revolting. I both think we must grapple with the world as it is, and this post is in that area, strongly, but to let fear be the dominant ruling emotion is one of the main definitions of conservativism, and it's use here to scare us sounds bad.
Did he take down the video because of embarrassment or did he fear negative impact on his sales?
And his politics are a derivative of Great Man Theory, and his positions on things like democracy follow from that. This idea, and those espoused by some of the VC/tech elite like Peter Theil are that singular hardworking genius individuals can change the world on their own, and everyone who not in this top 0.1% are borderline NPCs.
They do this both because of their genius/hardwork, and also because they are willing to break the rules that are set forth by this bottom 99.9%.
I'm starting to call this ideology Authoritarian techno-Libertarianism. Its a delibriately oxymoronic name that I use, because these "Great Men" are definitely trying to change the world. IE, they are trying to impose their goals and values on the world without getting the buyin of other people.
Thats the "authoritarian" part. And then the "libertarian" part is that they are going about this imposition of their will on the world by doing it all themselves, through their own hard work.
Think "Person invents a world changing technology, that some people thing is bad, and just releases it open source for anyone to use". AI models are a great example, in fact. Once that technology is out there the genie cannot be put back into the bottle and a ton of people are going to lose their jobs, ect.
A distain for democracy follows directly from things like this. You dont wait for people to vote to allow you to change the world by inventing something new. You just do and watch the results.
I think all these wildly successful neo-feudalists get increasingly emboldened the more they get away with bigger and bigger social infractions.
It's also clear that they haven't experienced existed an environment with extreme inequality - it's not safe for anyone there! They think the NPC plebs will continue to follow "the rules" ad perpetuam without considering that it is a direct result of the stability they are actively undermining. They clearly don't read enough history.
Since when did our perception of tiny blow out of size in tech? Is it the influence of "hello world" eletron apps consuming 100mb of mem while idle setting the new standard? Anyway being an AI bro seems like an expensive hobby...
Literally the line above that
> All bounties paid out at my (geohot) discretion. Code must be clean and maintainable without serious hacks.
No thanks. If you want to try before you buy, have your candidates do a paid test project. Founders need to stop acting like it's a privilege to work for them. Any talent worth hiring has plenty of other options that will treat them with respect.
I'm running a 70b model now that's okay, but it's still fairly tight. And I've got 16gb more vram then the red v2.
I'm also confused why this is 12U. My whole rig is 4u.
The green v2 has better GPUs. But for $65k, I'd expect a much better CPU and 256gb of RAM. It's not like a threadripper 7000 is going to break the bank.
I'm glad this exists but it's... honestly pretty perplexing
I imagine that's because they are buying a single SKU for the shell/case. I imagine their answer to your question would be: In order to keep prices low and quality high, we don't offer any customization to the server dimensions
I used to own a Dell Poweredge for my home-office, but those fans even on minimal setting kept me up at night
The thing that’s less useful is the 64G VRAM/128G System RAM config, even the large MoE models only need 20B for the router, the rest of the VRAM is essentially wasted (Mixing experts between VRAM and/System RAM has basically no performance benefit).
But yeah, 4x Blackwell 6000s are ~32-36k, not sure where the other $30k is going.
edit: Found your comment about /r/localllama, but if you have anything more to add I'm still very interested.
A 120B model cannot fit on 4 x 24GB GPUs at full quantization.
Either you're confusing this with the 20B model, or you have 48GB modded 3090s.
EDIT: Either they edited that to say "quad 3090s", or I just missed it the first time.
check out what other people are getting. you're welcome.
https://www.reddit.com/r/LocalLLaMA/comments/1nunq7s/gptoss1... https://www.reddit.com/r/LocalLLaMA/comments/1p4evyr/most_ec...
I was considering picking up a couple of the 48 gig 4090/3090s on an upcoming trip to China, but I just ended up getting one of the Max-Q's. But maybe the token throughput would still be higher with the 4090 route? Impressive numbers with those 3090s!
What's the rig look like that's hosting all that?
I don't see the 120B claim on the page itself. Unless the page has been edited, I think it's something the submitter added.
I agree, though. The only way you're running 120B models on that device is either extreme quantization or by offloading layers to the CPU. Neither will be a good experience.
These aren't a good value buy unless you compare them to fully supported offerings from the big players.
It's going to be hard to target a market where most people know they can put together the exact same system for thousands of dollars less and have it assembled in an afternoon. RTX 6000 96GB cards are in stock at Newegg for $9000 right now which leaves almost $30,000 for the rest of the system. Even with today's RAM prices it's not hard to do better than that CPU and 256GB of RAM when you have a $30,000 budget.
Can't you offload KV to system RAM, or even storage? It would make it possible to run with longer contexts, even with some overhead. AIUI, local AI frameworks include support for caching some of the KV in VRAM, using a LRU policy, so the overhead would be tolerable.
With that said, people are trying to extend VRAM into system RAM or even NVMe storage, but as soon as you hit the PCI bus with the high bandwidth layers like KV cache, you eliminate a lot of the performance benefit that you get from having fast memory near the GPU die.
Only useful for prefill (given the usual discrete-GPU setup; iGPU/APU/unified memory is different and can basically be treated as VRAM-only, though a bit slower) since the PCIe bus becomes a severe bottleneck otherwise as soon as you offload more than a tiny fraction of the memory workload to system memory/NVMe. For decode, you're better off running entire layers (including expert layers) on the CPU, which local AI frameworks support out of the box. (CPU-run layers can in turn offload to storage for model parameters/KV cache as a last resort. But if you offload too much to storage (insufficient RAM cache) that then dominates the overhead and basically everything else becomes irrelevant.)"