Show HN: 1-Bit Bonsai, the First Commercially Viable 1-Bit LLMs
101 points by PrismML 6 hours ago | 44 comments

drob518 44 seconds ago
I’m really curious how this scales up. Bonsai delivers an 8B model in 1.15 GB. How large would a 27B or 35B model be? Would it still retain the accuracy of those large models?
reply
jjcm 3 hours ago
1 bit with a FP16 scale factor every 128 bits. Fascinating that this works so well.

I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.

On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.

Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.

Pictures of the results here: https://x.com/pwnies/status/2039122871604441213

There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev

reply
najarvg 3 hours ago
Thanks for sharing the link to your instance. Was blazing fast in responding. Tried throwing a few things at it with the following results: 1. Generating an R script to take a city and country name and finding it's lat/long and mapping it using ggmaps. Generated a pretty decent script (could be more optimal but impressive for the model size) with warnings about using geojson if possible 2. Generate a latex script to display the gaussian integral equation - generated a (I think) non-standard version using probability distribution functions instead of the general version but still give it points for that. Gave explanations of the formula, parameters as well as instructions on how to compile the script using BASH etc 3. Generate a latex script to display the euler identity equation - this one it nailed.

Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response

reply
jjcm 3 hours ago
> Was blazing fast in responding.

I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.

reply
najarvg 3 hours ago
I must add that I also tried out the standard "should I walk or drive to the carwash 100 meters away for washing the car" and it made usual error or suggesting a walk given the distance and health reasons etc. But then this does not claim to be a reasoning model and I did not expect, in the remotest case, for this to be answered correctly. Ever previous generation larger reasoning models struggle with this
reply
jjcm 2 hours ago
[dead]
reply
rjh29 2 hours ago
I reminds me of very early ChatGPT with mostly correct answers but some nonsense. Given its speed, it might be interesting to run it through a 'thinking' phase where it double checks its answers and/or use search grounding which would make it significantly more useful.
reply
andai 28 minutes ago
Thanks. Did you need to use Prism's llama.cpp fork to run this?
reply
adityashankar 3 hours ago
here's the google colab link, https://colab.research.google.com/drive/1EzyAaQ2nwDv_1X0jaC5... since the ngrok like likely got ddosed by the number of individuals coming along
reply
jjcm 3 hours ago
Good call. Right now though traffic is low (1 req per min). With the speed of completion I should be able to handle ~100x that, but if the ngrok link doesn't work defo use the google colab link.
reply
adityashankar 3 hours ago
The link didn't work for me personally, but that may be a bandwidth issue with me fighting for a connection in the EU
reply
uf00lme 3 hours ago
The speed is impressive, I wish it could be setup for similar to speculative decoding
reply
hmokiguess 3 hours ago
wow that was cooler than I expected, curious to embed this for some lightweight semantic workflows now
reply
tristanMatthias 2 hours ago
[dead]
reply
kent8192 15 minutes ago
Oh, boy. This good tool hates my LM Studio... The following message appears when I run Bonsai in my LM Studio. I think my settings have done something wrong. ``` Failed to load the model Error loading model. (Exit code: null). Please check the settings and try loading the model again. ```
reply
andai 24 minutes ago
Does anyone know how to run this on CPU?

Do I need to build their llama.cpp fork from source?

Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!

reply
wild_egg 2 hours ago
Don't have a GPU so tried the CPU option and got 0.6t/s on my old 2018 laptop using their llama.cpp fork.

Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.

Cool model.

reply
UncleOxidant 42 minutes ago
what did you have to do to add the AVX2 support?
reply
plombe 23 minutes ago
Interesting post. Curious to know how they arrived at intelligence density = Negative log of the model's error rate divided by the model size.
reply
Geee 14 minutes ago
What is model's error rate?
reply
alyxya 4 hours ago
I expect the trend of large machine learning models to go towards bits rather than operating on floats. There's a lot of inefficiency in floats because typically they're something like normally distributed, which makes the storage and computation with weights inefficient when most values are clustered in a small range. The foundation of neural networks may be rooted in real valued functions, which are simulated with floats, but float operations are just bitwise operations underneath. The only issue is that GPUs operate on floats and standard ML theory works over real numbers.
reply
_fw 3 hours ago
What’s the trade-off? If it’s smaller, faster and more efficient - is it worse performance? A layman here, curious to know.
reply
kvdveer 3 hours ago
Their own (presumably cherry picked) benchmarks put their models near the 'middle of the market' models (llama3 3b, qwen3 1.7b), not competing with claude, chatgtp, or gemini. These are not models you'd want to directly interact with. but these models can be very useful for things like classification or simple summarization or translation tasks.

These models quite impressive for their size: even an older raspberry pi would be able to handle these.

There's still a lots of use for this kind of model

reply
adityashankar 3 hours ago
If you look at their whitepaper (https://github.com/PrismML-Eng/Bonsai-demo/blob/main/1-bit-b...) you'll notice that it does have some tradeoffs due to model intelligence being reduced (page 10)

The average of MMLU Redux,MuSR,GSM8K,Human Eval+,IFEval,BFCLv3 for this model is 70.5 compared to 79.3 for Qwen3, that being said the model is also having a 16x smaller size and is 6x faster on a 4090....so it is a tradeoff that is pretty respectable

I'd be interested in fine tuning code here personally

reply
bilsbie 2 hours ago
I can’t see how this is possible. You’re losing so much information.
reply
keyle 2 hours ago
Extremely cool!

Can't wait to give it a spin with ollama, if ollama could list it as a model that would be helpful.

reply
Archit3ch 3 hours ago
Doesn't Jevons paradox dictate larger 1-bit models?
reply
wmf 2 hours ago
Yeah, hopefully they release >100B models.
reply
volume_tech 2 hours ago
the speed is not just about storage -- at 1-bit you are reading roughly 16x less data from DRAM per forward pass compared to FP16. on memory-bandwidth-constrained hardware that is usually the actual bottleneck, so the speedup scales pretty directly. the ac
reply
syntaxing 4 hours ago
Super interesting, building their llama cpp fork on my Jetson Orin Nano to test this out.
reply
ariwilson 2 hours ago
Very cool and works pretty well!
reply
onlyrealcuzzo 2 hours ago
I'm fascinated by these smaller models.

The amount of progress they've been making is incredible.

Is anyone following this space more closely? Is anyone predicting performance at certain parameter sizes will plateau soon?

Unlike the frontier models, these don't seem to be showing much progress of slowing down.

reply
yodon 4 hours ago
Is Bonsai 1 Bit or 1.58 Bit?
reply
woadwarrior01 4 hours ago
1-bit g128 with a shared 16-bit scale for every group. So, effectively 1.125 bit.
reply
marak830 2 hours ago
It's been a hell of a morning for llama heads - first this, then the claude drop and turboquant.

I'm currently setting this one up, if it works well with a custom LoRa ontop ill be able to run two at once for my custom memory management system :D

reply
hatthew 3 hours ago
I feel like it's a little disingenuous to compare against full-precision models. Anyone concerned about model size and memory usage is surely already using at least an 8 bit quantization.

Their main contribution seems to be hyperparameter tuning, and they don't compare against other quantization techniques of any sort.

reply
OutOfHere 3 hours ago
How do I run this on Android?
reply
najarvg 2 hours ago
Pocket Pal is what I've seen used before. Although recently heard about "Off Grid" but not read any reviews about it or tried it personally so caveat emptor. Will see if the community has other suggestions
reply
stogot 4 hours ago
What is the value of a 1 bit? For those that do not kno
reply
fgfarben 11 minutes ago
I can port it to an FPGA and so can you.
reply
jacquesm 4 hours ago
That you can process many operations with a single instruction.
reply
SwellJoe 4 hours ago
0 or 1
reply
jjcm 3 hours ago
Technically not in this case, or not effectively. The 0 or 1 correspond to a FP16 scaling factor for each group of 128 bits. The value fluctuates between each group of 128.
reply
trebligdivad 4 hours ago
Speed and density.
reply
techpulselab 3 hours ago
[dead]
reply
imta71770 2 hours ago
[dead]
reply
68768-8790 2 hours ago
[dead]
reply
rcdwealth 34 minutes ago
[dead]
reply
childrapst 24 minutes ago
[dead]
reply
simian1983 4 hours ago
[dead]
reply
zephyrwhimsy 2 hours ago
Cursor and similar AI-native IDEs are interesting not because of the AI itself, but because they demonstrate that the IDE paradigm is not settled. There is room for fundamental rethinking of how developers interact with codebases.
reply
tacotime 44 minutes ago
"Don't post generated comments or AI-edited comments. HN is for conversation between humans."

https://news.ycombinator.com/newsguidelines.html#generated

reply