I tried a few things with it. Got it driving Cursor, which in itself was impressive - it handled some tool usage. Via cursor I had it generate a few web page tests.
On a monte carlo simulation of pi, it got the logic correct but failed to build an interface to start the test. Requesting changes mostly worked, but left over some symbols which caused things to fail. Required a bit of manual editing.
Tried a Simon Wilson pelican as well - very abstract, not recognizable at all as a bird or a bicycle.
Pictures of the results here: https://x.com/pwnies/status/2039122871604441213
There doesn't seem to be a demo link on their webpage, so here's a llama.cpp running on my local desktop if people want to try it out. I'll keep this running for a couple hours past this post: https://unfarmable-overaffirmatively-euclid.ngrok-free.dev
Strongly agree that the knowledge density is impressive for the being a 1-bit model with such a small size and blazing fast response
I should note this is running on an RTX 6000 pro, so it's probably at the max speed you'll get for "consumer" hardware.
Do I need to build their llama.cpp fork from source?
Looks like they only offer CUDA options in the release page, which I think might support CPU mode but refuses to even run without CUDA installed. Seems a bit odd to me, I thought the whole point was supporting low end devices!
Then found out they didn't implement AVX2 for their Q1_0_g128 CPU kernel. Added that and getting ~12t/s which isn't shabby for this old machine.
Cool model.
These models quite impressive for their size: even an older raspberry pi would be able to handle these.
There's still a lots of use for this kind of model
The average of MMLU Redux,MuSR,GSM8K,Human Eval+,IFEval,BFCLv3 for this model is 70.5 compared to 79.3 for Qwen3, that being said the model is also having a 16x smaller size and is 6x faster on a 4090....so it is a tradeoff that is pretty respectable
I'd be interested in fine tuning code here personally
Can't wait to give it a spin with ollama, if ollama could list it as a model that would be helpful.
The amount of progress they've been making is incredible.
Is anyone following this space more closely? Is anyone predicting performance at certain parameter sizes will plateau soon?
Unlike the frontier models, these don't seem to be showing much progress of slowing down.
I'm currently setting this one up, if it works well with a custom LoRa ontop ill be able to run two at once for my custom memory management system :D
Their main contribution seems to be hyperparameter tuning, and they don't compare against other quantization techniques of any sort.