Can use this library:
https://github.com/antirez/qwen-asr
https://github.com/antirez/voxtral.c
Qwen-asr can easily transcribe live radio (see README) in any random laptop. It looks like we are going to see really cool things on local inference, now that automatic programming makes a lot simpler to create solid pipelines for new models in C, C++, Rust, ..., in a matter of hours.
The thing that sold me on Voxtral Realtime over Whisper-based models for dictation is the causal encoder. Text streaming in as you speak rather than appearing after you stop is a fundamentally different UX. On M1 Pro with a 4-bit quant through voxmlx it feels responsive enough for natural dictation, though I haven't done proper latency benchmarks yet.
Integrating voxtral.c as a backend is on my roadmap, compiling to a single native binary makes it much easier to bundle into a macOS app than a Python-based backend.
100%. I don’t understand how people are able to compromise on this.
Another problem is too much abstraction on input spec level. The other day I asked Claude to generate few classes. When reviewing the code I noticed it doing full scan for ranges on one giant set. This would bring my backend to a halt. After pointing it out to Claude it had smartened up to start with lower_bound() call. When there are no people to notice such things what do you think we are going to have?
Now the abstraction I am with you on that, I foresee a more formal way to give specifications, but more suitable for natural language as input, or even proper mathematics, than the languages we have been using thus far.
Naturally we aren't there yet.
But we were. COBOL ;)
On more serious note. Sure we need Spec development IDE which LLM would compile to a language of choice (or print ASIC). It would still not prevent that lower_bound things from happening and there will be no people to find out why
That's why I'm still holding on to a bulky Core 2 Duo Management Engine-free Fujitsu workstation, for when personal computing finally goes underground again.
https://github.com/kitlangton/Hex
This is now my standard way to speak to coding agents.
I used to use Handy but Hex is even faster. Last I checked, Handy has stuttering issues but Hex doesn’t.
Hoe does this compare?
I played around with it this week, and when you enable advanced mode and add a post-transcription AI model to point to your own server which mimics a minimal ChatGPT-compatible behavior, then you can use it to modify the output, even return an empty string if you noticed that the transcript was more targeted to do other stuff ("turn the lights on"), if you then return an empty string, it won't inject keypresses.
So one gets the best for both worlds: transcription for dictation and transcription to trigger events.
If I now only could let it listen constantly and react to voice, so that no push to talk is active, that would be nice.
Maybe this project here could be used for that.
Also, this seems to support streaming transcription.
There's https://kyutai.org/stt, which is very low latency. But it seems not as hackable.
What it does: - Runs 7 model families: offline transcription (CTC, RNNT, TDT, TDT-CTC), streaming (EOU, Nemotron), and speaker diarization (Sortformer) - Word-level timestamps - Streaming transcription from microphone input - Speaker diarization detecting up to 4 speakers