I also suffer from finding the appropriate word I want as I've gotten older and slower, and this fast-voice-gpt just ends up frustrating me more than helping. I have to sit there and think out the whole sentence in my head before I say anything -- not very natural.
I think the solution is to handle pauses more intelligently rather than having a higher latency protocol. With low latency you can interrupt and the bot can immediately stop rambling.
I also think it spends most of its iq on sounding good rather than thinking about the problem. “Yeah absolutely I can see why you’d like to…” etc. This is likely because it’s on a timer and maybe voice is more expensive to process? Text responses spend more time on the task.
The actual implementation is at fault. I had some luck with instructing the model to only respond with "Mhm" until I've explicitly finished my thought and asked it a question. Makes this much less of an issue.
But I've decided that their voice mode is completely unusable for a different reason: the model feels incredibly dumb to interact with, keeps repeating and re-phrasing what I said, ends every single answer with a "hook" making the entire interaction idiotically robotic, completely ignores instructions when you ask it to stop that, and - most importantly - doesn't feel helpful for brainstorming. I was completely surprised how bad it is in practice; this should be their killer app but the model feels incredibly badly tuned.
Surely the number refers to the total users of ChatGPT overall, and the fraction of those who use voice features is considerably smaller, is it not?
That’s the kind of thing that influences business decisions like knowing how much hardware and software optimization to throw at a problem.
Pipecat's smart turn model is really good for VAD - https://huggingface.co/pipecat-ai/smart-turn-v3
https://github.com/zarldev/zarl & https://www.zarl.dev/posts/hal-by-any-other-name
I've found that LiteRT-LM has a much lower DRAM footprint than Ollama. I've also made tons of optimizations in the code - for eg, you can do quite a bit with a 16k context window for a voice assistant while managing a good footprint, so I keep track of the token usage and then perform an auto-compaction after a while. I use sub-agents and only do deep-think calls with them, so the context window is separated out. In a multi-turn conversation, if Gemma 4 directly processes audio input, the KV cache fills up within a few turns, so I channel it all via Whisper.
Also, by far the biggest optimization is: 3-stage producer-consumer architecture. The LiteRT-LM streams tokens and I split them into sentences. A synthesizer thread then converts each sentence to audio via Kokoro TTS - the main thread then plays audio chunks sequentially. There's a parallel barge-in monitor thread. https://github.com/pncnmnp/strawberry/blob/main/main.py#L446
I did not want to use openWakeWord or Picovoice because they had limitations on which wake word you could choose. Alternative was to train a model of my own. So I created my own wake word detection pipeline using Whisper Tiny - works surprisingly well: https://github.com/pncnmnp/strawberry/blob/main/main.py#L143...
Also, I have VAD going with smart turn v3 (like I mentioned above) + I use browser/websocket for AEC + Barge-in (https://github.com/pncnmnp/strawberry/blob/main/audio_ws.py).
I'm using the MacBook's built-in microphones for this, though, and I haven't fully tested it with other microphones. I've been ironing out the rough edges on a daily basis. I should write a quick blog on this too.
Still, it’s worth to keep in mind that these are not frontier models, differently from when they were released.
(Please Sam, if you read this, release the new realtime audio models)
- openai is wrong. almost of the issues they described are issues with libwebrtc, not with webrtc, kubernetes, network architecture, etc. the clue was when they said "the conventional one-port-per-session WebRTC model."
- there are no alternatives worth trying. everything else open source in the ecosystem, like pion, coturn, stunner, are too immature.
- libwebrtc is the only game in town.
- they haven't discovered libwebrtc feature flags or how it works with candidates, which directly fix a bunch of latency issues they are discovering. a correct feature flag can instantly reduce latency for free, compared to pay for twilio network traversal style solutions
- 99% of low latency voice END USERS will be in a network situation that can eliminate relays, transceivers, etc. it is totally first class on kubernetes. but you have to know something :)
this is the first time i'm experiencing gell mann amnesia with openai! look those guys are brilliant, but there is hardly anyone in the world who is doing this stuff correctly.
Even for clients you have things like libpeer that libwebrtc can't hit.
Yet another reason to not consider anything else like that for low-latency networking. Golang (or even Rust and C++) is unmatched for this use-case.
Node.js's initial release was May 27, 2009
Golang 's initial release was November 10, 2009
They're different, yes, but it's not like
lol, definitely didn't need to know there's 900M weekly users for this post. I mean yeah, there's a lot of users and they serve globally, that's relevant. But this is just pulling out your biggest stat because you can. How many voice users you have would actually be relevant and interesting but, to baselessly speculate on motivation here, might be a number that doesn't add as much fuel to an upcoming IPO as reminded people that you're almost at a billion users does.
[0] https://github.com/pion/webrtc
[1] https://webrtcforthecurious.com
I read parts of it a while ago when I had an idea on using webRTC data channels to pass data from databases to browser clients via a CLI. Your book made me understand that it's probably not a great fit for my use case. I just used a centralized control plane and websockets instead.
I still feel like there is something fun that we can do with webRTC data channels + zero copy Apache Arrow arraybuffers + duckdb WASM, but haven't figured it out yet
To me go code looks like somebody vomitted stuff in the root dir and i have to wade through that every time. No namespacing. nothing
Go provides you all the tools to organise your code as much as you want (I.e vanity domain names, meta tags so that you can import a package such aol.com/hello domain even if your package is in github.com/xha/src/pkg/aol
And you're right, fanboys are in every language. But resorting to changing the argument by whataboutism is a bit reductive.