But a few weeks ago someone on HN pointed me to Hex, which also supports Parakeet-V3 , and incredibly enough, is even faster than Handy because it’s a native MacOS-only app that leverages CoreML/Neural Engine for extremely quick transcriptions. Long ramblings transcribed in under a second!
It’s now my favorite fully local STT for MacOS:
My comment on this from a month back: https://news.ycombinator.com/item?id=46637040
I think the biggest difference between FreeFlow and Handy is that FreeFlow implements what Monologue calls "deep context", where it post-processes the raw transcription with context from your currently open window.
This fixes misspelled names if you're replying to an email / makes sure technical terms are spelled right / etc.
The original hope for FreeFlow was for it to use all local models like Handy does, but with the post-processing step the pipeline took 5-10 seconds instead of <1 second with Groq.
Thank you for making Handy! It looks amazing and I wish I found it before making FreeFlow.
You can go to Settings > Run Logs in FreeFlow to see the full pipeline ran on each request with the exact prompt and LLM response to see exactly what is sent / returned.
pretty sure it's awesome - sorry OP about mentioning another project, we're all learning here :)
Surprisingly, it produced a better output (at least I liked its version) than the recommended but heavy model (Parakeet V3 @ 478 MB).
just bought the one-time licence. this is the future of AI pricing - local models and one-time fee.
F12 -> sox for recording -> temp.wav -> faster-whisper -> pbcopy -> notify-send to know what’s happening
https://github.com/sathish316/soupawhisper
I found a Linux version with a similar workflow and forked it to build the Mac version. It look less than 15 mins to ask Claude to modify it as per my needs.
F12 Press → arecord (ALSA) → temp.wav → faster-whisper → xclip + xdotool
https://github.com/ksred/soupawhisper
Thanks to faster-whisper and local models using quantization, I use it in all places where I was previously using Superwhisper in Docs, Terminal etc.
Edit: Ah but Parakeet I think isn’t available for free. But very worthwhile single purchase app nonetheless!
I had previously used Hyprnote to record meetings in this way - and indeed I still use that as a backup, it's a great free option - but the meeting prompting to record and better transcription offered by Macwhisper is a much better experience.
Been a power user of SuperWhisper and Wispr Flow for a long time and eventually decided to unify those flows - memos & dictations, everything is a file and local first, BYOK
If you look at how authors dictate they works ( which they have done for millennia), just getting the words written down is only the first step, and its by far the easiest. I have been helping build a tool https://bookscribe.ai that not only does the transcription, but then can post process it to make it actually usable for longer form content.
I build https://github.com/bwarzecha/Axii to keep EVERYTHING locally and be fully open source - can be easily used at any company. No data send anywhere.
If you are willing to use a service for transcriptions, Mistral (which is also European) works rather nicely if they support your language https://docs.mistral.ai/capabilities/audio_transcription#tra...
https://github.com/PawelAdamczuk/blah
Mine was only tested on an Arc GPU (the acceleration works nicely through Vulkan). It hooks into Win32 API and simulates key presses so it works in various non-obvious contexts.
The top feature is the per-app custom settings - you can peak different models and instructions for different apps and websites.
- I use the Parakeet fast model when working with Claude Code (VS Code app). - And I use a smart one when I draft notes in Obsidian. I have a prompt to clean up my rambling and format the result with proper Markdown, very convenient.
One more cool thing is that it allows me to use LLMs with audio input modalities directly (not as text post-processing). e.g. It sends the audio to Gemini and prompts it to transcribe, format, etc., in one run. I find it a bit slow to work with CC, but it is the absolute best model in terms of accuracy, understanding, and formatting. It is the only model I trust to understand what I meant and produce the correct result, even when I use multiple languages, tech terms, etc.
https://github.com/kitlangton/Hex
for me it strikes the balance of good, fast, and cheap for everyday transcription. macwhisper is overkill, superwhisper too clever, and handy too buggy. hex fits just right for me (so far)
My take for X11 Linux systems. Small and low dependency except for the model download.
Chatterbox TTS (from Resemble AI) does the voice generation, WhisperX gives word-level timestamps so you can click any word to jump, and FastAPI ties it all together with SSE streaming so audio starts playing before the whole thing is done generating.
There's a ~5s buffer up front while the first chunk generates, but after that each chunk streams in faster than realtime. So playback rarely stalls.
It took about 4 hours today... wild.
And then I set the button right below that as the enter key so it feels mostly handsoff the keyboard.
Greg Priest Dorman [0][1] had other physical issues such that he had to regularly switch between sitting, standing, and walking during his workday. His solutions included (in part) some very specialized keypads, but TTS might well have been another solution for someone with similar needs.
Another fellow on my team refuses to write/type anything other than pure code to solve issues at work, but will absolutely talk for hours on end about designs, considerations, issues, what-have-you, so we're actively trying to get him to adopt a TTS-based workflow for knowledge transfer, writing tickets/bugs, etc.
[0]: https://computerhistory.org/profile/greg-priest-dorman/ [1]: https://www.cs.vassar.edu/people/priestdo/wearables/top
If you do that, the total pipeline takes too long for the UX to be good (5-10 seconds per transcription instead of <1s). I also had concerns around battery life.
Some day!
It’s free and offline
My only issue with it was that it cut off the words [I'm using] at the beginning and obviously it doesn't enter paragraph breaks. It took about 25 seconds to transcribe all of that on a 10th gen i7 laptop processor.
If they could incorporate combination typing out what was said while you're talking it would be pretty perfect.
https://github.com/corlinp/voibe
I do see the name has since been taken by a paid service... shame.
native app uses Parakeet (v2 or V3) on iOS
It would be nice if below 0 it had a -1 option to keep all recordings.
Just use handy: https://github.com/cjpais/Handy
Won't be free when xAI starts charging.
Too bad that tool no longer seems to be developed. Looking for something similar. But it's really nice to see what's possible with local models.
The downside is that couldn't get it to segment for different speakers. The concensus seemed to be to use a separate tool.
By "any GPU" you mean a physical, dedicated GPU card, right?
That's not a small requirement, especially on Macs.
https://news.ycombinator.com/item?id=46640855
I find the model works surprisingly well and in my opinion surpasses all other models I've tried. Finally a model that can mostly understand my not-so-perfect English and handle language switching mid sentence (compare that to Gemini's voice input, which is literally THE WORST, always trying to transcribe in the wrong language and even if the language is correct produces the uttermost crap imaginable).