> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.
So basically just concatenating the audio vectors without compression or discretization?
I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.
I like me a good rabbit hole that's interesting and also digs into stereotypes.
Turns out, like many memes, it's not just that. It's (also?) a normal speech pattern, used by different genders, ages, and social groups, in many languages.
This doesn't mean that vocal fry isn't used as social signaling. But complaining about it, well, isn't that social signalling too?
Geoff Lindsey - Vocal Fry: what it is, who does it, and why people hate it! - https://www.youtube.com/watch?v=Q0yL2GezneU
Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?
I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.
All that said, I think it likely this has been built and trained only on Nvidia
The homes at home, such as by Apple, don't count for serious workflows that must run reliably.
Of course, the big interest these days is in cloud based assistants, where synthesizing on server and piggybacking on the rest of the answer is quite reasonable.
If you are wanting to run this on a server to pipe the generated speech to a remote user (live, or generating it to send at some other appropriate moment) and your server resources don't have GPUs, then you either have to change your infrastructure, use CPU, or not bother.
Renting GPU access on cloud systems can be more expensive than CPU, especially if you only need GPU processing for specific occasional run tasks. Spinning up a VM to server a request then pulling it down is rarely as quick as cloud providers like to suggest in advertising, so you end up keeping things alive longer than absolutely needed meaning spot-pricing rates quoted are lower than you end up paying.
Even if utilisation weren't a metric, "efficient" can be interpreted in so many ways as to be pointless to try and apply in the general case. I consider any model I can foist into a Lambda function "efficient" because of secondary concerns you simply cannot meaningfully address with GPU hardware at present (elasticity and manageability for example). That it burns more energy per unit output is almost meaningless to consider for any kind of workload where Lambda would be applicable.
It's the same for any edge-deployed software where "does it run on CPU?" translates to "does the general purpose user have a snowball's chance in hell of running it?", having to depend on 4GB of CUDA libraries to run a utility fundamentally changes the nature and applicability of any piece of software
A few years ago we had smaller cuts of Whisper running at something like 0.5x realtime on CPU, people struggled along anyway. Now we have Nvidia's speech model family comfortably exceeding 2x real time on older processors with far improved word error rate. Which would you prefer to deploy to an edge device? Which improves the total number of addressable users? Turns out we never needed GPUs for this problem in in the first place, the model architecture mattered all along, as did the question, "does it run on CPU?".
It's not even clear cut when discussing raw achievable performance. With a CPU-friendly speech model living in a Lambda, no GPU configuration will come close to the achievable peak throughput for the same level of investment. Got a year-long audio recording to process once a year? Slice it up and Lambda will happily chew through it at 500 or 1000x real time
Also, for inference (and not training) there are other ways to efficiently do matmuls besides the GPU. You might want to look up Apple's undocumented AMX CPU ISA, and also this thing that vendors call the "Neural Engine" in their marketing (capabilities and the term's specific meaning varies broadly from vendor to vendor).
For small 1-3B parameter transformers like TADA, both these options are much more energy efficient, compared to GPU inference.
The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.
Was it trained on Sam Altman?