Hacker News

TADA: Speech generation through text-acoustic synchronization

91 points by smusamashah 16 hours ago | 25 comments

To me, the speech sounds impressively expressive, but there is something off about the audio quality that I can't quite put my finger on.

The "Anger Speech" has an obvious lisp (Maybe a homage to Elmer Fudd?). But I hear a similar, but more subtle, speech impediment in the "Adoration Speech". The "Fearful Speech" might have a slight warble to it. And the "Long Speech" is difficult to evaluate because the speaker has vocal fry to an extent that I find annoying.

sjcoles 19 minutes ago

There's a subtle modulation that happens on all of the samples. It sounds almost like some kind of harmonic or phase shift? This is something I notice with every AI generated speech out there.

sharyphil 8 hours ago

> speaker has vocal fry to an extent that I find annoying.

Was it trained on Sam Altman?

earthnail 10 hours ago

I don’t understand the approach

> TADA takes a different path. Instead of compressing audio into fewer fixed-rate frames of discrete audio tokens, we align audio representations directly to text tokens — one continuous acoustic vector per text token. This creates a single, synchronized stream where text and speech move in lockstep through the language model.

So basically just concatenating the audio vectors without compression or discretization?

I haven’t read the full paper yet (I know, I should before commenting), but this explanation puzzles me.

yorwba 9 hours ago

It's a variable-rate codec. The audio is still compressed, but by how much depends on the duration of the segment corresponding to a particular text token. The TTS model predicts one audio token per text token and its duration, and the audio decoder fills in a waveform of the appropriate length.

kavalg 8 hours ago

MIT license, supported languages beyond english: ar, ch, de, es, fr, it, ja, pl, pt.

https://huggingface.co/HumeAI/tada-3b-ml

https://github.com/HumeAI/tada

tcbrah 9 hours ago

the 0.09 RTF is wild but i wonder how much of that speed advantage disappears once you need voice cloning or fine grained prosody control. i use cartesia sonic for TTS in a video pipeline and the thing that actually matters for content creation isnt raw speed - its whether you can get consistent emotional delivery across like 50+ scenes without it drifting. the 1:1 text-acoustic alignment should help with hallucinations for sure but does it handle things like mid-sentence pauses or emphasis on specific words? thats where most open source TTS falls apart IMO

regularfry 6 hours ago

Given that it's one-to-one audio and text tokens, you'd get mid-sentence pauses if you just stopped feeding it.

mpalmer 9 hours ago

"Long speech" is a faithful synthesis of a fairly irritating modern American English speech pattern.

ggus 6 hours ago

"Vocal fry", aka "creaky voice". It's stereotypically associated with irritating young women.

I like me a good rabbit hole that's interesting and also digs into stereotypes.

Turns out, like many memes, it's not just that. It's (also?) a normal speech pattern, used by different genders, ages, and social groups, in many languages.

This doesn't mean that vocal fry isn't used as social signaling. But complaining about it, well, isn't that social signalling too?

Geoff Lindsey - Vocal Fry: what it is, who does it, and why people hate it! - https://www.youtube.com/watch?v=Q0yL2GezneU

mpalmer 5 hours ago

Not the fry, the cadence that makes everything sound like the same list of three or four things

ilaksh 8 hours ago

okay so they say text continuation only without fine tuning. I assume that means that we can't use it as a replacement for TTS in an AI agent chat? Because it will not work without enough context?

Could you maybe trick it into thinking it was continuing a sample for an assistant use case if the sample was generic enough?

I appreciate them being honest about it though because otherwise I might spend two days trying to make it work.

qinqiang201 13 hours ago

Could it run on Macbook? Just on GPU device?

OutOfHere 14 hours ago

Will this run on CPU? (as opposed to GPU)

vessenes 7 hours ago

I could not get my Mac to successfully do anything with the script from their GitHub; set device to mps, downloaded llama for the first time in a year, and it just .. hangs. I presume this is sortable, but I'm not sure I care enough.

All that said, I think it likely this has been built and trained only on Nvidia

microtherion 10 hours ago

This is bound to be a question that will be increasingly harder to answer. For instance, Apple processors have at least two different neural accelerators/matrix coprocessors (ANE and AMX) in addition to the integrated GPU. Do these count as "CPU"?

OutOfHere 8 hours ago

I think the answer is rather simple and boring -- only the CPU type commonly used in cheap cloud machines counts. This still is x86 only.

The homes at home, such as by Apple, don't count for serious workflows that must run reliably.

microtherion 4 minutes ago

Personally, I love synthesis that can be generated on the client machine, in real time. For some applications, like screen readers, this is a really important feature.

Of course, the big interest these days is in cloud based assistants, where synthesizing on server and piggybacking on the rest of the answer is quite reasonable.

boxed 13 hours ago

Why would you want to? It's like using a hammer for screws.

g-mork 12 hours ago

CPU compute is infinity times less expensive and much easier to work with in general

boxed 12 hours ago

Less expensive how? The reason GPUs are used is because they are more efficient. You CAN run matmul on CPUs for sure, but it's going to be much slower and take a ton more electricity. So to claim it's "less expensive" is weird.

dspillett 11 hours ago

In situations where you have space CPU power but not spare GPU power because your GPU(s) & VRAM are allocated to be busy on other tasks, you might prefer to use what you have rather than needing to upgrade that will cost (even if that means the task will run more slowly).

If you are wanting to run this on a server to pipe the generated speech to a remote user (live, or generating it to send at some other appropriate moment) and your server resources don't have GPUs, then you either have to change your infrastructure, use CPU, or not bother.

Renting GPU access on cloud systems can be more expensive than CPU, especially if you only need GPU processing for specific occasional run tasks. Spinning up a VM to server a request then pulling it down is rarely as quick as cloud providers like to suggest in advertising, so you end up keeping things alive longer than absolutely needed meaning spot-pricing rates quoted are lower than you end up paying.

g-mork 11 hours ago

This is far too simplistic, you can't discuss perf per watt unless you're talking about a job running at any decent level of utilisation. Numbers like that only matter for larger scale high utilisation services, meanwhile Intel boxes mastered the art of power efficient idle modes decades ago while almost any contemporary GPU still isn't even remotely close, and you can pick up 32 core boxes like that for pennies on the dollar.

Even if utilisation weren't a metric, "efficient" can be interpreted in so many ways as to be pointless to try and apply in the general case. I consider any model I can foist into a Lambda function "efficient" because of secondary concerns you simply cannot meaningfully address with GPU hardware at present (elasticity and manageability for example). That it burns more energy per unit output is almost meaningless to consider for any kind of workload where Lambda would be applicable.

It's the same for any edge-deployed software where "does it run on CPU?" translates to "does the general purpose user have a snowball's chance in hell of running it?", having to depend on 4GB of CUDA libraries to run a utility fundamentally changes the nature and applicability of any piece of software

A few years ago we had smaller cuts of Whisper running at something like 0.5x realtime on CPU, people struggled along anyway. Now we have Nvidia's speech model family comfortably exceeding 2x real time on older processors with far improved word error rate. Which would you prefer to deploy to an edge device? Which improves the total number of addressable users? Turns out we never needed GPUs for this problem in in the first place, the model architecture mattered all along, as did the question, "does it run on CPU?".

It's not even clear cut when discussing raw achievable performance. With a CPU-friendly speech model living in a Lambda, no GPU configuration will come close to the achievable peak throughput for the same level of investment. Got a year-long audio recording to process once a year? Slice it up and Lambda will happily chew through it at 500 or 1000x real time

woadwarrior01 11 hours ago

GPUs are a near monopoly. There are at least handful of big players in the CPU space. Competition alone makes the latter space a lot cheaper.

Also, for inference (and not training) there are other ways to efficiently do matmuls besides the GPU. You might want to look up Apple's undocumented AMX CPU ISA, and also this thing that vendors call the "Neural Engine" in their marketing (capabilities and the term's specific meaning varies broadly from vendor to vendor).

For small 1-3B parameter transformers like TADA, both these options are much more energy efficient, compared to GPU inference.

regularfry 13 hours ago

To maximise the VRAM available for an LLM on the same machine. That's why I asked myself the same question, anyway.

octoclaw 12 hours ago

[dead]

theturtle 12 hours ago

[dead]

zacklee1988 15 hours ago

[dead]