4TB of voice samples just stolen from 40k AI contractors at Mercor
67 points by Oravys 5 hours ago | 16 comments

eqvinox 60 minutes ago
The only data that cannot be stolen or leaked is data that doesn't exist. Hard lesson for both users and companies.

Germans (because of course) have a word for this: "Datensparsamkeit". Being frugal with your data.

reply
wlesieutre 46 minutes ago
I miss the pre-LLM days when you could make a decent argument that having any unnecessary data was just a liability. Now all anybody thinks is “more data for the AI!”
reply
CincinnatiMan 38 minutes ago
Were you not around for the Big Data heyday a decade ago?
reply
varispeed 30 minutes ago
Until thumb drives became large enough to fit most datasets it stopped becoming Big Data. Just normal data.
reply
citrin_ru 26 minutes ago
Data hoarding predates LLMs. There where other machine learning methods which also needed data for training.
reply
Forgeties79 13 minutes ago
“Before LLM’s there was_____”

I see this whenever an LLM’s impact is assessed. We know. The issue is scale and the ability for smaller and smaller groups (down to individuals) being able to execute at scale.

Fake news always existed. Now one dude in India can flood multiple sock puppet media accounts with right wing content/images (actual example) at a scale previously unimaginable.

reply
Oravys 5 hours ago
Author here. Wrote this after watching Lapsus$ post the Mercor archive on their leak site earlier this month. The thing that struck me is the combination: voice samples paired with ID document scans. Most breaches leak one or the other. This one ships a deepfake-ready kit. Tried to keep the writeup practical: what an attacker can actually do with this combo (banking voiceprint bypass, Arup-style video calls, insurance fraud), and a 5-step checklist for the contractors who were in the dump.

  Happy to discuss the forensic detection side. AudioSeal
  watermarks, AASIST anti-spoofing, and how the detection landscape changes
  once voice biometrics start leaking at scale.
reply
embedding-shape 17 minutes ago
I wonder how many of the current text-to-speech ML models have large parts of leaked or "stolen" data in their training data? Almost none of the TTS releases seem to talk about exactly where they get their training data from, for some reason. I also wonder if we'll see an explosion in SOTA TTS in ~6 months from now.
reply
hirako2000 11 minutes ago
It's already there. And keeps moving.

Even have a nice UI on top.

https://voicebox.sh/

reply
john_strinlai 15 minutes ago
>Set up a verbal codeword with family and finance contacts. Pick a phrase that has never been spoken on a recording and never typed in chat. Brief the people who handle money on your behalf. If a call ever asks for a transfer, the codeword is mandatory.

good luck with this. most finance people deal with hundreds to thousands of clients. they obviously cant remember everyones code word. commonly used finance systems arent setup to securely store these codewords. they dont have processes or policies in place to implement or adhere to any sort of codeword verification.

>Rotate where voiceprints are still in use. [...] Do that now, ideally from a new recording in a different acoustic environment than the leaked sample.

would this even have an effect? i have never heard of "rotating" a voice print. isnt the whole point of a voice print that you cant really change it? if simply switching your environment completely changes your voice print, that would make voice prints utterly useless to begin with.

reply
amarcheschi 33 minutes ago
I've been doing similar things on a different platform because as a uni student the pay is kinda nice, but I limit myself to task without voice/video and just input from mouse/keyboard to do reinforcement learning/data tagging. No way I'm trusting these companies or the companies they contract the work with
reply
VladVladikoff 45 minutes ago
Man that’s pretty shitty that Mercor tricked 40k contractors, and then did a poor job of securing their data. There should be stronger consequences for stuff like this.
reply
Havoc 40 minutes ago
I love how the check if your affected involves giving a voice sample to whatever the fuck that website is
reply
josefritzishere 51 minutes ago
This kind of event is the best argument against needless data hoarding. But it would help if the law better provided for some kind of consequences for negligence.
reply
throw0101c 39 minutes ago
"My voice is my passport. Verify Me.

:)

reply
jacquesm 51 minutes ago
You could have seen this coming a mile away. So far I have gotten away with never uploading my ID and/or interacting with one of those companies (though one idiot working for some VC thought it was ok to sign a document on my behalf by uploading my signature!!, never mind a bit of fraud) but it is getting harder and harder. Banks and in some cases even governments forcing you to send data to these operators is a very bad idea. But hey, who ever got hurt by some security theater?

I've had to open a bank account for a company here a few years ago and that was right on the bubble of this happening and they still had an option to come by in person with the proper documentation, which I did, now it is all outsourced.

These companies are the fattest targets and they're run by incompetents. You should assume that anything you give them will eventually be part of some hack.

reply
KnuthIsGod 26 minutes ago
[dead]
reply
algoth1 49 minutes ago
[dead]
reply