A PDF that changes based on who is reading
31 points by SarthakGaud 2 hours ago | 13 comments

gpvos 51 minutes ago
I would suggest changing the title to the actual title of the article: Adaptive PDFs.

Assuming the program works, the PDF will not actually look different to me than to anyone else looking at it, so there is nothing that "changes based on who is reading". It is just that text extraction, a wholly different (and much fuzzier) process than viewing the PDF, and something that the same person can do, will now return structured (Markdown) text. (One might say the PDF changes based on how you are reading it.) A great idea, IMHO.

reply
dredmorbius 14 minutes ago
Email the mods: <https://news.ycombinator.com/item?id=40493683>.

hn@ycombinator.com

reply
mc32 8 minutes ago
Having slightly different versions would certainly be a help in identifying leakers of certain kinds of documents to increase the odds of identifying leakers. That would be of interest to some kinds of organizations or departments within organizations.
reply
Tomte 7 minutes ago
> LaTeX, Chrome's print-to-PDF, most export tools don't produce tags

LaTeX is actually one of the best ways to create tagged PDF: https://latex3.github.io/tagging-project/tagging-status/ and https://www.overleaf.com/learn/latex/An_introduction_to_tagg...

reply
gnunicorn 44 minutes ago
Just because everything is a potential threat vector now: doesn't this also mean you could easily put AI specific malicious instructions into the PDF that the regular human would never notice?

Like the "white text between the lines that only appears when copy-pasted"-hack that some professors have been doing in their exercises to their students to include pink elephants in the output and stuff. But worse. Just thinking of a electricity bill pdf you provide as proof of address to some company that uses an LLM to extraxt that address and pre-process that doc. But instead we can command it to do something else that a regular human wouldn't even ever notice...

Just a thought

reply
al_hag 16 minutes ago
In the US, publicly funded organizations are required to code their PDF with semantic structure to support machine access by screen readers and other assistive technologies [1], [2].

Given the low adherence to accessibility standards e.g. in academic publishing [3], LLM parsing needs creating a commercial incentive for comparable structured access would be marvelous.

[1] https://www.section508.gov/create/pdfs/common-tags-and-usage...

[2] https://pdfa.org/resource/tagged-pdf-best-practice-guide-syn...

[3] https://arxiv.org/html/2410.03022v1

reply
Xotic007 11 minutes ago
Cool but it's relying on every extractor honoring that replacement-text property which you said yourself is hit or miss. So it's clean markdown until someone runs it through a tool that ignores it and quietly gets the messy version and has no idea that happened.
reply
jheimark 53 minutes ago
This looks really interesting. Optimizing for humans vs. agents feels like the new wave of Desktop vs. Mobile (where mobile won) - agents are going to win even faster.

Where is the repo? It's mentioned but I can't find it.

reply
jheimark 52 minutes ago
reply
gpvos 41 minutes ago
Looks like it, the author's name matches.
reply
jexp 40 minutes ago
Shouldn’t it be possible since forever to put machine readable source information into PDF metadata. It’s more a problem of the tools and programs generating the PDFs.

We spend millions turning structured information into PDFs and billions to extract the same data from a printer rendering language

reply
vjvjvjvjghv 3 minutes ago
[delayed]
reply
neonmagenta 21 minutes ago
Exactly. But we have no real coordination or uniform application in how we're creating PDFs across all these programs so we always end up with a fun mix of what will and wont be static, scalable, searchable
reply
iLoveOncall 35 minutes ago
I'd be more interested in the contrary. A PDF that ensures it's only readable by humans.

I guess the exact same technique can actually be used.

reply
vjvjvjvjghv 57 seconds ago
[delayed]
reply
froh 45 minutes ago
[dead]
reply