I rendered 1,418 confusables over 230 fonts. Most aren't confusable to the eye
105 points by paultendo 3 days ago | 51 comments

jonhohle 10 hours ago
About 20 years ago I used Cyrillic confusables to watermark internal documentation that was being leaked by a disgruntled customer service employee. The document would dynamically render and include the employee ID based encoded as bits in the text. It survived copy/paste to plain text well.

I did run into some issues in early versions on when characters in Linux commands or visible web addresses were replaced. Fortunately the source docs were HTML, and it was easy to exclude code or pre nodes when rendering.

I thought this was so clever, but the leaker was never caught using it, to the best of my knowledge.

reply
sigwinch 8 hours ago
We did this with variations of white space characters.
reply
apothegm 3 days ago
Maybe not at super large font sizes. But even lowercase i and l are easy enough to confuse at a glance mid-word in most sans-serif fonts, not to mention uppercase I and lowercase l. You don’t even need “confusable” glyphs to create a domain name that will stand up to a casual visual confirmation from a busy user in a phishing context.
reply
hinkley 3 days ago
Every Albert, Alfred, or Alphonso who goes by “Al” getting confused with bots right now…
reply
rithdmc 12 hours ago
I often ask my friend Alan to review what I've created, so I can tell people it has been enhanced with Al.
reply
thih9 16 hours ago
Perhaps there are people named “Alexa” who started using “Al” after Amazon’s launch. Talk about bad luck.
reply
tliltocatl 3 days ago
I used to read"Weird Al" as "AI" even before the LLM craze.
reply
LorenPechtel 3 hours ago
I recently spent way too much time on a bug that only showed up in a large data set. (Turned out a walker had a problem with certain leaf patterns.) Put a trap on a string that looked unique--even after I had actually found the problem and fixed it it still couldn't find the offending text. Sans serif, l vs I.
reply
dec0dedab0de 9 hours ago
we used to mess with our friends by making AIM screen names that looked identical, or super close to then. then messaging other friends in the group. Or going into chat and saying things like "im a big dumb idiot"

This was like 1998-2003, and non technical people were doing it too. I think I am the only one from that friend group who would even consider that as something to watch out for.

reply
ordu 16 hours ago
But what about 'Ы'? It looks like 'bl', doen't it? 'Ы' is one codepoint and one glyph, though 'bl' is a sequence of two letters. I believe that the method described will miss such things. Cyrillic also has 'Ю', I suppose it is possible to design a font that make it look like 'lO'? Are there any fonts like this in a wild?
reply
Sharlin 11 hours ago
Yes, it's one of the things listed in the limitations section.
reply
noname120 8 hours ago
This other article from the same author is more interesting: https://paultendo.github.io/posts/unicode-confusables-nfkc-c...
reply
vivid242 15 hours ago
Thanks for the effort!

I'm always intrigued by the German FE-Schrift ("fälschungserschwerende Schrift", "more-difficult-to-forge font") chooses shapes for characters that makes it hard for them to be turned into one another (like a 3 into an 8 or so):

https://en.wikipedia.org/wiki/FE-Schrift

reply
Terr_ 15 hours ago
As a youth in the DOS era, I was always enamored of fonts like OCR-A, there is some overlap between the problems of "make it easy to distinguish" and "make it hard to maliciously corrupt", although I can imagine some cases where they might be in conflict, especially if adding ink is asymmetrically easier than removing or covering it.

https://en.wikipedia.org/wiki/OCR-A

reply
schiffern 6 hours ago
See also: Chinese banking (anti-fraud) numerals.

https://en.wikipedia.org/wiki/Chinese_numerals#Financial_num...

reply
rob74 14 hours ago
What I have always wondered about with FE-Schrift: they painstakingly made all glyphs distinguishable, but completely f'ed it up with V and Y: the "stalk" of the Y is vertical and so short that they're very easy to confuse. They could have made the "stalk" slanted, or even curved like in lowercase "g", and most people would have still recognized it as a "Y"...
reply
wongarsu 13 hours ago
A slanted stalk might have made it too close to an X with a removed lower left arm. But a curved stalk does seem like it would have been an easy improvement
reply
albert_e 10 hours ago
> 82 pairs are pixel-identical

> a string like “аpple.com” with Cyrillic а (U+0430) is pixel-identical to “apple.com” in 40+ fonts. The user, the browser’s address bar, and any visual review process all see the same pixels. This is not theoretical. It is a measured property of the font files shipping on every Mac.

Current implementations of "Computer Use" Agentic AI tools mostly use visuals -- screenshotting of a computer screen and interpreting it.

These pixel-dentical character pairs will be a straight failure mode for those automations and could possibly be a threat vector if crafted well.

reply
pitched 10 hours ago
I don’t think a human could tell the difference either. This will make phishing emails much more effective.
reply
jeroenhd 14 hours ago
An interesting attempt, Claude. However, your promot is missing an important step to measure effectiveness against humans: wait 40-60 years for your vision to degrade naturally, and check the confusables again, preferably on a small phone screen. Bonus points if you can find someone with visual disabilities from birth. Obviously most attacks aren't pixel-perfect, but that's not the point, all you need to confuse are human eyes.

Things like the Fraktur characters are obvious mismatches in any font I know, I do do wonder why they're on the list.

reply
staticassertion 11 hours ago
Staring in keratoconus
reply
keeda 7 hours ago
I'm not an expert, I've just been "vibe-R&D"-ing computer vision for a bit now, but I'll guarantee you SSIM is not suitable for this purpose. I've been dabbling in basically this area (comparing small, potentially low-resolution images) and SSIM produces a lot of false negatives and some false positives.

I would recommend template matching using normalized cross-correlation (TM_CCOEFF_NORMED in opencv.)

Also this paper from Nvidia critically scrutinizing SSIM may be relevant: https://research.nvidia.com/publication/2020-07_Understandin...

reply
bawolff 8 hours ago
That's super interesting, but at the same time, i think the primary concern is not if they are literally the same but if a user is likely to confuse them in a small font you dont have control over in a place they are not likely to pay attention to (e.g. addeess bar).

Like even if the two characters look quite different, if they both look like the same letter in different fonts that is a problem. It doesn't mattter if you can tell the difference between the glyphs in a side by side comparision. What matters is what letter the user interprets the glyph as.

reply
Grom_PE 16 hours ago
0 and O, and l and I that look the same in a single font is a crime of modern typography.

Also, I remember 8x16 VGA font that came with KeyRus had some slight differences between Cyrillic and Latin lookalikes, that brought some strange sense of comfort when reading, and especially typing the letter c, because its Cyrillic lookalike is located on the same key.

reply
leoedin 14 hours ago
The font the arduino editor uses renders l and 1 exactly the same. Utter madness in a beginner programming context.
reply
chii 17 hours ago
> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical.

that is very interesting.

I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region. But that is only going to patch some edge cases and miss others.

Ideally, the solution is password managers everywhere, which don't have this vulnerability, instead of using human eyes to visually recognize web urls and thus is vulnerable.

reply
bojan 14 hours ago
> I imagine the browser could take some context clues and switch rendering to puny code if the locale of the user is nowhere near a cyrillic region.

Anyone reading this - please, please, please do not make any assumptions based on the end-user's geography.

Signed, someone who can cross 3 national and 4 language borders within a few hours of driving.

reply
jdranczewski 15 hours ago
The article mentions this only briefly, but browsers already do this kind of heuristic protection! See https://en.wikipedia.org/wiki/IDN_homograph_attack#Defending... or https://chromium.googlesource.com/chromium/src/+/main/docs/i... for a Chrome-specific blog post.

I think the lack of exploration of the context around the problem and current mitigations is an issue with the article - it spends a lot of time talking about the possible threat, but very little time on whether the attack is actually practical with modern mitigations.

reply
olsondv 11 hours ago
Not to mention it would only apply to clicking spoofed links. Unless the keyboard mapping was compromised, those letters won’t be typed.
reply
alterom 15 hours ago
>> A domain using only Cyrillic characters that happen to spell a Latin word (like “аpple” in all-Cyrillic) may still render in the address bar’s font and look identical

Here you go:

https:// аррlе.соm

(using English "l" and "m" here, Russian м looks differently)

reply
drran 15 hours ago
[dead]
reply
nnevatie 10 hours ago
Hmm, is SSIM a good metric for comparing fonts? I'd imagine it isn't ideal, as fonts are mostly textureless and SSIM has no concept of glyph identity or typographic intent.
reply
keeda 7 hours ago
You're right, it's not. I just posted this comment: https://news.ycombinator.com/item?id=47182655
reply
lallysingh 8 hours ago
I think we'll have to start configuring our client tools (e.g. browser, email client, etc) to render domain names with annotations for different character classes. E.g. our native character set is a standard color (blue/black) and then other character sets would have to stand out (purple background?).
reply
q0uaur 7 hours ago
i'm pretty sure Mox (email server with included webui written in Go) does that - at least the Umlauts in mails i get from Hetzner seem to always stand out.

it also defaults to not loading HTML in emails, which i love. really opened my eyes to how dumb it really is to just accept all kinds of dynamic content in unknown messages. (kinda same as how the modern web relies on remote code execution to work)

reply
dangond 9 hours ago
Good read (as is the next article in the series), but you can tell it hasn't been proofread due to "paypa׀.com" being described as a danger. Maybe in a different font than the website's, but in that case, maybe this should have been rendered out.
reply
serial_dev 11 hours ago
Was it a demo site? The font looks very wonky, not sure if I should copy-paste from it.
reply
Oarch 3 days ago
This is really cool. I loved the technical breakdown and side by side comparisons. Surprised to hear that Microsoft and MacOS default fonts didn't score so well!
reply
recursivecaveat 16 hours ago
This seems misguided. The fact that 'ρ' isn't a pixel for pixel match for 'p' doesn't mean they're not confusable. The threat model is not being unable to solve a spot-the-difference puzzle. Unless you are familiar with every pixel of your system fonts, and carefully scrutinize every character on your screen, the lack of an exact match in jρmorgan[.]com in a URL is going to do very little for you. There are many english characters that have multiple totally distinct ways to write them, so you can have two 'a' variants that are distinct but equally 'normal' looking. I guess if you get an LLM to write your blog posts they don't have to make much sense to begin with.
reply
Sharlin 11 hours ago
To be fair, the correlation threshold they used was 0.7 for confusable, and 0.3-0.7 for contextually confusable. But I definitely would have liked to see some examples of glyph pairs at around 0.5 correlation. And at small font sizes realistic in actual threat scenarios.
reply
rustyhancock 10 hours ago
Ooph, I couldn't get far in this the font is giving me motion sickness some how.

Was that the intention?

reply
doctorpangloss 18 hours ago
well, you didn't really do anything, did you? Claude Code rendered these things and wrote the blog post haha

> "This is not theoretical. It is a measured property of the font files shipping on every Mac."

some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.

reply
deaux 17 hours ago
Going off on a bit of a tangent here..

> some patterns of speech are so recognizably LLM, i am convinced that the AI detection startups have a very strong chance to succeed on text.

The problem for them is the market. Those who actually want to buy AI detection tools usually want the impossible - detecting any kind of AI-written text, or even AI-written-human-edited text.

You're right in that many HN articles (not going to comment on this one specifically) are very easy to detect. But that's just because these article writers are too lazy to even use any of the plethora of tools that remove the smells automatically, or tools that write without them in the first place (I've made such a tool myself), or even just adjusting the prompt to write in a different style that avoids them.

Most people who would be interested in paying for AI detection tools want them to detect all of the above cases too, which is of course impossible.

reply
roywiggins 8 hours ago
I don't know how people read this sort of LLM output without their eyes glazing over and tuning it out. Every blog post authored or substantially edited by Claude sounds the same sort of vaguely pompous and stilted, surely people are bored of it by now? But apparently not.
reply
jcynix 16 hours ago
Yes, some patterns of speech are recognizable … The "That's LLM generated" pattern is one of those. And while I can understand the motivation behind this, I find it more irritating now than LLM texts, if these contain useful information, which make me curious.

This text made me curious, I liked the approach the author has taken. And it made me think how I would do it. My first idea would be to use ImageMagick to render text and then use ImageMagick's https://imagemagick.org/script/compare.php to somehow calculate the risk of confounding glyphs.

So: Don't be snarky? Maybe we need another rule here, to limit comments on "LLM style" https://news.ycombinator.com/newsguidelines.html

reply
aronhegedus 17 hours ago
However it was written, it’s a useful and well structured article. I thought it was a good read
reply
alterom 15 hours ago
I mean, no shit Sherlock, Cyrillic letters being indistinguishable from English ones is what Russian speakers have been using to get around braindead keyword сеnsоrshір¹ forever, same way kids type "de@th" on TikTok to avoid automoderation.

Most of the added value in this article can be summed up by saying that the Cyrillic glyphs are identical to the similar English ones in the fonts that author looked at (which isn't true for all fonts), and author didn't find many other such examples.

_______

¹ Try matching that word with "censorship" for fun

reply
tstrimple 17 hours ago
[flagged]
reply
tuwtuwtuwtuw 17 hours ago
Maybe not. I checked OPs blog and he seem to be putting up 2-3 longer posts per day. Since it is LLM content, I have no idea whether it's mainly hallucinations or based on facts. So what did I learn from reading the article? Maybe nothing, maybe it's just made up.
reply
pmontra 17 hours ago
If you have a Mac you can follow the steps at the end of the post and reproduce the results https://paultendo.github.io/posts/confusable-vision-visual-s...

I don't have a Mac.

reply
Cool_Caribou 16 hours ago
Why are all the descending letters truncated in the titles? Not sure if it's a css glitch or terrible font choice. A bit ironic on an article about fonts.
reply
sheept 16 hours ago
It appears to be part of the font[0]. It looks a bit weird, but display fonts usually can get away with being more eccentric.

[0]: https://fonts.google.com/specimen/Syne

reply
arlattimore 17 hours ago
This is very cool, impressive piece of work Paul.
reply
polliog 13 hours ago
[dead]
reply