Graphing how the 10k* most common English words define each other
100 points by wyattsell 4 days ago | 24 comments

xtiansimon 12 minutes ago
Careful. This is starting to sound like Saussurean semiotics.
reply
will_critchlow 55 minutes ago
It seems like this is a common thing to go down the rabbit hole of. My version was trying to run pagerank on the directed graph: https://x.com/willcritchlow/status/2013598472562168056
reply
anigbrowl 13 hours ago
It's a common problem to get excited about networks, build a large one, and then by stuck with an unapproachable hairball. If you want to explore network structure, consider using tools like quadrilateral simmelian backones which can provide an opinionated look at what matters in the network.
reply
Someone 11 hours ago
One could also try to use a different set of definitions better suited to such a visualization.

The Oxford Advanced Learner’s dictionary has an appendix called “Defining Vocabulary”. It says:

“In order to make the dictionary definitions easy to understand, we have written them using only the words in the following list.

[…]

Occasionally it has been necessary to use in a definition a word not in the list. When such a word occurs it is shown in SMALL CAPITAL LETTERS.”

I estimate that list has about 3,500 words.

⇒ If you base your network on that dictionary or one carefully constructed like that, the graph could have a central core of about 3,500 nodes with the other words circling around it.

Making a good visualization still would be a challenge, of course.

reply
tomstuart 12 hours ago
I had to look this up: https://doi.org/10.7155/jgaa.00370
reply
MrDrDr 8 hours ago
I remember thinking about this when the semantic web was first being discussed. If you think of it from the perceptive of a child, your first 'foundational' words are learned though direct experience. Then while you continue to learn words this way, we can also use those words we 'know' to define secondary or tertiary terms that we have no direct experience of. I'd like to see a graph like this with someones take on the minimum number of necessary foundational words and how that graph would look.
reply
euroderf 7 hours ago
> If you think of it from the perceptive of a child, your first 'foundational' words are learned through direct experience.

And lest we (or AGI) forget, there's qualia in the foundations.

reply
avidiax 15 hours ago
If you like this, you would probably enjoy Princeton Wordnet. They have unfortunately stopped developing it.

You can still browse it a bit online with some 3rd party sites: https://en-word.net/

reply
jaen 9 hours ago
The page literally credits "Open English Wordnet" (based on it) in the sidebar :)

(the link is broken though, it should be https://github.com/globalwordnet/english-wordnet)

reply
reubenmorais 12 hours ago
This reminds me of the classic "Growing a Language" talk by Guy Steele: https://www.youtube.com/watch?v=_ahvzDzKdB0
reply
WillAdams 8 hours ago
Nice! Reminds me a bit of "WordWeb" which is still around:

https://wordweb.info/free/

which also uses WordNet:

https://en.wikipedia.org/wiki/WordNet

(which this is also using)

which was developed by Princeton w/ DARPA money as an early investigation into AI and so forth.

reply
sspehr 9 hours ago
There are some surprises like the word 'r'
reply
breakingcups 10 hours ago
It seems broken. The word "knows" only connects to the word "operator"
reply
codeflo 10 hours ago
It's likely that "knows" has no separate definition, but is used in some definition of "operator". If so, then "operator" should probably connect to "know", and "knows" shouldn't appear in the graph at all. But calling that edge case "broken" is a bit harsh, I think.
reply
yubainu 4 hours ago
It is an intriguing and aesthetically profound approach.
reply
castral 3 days ago
It's an interesting visualization for sure, but I don't really know what I can take away from it. Is it useful for something?
reply
h4ch1 3 days ago
You can look at this as how small sets of a primitive lexicon give rise to a larger, more complex language. At least that's how I interpret it.
reply
rhelz 3 days ago
Beautiful! Thank you!
reply
theodpHN 4 days ago
Very neat. What software is being used to construct/display the graph?
reply
wyattsell 4 days ago
Glad you like it. NetworkX for creating the graph and the layout; then SigmaJS for displaying it.
reply
readthenotes1 16 hours ago
Is, be, and the don't show up in search box.

What am I missing?

reply
Cyphase 15 hours ago
Other words too, e.g. "from".

My first thought was that the creator used a search library that filters common words by default, but the search code is all in the page and doesn't do that.

My second thought was that the 10k word corpus doesn't include those most common words. But it does.

Then I realized that the creator filtered them out. The page does say "7931 words", and the title here on HN says "10k* most common". The original corpus has exactly 10,000 words.

https://github.com/first20hours/google-10000-english/blob/d0...

The first 21 include all four we've mentioned:

the, of, and, to, a, in, for, is, on, that, by, this, with, i, you, it, not, or, be, are, from

reply
wyattsell 15 hours ago
The reason for this (I should have probably added a note to the site in hindsight), is that WordNet doesn't include definitions for these words in its corpus. This is why the count is less than 10,000: anything that WordNet doesn't have a definition for isn't included. I left a nod to this in the asterisk, but I realise now I didn't explain it anywhere.

From the old Princeton WordNet FAQ page (https://wordnet.princeton.edu/frequently-asked-questions):

> WordNet only contains "open-class words": nouns, verbs, adjectives, and adverbs. Thus, excluded words include determiners, prepositions, pronouns, conjunctions, and particles.

I suppose I could have included them as source nodes (only outgoing), but I think they would have ended up connecting to a whole bunch of definitions, while not providing much in the way of interest.

reply
oxonia 11 hours ago
Yet "tc" does?
reply