That inspired me — if AI can rewrite a whole language stack that fast, I wanted to try building a programming language from scratch with AI assistance.
I've also been noticing growing global interest in Korean language and culture, and I wondered: what would a programming language look like if every keyword was in Hangul (the Korean writing system)?
Han is the result. It's a statically-typed language written in Rust with a full compiler pipeline (lexer → parser → AST → interpreter + LLVM IR codegen).
It supports arrays, structs with impl blocks, closures, pattern matching, try/catch, file I/O, module imports, a REPL, and a basic LSP server.
This is a side project, not a "you should use this instead of Python" pitch. Feedback on language design, compiler architecture, or the Korean keyword choices is very welcome.
Learning the Korean alphabet (Hangul) can be done quite quickly, it's only about as many "letters" as the English alphabet!
Remembering the words is a bit more difficult though, especially if you don't know a similar language. Have been using Anki and my own app for that: https://game.tolearnkorean.com/
Meanwhile, "Korean writing is so easy and logical you can learn it in no time at all" has become a meme to the point where I suspect the number of people who've been exposed to the meme and don't remember a single character might be larger than the number of Koreans who've heard about the tongue shape thing and still remember it.
Also, ㄹ is obviously anatomically impossible for human tongues. It does however closely resemble similar letters in some Brahmic scripts. I'm partial to ʼPhags-pa ꡙ https://en.wikipedia.org/wiki/Origin_of_Hangul#%CA%BCPhags-p...
Nouns translate fairly naturally, but standalone verb commands in English need more care. In English, a verb like "find" can stand alone, but in Korean a verb usually needs an ending, and different endings can sound quite different or awkward depending on context. For example, "find" could become 찾다, 찾기, or 찾음, but those are not interchangeable.
Plural forms are also tricky. English distinguishes strongly between singular and plural, but Korean usually does not. Explicit plurals like “단어들” often sound unnatural unless the individuality of each item is important.And it feel same with "단어목록"
Overall, this is a very interesting project with real potential. I think it could become even stronger if it considers the structural differences between English and Korean, rather than treating it as simple keyword substitution.
I self-taught programming quite early in my life, way before I had a good command of the English language. I've read books in my native language, talked on programming forums in my native language. In the end the "english" in programming languages is just a handful of keywords, and it didn't hinder me one bit that I had no idea "int" stands for "integer".
Of course, I started by writing code like "bool es_primo(int numero)" (in my language), but there's nothing in C that says identifiers must be english, just convention. Standard library and packages nowadays would be a problem, but back then standard library were thin and "strcpy" name is obscure anyway. The real hard part was always learning how to program and design properly.
And for more advanced topics, documentation and learning materials in english only are HUGE problem for ESL, because one has to actually read and understand them. But this is not something programming language can help with.
Also, in most languages you already can name variables/classes/members in any Unicode letters. So only "if/for/while" keywords and stdlib classes remain English. It makes little sense to translate those.
When Toss, a Korean unicorn startup, announced that they would start using Korean for variable names within financial contexts, it sparked significant debate and a wide range of reactions among Korean programmers.
Hangul's phonetic symbolic design: https://news.ycombinator.com/item?id=47382219
Korean plural forms: https://news.ycombinator.com/item?id=47386312
Your comment on how LLM tokenizers shorten common inputs in training data; Korean is more visually compact but suffers from poor token compression: https://news.ycombinator.com/item?id=47381843
Hangul keyboard layout - so cool that the layout is split between consonant and vowel hands and forms rhythmic harmony while typing: https://news.ycombinator.com/item?id=47382081
I suppose is again why I discuss with everyone as I would like to have a single language in the world, it would reduce wars, miscommunications, bound everyone closer. But ofc, the other point of view is that it reduces culture But I think it would happen as UK/US or Spanish, same language with variations, but everyone can understand each other.
I can't imagine what would have happened if Python or JS had been fragmented into X different languages because of egos, and instead of collaborating, decide each to create their own languages. I don't think we would be where we are today, probably AIs would not be around, since we would be fighting to understand so many different programming languages.
# def two_sum(arr: list[int], target: int) -> list[int]:
펀크 투섬(아래이: 목록[정수], 타개트: 정수) -> 목록[정수]:
# n = len(arr)
ㄴ = 길이(아래이)
# start, end = 0, n - 1
시작, 끝 = 0, ㄴ - 1
# while start < end:
동안 시작 < 끝:
Code would be more compact, allowing things like more descriptive keywords e.g.
AbstractVerifiedIdentityAccountFactory vs 실명인증계정생성, but we'd lose out on the nice upper/lowercase distinction.I hear that information processing speed is nearly the same across all languages though regardless of density, so in terms of processing speed, may not make much difference.
It never really took off. I think because computers already require users to read and type Latin letters in lots of other situations, and it's not that hard to learn what a few keywords mean, so you might as well stick with the English keywords everyone else is using.
Changing syntax doesn't change the surrounding world. Unless you plan to translate half of pip and npm you mostly end up with a teaching language or a local curiosity.
Technical people already have to make concessions to deal with ascii chars and English in computing by the time they use a terminal, so the upside of changing any one thing kinda peters out.
One distinction though: Han uses actual Korean words, not transliterations. 함수 means "function" in Korean, 만약 means "if" — they're real words Korean speakers already know.
Your example uses transliterations like 펀크 and 아래이 which would look odd to a Korean reader. That difference matters for readability.
There's probably a lot of reasons why non English programmers stick with English keywords, beyond just language/tooling support. Learning new keywords is already part of learning a programming language, and much of the documentation and resources available for languages and libraries are only in English. ASCII-only strings are still ubiquitous in software, like URLs and usernames. And in international teams, English is the go-to lingua franca.
Could this change with LLMs? Maybe, but most code in its training data is in English, so LLMs likely work most effectively in English.
It’s fun to look at your code samples, have absolutely no clue what any of it means, and think about just how many non-English-speaking programmers must have felt that way looking at our all-English programming languages.
Except lisp: that’s just inscrutable symbols like cond and cons and car and cadr and a bunch of parens! :-)
I actually tested this with GPT-4o's tokenizer, and the result was the opposite — Korean keywords average 2-3 tokens vs 1 for English. A fibonacci program in Han takes 88 tokens vs 54 in Python.
The reason comes down to how LLM tokenizers work. They use BPE (Byte Pair Encoding), which starts with raw bytes and repeatedly merges the most frequent pairs into single tokens. Since training data is predominantly English, words like `function` and `return` appear billions of times and get merged into single tokens.
Korean text appears far less frequently, so the tokenizer doesn't learn to merge Hangul syllables — it falls back to splitting each character into 2-3 byte-level tokens instead.
It's a tokenizer training bias, not a property of Hangul itself. If a tokenizer were trained on a Korean-heavy corpus, `함수` could absolutely become a single token too.
So no efficiency benefit today. But it was a fun exploration, and Korean speakers can read the code like natural language. It could also be a fun way for people learning Korean to practice reading Hangul in a different context — every keyword is a real Korean word with meaning.
For Korean readers the character systems look quite different, but I can see how it's hard to parse visually without familiarity.
As you said, syntax highlighting helps a lot — there's a colored screenshot at the top of the README showing how it looks in practice.
I have similar idea to train LLM in Serbian, create even new encoding https://github.com/topce/YUTF-8 inspired by YUSCII. Did not have time and money ;-) Great that you succeed. Idea if train in Serbian text encoded in YUTF-8 (not UTF-8) it will have less token when prompt in Serbian then English, also Serbian Cyrillic characters are 1 byte in YUTF-8 instead of 2 in UTF.Serbian language is phonetic we never ask how you spell it.Have Latin and Cyrillic letters.
Even without retraining BPE from scratch, starting with YUTF-8 and measuring how existing tokenizers handle it would already be a worthwhile experiment.
Hope you find the time to build it, good luck!
https://huggingface.co/spaces/lapa-llm/lapa
Best tokenizer for the Ukrainian language
Thanks to a SOTA method for tokenizer adaptation developed by Mykola Haltiuk as part of this project, it was possible to replace 80,000 tokens out of 250,000 with Ukrainian ones without loss of model quality, thus making Lapa LLM the fastest model for working with the Ukrainian language. Compared to the original Gemma 3, for working with Ukrainian, the model requires 1.5 times fewer tokens, thus performing three times fewer computations to achieve better results.
The main benefit of Korean actually comes from the fact that the language itself fits perfectly into a standard 27 alphabet keys and laid out in such a way that lets you type ridiculously fast. The consonant letters are always situated in the left half and the vowels are in the right half of the keyboard. This means it is extremely easy to train muscle memory because you’re mostly alternating keystrokes on your left hand and right hand.
Anecdotally I feel like when I’m typing in English, each half of my brain needs to coordinate more compared to when I’m typing in Korean, the right brain only need to remember the consonant positions for my left hand and my left brain only need to remember the vowel positions.
People started moving away from using difficult-to-type Hanza as soon as the typewriter was introduced. As computerization progressed, the transition naturally continued until Hanza was phased out of most documents. Even so, it has only been about 40 years since Hanza disappeared from everyday daily life.
That said, a few things do lean into Korean specifically:
- Method names are real Korean verbs: .추가() (add), .삭제() (delete), .분리() (split)
- Error messages are in Korean
- The REPL prompt is 한> and exit command is 나가기 (literally "go out")
Good question — it pushed me to think about what makes this more than just s/function/함수/g.
I replied that for Japanese at least, probably due to symbol input being too tedious. However I think it's worth mentioning a potential mitigation, and maybe even an advantage.
As a mitigation, full-width symbols could be used instead, which are typically the default in Japanese input. Japanese itself is also fixed-width so if done across the board the language itself becomes fixed-width, not just by virtue of a font selection.
As an advantage, some logical symbols, greek letters, other rare characters are easy to input in Japanese mode, and this could lend itself to a more symbol-heavy language design. I already take advantage of this myself with Δ, φ, and τ use selectively in a few projects. Symbols with easy entry may differ by OS, but here are a few other examples that could be useful:
≠, ≡, ∞, ∴, λ, θ, α, β, ・, °, ※
And my other point is that != is _harder_ to type in Japanese input mode because you constantly have to manage full-width vs half-width input.
https://github.com/farant/rhubarb/blob/main/include/latina.h
edit: oh, maybe you can’t do full unicode. that’s too bad!
Error messages, REPL, LSP hover docs are all in Korean. You can't get that from #define 만약 if.
anecdotally it is also interesting to use with ai because apparently it is "harder to be on autopilot" based on a huge pre-existing corpus of code when you write it in a different language. could activate different reasoning regions somehow.
(i just appreciate what can be trivially accomplished in c even if it's kind of janky after spending way too much time in the JS preprocessor mines...)
Perhaps look to APL for efficient ways to represent math concepts/structures?
Korean has its own pure Korean words (순우리말) as its foundation, and borrowed some Chinese-origin vocabulary on top of that.
Hangul was specifically created so people wouldn't need to learn Chinese characters.
So Han's keywords use native Korean words where possible — it fits the spirit of Hangul itself.
\begin{quotation}
\emph{The beginning of wisdom is to call things by their right names.}
--- \textsc{Confucius}
\end{quotation}
Very glad /u/faitswulff mentioned Wenyan (though I'm bummed that there are only simplified Chinese and Japanese translations).
But in practice, breaking syllables into jamo would make keywords less readable, which goes against Hangul's design goal. And considering how AI-assisted coding works today, fully named descriptive keywords actually reduce errors — LLMs perform better with explicit, unambiguous tokens than with cryptic symbol compositions.
So Han leans toward more descriptive Korean keywords rather than shorter symbolic ones. Readability over brevity.
Interesting direction to think about though — thanks for the question.
If X then Y, sure. While X, do Y? Maybe. For X equals Y, X equals Z, X is incremented, do A? Hardly. Match X case Y1 Z1 case Y2 Z2? Definitely not
Native English speakers have a small leg up understanding the vocabulary, not the syntax.
* , realy hope this isn't unforgiveably offensive.
This seems like a reasonably good security measure too
Non-ascii encodings are harder to program in due to the need to switch in and out of input methods.
That said, some languages like Arabic and Japanese (and possibly Korean and Hindi) lend naturally to VSO token ordering, which maps directly to LISP syntax, so it's unfortunate that there isn't a lot of interest in this. It would be lots of fun. Maybe agents will make this possible!
Here are some interesting examples.
- https://github.com/nasser/--- (Arabic)
- https://honoka.nukenin.jp/Introduction/Loop.html (Japanese)
- https://github.com/wenyan-lang/wenyan (Chinese, which is SVO like English)Amazing though!
Rather than just translating keywords, it lets you write code that actually uses Korean grammar. For example, "10을 5로 나누고 출력하다" (literally "10 by 5 divide and print") outputs "2".
You might already know this, but there's also a Korean programming language called 'Yaksok'. Here's a 2048 written entirely in Korean: https://github.com/yaksok/yaksok/blob/master/code_examples/2...
For example, "library" is pronounced "tu-shu-guan" in Chinese, "do-seo-gwan" in Korean, and "to-sho-kan" in Japanese. All three can be written with the same characters, "圖書館". In modern Korea, though, people use Hangul, so very few Koreans actually know how to write "library" in Chinese characters. In Japan, Chinese characters are still heavily used, but for difficult ones, they often write kana alongside them as a reading aid.
It's very much like how Latin "universitas" became "university" in English, "universidad" in Spanish, and "università" in Italian.
These aren't an indication of a shared vocabulary or ancestry, just loanwords for concepts that were novel and scientific by victorian standards.
a big one: hanja (kr) kanji (jp) both are 漢字