SectorC: A C Compiler in 512 bytes (2023)
369 points by valyala 2 days ago | 79 comments

layer8 24 hours ago
If this implementation had existed in the 1980s, the C standard would have a rule that different tokens hashing to the same 16-bit value invoke undefined behavior, and optimizing compilers in the 2000s would simply optimize such tokens away to a no-op. ;)
reply
RodgerTheGreat 20 hours ago
"you don't have -wTokenHashCollision enabled! it's your own foolish ignorance that triggered UB; the spec is perfectly clear!"
reply
fredrikholm 11 hours ago
Hey stop it with the ad hominems!
reply
xorvoid 24 hours ago
Too real! LMAO
reply
mati365 2 days ago
Oh, it looks like my X86-16 boot sector C compiler that I made recently [1]. Writing boot sector games has a nostalgic magic to it, when programming was actually fun and showed off your skills. It's a shame that the AI era has terribly devalued these projects.

[1] https://github.com/Mati365/ts-c-compiler

reply
guenthert 13 hours ago
Er, what? The article describes a compiler for a not-quite-C programming language which fits entirely in 512B. Your project, if I see this correctly, can optionally produce code meant to execute as boot sector.

Both interesting projects, but other than the words 'boot sector', 'C' and 'compiler', I don't see a similarity.

reply
w4yai 18 hours ago
> when programming was actually fun and showed off your skills

Oh no. Now more people are able to do what I do. I'm not special anymore.

reply
mlsu 18 hours ago
Seems like this is facetious but to me, “I’m not special” is a pretty valid thing to be sad about.
reply
tgv 17 hours ago
The two dos in "do what I do" do absolutely not carry the same meaning.
reply
xorvoid 2 days ago
I may be the author.. enjoy! It was an absolute blast making this!
reply
veltas 2 days ago
This is very nice. I'm currently writing a minimalist C compiler although my goal isn't fitting in a boot sector, it's more targeted at 8-bit systems with a lot more room than that.

This is a great demonstration of how simple the bare bones of C are, which I think is one reason I and many others find it so appealing despite how Spartan it is. C really evolved from B which was a demake of Fortran, if Ken Thompson is to be trusted.

reply
JamesTRexx 2 days ago
Would and how much would it shrink when if, while, and for were replaced by the simple goto routine? (after all, in assembly there is only jmp and no other fancy jump instruction (I assume) ).

And PS, it's "chose your own adventure". :-) I love minimalism.

reply
SAI_Peregrinus 2 days ago
What fancy jumps are present in assembly depends on the CPU architecture. But there are always conditional jumps, like JNZ that jumps if the Zero flag isn't set.
reply
MobiusHorizons 15 hours ago
The “fancy jump” is the branch instruction. As far as I know all ISAs have them. Even rv32i which is famously minimal has several branch instructions in addition to two forms of unconditional jump. Branches are typically used to construct if / for / while as well as && and || (because of short circuiting) and ternary (although some architectures may have special instructions for that that may or may not be faster than branches depending on the exact model). Without it you would have to use computed goto with a destination address computed without conditional execution using constant time techniques.
reply
dzaima 23 hours ago
It only does if & while, not for. A goto in a single-pass thing would need separate handling for forwards vs backwards jumps, which involves keeping track of data per name (in a form where you can tell when it's not yet set; whereas if/while data is freely held in recursion stack). And you'd still need to handle at least `if ( expr ) goto foo;` to do any conditionals at all.
reply
direwolf20 23 hours ago
It's "choose your own adventure"
reply
globalnode 23 hours ago
thats the most important thing i noticed about the article, apart from the forth tokenising ideas.
reply
einpoklum 2 days ago
An interesting use case - for the compiler as-is or for the essentiall idea of barely-C - might be in bootstrapping chains, i.e. starting from tiny platform-specific binaries one could verify the disassembly of, and gradually building more complex tools, interpreters, and compiler, so that eventually you get to something like a version of GCC and can then build an entire OS distribution.

Examples:

https://github.com/cosinusoidally/mishmashvm/

and https://github.com/cosinusoidally/tcc_bootstrap_alt/

reply
ahazred8ta 21 hours ago
Related: the stage0/stage1 series of hex-to-c compiler bootstrapping tools https://github.com/oriansj/stage0?tab=readme-ov-file and OTCC https://bellard.org/otcc/
reply
teo_zero 2 days ago
It would be interesting to understand what non-toy programs can be coded in this subset of C. For example, could tcc be rewritten in this dialect?
reply
direwolf20 23 hours ago
https://bootstrapping.miraheze.org/wiki/Main_Page

(Why does the referenced short story remind me of "There Is No Antimemetics Division"?)

reply
wzbtoolbox 14 hours ago
This is the kind of project that reminds you how far removed modern development is from the actual machine. We pile abstractions on abstractions until "Hello World" needs 200MB of node_modules, and then someone fits a C compiler in 512 bytes.

Not saying we should all write boot sector code, but reading through projects like this is genuinely humbling. Great educational resource too.

reply
lock1 7 hours ago
This kind of comment reminds me of how broad "software development" is.

On other HN posts, they're stating something like "software development is dead", "LLM as a compiler", "Do you read compiled assembly?", and so on.

While some other posts like this contain huge mechanical sympathy and literally r/w the assembly directly.

reply
riedel 2 days ago
Beautiful, but make sure to quickly add 2023 to the title.

Discussed at the time: https://news.ycombinator.com/item?id=36064971

reply
dang 21 hours ago
Thanks! Macroexpanded:

SectorC: A C Compiler in 512 bytes - https://news.ycombinator.com/item?id=36064971 - May 2023 (80 comments)

reply
gjvc 10 hours ago
why? and why "quickly?
reply
mojuba 2 days ago
Compare that to the C compiler in 100,000 lines written by Claude in two weeks for $20,000 (I think was posted on HN just yesterday)
reply
vidarh 2 days ago
It's a fun comparison, but with the notable difference that that one can compile the Linux kernel and generate code for multiple different architectures, while this one can only compile a small proportion of valid C. It's a great project, but it's not so much a C compiler, as a compiler for a subset of C that allows all programs this compiler can compile to also be compiled by an actual C compiler, but not vice versa.
reply
d_silin 2 days ago
But can it compile "Hello, World" example from its own README.md?

https://github.com/anthropics/claudes-c-compiler/issues/1

reply
Retr0id 2 days ago
It's fascinating how few people read past the issue title
reply
fooker 22 hours ago
And this is exactly why coding with AI is not-so-slowly taking over.

Most people think they are more capable than they actually are.

reply
vidarh 2 days ago
Noticed the part where all it requires is to actually have the headers in the right location?
reply
d_silin 2 days ago
"The location of Standard C headers do not need to be supplied to a conformant compiler."

From https://news.ycombinator.com/item?id=46920922 discussion.

reply
vidarh 2 days ago
And it doesn't for the compiler in question either. As long as the headers exist in the places it looks for them. No compiler magically knows where the headers are if you haven't placed them in the right location
reply
Retr0id 2 days ago
stddef.h (et al) should be shipped by the compiler itself, and so it should know where it is. But they rely on gcc for it, hence it doesn't always know where to look. Seems totally fine for a prototype.
reply
vidarh 2 days ago
Especially given they're not shipping anything. The GCC binaries can't find misplaced or not installed headers either.
reply
josefx 5 hours ago
Shipping GPL headers that explicitly state that they are part of GCC with a creative commons licensed compiler would probably make a lot of people rather unhappy, possibly even lawyers.
reply
d_silin 2 days ago
Would you accept the same quality of implementation from a human team?
reply
dzaima 23 hours ago
I've certainly encountered clang & gcc not finding or just not having header files a good couple times. Mostly around cross-compilation, but there was a period of time for which clang++ just completely failed to find any C++ headers on my system.
reply
fooker 22 hours ago
Yes, clang is famously in this category.

If you copy the clang binary to a random place in your filesystem, it will fail to compile programs that include standard headers.

reply
vidarh 2 days ago
A compiler that can't magically know how to find headers that don't exist in the expected directory?

Yes, that is the case for pretty much every compiler. I suppose you could build the headers into the binary, but nobody does that.

reply
tekne 23 hours ago
Consider: content-addressed headers.
reply
vidarh 12 hours ago
Then you might as well embed the headers, since in that case you can't update the compiler and headers separately anyway.
reply
IshKebab 11 hours ago
I guess you've heard of https://www.unison-lang.org/
reply
HendrikHensen 14 hours ago
Noticed the part where the exact instructions from the Readme were followed and it didn't work?
reply
vidarh 13 hours ago
So we're down to a missing or unclear description of a dependency in a README - note following the instructions worked for others -, from implications the compiler didn't work.
reply
mojuba 23 hours ago
Well I'm pretty sure the author can make a compliant C compiler in a few more sectors.
reply
vidarh 9 hours ago
I mean we know it can be done in little space, given the many tiny C compilers. I think what is most interesting about this one is exactly the creative shortcuts. It's an interesting design space for e.g. bootstrapping to impose extra restrictions.
reply
sanufar 2 days ago
The way hashing is used for tokens and for making a pseudo symbol table is such an elegant idea.
reply
fix4fun 2 days ago
I think the same. Really nice project and good trick with hashing tokens.

PS. There left 21 bytes (21 * 0x00 - from 0x01e0 to 0x01fd). Maybe something can be packed there ;)

reply
avadodin 14 hours ago
I actually "shipped" a parser using the symbols' hash(as the only identifier) for a test tool once. Hopefully, the users never used enough symbols to collide 32-bits.
reply
benj111 9 hours ago
I've had the idea before. Was never quite brave enough to do it. It's elegant until it isn't....
reply
shikaan 15 hours ago
Such a great read! Reminds me of the bootsector OS I made some time ago[^1]

Maybe it's time to equip it with a C compiler...

[1]: https://github.com/shikaan/osle

reply
drob518 4 hours ago
Brilliant! I love the stealing of Forth ideas to power this. Forth’s minimalism is highly underrated.
reply
alittlebee 5 hours ago
This is really beautiful (I feel like this sort of project is outsider art), thank you for sharing.
reply
kreelman 19 hours ago
There seems to be a good amount of interest for a boot sector compiler!!

If you're running on Linux, adjust the qemu call to use alsa rather than coreaudio.

I generated a pull request for this on Github. If the author is happy enough with my verbose shell scripting style :-) it might get included.

reply
fooker 22 hours ago
This is so cool!

Fun fact, Tiny C Compiler was derived from such a C compiler submitted to the the International Obfuscated C Code Contest.

https://www.ioccc.org/2001/bellard/index.html

reply
xorvoid 20 hours ago
Further Fun fact, that submission was called OTCC. I reverse engineered it and that provided inspiration for SectorC.

https://xorvoid.com/otcc_deobfuscated.html https://github.com/xorvoid/otcc_deobfuscated

reply
pseudohadamard 18 hours ago
Meh, I did an entire awk interpreter in two lines:

  #!/bin/sh
  echo "awk: bailing out" >&2
reply
hgs3 6 hours ago
Great read. It would be neat to see a mini operating system under 1 kb of code.
reply
zahlman 15 hours ago
> Big Insight #2 is that atoi() behaves as a (bad) hash function on ordinary text. It consumes characters and updates a 16-bit integer.

I could have sworn I remembered atoi() being defined to return 0 for invalid input (i.e. text not representing an integer in base ten).

reply
MobiusHorizons 8 hours ago
That would be true of one using a libc, but in a boot sector, you only have the bios, so the atoi being referenced is the one defined in c near the beginning of the article
reply
zahlman 6 hours ago
Ah, I somehow skipped over that exact code block on first read.
reply
userbinator 20 hours ago
C-subset, to be precise; but microcomputer C compilers were in the tens of KB range, for one that can actually compile real C.
reply
DeathArrow 16 hours ago
For me is not interesting because it fits in 512 bytes, it's interesting because it's very simple. I think it would be a great introduction to learning about compilers.
reply
SeanSullivan86 2 days ago
Why is it called a C Compiler if it's a subset of C?
reply
userbinator 16 hours ago
[flagged]
reply
perching_aix 12 hours ago
Why is your visceral reaction is to frame it as a quest for truth versus a great suppression of truth? Everything alright up there?

Literal second sentence in the article, in case it wasn't incredibly obvious to people anyways:

> It supports a subset of C that is large enough to write real and interesting programs.

I'm all for more boring headlines, but this characterization is ridiculous.

reply
userbinator 4 hours ago
I've had enough of headlines that overpromise and underdeliver. It's essentially false advertising. It's not like the word "subset" would put it over the length limit.
reply
wbsun 19 hours ago
Nice, now you can dd it to your boot sector and ... Wait, it is 2026, there are 1000 ways of booting and memory mapping on so-called unified ARM architecture @,@
reply
NooneAtAll3 2 days ago
> I wrote a fairly straight-forward and minimalist lexer and it took >150 lines of C code

was it supposed to be "<150"?

reply
owalt 2 days ago
They're saying the naive implementation was more than 150 lines of C code (300-450 bytes), i.e. too big.
reply
EGreg 2 days ago
Reminds me of Allegro SizeHack where we made games in 10KB - but we were using C and Allegro library!

https://www.oocities.org/trentgamblin/sizehack/entries.html#...

reply
gonzus 2 days ago
Lacking support for structs, I think this is too minimalistic to be called "a C compiler".
reply
pilord314 24 hours ago
you bootstrap it into a library you can include optionally, duh
reply
benj111 9 hours ago
Weren't structs a fairly late addition to C?

And anyway, isn't that kind of missing the point. 512 bytes isn't much. Your comment is nearly a 5th of that budget.

reply
userbinator 16 hours ago
[flagged]
reply
perching_aix 12 hours ago
> but it seems there are others here who don't want to speak of the truth

Or you know, just didn't get hung up on the blatantly obvious thing not being explicitly disclaimed right in the title, only in the preamble?

reply
userbinator 4 hours ago
Not telling the whole truth, little-by-little, this is how honesty crumbles.
reply
kayo_20211030 2 days ago
Nice. Very K&R-ish. Not a bad thing.
reply
MORPHOICES 14 hours ago
[dead]
reply