Making Deep Learning Go Brrrr from First Principles
35 points by tosh 3 hours ago | 14 comments
jdw64 2 hours ago
Right now, all I know how to do is pull models from Hugging Face, but someday I want to build my own small LLM from scratch
replykflansburg 41 minutes ago
If you aren't already aware, Karpathy has several videos that could get you there in a few hours https://www.youtube.com/@AndrejKarpathy
replymax-amb 30 minutes ago
If you want a written resource I have a blog post about the mathematics behind building a feed forward from scratch, https://max-amb.github.io/blog/the_maths_behind_the_mlp/. Kinda focuses on translation from individual components to matrix operations.
replynoosphr 2 hours ago
>For example, getting good performance on a dataset with deep learning also involves a lot of guesswork. But, if your training loss is way lower than your test loss, you're in the "overfitting" regime, and you're wasting your time if you try to increase the capacity of your model.
replyappplication 2 hours ago
Generally, posting a link-only reply without further elaboration comes across as a bit rude. Are you providing support for the above point? Refuting it? You felt compelled to comment, a few words to indicate what you’re actually trying to say would go a long way.
replynoosphr 2 hours ago
>We show that a variety of modern deep learning tasks exhibit a "double-descent" phenomenon where, as we increase model size, performance first gets worse and then gets better.
replyForceBru 18 minutes ago
Right, isn't double descent one of the reasons why modern Extremely Large Language Models work at all? I think I heard somewhere that basically all today's "smart" (reasoning, solving math problems, etc) LLMs are trained in the "double descent" territory (whatever this means, I'm not entirely sure).
reply
wild
"The I7-4770K and preform 20k more Flops than C++" is an equally sensible statement (i.e. not)
The rest will be from "python float" (e.g. not from numpy) to C, which gives you already 2 to 3 order of magnitude difference, and then another 2 to 3 from plan C to optimized SIMD.
See e.g. https://github.com/Avafly/optimize-gemm for how you can get 2 to 3 order of magnitude just from C.