The Problem: I found that learning resources for modern data engineering are often fragmented and scattered across hundreds of medium articles or disjointed tutorials. It's hard to piece everything together into a coherent system.
The Solution: I decided to open-source my learning notes and build them into a structured book. My goal is to help developers fast-track their learning curve.
Key Features:
LLM-Centric: Focuses on data pipelines specifically designed for LLM training and RAG systems.
Scenario-Based: Instead of just listing tools, I compare different methods/architectures based on specific business scenarios (e.g., "When to use Vector DB vs. Keyword Search").
Hands-on Projects: Includes full code for real-world implementations, not just "Hello World" examples.
This is a work in progress, and I'm treating it as "Book-as-Code". I would love to hear your feedback on the roadmap or any "anti-patterns" I might have included!
Check it out:
Online: https://datascale-ai.github.io/data_engineering_book/
GitHub: https://github.com/datascale-ai/data_engineering_book
> The "Modern Data Stack" (MDS) is a hot concept in data engineering in recent years, referring to a cloud-native, modular, decoupled combination of data infrastructure
https://github.com/datascale-ai/data_engineering_book/blob/m...
Later parts are better and more to the point though: https://github.com/datascale-ai/data_engineering_book/blob/m...
Edit: perhaps I judged to early. The RAG sections isn't bad either: https://github.com/datascale-ai/data_engineering_book/blob/m...
Oil[0] is fairly useless without being refined as well. Perhaps: "Data is the new oil, you need to refine it"?
I hope xx123122 won't mind my mentioning that they emailed us about this post, which originally got caught in a spam filter. I invited them to post a comment giving the background to the project but they probably haven't seen my reply yet. Hopefully soon, given that the post struck a chord!
Edit: they did, and I've moved that post to the toptext.
We've found keyword search (BM25) often beats semantic search for specific entity names/IDs, while vectors win on concepts. Do you cover hybrid search patterns/re-ranking in the book? That seems to be where most production systems end up.
How is possible a Chinese publication gets to the top in HN?
We are pleasantly surprised by the warm reception. We know the project (and our English localization) is still a Work in Progress, but we are committed to improving it to meet the high standards of the HN community. We'll keep shipping updates!
I am a complete novice in training LLMs, and have been trying to train a novel architecture for Python code generation, using Apple Silicon.
I've been a bit frustrated to be honest that the data tools don't seem to have any focus on code, their modalities are generic text and images. And for synthetic data generation I would love to use EBNF-constrained outputs but SGlang is not available on MacOS. So I feel a bit stuck, downloading a large corpus of Python code, running into APFS issues, sharding, custom classifying, custom cleaning, custom mixing, etc. Maybe I've missed a tool but I'm surprised there aren't pre-tagged, pre-categorized, pre-filtered datasets for code where I can just tune the curriculum/filters to input into training.