I'm not familiar with how most out of the box RAG systems categorize data, but with a database you can index content literally in any way you want. You could do it like a filesystem with hierarchy, you could do it tags, or any other design you can dream up.
The search can be keyword, like grep, or vector, like rag, or use the ranking algorithms that traditional text search uses (tf-idf, BM25), or a combination of them. You don't have to use just the top X ranked documents, you could, just like grep, evaluate all results past whatever matching threshold you have.
Search is an extremely rich field with a ton of very good established ways of doing things. Going back to grep and a file system is going back to ... I don't know, the 60s level of search tech?
Empirically, agents (especially the coding CLIs) seem to be doing so much better with files, even if the tooling around them is less than ideal.
With other custom tools they instantly lose 50 IQ points, if they even bother using the tools in the first place.
E.g. for wikipedia the logical unit would likely be an article. For a book, maybe it's a chapter, or maybe it's a paragraph. You need to design the system around your content and feed the LLM an appropriate logically related set of data.
Oh but they do. These CLI agents are trained and specifically tuned to work with the filesystem. It’s not about the content or how it’s actually stored, it’s about the familiar access patterns.
I can’t begin to tell you how many times I’ve seen a coding agent figure out it can get some data directly from the filesystem instead of a dedicated, optimized tool it was specifically instructed to use for this purpose.
You basically can’t stop these things from messing with files, it’s in their DNA. You block one shell command, they’ll find another. Either revoke shell access completely or play whackamole. You cannot believe how badly they want to work with files.
They do. I highly suggest not try to derive LLMs' behaviors (in your mind) from first principles, but actually use them.
The way I think of it, the main characteristic of agentic search is just that the agent can execute many types of adhoc queries
It’s not about a file system
As I understood it early RAG systems were all about performing that search for the agent - that’s what makes that approach “non agentic”
But when I have a database that has both embeddings and full text and you can query against both of those things and I let the agent execute whatever types of queries it wants - that’s “agentic search” in my book
I'm working on a related challenge which is mounting a virtual filesystem with FUSE that mirrors my Mac's actual filesystem (over a subtree like ~/source), so I can constrain the agents within that filesystem, and block destructive changes outside their repo.
I have it so every repo has its own long-lived agent. They do get excited and start changing other repos, which messes up memory.
I didn't want to create a system user per repo because that's obnoxious, so I created a single claude system user, and I am using the virtual file system to manage permissions. My gmail repo's agent can for instance change the gmail repo and the google_auth repo, but it can't change the rag repo.
Edit: I'm publishing it here. It's still under development. https://github.com/sunir/bashguard
Putting Chroma behind a FUSE adapter was my initial thought when I was implementing this but it was way too slow.
I think we would also need to optimize grep even if we had a FUSE mount.
This was easier in our case, because we didn’t need a 100% POSIX compatibility for our read only docs use case because the agent used only a subset of bash commands anyway to traverse the docs. This also avoids any extra infra overhead or maintenance of EC2 nodes/sandboxes that the agent would have to use.
The "ai" npm package includes a root-level docs folder containing .mdx versions of the docs from their site, specific to the version of the package. Their intended AI-assisted developer experience is that people discover and install their ai-sdk skill (via their npx skills tool, which supports discovery and install of skills from most any provider, not just Vercel). The SKILL.md instructs the agent to explicitly ignore all knowledge that may have been trained into its model, and to first use grep to look for docs in node_modules/ai/docs/ before searching the website.
https://github.com/vercel/ai/blob/main/skills/use-ai-sdk/SKI...
But the idea of spinning up a whole VM to use unix IO primitives is way overkill. Makes way more sense to let the agent spit our unix-like tool calls and then use whatever your prod stack uses to do IO.
Self-guided "grep on a filesystem" often beats RAG because it allows the LLM to run "closed loop" and iteratively refine its queries until it obtains results. Self-guided search loop is a superset of what methods like reranking try to do.
I don't think vector search and retrieval is dead, but the old-fashioned RAG is. Vector search would have to be reengineered to fit into the new agentic workflows, so that the advantages of agentic LLMs can compound with that of vector search - because in current day "grep vs RAG" matchups, the former is already winning on the agentic merits.
"Optimize grep-centric search" is a surprisingly reasonable stopgap in the meanwhile.
python -c " import json, wire, pathlib d = json.loads((pathlib.Path(wire.__file__).parent / 'assets/search_index.json').read_text()) [print(e['title'], e['url']) for e in d if 'QUERY' in (e.get('body','') + e.get('title','')).lower()] "
python -c " import json, wire, pathlib d = json.loads((pathlib.Path(wire.__file__).parent / 'assets/search_index.json').read_text()) [print(e['body']) for e in d if e.get('url','') == 'PATH'] "
My biggest success is a Roslyn method that takes a .NET solution and converts it into a SQLite database with Files, Lines, Symbols, and References tables. I've found this approach to perform substantially better than a flat, file-based setup (i.e., like what Copilot provides in Visual Studio). Especially, for very large projects. 100+ megs of source is no problem. The relational model enables some really elegant [non]queries that would otherwise require bespoke reflection tooling or a lot more tokens consumed.
Is the Roslyn method called as part of the build/publish?
In practice this allows for me to combine multiple, complex data sources with a constant number of tools. I can add a whole new database and not add a new tool. My prompts are effectively empty aside from metadata around the handful of tools it has access to.
This only seems to perform well with powerful models right now. I've only seen it work with GPT5.x. But, when it does work it works at least as well as a human given access to the exact same tools. The bootstrapping behavior is extremely compelling. The way the LLM probes system tables, etc.
The tasks this provides the most uplift for are the hardest ones. Being able to make targeted queries over tables like references and symbols dramatically reduces the number of tokens we need to handle throughout. Fewer tokens means fewer opportunities for error.
AgentFS https://agentfs.ai/ https://github.com/tursodatabase/agentfs
Which sounds like a great idea, except that is uses NFS instead of FUSE (note that macFUSE now has a FSKit backend so FUSE seems like the best solution for both Mac and Linux).
Am I crazy or is 850,000/month of anything...not really that much? Where are you spending all your CPU cycles and memory usage?
> ChromaFs is built on just-bash by Vercel Labs (shoutout Malte!), a TypeScript reimplementation of bash that supports grep, cat, ls, find, and cd
Oh.. I see.
Curious about the latency though. RAG is one round trip: embed query, fetch chunks, generate. This approach seems like it needs multiple LLM calls to navigate the tree before it can answer. How many hops does it typically take, and did you have to do anything special to keep response times reasonable?
But in the end, I would expect, that you could add a skill / instructions on how to use chromadb directly
To be honest, I have no idea what chromadb is or how it works. But building an overlay FS seems like quite lot of work.
That is why grep still beats it for code.
I generated visual schematic of every stage of the pipeline - https://vectree.io/c/retrieval-augmented-generation-embeddin...
We were bitten by our own nomenclature.
Just a small variation in chosen acronym ... may have wrought a different outcome.
Different ways to find context are welcome, we have a long way to go!
$70k?
how about if we round off one zero? Give us $7000.
That number still seems to be very high.
It being dedicated there are no limits on session lifetime and it'd run 16 those sessions no problem, so the real price should be around ~$70/year for that load.
I find it very hard to believe that a human designed their process around a "Daytona Sandbox" (whatever the fuck that is) at 100x markup over simply renting a VPS (a DO droplet is what, $6/m? $5/m?) and either containerising it or using FreeBSD with jails.
I'm looking at their entire design and thinking that, if I needed to do some stuff like this, I'd either go with a FUSE-based design or (more flexible) perform interceptions using LD_PRELOAD to catch exec, spawn, open, etc.
What sort of human engineer comes up with this sort of approach?
I don't know. There is that "just-bash" thing in typescript which they call "a reimplementation of bash that supports cat and cd".
The problem they solve I think is translating one query language (of find and ripgrep) into one of their existing "db". The approach is hilarious of course.
It's "beyond engineering" :)
I also vibed a brainstorming note with my knowledge base system. The initial prompt: """when I read "We replaced RAG with a virtual filesystem for our AI documentation assistant (mintlify.com)" title on HackerNews - the discussion is about RAG, filesystems, databases, graphs - but maybe there is something more fundamental in how we structure the systems so that the LLM can find the information needed to answer a question. Maybe there is nothing new - people had elaborate systems in libraries even before computers - but maybe there is something. Semantic search sounds useful - but knowing which page to return might be nearly as difficult as answering the question itself - and what about questions that require synthesis from many pages? Then we have distillation - an table of content is a kind of distillation targeting the task of search. """ Then I added a few more comments and the llm linked the note with the other pages in my kb. I am documenting that - because there were many voices against posting LLM generated content and that a prompt will be enough. IMHO the prompt is not enough - because the thought was also grounded in the whole theory I gathered in the KB. And that is also kind of on topic here. Anyway - here is the vibed note: https://zby.github.io/commonplace/notes/charting-the-knowled...
They did not replaced RAG because they are still using chunk and embedding. What they changed is the interface.
https://huggingface.co/docs/smolagents/en/examples/rag
Agentic RAG: A More Powerful Approach We can overcome these limitations by implementing an Agentic RAG system - essentially an agent equipped with retrieval capabilities. This approach transforms RAG from a rigid pipeline into an interactive, reasoning-driven process.
The innovation of the blogpost is in the retrieval step.
I would have used Fuse if it got to that point as then it is an actual filesystem.
Am I the only one who read this and thought this is fucking insane? Who in their right mind would even consider spinning up a virtual machine and cloning a repo on every search query? And if all you need is a real filesystem why would you emulate a filesystem on top of a database (Chroma)? If you need a filesystem just use an actual filesystem! This sounds like insane gymnastics just to fit a “serverless” workflow. 850,000 searches a month (less than 1 request per second) sounds like something a single raspberry pi or Mac Mini could handle.
Not to be "that guy" [0], but (especially for users who aren't already in ChromaDB) -- how would this be different for us from using a RAM disk?
> "ChromaFs is built on just-bash ... a TypeScript reimplementation of bash that supports grep, cat, ls, find, and cd. just-bash exposes a pluggable IFileSystem interface, so it handles all the parsing, piping, and flag logic while ChromaFs translates every underlying filesystem call into a Chroma query."
It sounds like the expected use-case is that agents would interact with the data via standard CLI tools (grep, cat, ls, find, etc), and there is nothing Chroma-specific in the final implementation (? Do I have that right?).
The author compares the speeds against the Chroma implementation vs. a physical HDD, but I wonder how the benchmark would compare against a Ramdisk with the same information / queries?
I'm very willing to believe that Chroma would still be faster / better for X/Y/Z reason, but I would be interested in seeing it compared, since for many people who already have their data in a hierarchical tree view, I bet there could be some massive speedups by mounting the memory directories in RAM instead of HDD.
Congratulations, you just reinvented Plan 9. I think we're going to end up reinventing a lot of things in computing that we discovered and then forgot about because Apple/Microsoft/Google couldn't monetize them, "because AI". And I don't know how to feel about that.
We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.
https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...
The directory hierarchy is already a human-curated knowledge graph. We just forgot that because we got excited about vector math.
Everything is based on the metadata stored with chunks, just allowing the agent to navigate that metadata through ls, cd, find and grep.
You guess what's the difference between code and loosely structured text...
Your comment however definitely breaches several of them.
We started with LLMs when everyone in search was building question answering systems. Those architectures look like the vector DB + chunking we associate with RAG.
Agents ability to call tools, using any retrieval backend, call that into question.
We really shouldn’t start RAG with the assumption we need that. I’ll be speaking about the subject in a few weeks
https://maven.com/p/7105dc/rag-is-the-what-agentic-search-is...
Now, the pendulum on that general concept seems to be swinging the opposite direction where a lot of those people just figured out that you don't need embeddings. That's true, but I'd suggest that people don't overindex on thinking that means embeddings are not actually useful or valuable. Embeddings can be downright magical in what you can build with them, they're just one more tool at your disposal.
You can mix and match these things, too! Indexing your documents into semantically nested folders for agents to peruse? Try chunking and/or summarizing each one, and putting the vectors in sidecar files, or even Yaml frontmatter. Disks are fast these days, you can rip through a lot of files indexed like that before you come close to needing something more sophisticated.
Likely due to the rise in popularity of semantic search via LLM embeddings, which for some reason became the main selling point for RAG. Meanwhile keyword search has existed for decades.
I am active in fandoms and want to create a search where someone can ask "what was that fanfic where XYZ happened?" and get an answer back in the form of links to fanfiction that are responsive.
This is a RAG system, right? I understand I need an actual model (that's something like ollama), the thing that trawls the fanfiction archive and inserts whatever it's supposed to insert into one of these vector DBs, and I need a front-facing thing I write, that takes a user query, sends it to ollama, which can then search the vector DB and return results.
Or something like that.
Is it a RAG system that solves my use case? And if so, what software might I go about using to provide this service to me and my friends? I'm assuming it's pretty low in resource usage since it's just text indexing (maybe indexing new stuff once a week).
The goal is self-hosting. I don't wanna be making monthly payments indefinitely for some silly little thing I'm doing for me and my friends.
I am just a stay at home dad these days and don't have anyone to ask. I'm totally out the tech game for a few years now. I hope that you could respond (or someone else could), and maybe it will help other people.
There's just so many moving parts these days that I can't even hope to keep up. (It's been rather annoying to be totally unable to ride this tech wave the way I've done in the past; watching it all blow by me is disheartening).
The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.
That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.
I hope this helps :-)
You can run Claude Code using a local instance of ~recent Ollama fine, and it'll do the teaching job perfectly well using (say) Qwen 3.5.
Doesn't even need to be one of the large models, one of the mid-size ones that fit in ~16GB of ram when given 128k+ context size should be fine.
Paying $20/m sounds like overkill. I have tabs open for all of the most well-known AI chatbots. Despite trying my hardest, it is not possible to exhaust your free options just by learning.
Hell, just on the chatbots alone, small projects can be vibe-coded too! No $20/m necessary.
Sure, but that wasn't what you recommended Codex for, was it?
>>> Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this.
I think this turned out to be one of those lessons about premature optimization. It didn't need to be as complex as what people initially assumed. Perhaps with older models it would have been a different story.
Why would the size of your docs have any bearing on whether or not the chunking process works? That makes no sense. Unless of course they're operating on the document entirely in memory which seems not very bright unless you're very confident of the maximum size of document you're going to be dealing with.
(I implemented a RAG process from scratch a few weeks ago, having never done so before. For our use case it's actually not that hard. Not trivial, but not that hard. I realise there are now SaaS RAG solutions but we have almost no budget and, in any case, data residence is a huge concern for us, and to get control of that you generally have to go for the expensive Enterprise tier.)
Not all problems have to be solved. We just fell back to using older, more proven technology, started with the simplest implementation and iterated as needed, and the result was great.
Less useful in other contexts, unless you move away from traditional chunked embeddings and into things like graphs where the relationships provide constraints as much as additional grounding
It's basically the same thing as Google's inverted index, which is how Google search works.
Nothing new under the sun :)
1: https://github.com/VectifyAI/PageIndex
https://x.com/wibomd/status/1818305066303910006
Pixar got this right in Ralph Wrecks The Internet.
https://x.com/wibomd/status/1827067434794127648
It's obvious by that sentence that these guys neither understand RAG nor realized that the solution to their agentic problem didn't need any of this further abstractions including vector or grep
Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.
It depends on the task no? Codebase RAG for example has arguably a different setup than text search. I wonder how much the FS "native" embedding would help.