We replaced RAG with a virtual filesystem for our AI documentation assistant
395 points by denssumesh 3 days ago | 148 comments

softwaredoug 2 days ago
The real thing I think people are rediscovering with file system based search is that there’s a type of semantic search that’s not embedding based retrieval. One that looks more like how a librarian organizes files into shelves based on the domain.

We’re rediscovering forms of in search we’ve known about for decades. And it turns out they’re more interpretable to agents.

https://softwaredoug.com/blog/2026/01/08/semantic-search-wit...

reply
neuzhou 2 days ago
Agreed. I've been working on a codebase with 400+ Python files and the difference is stark. With embedding-based RAG, the agent kept pulling irrelevant code snippets that happened to share vocabulary. Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds and started asking for the right files by path.

The directory hierarchy is already a human-curated knowledge graph. We just forgot that because we got excited about vector math.

reply
jimmySixDOF 2 days ago
There a lot of methods in IR/RAG that maintain structure as metadata used in a hybrid fusion to augment search. Graph databases is an extreme form but some RAG pipelines pull out and embed the metadata with the chunk together. In the specific case of code, other layered approaches like ColGrep (late interaction) show promise.... the point is most search most of the time will benefit from a combination approach not a silver bullet
reply
jimbokun 2 days ago
Just like the approach in the article.

Everything is based on the metadata stored with chunks, just allowing the agent to navigate that metadata through ls, cd, find and grep.

reply
jahala 18 hours ago
I built tilth (https://github.com/jahala/tilth) much for this reason. Couldn't bother with RAG, but the agents kept using too many tokens - and too many turns - for finding what it needed. So I combined ripgrep and tree-sitter and some fiddly bits, and now agents find things faster and with ~40% less token use (benchmarked).
reply
siva7 2 days ago
> Switched to just letting the agent browse the directory tree and read files on demand -- it figured out the module structure in about 30 seconds

You guess what's the difference between code and loosely structured text...

reply
huflungdung 2 days ago
[dead]
reply
pertymcpert 2 days ago
[flagged]
reply
phs318u 2 days ago
Parent may or may not be AI generated or AI edited. As such it MAY breach one of the HN commenting guidelines

Your comment however definitely breaches several of them.

reply
holoduke 2 days ago
indeed. moltbook vibes
reply
mikkupikku 2 days ago
I'd rather read a hundred comments like that than one more like yours.
reply
wielebny 2 days ago
Someone simply assumed at some point that RAG must be based on vector search, and everyone followed.
reply
softwaredoug 2 days ago
It’s something of a historical accident

We started with LLMs when everyone in search was building question answering systems. Those architectures look like the vector DB + chunking we associate with RAG.

Agents ability to call tools, using any retrieval backend, call that into question.

We really shouldn’t start RAG with the assumption we need that. I’ll be speaking about the subject in a few weeks

https://maven.com/p/7105dc/rag-is-the-what-agentic-search-is...

reply
TeMPOraL 2 days ago
Right. R in RAG stands for retrieval, and for a brief moment initially, it meant just that: any kind of tool call that retrieves information based on query, whether that was web search, or RDBMS query, or grep call, or asking someone to look up an address in a phone book. Nothing in RAG implies vector search and text embeddings (beyond those in the LLM itself), yet somehow people married the acronym to one very particular implementation of the idea.
reply
macNchz 2 days ago
Yeah there's a weird thing where people would get really focused on whether something is "actually doing RAG" when it's pulling in all sorts of outside information, just not using some kind of purpose built RAG tooling or embeddings.

Now, the pendulum on that general concept seems to be swinging the opposite direction where a lot of those people just figured out that you don't need embeddings. That's true, but I'd suggest that people don't overindex on thinking that means embeddings are not actually useful or valuable. Embeddings can be downright magical in what you can build with them, they're just one more tool at your disposal.

You can mix and match these things, too! Indexing your documents into semantically nested folders for agents to peruse? Try chunking and/or summarizing each one, and putting the vectors in sidecar files, or even Yaml frontmatter. Disks are fast these days, you can rip through a lot of files indexed like that before you come close to needing something more sophisticated.

reply
viktor_von 2 days ago
> yet somehow people married the acronym to one very particular implementation of the idea.

Likely due to the rise in popularity of semantic search via LLM embeddings, which for some reason became the main selling point for RAG. Meanwhile keyword search has existed for decades.

reply
oceansky 2 days ago
I'm still using the old definition, never got the memo.
reply
adfm 2 days ago
That’s OK. Most got ReST wrong, too.
reply
rafterydj 2 days ago
Stuck it on my calendar, looking forward to it.
reply
KPGv2 2 days ago
You seem like someone who knows what they're doing, and I understand the theoretical underpinnings of LLMs (math background), but I have little kids that were born in 2016 and so the entire AI thing has left me in the dust. Never any time to even experiment.

I am active in fandoms and want to create a search where someone can ask "what was that fanfic where XYZ happened?" and get an answer back in the form of links to fanfiction that are responsive.

This is a RAG system, right? I understand I need an actual model (that's something like ollama), the thing that trawls the fanfiction archive and inserts whatever it's supposed to insert into one of these vector DBs, and I need a front-facing thing I write, that takes a user query, sends it to ollama, which can then search the vector DB and return results.

Or something like that.

Is it a RAG system that solves my use case? And if so, what software might I go about using to provide this service to me and my friends? I'm assuming it's pretty low in resource usage since it's just text indexing (maybe indexing new stuff once a week).

The goal is self-hosting. I don't wanna be making monthly payments indefinitely for some silly little thing I'm doing for me and my friends.

I am just a stay at home dad these days and don't have anyone to ask. I'm totally out the tech game for a few years now. I hope that you could respond (or someone else could), and maybe it will help other people.

There's just so many moving parts these days that I can't even hope to keep up. (It's been rather annoying to be totally unable to ride this tech wave the way I've done in the past; watching it all blow by me is disheartening).

reply
9dev 2 days ago
In the definition of RAG discussed here, that means the workflow looks something like this (simplified for brevity): When you send your query to the server, it will first normalise the words, then convert them to vectors, or embeddings, using an embedding model (there are also plain stochastic mechanisms to do this, but today most people mean a purpose-built LLM). An embedding is essentially an array of numeric coordinates in a huge-dimensional space, so [1, 2.522, …, -0.119]. It can now use that to search a database of arbitrary documents with pre-generated embeddings of their own. This usually happens during inserting them to the database, and follows the same process as your search query above, so every record in the database has its own, discrete set of embeddings to be queried during searches.

The important part here is that you now don’t have to compare strings anymore (like looking for occurrences of the word "fanfiction" in the title and content), but instead you can perform arbitrary mathematical operations to compare query embeddings to stored embeddings: 1 is closer to 3 than 7, and in the same way, fanfiction is closer to romance than it is to biography. Now, if you rank documents by that proximity and take the top 10 or so, you end up with the documents most similar to your query, and thus the most relevant.

That is the R in RAG; the A as in Augmentation happens when, before forwarding the search query to an LLM, you also add all results that came back from your vector database with a prefix like "the following records may be relevant to answer the users request", and that brings us to G like Generation, since the LLM now responds to the question aided by a limited set of relevant entries from a database, which should allow it to yield very relevant responses.

I hope this helps :-)

reply
johnathandos 2 days ago
I think the example you give is a little backwards — a RAG system searches for relevant content before sending anything to the LLM, and includes any content retrieved this way in the generative prompt. User query -> search -> results -> user query + search results passed in same context to LLM.
reply
senordevnyc 2 days ago
Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this. I bet if you put in your comment verbatim with Opus 4.6 and went back and forth a bit, it could help you figure out exactly what you need and build a first version in a couple hours. Seriously, if you know the fundamentals and can poke and prod, these tools are amazing for helping expand your knowledge base. And constraints like how much you want to pay are excellent for steering the models. Seriously, just try it!
reply
justinclift 20 hours ago
You don't need to pay an external crowd for that.

You can run Claude Code using a local instance of ~recent Ollama fine, and it'll do the teaching job perfectly well using (say) Qwen 3.5.

Doesn't even need to be one of the large models, one of the mid-size ones that fit in ~16GB of ram when given 128k+ context size should be fine.

reply
lelanthran 2 days ago
> Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this.

Paying $20/m sounds like overkill. I have tabs open for all of the most well-known AI chatbots. Despite trying my hardest, it is not possible to exhaust your free options just by learning.

Hell, just on the chatbots alone, small projects can be vibe-coded too! No $20/m necessary.

reply
senordevnyc 23 hours ago
Yeah, but when it comes to actually building stuff, using Codex is night and day different from using ChatGPT.
reply
lelanthran 21 hours ago
> Yeah, but when it comes to actually building stuff, using Codex is night and day different from using ChatGPT.

Sure, but that wasn't what you recommended Codex for, was it?

>>> Honestly, just from this question, I think you know enough that I’d go spend $20/month for a subscription to Codex, Claude Code, or Cursor, and ask them to teach you all this.

reply
safety1st 2 days ago
We were given a demo of a vector based approach, and it didn't work. They said our docs were too big and for some reason their chunking process was failing. So we ended up using a good old fashioned Elastic backend because that's what we know, and simply forwarding a few of these giant documents to the LLM verbatim along with the user's question. The results have been great, not a single complaint about accuracy, results are fast and cheap using OpenAI's micro models, Elastic is mature tech everyone understands so it's easy to maintain.

I think this turned out to be one of those lessons about premature optimization. It didn't need to be as complex as what people initially assumed. Perhaps with older models it would have been a different story.

reply
bartread 2 days ago
> They said our docs were too big and for some reason their chunking process was failing.

Why would the size of your docs have any bearing on whether or not the chunking process works? That makes no sense. Unless of course they're operating on the document entirely in memory which seems not very bright unless you're very confident of the maximum size of document you're going to be dealing with.

(I implemented a RAG process from scratch a few weeks ago, having never done so before. For our use case it's actually not that hard. Not trivial, but not that hard. I realise there are now SaaS RAG solutions but we have almost no budget and, in any case, data residence is a huge concern for us, and to get control of that you generally have to go for the expensive Enterprise tier.)

reply
safety1st 2 days ago
I agree it makes no sense. The whole point of chunking is to handle large documents. If your chunking system fails because a document is too big, that seems like a pretty glaring omission. I just chalked it up to the tech being new and novel and therefore having more bugs/people not fully understanding how it worked/etc. It was a vendor and they never gave us more details.

Not all problems have to be solved. We just fell back to using older, more proven technology, started with the simplest implementation and iterated as needed, and the result was great.

reply
ivanovm 2 days ago
I don't think this was a simple assumption. LLMs used to be much dumber! GPT-3 era LLMS were not good at grep, they were not that good at recovering from errors, and they were not good at making followup queries over multiple turns of search. Multiple breakthroughs in code generation, tool use, and reasoning had to happen on the model side to make vector-based RAG look like unnecessary complexity
reply
bluegatty 2 days ago
It was the terminology that did that more than anything. The term 'RAG' just has a lot of consequential baggage. Unfortunately.
reply
darkteflon 2 days ago
Certainly a lot of blog posts followed. Not sure that “everyone” was so blinkered.
reply
morkalork 2 days ago
Doesn't have to be tho, I've had great success letting an agent loose on an Apache Lucene instance. Turns out LLMs are great at building queries.
reply
graemefawcett 2 days ago
RAG is like when you want someone to know something they're not quite getting so you yell a bit louder. For a workflow that's mainly search based, it's useful to keep things grounded.

Less useful in other contexts, unless you move away from traditional chunked embeddings and into things like graphs where the relationships provide constraints as much as additional grounding

reply
woah 2 days ago
My intuition is that since AI assistants are fictional characters in a story being autocompleted by an LLM, mechanisms that are interpretable as human interactions with language and appear in the pretraining data have a surprising advantage over mechanisms that are more like speculation about how the brain works or abstract concepts.
reply
reactordev 2 days ago
This is also why LLMs get 80% of the way there and crap out on logic. They were trained on all the open source abandonware on GitHub.
reply
andai 2 days ago
I spent a while working on a retrieval system for LLMs and ended up reinventing a concordance (which is like an index).

It's basically the same thing as Google's inverted index, which is how Google search works.

Nothing new under the sun :)

reply
czhu12 2 days ago
Similar effort with PageIndex [1], which basically creates a table of contents like tree. Then an LLM traverses the tree to figure out which chunks are relevant for the context in the prompt.

1: https://github.com/VectifyAI/PageIndex

reply
manunamz 23 hours ago
Exactly. Traditional library science truly captured deep patterns of information architecture.

https://x.com/wibomd/status/1818305066303910006

Pixar got this right in Ralph Wrecks The Internet.

https://x.com/wibomd/status/1827067434794127648

reply
khalic 2 days ago
This kind of circles back to ontological NLP, that was using knowledge representation as a primitive for language processing. There is _a ton_ of work in that direction.
reply
softwaredoug 2 days ago
Exactly. And LLMs supervised by domain experts unlock a lot of capabilities to help with these types of knowledge organization problems.
reply
siva7 2 days ago
> Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs

It's obvious by that sentence that these guys neither understand RAG nor realized that the solution to their agentic problem didn't need any of this further abstractions including vector or grep

reply
stingraycharles 2 days ago
Aren’t most successful RAGs using a combination of embedding similarity + BM25 + reranking? I thought there were very few RAGs that only did pure embedding similarity, but I may be mistaken.
reply
rao-v 2 days ago
I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference.

Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.

reply
3abiton 2 days ago
> I got to say people also seem to be missing really simple tricks with RAG that help. Using longer chunks and appending the file path to the chunk makes a big difference. > > Having said that, generally agree that keyword searching via rg and using the folder structure is easier and better.

It depends on the task no? Codebase RAG for example has arguably a different setup than text search. I wonder how much the FS "native" embedding would help.

reply
danelliot 2 days ago
[dead]
reply
skeptrune 2 days ago
I think it's cool that LLMs can effectively do this kind of categorization on the fly at relatively large scale. When you give the LLM tools beyond just "search", it really is effectively cheating.
reply
baby 2 days ago
Yep, I was using RAG for all sorts of stuff and now moved everything to just rg+fd+cd+ls, much faster, easier, etc.
reply
_boffin_ 2 days ago
And next, we’ll get to tag based file systems
reply
UltraSane 2 days ago
Inverted indexes have the major advantages of supporting Boolean operators.
reply
risyachka 2 days ago
more and more often you see "new discoveries" that are very old concepts. the only discovery that usually happens there is that the author discovers for himself this concept. but it is essential nowadays to post it like if you discovered something new
reply
whattheheckheck 2 days ago
Turns out the millions of people in knowledge work arent librarians and they wing shit everywhere
reply
tensor 2 days ago
This is one of the most confusing claims I've seen in a long time. Grep and others over files would be the equivalent of an old fashioned keyword search where most RAG uses vector search. But everything else they claim about a file system just suggests that they don't know anything about databases.

I'm not familiar with how most out of the box RAG systems categorize data, but with a database you can index content literally in any way you want. You could do it like a filesystem with hierarchy, you could do it tags, or any other design you can dream up.

The search can be keyword, like grep, or vector, like rag, or use the ranking algorithms that traditional text search uses (tf-idf, BM25), or a combination of them. You don't have to use just the top X ranked documents, you could, just like grep, evaluate all results past whatever matching threshold you have.

Search is an extremely rich field with a ton of very good established ways of doing things. Going back to grep and a file system is going back to ... I don't know, the 60s level of search tech?

reply
brap 2 days ago
I get what you’re saying, and you’re right, however I can also see where they’re coming from:

Empirically, agents (especially the coding CLIs) seem to be doing so much better with files, even if the tooling around them is less than ideal.

With other custom tools they instantly lose 50 IQ points, if they even bother using the tools in the first place.

reply
tensor 2 days ago
Sorry, this still makes no sense. LLMs don't care about files. The way most codings systems work is that they simply provide the whole file to the LLM rather than a subset of it. That's just a choice in how you implemented your RAG search system and database. In this case the "record" is big, a file. No doubt that works for code, but it's nonsensical outside that.

E.g. for wikipedia the logical unit would likely be an article. For a book, maybe it's a chapter, or maybe it's a paragraph. You need to design the system around your content and feed the LLM an appropriate logically related set of data.

reply
brap 2 days ago
>LLMs don't care about files.

Oh but they do. These CLI agents are trained and specifically tuned to work with the filesystem. It’s not about the content or how it’s actually stored, it’s about the familiar access patterns.

I can’t begin to tell you how many times I’ve seen a coding agent figure out it can get some data directly from the filesystem instead of a dedicated, optimized tool it was specifically instructed to use for this purpose.

You basically can’t stop these things from messing with files, it’s in their DNA. You block one shell command, they’ll find another. Either revoke shell access completely or play whackamole. You cannot believe how badly they want to work with files.

reply
raincole 2 days ago
> LLMs don't care about files

They do. I highly suggest not try to derive LLMs' behaviors (in your mind) from first principles, but actually use them.

reply
darkteflon 2 days ago
Yeah, some of the uplift people are anecdotally seeing from “just using the filesystem” is, imo, on account of how difficult it is to take a principled approach to pre-chunking when implementing other approaches.
reply
girvo 2 days ago
They've been RLHF'd to the nth degree around working with *nix tools and filesystems, in practice.
reply
pertymcpert 2 days ago
They do care about files. They also care about how you express yourself, your tone, all sorts of seemingly unimportant details.
reply
pjm331 2 days ago
Yeah I’ve had a lot of success with agentic search against a database.

The way I think of it, the main characteristic of agentic search is just that the agent can execute many types of adhoc queries

It’s not about a file system

As I understood it early RAG systems were all about performing that search for the agent - that’s what makes that approach “non agentic”

But when I have a database that has both embeddings and full text and you can query against both of those things and I let the agent execute whatever types of queries it wants - that’s “agentic search” in my book

reply
darkteflon 2 days ago
Absolutely, agentic search is much more robust to the specific implementation details of your search setup (data quality issues, too) than the early one-shot approaches were. Anyone watching Claude Code work can see this for themselves.
reply
thefourthchime 2 days ago
I didn't get into the details too much, but I kept thinking, why isn't he just having an agent discover things from various data sources? I've had much better success with that.
reply
jimbokun 24 hours ago
Isn’t this the approach described in the article?
reply
dboreham 22 hours ago
Also odd in that most filesystems implement directories and file names as...a database. You can use a filesystem as a database but you're not being as clever as you thought.
reply
sunir 2 days ago
I am really enjoying this renaissance in CLI world applications. There's so much possible.

I'm working on a related challenge which is mounting a virtual filesystem with FUSE that mirrors my Mac's actual filesystem (over a subtree like ~/source), so I can constrain the agents within that filesystem, and block destructive changes outside their repo.

I have it so every repo has its own long-lived agent. They do get excited and start changing other repos, which messes up memory.

I didn't want to create a system user per repo because that's obnoxious, so I created a single claude system user, and I am using the virtual file system to manage permissions. My gmail repo's agent can for instance change the gmail repo and the google_auth repo, but it can't change the rag repo.

Edit: I'm publishing it here. It's still under development. https://github.com/sunir/bashguard

reply
slp3r 2 days ago
This feels like massive overengineering just to bypass naive chunking. Emulating a POSIX shell in TS on top of ChromaDB to do hierarchical search is going to destroy your TTFT. Every ls and grep the agent decides to run is a separate inference cycle. You're just trading RAG context-loss for severe multi-step latency
reply
stuaxo 2 days ago
Could totally have FUSE over the chunks and then there is no shell emilation.
reply
girvo 2 days ago
I'll be honest that's what I expected to read about!
reply
MeetRickAI 2 days ago
[dead]
reply
Galanwe 2 days ago
I am not familiar with the tech stack they use, but from an outsider point of view, I was sort of expecting some kind of fuse solution. Could someone explain why they went through a fake shell? There has to be a reason.
reply
skeptrune 2 days ago
100% agree a FUSE mount would be the way to go given more time and resources.

Putting Chroma behind a FUSE adapter was my initial thought when I was implementing this but it was way too slow.

I think we would also need to optimize grep even if we had a FUSE mount.

This was easier in our case, because we didn’t need a 100% POSIX compatibility for our read only docs use case because the agent used only a subset of bash commands anyway to traverse the docs. This also avoids any extra infra overhead or maintenance of EC2 nodes/sandboxes that the agent would have to use.

reply
darkteflon 2 days ago
Did you guys look at Firecracker-based options such as E2B and Fly.io? We’ve had positive early results on latency, but yeah … too early to tell where we end up on cost.
reply
skeptrune 2 days ago
Yea we did and actually use Daytona for another product, but it would have been too slow here.
reply
readitalready 2 days ago
Yah my Claude Code agents run a ton of Python and bash scripts. You're probably missing out on a lot of tool use cases without full tool use through POSIX compatibility.
reply
skeptrune 2 days ago
agreed. hopefully we can get there soon
reply
Galanwe 2 days ago
Makes sense, thanks for clarifying!
reply
nlawalker 2 days ago
Relative to making docs accessible to AI via filesystem tools, I've been looking around to see what kinds of patterns SDK authors are using to get AI coding agents to use the freshest documentation, and Vercel is doing something interesting with their AI SDK that I haven't seen elsewhere (although maybe I just haven't looked hard enough).

The "ai" npm package includes a root-level docs folder containing .mdx versions of the docs from their site, specific to the version of the package. Their intended AI-assisted developer experience is that people discover and install their ai-sdk skill (via their npx skills tool, which supports discovery and install of skills from most any provider, not just Vercel). The SKILL.md instructs the agent to explicitly ignore all knowledge that may have been trained into its model, and to first use grep to look for docs in node_modules/ai/docs/ before searching the website.

https://github.com/vercel/ai/blob/main/skills/use-ai-sdk/SKI...

reply
pboulos 2 days ago
I think this is a great approach for a startup like Mintlify. I do have skepticism around how practical this would be in some of the “messier” organisations where RAG stands to add the most value. From personal experience, getting RAG to work well in places where the structure of the organisation and the information contained therein is far from hierarchical or partition-able is a very hard task.
reply
khalic 2 days ago
The use case is well defined here, let’s not jump the gun. Text search, like with code, is a relatively simple problem compared to intrinsic semantic content in a book for example. I think the moral here is that RAG is not a silver bullet, the claude code team came to the same conclusion.
reply
pboulos 2 days ago
I agree with your assessment.
reply
dominotw 2 days ago
> he claude code team came to the same conclusion.

github copilot uses rag

reply
skeptrune 2 days ago
Modern OCR tooling is quite good. If the knowledge you are adding into your search database is able to be OCR'd then I think the approach we took here is able to be generalized.
reply
GandalfHN 2 days ago
[flagged]
reply
seanlinehan 2 days ago
This is definitely the way. There are good use cases for real sandboxes (if your agent is executing arbitrary code, you better it do so in an air-gapped environment).

But the idea of spinning up a whole VM to use unix IO primitives is way overkill. Makes way more sense to let the agent spit our unix-like tool calls and then use whatever your prod stack uses to do IO.

reply
skeptrune 2 days ago
100% agree. However, if there were no resource tradeoffs, then a FUSE mount would probably be the way to go.
reply
benlm 24 hours ago
We use both a virtual file system and RAG — they each excel in different areas. The trick with RAG is the quality of data: we use an LLM to chunk into semantically cohesive sections, as well as generate metadata (including fact triples and links to other related chunks in the document) for every chunk as well as the document as a whole. We use voyage contextual embeddings to then embed each chunk with the document and chunk metadata. Works incredibly well. At retrieval time the agent can follow chunk links if needed, as well as analyze the raw file in the vfs. High quality instruction based reranking helps a lot too! We are often looking over 10s of thousands of documents and it’d be very inefficient to have our agents analyze just the vfs without rag.
reply
benlm 24 hours ago
Our vfs is also pretty powerful too, though: it is all backed by postgres then projected into files/directories for our agents. They get basic grep etc but also optimized fts tools for bm25, jq, and preview tools that show representative slices of large documents. All on top of Pydantic AI.
reply
verall 21 hours ago
hey i tried to check out your website but i'm getting cloudflare error page code 520
reply
ACCount37 2 days ago
Traditional RAG is a poor fit for this generation of LLMs, because it doesn't fit the "agentic tool use" workflow at all.

Self-guided "grep on a filesystem" often beats RAG because it allows the LLM to run "closed loop" and iteratively refine its queries until it obtains results. Self-guided search loop is a superset of what methods like reranking try to do.

I don't think vector search and retrieval is dead, but the old-fashioned RAG is. Vector search would have to be reengineered to fit into the new agentic workflows, so that the advantages of agentic LLMs can compound with that of vector search - because in current day "grep vs RAG" matchups, the former is already winning on the agentic merits.

"Optimize grep-centric search" is a surprisingly reasonable stopgap in the meanwhile.

reply
emson 2 days ago
This is interesting as there is definitely a middle ground for agent memory. On the openclaw side you have a single MEMORY.md file on the other you have RAG and GraphRAG. I wonder if Agent memory should be more nuanced? When an agent learns something how should it promote or degrade these memory blocks - you don’t want a trading agent memorising a bad trading pattern, for example. Also the agent might want to recall semantically similar memories, but it might also want to retrieve block relationships or groups of blocks for different purposes. We’ve been exploring all these concepts with “elfmem” (sELF improving MEMory): https://github.com/emson/elfmem Would love your feedback!
reply
chelm 23 hours ago
RAG provided me no way to read the content myself. I now integrate the knowledge into a static page that I can read and edit myself in Markdown. Similar to MkDocs. But after I edit the content or remove elements that are no longer true, I build a JSON file and tell the agent how to query this source.

python -c " import json, wire, pathlib d = json.loads((pathlib.Path(wire.__file__).parent / 'assets/search_index.json').read_text()) [print(e['title'], e['url']) for e in d if 'QUERY' in (e.get('body','') + e.get('title','')).lower()] "

python -c " import json, wire, pathlib d = json.loads((pathlib.Path(wire.__file__).parent / 'assets/search_index.json').read_text()) [print(e['body']) for e in d if e.get('url','') == 'PATH'] "

https://wire.wise-relations.com/use-cases/replace-rag/

reply
jdthedisciple 2 days ago
But SQLite is notoriously 35% faster than the filesystem [0], so why not use that?

[0] https://news.ycombinator.com/item?id=14550060

reply
bob1029 2 days ago
SQLite + GPT5.4 works very well for me.

My biggest success is a Roslyn method that takes a .NET solution and converts it into a SQLite database with Files, Lines, Symbols, and References tables. I've found this approach to perform substantially better than a flat, file-based setup (i.e., like what Copilot provides in Visual Studio). Especially, for very large projects. 100+ megs of source is no problem. The relational model enables some really elegant [non]queries that would otherwise require bespoke reflection tooling or a lot more tokens consumed.

reply
deskamess 21 hours ago
Interesting... what is the use case for the AI that is querying it? Is it how to develop additional features for integration with your app or do you have some other use case? Code review/audit/debugging/etc. For AI developing against an API I would think an OpenAPI json file would do the trick.

Is the Roslyn method called as part of the build/publish?

reply
superjan 24 hours ago
Is this something you can share in more detail? Did you document a skill for the LLM to use? And with what tasks do you see most improvement?
reply
bob1029 5 hours ago
The point of this is to reduce a complex tool surface to a single sql query tool without losing the richness of the underlying representation.

In practice this allows for me to combine multiple, complex data sources with a constant number of tools. I can add a whole new database and not add a new tool. My prompts are effectively empty aside from metadata around the handful of tools it has access to.

This only seems to perform well with powerful models right now. I've only seen it work with GPT5.x. But, when it does work it works at least as well as a human given access to the exact same tools. The bootstrapping behavior is extremely compelling. The way the LLM probes system tables, etc.

The tasks this provides the most uplift for are the hardest ones. Being able to make targeted queries over tables like references and symbols dramatically reduces the number of tokens we need to handle throughout. Fewer tokens means fewer opportunities for error.

reply
tomComb 2 days ago
And Turso has built a Virtual Filesystem on top of their SQLite.

AgentFS https://agentfs.ai/ https://github.com/tursodatabase/agentfs

Which sounds like a great idea, except that is uses NFS instead of FUSE (note that macFUSE now has a FSKit backend so FUSE seems like the best solution for both Mac and Linux).

reply
kenforthewin 2 days ago
I don't get it - everybody in this thread is talking about the death of vector DBs and files being all you need. The article clearly states that this is a layer on top of their existing Chroma db.
reply
jimbokun 24 hours ago
I assumed they had other use cases where vector search was required.
reply
dominotw 2 days ago
what value is chromadb adding in that setup
reply
skeptrune 2 days ago
yea chromadb is not the point. multiple data storage solutions work
reply
kenforthewin 2 days ago
I see .. so you're not using the vectors at all. Where are the evaluations showing this chromaFS approach is performing better than vectors?
reply
skeptrune 2 days ago
Working on publishing those, but publishing benchmarks requires a lot of attention to detail so it will likely be a bit longer.
reply
petcat 24 hours ago
> At 850,000 conversations a month, even a minimal setup (1 vCPU, 2 GiB RAM, 5-minute session lifetime) would put us north of $70,000 a year

Am I crazy or is 850,000/month of anything...not really that much? Where are you spending all your CPU cycles and memory usage?

> ChromaFs is built on just-bash by Vercel Labs (shoutout Malte!), a TypeScript reimplementation of bash that supports grep, cat, ls, find, and cd

Oh.. I see.

reply
vlmutolo 23 hours ago
If you give every agent an isolated container to use, you’re going to be paying for the reserved memory while the container is active, even if the agent isn’t doing anything.
reply
tylergetsay 2 days ago
I dont understand the additional complexity of mocking bash when they could just provide grep, ls, find, etc tools to the LLM
reply
skeptrune 2 days ago
I agree that would have been the way to go given more time and resources. However, setting up a FUSE mount would have taken significantly longer and required additional infrastructure.
reply
wahnfrieden 2 days ago
agents are trained on bash grep/ls/find, not on tool-calling grep/ls/find
reply
MeetRickAI 2 days ago
[dead]
reply
kangraemin 2 days ago
This is essentially tool use with a filesystem interface — the LLM decides what to read instead of a retrieval pipeline choosing for it. Clean idea, and it sidesteps the chunking problem entirely.

Curious about the latency though. RAG is one round trip: embed query, fetch chunks, generate. This approach seems like it needs multiple LLM calls to navigate the tree before it can answer. How many hops does it typically take, and did you have to do anything special to keep response times reasonable?

reply
jimbokun 24 hours ago
In their case it was competing with cloning an entire repo before starting a session which was taking 10s of seconds.
reply
namxam 2 days ago
And you did not teach it to access chroma directly, because there is no adapter? Or because it is so much better at using FS tooling?

But in the end, I would expect, that you could add a skill / instructions on how to use chromadb directly

To be honest, I have no idea what chromadb is or how it works. But building an overlay FS seems like quite lot of work.

reply
dmix 2 days ago
This puts a lot of LLM in front of the information discovery. That would require far more sophisticated prompting and guardrails. I'd be curious to see how people architect an LLM->document approach with tool calling, rather than RAG->reranker->LLM. I'm also curious what the response times are like since it's more variable.
reply
skeptrune 2 days ago
Hmmm, the post is an attempt to explain that Mintlify migrated from embedding-retrieval->reranker->LLM to an agent loop with access to call POSIX tools as it desires. Perhaps we didn't provide enough detail?
reply
dmix 2 days ago
That matches what I'm curious about. Where an LLM is doing the bulk of information discovery and tool calling directly. Most simpler RAGs have an LLM on the frontend mostly just doing simpler query clean up, subqueries and taxonomy, then again later to rerank and parse the data. So I'd imagine the prompting and guardrails part is much more complicated in an agent loop approach, since it's more powerful and open ended.
reply
shaial 2 days ago
The title says you replaced RAG, but ChromaFs is still querying Chroma on every command — you replaced RAG's interface, not RAG itself. Which is actually the more interesting finding: the retrieval was never the bottleneck, the abstraction was. Agents don't need better search. They need `grep`.
reply
functional_dev 2 days ago
exactly, embeddings destroy information.. exact keywords, acronyms, etc. They get squashed into floats.

That is why grep still beats it for code.

I generated visual schematic of every stage of the pipeline - https://vectree.io/c/retrieval-augmented-generation-embeddin...

reply
dangoldbj 2 days ago
I think the interesting bit here is that filesystems give the model something it can actually operate on (ls, grep, etc), not just query.
reply
bluegatty 2 days ago
RAG should no have have been represented as a context tool but rather just vector querying ad an variation of search/query - and that's it.

We were bitten by our own nomenclature.

Just a small variation in chosen acronym ... may have wrought a different outcome.

Different ways to find context are welcome, we have a long way to go!

reply
skeptrune 2 days ago
agreed!
reply
jiusanzhou 2 days ago
Clever use of just-bash to avoid the sandbox cold-start problem. The key insight here is that agents don't need a real filesystem — they need a familiar interface backed by whatever storage you already have. We're seeing the same pattern in coding agents: directory hierarchy turns out to be a surprisingly effective knowledge graph that LLMs navigate better than embedding-based retrieval, mostly because they've been heavily trained on shell interactions.
reply
mandeepj 2 days ago
> even a minimal setup (1 vCPU, 2 GiB RAM, 5-minute session lifetime) would put us north of $70,000 a year based on Daytona's per-second sandbox pricing ($0.0504/h per vCPU, $0.0162/h per GiB RAM)

$70k?

how about if we round off one zero? Give us $7000.

That number still seems to be very high.

reply
all2 2 days ago
At that point I would buy an old mini PC off of ebay and just put it on my desk.
reply
lstodd 2 days ago
Hm. I think a dedicated 16-core box with 64 ram can be had for under $1000/year.

It being dedicated there are no limits on session lifetime and it'd run 16 those sessions no problem, so the real price should be around ~$70/year for that load.

reply
lelanthran 2 days ago
It looks like, to me, that someone spent a long back-and-forth with an LLM refining a design - everything they wrote screams "over-engineered, lots of moving parts, creating tiny little sub-problems that need to then be solved".

I find it very hard to believe that a human designed their process around a "Daytona Sandbox" (whatever the fuck that is) at 100x markup over simply renting a VPS (a DO droplet is what, $6/m? $5/m?) and either containerising it or using FreeBSD with jails.

I'm looking at their entire design and thinking that, if I needed to do some stuff like this, I'd either go with a FUSE-based design or (more flexible) perform interceptions using LD_PRELOAD to catch exec, spawn, open, etc.

What sort of human engineer comes up with this sort of approach?

reply
lstodd 18 hours ago
> What sort of human engineer comes up with this sort of approach?

I don't know. There is that "just-bash" thing in typescript which they call "a reimplementation of bash that supports cat and cd".

The problem they solve I think is translating one query language (of find and ripgrep) into one of their existing "db". The approach is hilarious of course.

It's "beyond engineering" :)

reply
samixg 3 hours ago
cool, but where're the benchmarks?
reply
maille 2 days ago
Let's say I want a free, local or free-tier-llm, simple solution to search information mostly from my emails and a little bit from text, doc and pdf files. Are there any tool I should try to have ollamma or gemini able to reply with my own knowledge base?
reply
ghywertelling 2 days ago
https://onyx.app/

This could be useful.

reply
maille 2 days ago
Are you using it? I will definitely give it a shot, any pointers to online resources will be appreciated
reply
zbyforgotpass 2 days ago
I don't know - we are discussing techniques - like having information in files, or in a semantic database, or in a relational database - as if there was one way that could dominate all information access. But finding the right information is not one task - if the needed information is a summary of expenses from a period of time then the best source of it will be a relational database, if it is who is the head of the HR department in a particular company - then it could probably be easy found on the company intranet pages (which are kind of graph database). It does not really matter much if the searcher is a human or LLM - there are some differences in the speed, the one time useful context length and the fact that LLMs are amnesiac - but these are just parameters, the task for humans is immensely complicated and there is no one architecture and there will not be one for LLMs.

I also vibed a brainstorming note with my knowledge base system. The initial prompt: """when I read "We replaced RAG with a virtual filesystem for our AI documentation assistant (mintlify.com)" title on HackerNews - the discussion is about RAG, filesystems, databases, graphs - but maybe there is something more fundamental in how we structure the systems so that the LLM can find the information needed to answer a question. Maybe there is nothing new - people had elaborate systems in libraries even before computers - but maybe there is something. Semantic search sounds useful - but knowing which page to return might be nearly as difficult as answering the question itself - and what about questions that require synthesis from many pages? Then we have distillation - an table of content is a kind of distillation targeting the task of search. """ Then I added a few more comments and the llm linked the note with the other pages in my kb. I am documenting that - because there were many voices against posting LLM generated content and that a prompt will be enough. IMHO the prompt is not enough - because the thought was also grounded in the whole theory I gathered in the KB. And that is also kind of on topic here. Anyway - here is the vibed note: https://zby.github.io/commonplace/notes/charting-the-knowled...

reply
dust42 2 days ago
If grep and ls do the trick, then sure you don't need RAG/embeddings. But you also don't need an LLM: a full text search in a database will be a lot more performant, faster and use less resources.
reply
nithril 2 days ago
The news headline is misleading, it does not reflect the actual article title, which is much closer to what they truly did.

They did not replaced RAG because they are still using chunk and embedding. What they changed is the interface.

reply
tschellenbach 2 days ago
I think generally we are going from vector based search, to agentic tool use, and hierarchy based systems like skills.
reply
ghywertelling 2 days ago
Agents doing retrieval has been around for quite a while

https://huggingface.co/docs/smolagents/en/examples/rag

Agentic RAG: A More Powerful Approach We can overcome these limitations by implementing an Agentic RAG system - essentially an agent equipped with retrieval capabilities. This approach transforms RAG from a rigid pipeline into an interactive, reasoning-driven process.

The innovation of the blogpost is in the retrieval step.

reply
skeptrune 2 days ago
Vector search has moved from a "complete solution" to just one tool among many which you should likely provide to an agent.
reply
kjgkjhfkjf 2 days ago
Seems like it would be simpler to give the agent tools to issue ChromaDB (or SQL) queries directly, rather than giving the LLM unix-like tools that are converted into queries under the hood using a complicated proprietary setup.
reply
r1290 2 days ago
Agree with this. Is this current world because llms are just trained on the file system? But in a year from now we move to the db?
reply
stuaxo 2 days ago
Oh that's funny, I just built a RAG and exposing the files inside the database as files seemed like the next logical steo.

I would have used Fuse if it got to that point as then it is an actual filesystem.

reply
znnajdla 24 hours ago
> The obvious way to do this is to just give the agent a real filesystem. Most harnesses solve this by spinning up an isolated sandbox and cloning the repo. We already use sandboxes for asynchronous background agents where latency is an afterthought, but for a frontend assistant where a user is staring at a loading spinner, the approach falls apart. Our p90 session creation time (including GitHub clone and other setup) was ~46 seconds.

Am I the only one who read this and thought this is fucking insane? Who in their right mind would even consider spinning up a virtual machine and cloning a repo on every search query? And if all you need is a real filesystem why would you emulate a filesystem on top of a database (Chroma)? If you need a filesystem just use an actual filesystem! This sounds like insane gymnastics just to fit a “serverless” workflow. 850,000 searches a month (less than 1 request per second) sounds like something a single raspberry pi or Mac Mini could handle.

reply
ahstilde 20 hours ago
buy off the shelf: https://archil.com/
reply
fudged71 2 days ago
since the solution here doesnt appear to be open source, I think you can get something similar by asking your agents to take AgentFS and replacing the DB with ChromaDB
reply
HanClinto 2 days ago
> "The agent doesn't need a real filesystem; it just needs the illusion of one. Our documentation was already indexed, chunked, and stored in a Chroma database to power our search, so we built ChromaFs: a virtual filesystem that intercepts UNIX commands and translates them into queries against that same database. Session creation dropped from ~46 seconds to ~100 milliseconds, and since ChromaFs reuses infrastructure we already pay for, the marginal per-conversation compute cost is zero."

Not to be "that guy" [0], but (especially for users who aren't already in ChromaDB) -- how would this be different for us from using a RAM disk?

> "ChromaFs is built on just-bash ... a TypeScript reimplementation of bash that supports grep, cat, ls, find, and cd. just-bash exposes a pluggable IFileSystem interface, so it handles all the parsing, piping, and flag logic while ChromaFs translates every underlying filesystem call into a Chroma query."

It sounds like the expected use-case is that agents would interact with the data via standard CLI tools (grep, cat, ls, find, etc), and there is nothing Chroma-specific in the final implementation (? Do I have that right?).

The author compares the speeds against the Chroma implementation vs. a physical HDD, but I wonder how the benchmark would compare against a Ramdisk with the same information / queries?

I'm very willing to believe that Chroma would still be faster / better for X/Y/Z reason, but I would be interested in seeing it compared, since for many people who already have their data in a hierarchical tree view, I bet there could be some massive speedups by mounting the memory directories in RAM instead of HDD.

[0] - https://news.ycombinator.com/item?id=9224

reply
skeptrune 2 days ago
We would also be super interested to see that comparison. I agree that there isn't a specific reason why Chroma would be required to build something like this.
reply
siliconc0w 2 days ago
I'm working on a filesystem for agents for similar reasons - https://clawfs.dev - lmk if your team would like an invite..
reply
devops000 2 days ago
Why not a simple full text search in Postgres ?
reply
r1290 2 days ago
Right. Give the agent permissions to search across certain tables. Wonder why reinventing file based detect for the db side is just a current fad?
reply
jrm4 2 days ago
Is this related to that thing where somehow the entire damn world forgot about the power of boolean (and other precise) searching?
reply
bitwize 2 days ago
What if... each agent had its own virtual file system, and anything the agent needed to access was accessible as files in the filesystem?

Congratulations, you just reinvented Plan 9. I think we're going to end up reinventing a lot of things in computing that we discovered and then forgot about because Apple/Microsoft/Google couldn't monetize them, "because AI". And I don't know how to feel about that.

reply
badgersnake 2 days ago
So you did GraphRAG but your graph is a filesystem tree.
reply
yieldcrv 2 days ago
I love the multipronged attack on RAG

RIP RAG: lasted one year at a skillset that recruiters would list on job descriptions, collectively shut down by industry professionals

reply
ctxc 2 days ago
haha, sweet. One of the cooler things I've read lately
reply
redoh 2 days ago
[dead]
reply
pithtkn 2 days ago
[dead]
reply
roach54023 2 days ago
[dead]
reply
microbuilderco 2 days ago
[dead]
reply
LeonTing1010 2 days ago
[dead]
reply
Sim-In-Silico 2 days ago
[dead]
reply
ryguz 22 hours ago
[dead]
reply
techpulselab 2 days ago
[dead]
reply
adamsilvacons 2 days ago
[dead]
reply
rs545837 2 days ago
[dead]
reply
Serberus 2 days ago
[dead]
reply
Vivolab 2 days ago
[dead]
reply
ryguz 2 days ago
[dead]
reply
wazionapps 2 days ago
[dead]
reply
aplomb1026 2 days ago
[dead]
reply
RodMiller 2 days ago
[dead]
reply
clawfund 2 days ago
[dead]
reply
dfordp11 2 days ago
[dead]
reply
maxbeech 2 days ago
[dead]
reply
volume_tech 2 days ago
[dead]
reply
genie3io 2 days ago
[dead]
reply
adamsilvacons 2 days ago
[dead]
reply
volume_tech 2 days ago
[dead]
reply
hyperlambda 2 days ago
[flagged]
reply
arafeq 2 days ago
[dead]
reply
pwr1 2 days ago
[flagged]
reply