One interesting thing I got in replies is Unison language (content adressed functions, function is defined by AST). Also, I recommend checking Dion language demo (experimental project which stores program as AST).
In general I think there's a missing piece between text and storage. Structural editing is likely a dead end, writing text seems superior, but storage format as text is just fundamentally problematic.
I think we need a good bridge that allows editing via text, but storage like structured database (I'd go as far as say relational database, maybe). This would unlock a lot of IDE-like features for simple programmatic usage, or manipulating langauge semantics in some interesting ways, but challenge is of course how to keep the mapping between textual input in shape.
> Dion language demo (experimental project which stores program as AST).
Michael Franz [1] invented slim binaries [2] for the Oberon System. Slim binaries were program (or module) ASTs compressed with the some kind of LZ-family algorithm. At the time they were much more smaller than Java's JAR files, despite JAR being a ZIP archive.[1] https://en.wikipedia.org/wiki/Michael_Franz#Research
[2] https://en.wikipedia.org/wiki/Oberon_(operating_system)#Plug...
I believe that this storage format is still in use in Oberon circles.
Yes, I am that old, I even correctly remembered Franz's last name. I thought then he was and still think he is a genius. ;)
Dion project was more about user interface to the programming language and unifying tools to use AST (or Typed AST?) as a source of truth instead of text and what that unlocks.
Dion demo is here: https://vimeo.com/485177664
Their system allow for intermediate state with errors. If that erroneous state can be stored to disk, they using a storage representation that is equivalent to text. If erroneous state cannot be stored, this makes Dion system much less usable, at least for me.
They also deliberately avoided pitfalls of languages like C. While they can do that because they can, I'd like to see how they will extend their concepts of user interface to the programming language and unifying tools to use (Typed) AST to C or, forgive me, C++, and what it'll unlock.
Also, there is an interesting approach of error correcting parsers: https://www.cs.tufts.edu/comp/150FP/archive/doaitse-swierstr...
Much extended version is in Haskell at Hackage: https://hackage.haskell.org/package/uu-parsinglib
As it allows monadic parsing combinators, it can parse context-sensitive grammars such as C.
It's interesting to see whether their demonstration around 08:30 of Visual Studio unable to recover from an error properly can be improved with error correction.
Of course text is so universal and allows for so many ways of editing that it's hard to give up. On the other hand, while text is great for input, it comes with overhead and core issues for (most are already in the article, but I'm writing them down anyway):
1. Substitutions such as renaming a symbol where ensuring the correctness of the operation pretty much requires having parsed the text to a graph representation first, or letting go of the guarantee of correctness in the first place and performing plain text search/replace.
2. Alternative representations requiring full and correct re-parsing such as:
- overview of flow across functions
- viewing graph based data structures, of which there tend to be many in a larger application
- imports graph and so on...
3. Querying structurally equivalent patterns when they have multiple equivalent textual representations and search in general being somewhat limited.
4. Merging changes and diffs have fewer guarantees than compared to when merging graphs or trees.
5. Correctness checks, such as cyclic imports, ensuring the validity of the program itself are all build-time unless the IDE has effectively a duplicate program graph being continuously parsed from the changes that is not equivalent to the eventual execution model.
6. Execution and build speed is also a permanent overhead as applications grow when using text as the source. Yes, parsing methods are quite fast these days and the hardware is far better, but having a correct program graph is always faster than parsing, creating & verifying a new one.
I think input as text is a must-have to start with no matter what, but what if the parsing step was performed immediately on stop symbols rather than later and merged with the program graph immediately rather than during a separate build step?Or what if it was like "staging" step? Eg, write a separate function that gets parsed into program model immediately, then try executing it and then merge to main program graph later that can perform all necessary checks to ensure the main program graph remains valid? I think it'd be more difficult to learn, but I think having these operations and a program graph as a database, would give so much when it comes to editing, verifying and maintaining more complex programs.
I think this is the way to go, kinda like on Github, where you write markdown in the comments, but that is only used for input, after that it's merged into the system, all code-like constructs (links, references, images) are resolveed and from then you interact with the higher level concept (rendered comment with links and images).
For programinng langauge, Unison does this - you write one function at a time in something like a REPL and functions are saved in content addressed database.
> Or what if it was like "staging" step?
Yes, and I guess it'd have to go even deeper. The system should be able to represent broken program (in edited state), so conceptually it has to be something like a structured database for code which separates the user input from stored semantic representation and the final program.
IDE's like IntelliJ already build a program model like this and incrementally update it as you edit, they just have to work very hard to do it and that model is imperfect.
There's million issues to solve with this, though. It's a hard problem.
I guess the most used one is styles editor in chrome dev tools and that one is only really useful for small tweaks, even just adding new properties is already pretty frustrating experience.
[edit] otherwise I agree that structural editing a-la IDE shortcuts is useful, I use that a lot.
In all seriousness this is being done. By me.
I would say structural editing is not a dead end, because as you mention projects like Unison and Smalltalk show us that storing structures is compatible with having syntax.
The real problem is that we need a common way of storing parse tree structures so that we can build a semantic editor that works on the syntax of many programming languages
[edit] on the level of a code in a function at least.
And you can build near any VCS of your dream while still using Git as storage backend, as it is database of a linked snapshots + metadata. Bonus benefit: it will work with existing tooling
The whole article is "I don't know how git works, let's make something from scratch"
"Git is a file system. We need a database for the code"
Which begs the sequitur: "A database is just files in the file system. We need a database for the database"
Also I checked the author out and can confirm that they know how git works in detail.
Then you can use alternative diff (which is pluggable, of course) to compare those ASTs, and quickly too.
Hell, you could generate those ASTs on server, making normal git client compatible, just unable to use this feature
I don't think there's a limit in git. The structure might be a bit deep for git and thus some things might be unoptimized, but the shape is the same.
Tree.
This is an interesting idea for how to reuse more of git's infrastructure, but it wouldn't be backwards compatible in the traditional sense either. If you checked out the contents of that repo you'd get every node in the syntax tree as a file, and let's just say that syntax nodes as directories aren't going to be compatible with any existing tools.
But even if I wanted to embrace it I still think I'd hit problems with the assumptions baked into the `tree` object type in git. Directories use a fundamentally different model than syntax trees do. Directories tend to look like `<Parent><Child/></>` while syntax trees tend to look like `<Person> child: <Person /> </>`. There's no room in git's `tree` objects to put the extra information you need, and eventually the exercise would just start to feel like putting a square peg in a round hole.
Instead of learning that I should use exactly git's data structure to preserve compatibility, I think my learning should be that a successful structure needs to be well-suited to the purpose it is being used for.
But the git directory entry contains: * a type (this one is quite limited, so I'm not sure how well that could be (ab)used * a name * a pointer to the content
Which is exaclty what an AST entry has.
I'm sure you could abuse a git `tree` to squish in the extra data, but my point was just that you'd have to because a directory doesn't have a name that's separate from the name its parent uses to point to it. An AST node has both a name that its parent uses to point to it and an named identity e.g.:
``` <BinaryExpression> left: <Number '2' /> op: <'+'> right: <Number '2' /> </> ```
So my point is that to fit this into git you'd have to do something funky like make a folder called `left_Number`, and my question about this is the same question as I have in the first place about creating a folder on disk named `Number` whose contents are only the digit `2`. Since every existing tool will present the information as overwhelming amounts of nonsense compared to what users are used to seeing, has any compatibility at all been created? What was the point?
I also see the need to check out files as being an aspect of Git that relates purely to its integration with editors through flat text files. But if git was a more of a database than a filesystem then it's fair to assume that you'd prefer to integrate database access directly into the IDE.
If your version control system is not compatible with GitHub it will be dead on arrival. The value of allowing people to gradually adopt a new solution can not be understated. There is also value in being compatible with existing git integrations or scripts in projects build systems.
Curious what the author thinks though, looks like it's posted by them.
IMO Git is not an unassailable juggernaut - I think if a new SCM came along and it had a frontend like GitHub and a VSCode plugin, that alone would be enough for a many users to adopt it (barring users who are heavy customers of GitHub Actions). It's just that nobody has decided to do this, since there's no money in it and most people are fine with Git.
The dynamic that I think balances this out is the "inherited expectations" dynamic. You'll end up winning a lot of people by offering them easy migration, but then you'll find it difficult or impossible to satisfy them because they just want the exact same product they were using but somehow magically cheaper and better.
My goal is to avoid the violent, priority-changing growth that comes with direct interoperability, and just ride the regular bell-shaped adoption curve.
Right now I want to connecting with early adopters who are eager to look at a new technology precisely because they're crashing up against the more immovable limitations of the old and so who are willing to seek out and accept new things, even if it means accepting change. But that acceptance of change is key to creating expectations I can actually satisfy so that my growth actually drives more growth
It shouldn't be hard to build a bijective mapping between a file system and AST.
The cli really isn't the greatest either way. But there's lots of infrastructure to make the sharing work reasonably well.
..to my mind such a thing could only be language-specific and the model for C# is probably something similar to Roslyn's interior (it keeps "trivia" nodes separate but still models the using section as a list for some reason) and having it all in a queryable database would be glorious for change analysis
It still needs to be language-aware to know which token grammar to use, but syntax highlighting as a field has a relatively well defined shared vocabulary of output token types, which lends to some flexibility in changing the language on the fly with somewhat minimal shifts (particularly things like JS to TS where the base grammars share a lot of tokens).
I didn't do much more with it than generate simple character-based diffs that seemed like improvements of comparative line-based diffs, but I got interesting results in my experiments and beat some simple benchmarks in comparing to other character-based diff tools of the time.
(That experiment was done in the context of darcs exploring character-based diffs as a way to improve its CRDT-like source control. I still don't think darcs has the proposed character-based patch type. In theory, I could update the experiment and attempt to use it as a git mergetool, but I don't know if it provides as many benefits as a git mergetool than it might in a patch theory VCS like darcs or pijul.)
i saw a diff tool that marked only the characters that changed. that would work here.
You can build whatever you want on top to help your AI agents. That would be actually beneficial so that we stop feeding raw text to this insane machinery for once.
Please no.
All you had to do was access it over VPN and you could go on vacation because nobody would expect you to get anything done!
The brokeness was always what got me. I had to believe either they had no unit tests, or 1000's of them were failing and they released it anyway. Because it was so fragile that it would have been impossible to test it and not notice easily broken things.
Where-as doing DB queries to navigate code would be quite unfamiliar.
Can anyone explain this one? I use monorepos everyday and although tools like precommit can get a bit messy, I've never found git itself to be the issue?
This subject deserves a study of its own, but big-big-tech tends to use other things than git.
Not a result of git.
Business continuity (no uncontrolled external dependencies) and corporate security teams wanting to be able to scan everything. Also wanting to update everyone's dependencies when they backport something.
Once you got those requirements, most of the benefits of multi-repo / roundtripping over releases just don't hold anymore.
The entanglement can be stronger, but if teams build clean APIs it's no harder than removing it from a cluster of individual repositories. That might be a pretty load bearing if though.
Monorepos are one such VCS, which personally I don't like, but that's just me. Otherwise there are plenty of large organizations that manage lots of git repositories in various ways.
Replacing git is a lot like saying we should replace Unix. Like, yeah, it's got its problems, but we're kind of stuck with it.
You can also just tell that monorepos don't scale because eventually if you keep consolidating over many generations, all the code in the world would be in just one or two repos. Then these repos would be so massive that just breaking off a little independent piece to be able to work on would be quite crucial to being able to make progress.
That's why the alternative to monorepos are multirepos. Git handles multirepos with it's submodules feature. Submodules are a great idea in theory, offering git repos the same level of composability in your deps that a modern package manager offers. But unfortunately submodules are so awful in practice so that people cram all their code into one repo just to avoid having to use the submodule feature for the exact thing it was meant to be used for...
You can do all kinds of workarounds and sandboxes, but it would be nice for git to support more modularity.
If he knew how to use it, he'd be annoyed at some edge cases.
If he knew how it works, he'd know the storage subsystem is flexible enough to implement any kind of new VCS on top of it. The storage format doesn't need to change to improve/replace the user facing part
If you create a new tool for version control, go for it. Then see how it fares in benchmarks (for end-to-end tools) or vox populi - if people use you new tool/skill/workflow.
Monorepo and large binary file support is TheWay. A good cross-platform virtual file system (VFS) is necessary; a good open source one doesn’t exist today.
Ideally it comes with a copy-on-write system for cross-repo blob caching. But I suppose that’s optional. It lets you commit toolchains for open source projects which is a dream of mine.
Not sure I agree that LSP like features need to be built in. That feels wrong. That’s just a layer on top.
Do think that agent prompts/plans/summaries need to be a first class part of commits/merges. Not sure the full set of features required here.
Well... I used SVN before that and it was way worse.
And clear OP hasn't heard of vss..
HN is a tiny bubble. The majority of the world's software engineers are barely using source control, don't do code reviews, don't have continuous build systems, don't have configuration controlled release versions, don't do almost anything that most of HN's visitors think are the basic table stakes just to conduct software engineering.
Wasn't this done by IBM in the past? Rational Rose something?
To help the agents understand the codebase, we indexed our code into a graph database using an AST, allowing the agent to easily find linked pages, features, databases, tests, etc from any one point in the code, which helped it produce much more accurate plans with less human intervention and guidance. This is combined with semantic search, where we've indexed the code based on our application's terminology, so when an agent is asked to investigate a task or bug for a specific feature, it'll find the place in the code that implements that feature, and can navigate the graph of dependencies from there to get the big picture.
We provide these tools to the coding agents via MCP and it has worked really well for us. Devs and QAs can find the blast radius of bugs and critical changes very quickly, and the first draft quality of AI generated plans requires much less feedback and corrections for larger changes.
In our case, I doubt that a general purpose AST would work as well. It might be better than a simple grep, especially for indirect dependencies or relationships. But IMO, it'd be far more interesting to start looking at application frameworks or even programming languages that provide this direct traversability out of the box. I remember when reading about Wasp[0] that I thought it would be interesting to see it go this way, and provide tooling specifically for AI agents.
[0] https://wasp.sh/
The interesting side effect is that AI tools get this traversability for free. When business logic and infrastructure declarations live in the same code, an AI agent doesn't need a separate graph database or MCP tool to understand what a service depends on or what infrastructure it needs. It's all in the type signatures. The agent generates standard TypeScript or Go, and the framework handles everything from there to running in production.
Our users see this work really well with AI agents as the agent can scaffold a complete service with databases and pub/sub, and it's deployable immediately because the framework already understands what the code needs.