Its business is underpinned by pre-AI assumptions about usage that, based on its recent instability, I suspect is being invalidated by surges in AI-produced code and commits.
I'm worried, at some point, they'll be forced to take an unpopular stance and either restrict free usage tiers or restrict AI somehow. I'm unsure how they'll evolve.
Github is still code-centric with issues and discussions being auxilliary/supporting features around the code. At some point those will become the frontline features, and the code will become secondary.
Specifications accurate enough to describe the exact behaviors are basically equivalent to code, also in terms of length, so you basically just change language (and current LLM tech is not on course to be able to handle such big specifications)
Higher level specifications (the ones that make sense) leave some details and assumption to the implementation, so you can not safely ignore the implementation itself and you cannot recreate it easily (each LLM build could change the details and the little assumptions)
So yeah, while I agree that documentation and specifications are more and more important in the AI world, I don't see the path to the conclusions you are drawing
Not saying that you are wrong, necessarily. But I think it's still a pretty broad presumption.
Or instead, is it mistakes being made migrating to Azure, rather than Azure being the actual problem? Changing providers can be difficult, especially if you relied on any proprietary services from the old provider.
Making big changes like the tech that underpins your product while still actively developing that product means a lot of things in a complicated system changing at once which is usually a recipe for problems.
Incidentally I think that is part of the current problem with AI generated code. Its a fire hose of changes in systems that were never designed or barely holding together at their existing rate of change. AI is able to produce perfectly acceptable code at times but the churn is high and the more code the more churn.
Yeah... my career hasn't been that long but I've only ever worked on one system that wasn't held together by duct-tape and a lot that were way more complicated than they needed to be.
It even sounds silly when you say it this way.
Assuming just text, deduplication,not being dumb about storage patterns, our range is 40-100TB, and that's probably too high by 10x. 100TB means that the average repo is 100KB, too.
Nearly every arcade machine and pre-2002 console is available as a software "spin" that's <20TB.
How big was "every song on spotify"? 400TB?
the eye is somewhere between a quarter and a half a petabyte.
Wikipedia is ~100GB. It may be more, now, i haven't checked. But the raw DB with everything you need to display the text contained in wikipedia is 50-100GB, and most of that is the markup - that is, not "information for us, but information for the computer"
Common Crawl, with over one billion, nine hundred and seventy thousand web pages in their archive: 345TB.
We do not believe this has anything to do with the "queries per second" or "writes per second" on the platform. Ballpark, github probably smooths out to around ten thousand queries per second, median. I'd have guessed less, but then again i worked on a photography website database one time that was handling 4000QPS all day long between two servers. 15 years ago.
P.S. just for fun i searched github for `#!/bin/bash` and it returned 15.3mm "code", assume you replace just that with 2 bytes instead of 12, you save 175MB on disk. That's compression; but how many files are duplicated? I don't mean forks with no action, but different projects? Also i don't care to discern the median bash script byte-length on github, but ballparked to 1000 chars/bytes, mean, that's 16GB on disk for just bash scripts :-)
i have ~593 .sh files that everything.exe can see, and 322 are 1KB or less, 100 are 1-2KB, 133 are 2-10KB, and the rest - 38 - are >11KB. of the 1KB ones, a random sample shows they're clustering such that the mean is ~500B.
My worry is for the business and how they structure pricing. GitHub is able to provide the free services they do because at some point they did the math on what a typical free tier does before they grow into a paid user. They even did the math on what paid users do, so they know they'll still make money when charging whatever amount.
My hunch is AI is a multiplier on usage numbers, which increases OpEx, which means it's eating into GH's assumptions on margin. They will either need to accept a smaller margin, find other ways to shrink OpEx, or restructure their SKUs. The Spotifies and YouTubes of the world hosting other media formats have it harder than them, but they are able to offset the cost of operation by running ads. Can you imagine having to watch a 20 second ad before you can push?
https://newsletter.betterstack.com/p/how-github-reduced-repo...
maybe sourced from this tweet?
https://x.com/github/status/1569852682239623173
Edit: though maybe that data doesn't count as your "just text" data.
Is there some command a git administrator can issue to see granular statistics, or is "du -sh" the best we can get?
0: i'm assuming a site-rip that only fetches the equivalent files to when you click the "zip download" button, not the releases, not the wikis, images, workers, gists, etc.
Also I would guess there would be copy-on-write and other such optimizations at Github. It's unlikely that when you fork a repo, somewhere on a disk the entire .git is being copied (but even if it was, it's not that expensive).
That's not to say they don't have people who can build good things. They built the standard for code distribution after all. But you can't help but recognize so much of it is duct taped together to ship instead of crafted and architected with intent behind major decisions that allow the small shit to just work. If you've ever worked on a similar project that evolved that way, you know the feeling.
But also, GitHub profiles and repos were at one point a window into specific developers - like a social site for coders. Now it's suffering from the same problem that social media sites suffer from - AI-slop and unreliable signals about developers. Maybe that doesn't matter so much if writing code isn't as valuable anymore.
Oh no, who would think about the big corporations? How is Micro$lop going to survive? /s
If you need to host git + a nice gui (as opposed to needing to promote your shit) Forgejo is free software.
Also, I wouldn't say GitHub is a corporate attempt to own git... GitHub is a huge part of why Git is as popular as it is these days, and GitHub started as a small startup.
Of course, you can absolutely say Microsoft bought GitHub in an attempt to own git, but I think you are really underselling the value of the community parts of GitHub.
What's more interesting to me is that Claude dramatically lowers the barrier to _testing_, not just writing code. I can mass-generate edge case tests that I'd never bother writing manually. The result is higher-quality solo repos that look "abandoned" by star count.
Is anyone tracking test coverage or CI pass rates for AI-assisted repos vs traditional ones? That seems like a much more useful signal than stars.
The more interesting question to me isn't "are AI-assisted repos less starred" but "are people building more stuff that's useful only to themselves." That feels like a good outcome — software that was previously only economically viable to write at scale is now worth writing for an audience of one.
Programming has long succumbed to influencer dynamics and is subject to the same critiques as any other kind of pop creation. Popular restaurants, fashion, movies - these aren't carefully crafted boundary pushing masterpieces.
Pop books are hastily written and usually derivative. Pop music is the same as is pop art. Popular podcasts and YouTube channels are usually just people hopping unprepared on a hot mic and pushing record.
Nobody is reading a PhD thesis or a scholarly journal on the bus.
The markers for the popularity of pop works are fairly independent from the quality of their content. It's the same dynamics as the popular kid at school.
So pop programming follows this exact trend. I don't know why we expect humans to behave foundationally differently here.
As someone who is involved in academia, I can attest that most of my colleagues (including myself) do in fact read quite a few papers on buses (and trams - can't forget those)
Stars get bought all the time. I've been around startup scene and this is basically part of the playbook now for open core model. You throw your code up on GitHub, call it open source, then buy your stars early so it looks like people care. Then charge for hosted or premium features.
There's a whole market for it too. You can literally pay for stars, forks, even fake activity. Big star count makes a project look legit at a glance, especially to investors or people who don't dig too deep. It feeds itself. More people check it out, more people star it just because others already did.
I have 60-ish repos, vast majority are zero star, one or two with a star or two, one with 25-ish. It’s a signal to me of interest in and usage of that project.
Doesn’t mean stars are perfect, or can’t be gamed, or anything in a universally true generalization sense. But also not meaningless.
They are bookmarks. It is a way to bookmark a repo, and while it might correlate with quality, it isn't a measure of it.
When I started reading commit data, it became painfully apparent that a very large number of repos are tests, demos, or tutorials. If you have at least 1 star, that excludes most of those - unless you starred it yourself. Having 2 stars excludes the projects that are self-starred.
Starring is also quite common with my friends and colleagues as a way to find repos again later, so there is some use to it, but I agree it's not a perfect indicator of utility or quality.
It’s just way cheaper to spin up repos now — lots of these are probably one-and-done.
Whatever reaction you have to this know that my internal reaction and yours were probably close.
In hindsight the headline was a bit more sensational than I meant it to be!
Agentic coding is not about creating software, it's about solving the problems we used to need software to solve directly.
The only reason I put my agentic code in a repo is so that I can version control changes. I don't have any intention of sharing that code with other people because it wouldn't be useful for them. If people want to solve a similar problem to me, they're much better of making their own solution.
I'm not at all surprised that most of Claude linked output is in low star repos. The only Claude repos I even bother sharing are those that are basically used as context-stores to help other people get up to speed faster with there of CC work.
At 2mo old - nearly a 1GB repo, 24M loc, 52K commits
https://github.com/thomaspryor/Broadwayscore
Polished site:https://broadwayscorecard.com/
Someone might want to tell the author to ask Claude what a database is typically used for...
That's the kind of "highlight" from a review when you use AI to extract/summarize content instead of asking a real human editor to do the job.
Substantive transformation of AI output via human creativity can be copyrighted, but if you're sticking to Claude commits, that's AI output.
And if that's not what you are saying then how are you determining that prompts to and AI are not copyrighted by the author of the prompt? The results are nothing more than a derivative work of the prompt. So you are faced with having to determine whether the prompts themselves or in combination are copyrightable. Depending on the prompt they may or may not be, but you can't apply a blanket rule here.
(Notwithstanding that Claude inserts itself explicitly as co-author and the author is listed on the commit as well)
> The Copyright Office affirms that existing principles of copyright law are flexible enough to apply to this new technology, as they have applied to technological innovations in the past. It concludes that the outputs of generative AI can be protected by copyright only where a human author has determined sufficient expressive elements. This can include situations where a human-authored work is perceptible in an AI output, or a human makes creative arrangements or modifications of the output, but not the mere provision of prompts.
People have tried over and over again to register copyright of AI output they supplied the prompts for, for example, in one instance[1], someone prompted AI with over 600+ cumulative iterations of prompting to arrive at the final product, and it wasn't accepted by the Copyright Office.
[1] https://www.copyright.gov/rulings-filings/review-board/docs/...
You are going to have to prove that things Claude stamps as co-authored are not the work of an "assisting instrument". It's certainly true that some vibe-coded one-shot thing might not apply.
I would also note that the applicant applying for copyright in your linked case explicitly refused guidance and advice from the examiner. That could well be because the creation of that specific work was not shaped much by that artist's efforts.
I wouldn't read too much into that when discussing a GitHub repo. It really will depend on how the user is using Claude and their willingness to demonstrate which parts they contributed themselves. You need to remember that copyright extends to plays and other works of performance. Everything the copyright office is saying in your linked ruling suggests that an AI-implementation of a human-design is copyrightable.
For example, machine-translating a book doesn't create a new copyright in the new translation, but that new translation would still inherit the copyright in the original book.
One used Lenovo micro PC (size of a book) from eBay will serve you well.
The main convenience of Github for me is the ability to send preprepared prompts to Claude through its web interface or the mobile app and have it write or revise a batch of dictionary entries in the repository. I can then confirm the results on the built website, which is hosted on Github Pages, and request changes or reverts to Claude when necessary. Each prompt takes ten to thirty minutes to carry out and I run a dozen or more a day, and it is very convenient to be able to do that prompting and checking wherever I am.
When I have Claude make changes to the codebase, I find that I need to pay closer attention to the process. I can’t do that while sitting in restaurant or taking a walk like I do with the prompting for dictionary-entry writing. The next time I start a mostly (vibe) coding project, I’ll look into Forgejo.
I used Claude code to build a custom notes application for my specific requirements.
It’s not perfect, but I barely invested 10 hours in it and it does almost everything I could have asked for, plus some really cool stuff that mostly just works after one iteration. I’ll probably open source the code at some point, and I fully expect the project to have less than two stars.
Still, I have my application.
For anyone that’s interested in taking a look, my terrible landing page is at rayvroberts.com
Auto updates don’t work quite right just yet. You have to manually close the app after the update downloads, because it is still sandboxed from when I planned to distribute via the Mac App Store. Rejected in review because users bring their own Claude key.
The framing assumes github repos are supposed to be products.
For libraries: still probably mostly useful for personal code bases, but for developers with enough experience to modularize development efforts even for personal or niche projects.
I am bothered by huge vibe coded projects, for example like OpenClaw, that have huge amounts of code that has not undergone serious review, refactoring, etc.
I asked him, how many people are using any of them? He told me it's just him.
But, I've developed a dozen or so projects with Claude code. I am meant to be the only user.
I am maintaining a homelab setup (homelab production environment, really) with a few dozen services, combination of open source and my own - closed sourced - ones.
I had tons of ideas of how to set things up. It evolved naturally, so changing things was hard. Progress was quite slow.
Now, I have a pretty much ideal end-state - runs on auto-pilot, version bumps mostly managed by Renovate, ingress is properly isolated and secured (to the extent I am familiar of).
I was able to achieve things I wouldnt've otherwise in that time. I skipped parts I did not care about and let LLMs drive the changes under supervision. I spent more time on things I did care about, and was interested in learning.
Yeah, most of my LLM code is sitting closed source and that's by design.
If you go on other account and ask LLM about your projects you'll pretty much get all the code you wrote using LLM again.
That's my gripe with LLMs, most of my friends across the globe are working on similar projects. They are 90% same. You are burning tokens thinking they are doing some new innovation.
I pretty much google for things if they exist, i don't build them.
I'd like to see projects in spaces where nothing exists like a good CAD for 3d printing etc...opensource. but nobody is building all that.
The idea with Claude writing code for most part is that everyone can write software that they need. Software for the audience of one. GitHub is just a place for them to live beyond my computer.
Why will I want to promote it or get stars?
GitHub stars are very much the textbook example of where you'd expect to find a Pareto distribution.
I disabled all the attribution. I find it noisy and I'm not blaming claude, I'm blaming someone if something is broken.
- 98% of human's repos have <2 stars
Claude is 5 times smarter than humans!
The math is a bit of a stretch, but the correlation still holds up.
I have never cared about LinkedIn or GitHub stars or any of those bullshit metrics (obviously because I don't score very highly in them), and am enjoying exploring a million things at the speed of thought; get left outside, if it suits you. Smart and flexible people have no trouble using it, and it's amazing.
Rather measure how much I've learnt and created recently compared to before, and get ready for some sobering shit because us experienced old dudes can judge good code from bad pretty well.
Unfortunately that type of analysis would take a bit more work, but I think the repo info and commit messages could probably be used to do that.
The 50B lines across those low-star repos isn't just an interesting metric about usage patterns. It's a significant amount of unreviewed code sitting in public repositories. Stars were never a quality signal, but they were at least a proxy for "someone other than the author looked at this." That selection effect disappears entirely when the build cost drops to near zero.
It is interesting to see a flip in attitude toward GitHub.
What a dumb metric to focus on.
As if it's to prevent the species from over-indexing on a particular set of behaviors.
Like how divisive films such as "Signs", "Cloud Atlas", and even "The Last Jedi" are loved by some and utterly reviled by others.
While that's kind of a silly case, maybe it's not just some random statistical fluke, but actually a function of the species at a population level to keep us from over-indexing and suboptimizing in some local minima or exploring some dangerous slope, etc.
Came across this from this ShowHN post yesterday https://news.ycombinator.com/item?id=47501348
What percentage of GitHub activity goes to GitHub repos with less than 2 stars? I would guess it's close to the same number.
This likely tripled the amount of stars I have.
workers on the management track
(But I don't use AI on them.)
The problem is that this title is editorialized, and the fact is cherry-picked. Why not =0? Why not >1000? This is just a dashboard, it highlights "Interesting Observations", but stars statistics is not there.
If anything, the fact that this is what he arrived at, even when starting with the opposite position, is proof of the validity of this result.
- visibility
- popularity (technical, domain, persona)
- genuine utility
- novelty
...
There are also plenty of super high utility repos that are widely used (often indirectly), but don't have a lot of stars, or even a meagre amount.
Also there is the issue of star != star, because it's not granular.
It's similar to upvotes on general social media platforms. Everyone likes cute cats doing funny things somewhat, but only few people appreciate something that's more niche but way more impactful, useful or entertaining (or requires some effort to consume), but those who do, value it very highly. But the same person might use the same score (single upvote) for a cat video and a video that they value much higher.
stars : uniq(k)
1 : 14946505
10 : 1196622
100 : 213026
1000 : 28944
10000 : 1847
100000 : 20
1 : 14946505
10 : 1196622
100 : 213026
1000 : 28944
10000 : 1847
100000 : 100
Interestingly, there are 21.37b commits in GitHub, implying 104 additions per commit. Per the dashboard, Claude is linked to 20.81m commits and 50.44b additions - or 2,424 additions/commit. So additions for Claude-linked repos is higher, and it's actually higher for repos with 0-1 stars (2,568 additions/commit for Claude, 91 for all GitHub). None of this is a smoking gun but aligns with the intuition that Claude is producing enormous amounts of code. TBD whether it is 'adding value'.
Would be appreciative of anyone who verifies/invalidates this. https://play.clickhouse.com/ https://ghe.clickhouse.tech/#clickhouse-demo-access
I mean, it’s an indicator. Just not a definitive—or individually sufficient—one.
https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect
https://en.wikipedia.org/wiki/Dunning%E2%80%93Kruger_effect#...
If you read your own reference (not the picture, but where you took it from on Wikipedia) really really carefully, you might be able to tell why it so perfectly applies to you
The person with little knowledge overestimates they're capability, and the person which actually knows how complicated [the thing] is , usually isn't as confident they mastered it.
Your take on that makes absolutely no sense
But the claim above was that having low confidence was correlated to higher skill. Ie, skill and confidence are anti correlated. The chart does not show that. The lowest data point for confidence is the point on the left of the chart. This is also the data point corresponding to people who have the least competence. Having low confidence is not evidence that you’re secretly an expert. Confidence and competence are still positively correlated according to that chart.
The Dunning-Kruger effect is not so strong that there are scores of novices convinced they are experts in a field. But in your case, I admit the data may not tell the full story.
"Baloney Detection Kit"
https://www.youtube.com/watch?v=aNSHZG9blQQ
Best regards =3
> In popular culture, the Dunning–Kruger effect is sometimes misunderstood as claiming that people with low intelligence are generally overconfident, instead of denoting specific overconfidence of people unskilled at particular areas.
Dunning-Kruger has also been discredited with suggestion they may have been over confident themselves:
The Dunning-Kruger Effect Is Probably Not Real (2020) https://www.mcgill.ca/oss/article/critical-thinking/dunning-...
Debunking the Dunning‑Kruger effect – the least skilled people know how much they don’t know, but everyone thinks they are better than average (2023) https://theconversation.com/debunking-the-dunning-kruger-eff... the Dunning‑Kruger effect – the least skilled people know how much they don’t know, but everyone thinks they are better than average
The study conclusion inferred the skills needed to be effective at some task, are the same skills needed to correctly evaluate if you are actually proficient at the same tasks.
Or put another way, the <5% population of narcissists by their nature become evasive when their egos are perceived as threatened. Thus, often will pose a challenge in a team setting, as compulsive lying or LLM turd-polishing is orthogonal to most real world tasks.
People are not as unique as they like to believe, and spotting problems is trivial after you meet around 3000 people. Best to avoid the nonsense, and get outside to enjoy life. Have a great day =3
https://arxiv.org/abs/2505.02151
It's good to raise people's expectations of themselves
The study conclusion inferred the skills needed to be effective at some task, are the same skills needed to correctly evaluate if you are actually proficient at the same tasks.
https://arxiv.org/abs/2505.02151
If the data infers another explanation is more applicable, than I'd be interested in the primary papers/studies the editorialized opinion seems to have omitted. =3
Most people figure out this scam very early in life, but some cling to terrible jobs for unfathomable reasons. =3
The answer to such questions is always that, given their circumstances, they have no realistic choice not to.
This is very obvious, and it's frustrating to continually see people pretend otherwise.
If folks expect someone to solve problems for them, than 100% people end up unhappy. The old idea of loyalty buying a 30 year career with vertical movement died sometime in the 1990s.
Ikigai chart will help narrow down why people are unhappy:
https://stevelegler.com/2019/02/16/ikigai-a-four-circle-mode...
Even if folks are not thinking about doing a project, I still highly recommend this crash course in small business contracts
https://www.youtube.com/watch?v=jVkLVRt6c1U
Rule #24: The lawyers Strategic Truth is to never lie, but also avoid voluntarily disclosing information that may help opponents.
Best of luck =3
in otherwords, plot the percentage or average metric and not the absolute metric.
e.g. number of lotto winners per thousand people living in that grid, percentage of starred repos as a percentage of all repos, per capita alcohol consumption, average screen-time etc.
Edit: unless ofcourse the point of the heatmap is to show the population distribution itself. In which case the metric would be number of people per square kilometer or some such.
Personally I think comparing github stars is always going to be a fraught metric.
If the answer wasn't in hundreds of request per seconds, i wasn't interested in job.
I found job at ad tech companies, pay wasn't any good but the challenges were immense.
Most people write code, which will hardly be run by other people or even receive any customers.
Value and use are not always synonymous.