Hacker News

End of an era for me: no more self-hosted git

285 points by dzulp0d 2 days ago | 193 comments

I cut traffic to my Forgejo server from about 600K request per day to about 1000: https://honeypot.net/2025/12/22/i-read-yann-espositos-blog.h...

1. Anubis is a miracle.

2. Because most scrapers suck, I require all requests to include a shibboleth cookie, and if they don’t, I set it and use JavaScript to tell them to reload the page. Real browsers don’t bat an eye at this. Most scrapers can’t manage it. (This wasn’t my idea; I link to the inspiration for it. I just included my Caddy-specific instructions for implementing it.)

QuiDortDine 2 days ago

I remember back when Anubis came out, some naysayers on here were saying it wouldn't work for long because the scrapers would adapt. Turns out careless, unethical vibecoders aren't very competent.

tuhgdetzhh 2 days ago

I still think it is just a matter of time until scrapers catch up. There are more and more scrapers that spin up an full blown chromium.

kstrauser 2 days ago

It seems inevitable, but in the mean time, that's vastly more expensive than running curl in a loop. In fact, it may be expensive enough that it cuts bot traffic down to a level I no longer care about defending against. Like GoogleBot had been crawling my stuff for years without breaking the site. If every bot were like that, I wouldn't care.

raw_anon_1111 20 hours ago

Serious question, in 2026 you can actually have a successful crawler with just curl? I just had to create one for a customer - for their own site - and nothing would have worked without using Chromium.

kstrauser 20 hours ago

Probably not for most sites. Example of a site where it'd likely work: a blog made with a static site generator. Example of one where it wouldn't: darn near anything made with React.

franga2000 7 hours ago

It works for the majority of things a text mining scraper would care to scrape. It's not just static sites but also any CMS like wordpress, as well as many JS apps that have server-side rendering. SPA-only sites aren't that common anymore, especially for things like blogs, news and text-based social media.

solid_fuel 2 days ago

Cool, if they're running full blown chromium maybe the next step can be mining bitcoin on any pages served to bots.

hxtk 2 days ago

Even that functions as a sort of proof of work, requiring a commitment of compute resources that is table stakes for individual users but multiplies the cost of making millions of requests.

gruez 2 days ago

AFAIK you can bypass it with curl because there's an explicit whitelist for it, no need for a headful browser.

cantalopes 2 days ago

Well it's a race, just like security. And as long as anubis is in the front, all looks bright

Elfener 2 days ago

> Turns out careless, unethical vibecoders aren't very competent.

Well they are scraping web pages from a git forge, where they could just, you know, clone the repo(s) instead.

wolfi1 2 days ago

"Turns out careless, unethical vibecoders aren't very competent." well, they rely on AI, don't they? and AI is trained with already existing bad code, so why should the outcome be different?

xorcist 2 days ago

> I set it and use JavaScript to tell them to reload the page

While throwing out all users who opt-in to javascript, using Noscript or uBlock or something like it, may be acceptable collateral damage to you, it might be good to keep in mind that this plays right into Big Adtech's playbook. They spend over two decades to normalize the behavior of running a hundred or more programs of untrusted origin on every page load, and to treat users to opt-in to running code in a document browser with suspicion. Not everyone would like to hand over that power to them on a silver platter with a neat little bow on top.

kstrauser 2 days ago

Oh please. That ship has sailed. I'm marginally sympathetic to people who don't run JavaScript on their browsers for a variety of reasons, but they've deliberately opted out of the de facto modern web. JS is as fundamental to current design as CSS. If you turn it off, things might work, but almost no one is testing that setup, nor should they reasonably be expected to.

This has zero to do with Adtech for 99.99% of uses, either. Web devs like to write TypeScript and React because that's a very pleasant tech stack for writing web apps, and it's not worth the effort for them to support a deliberately hamstrung browser for < 0.1% of users (according to a recent Google report).

See also: feel free to disable PNG rendering, but I'm not going to lift a finger to convert everything to GIFs.

forgotmypw17 2 days ago

There are many reasons to accommodate non-JS users beyond accommodating people who have intentionally disabled it, and most of them are in accessibility territory.

Be careful with using percentages for your arguments, because this is not that different from saying that 99.99% of people don't need wheelchair access.

greiskul 24 hours ago

This used to be true, but now I don't think it is anymore. Modern frameworks and modern screen readers have no issue with acessibility.

Some survey from WebAIM found that 99.3% of screen reader users have JavaScript enabled.

So... are they really in accessibility territory still? Only people I still see complaining about Javascript being required are people that insist the web should just be static documents with hyperlinks like it was in the early 90s.

Can you find a modern source with valid reasons for accomodating non-JS users?

forgotmypw17 8 hours ago

Slow/lossy connections: JS may not load, but site still works.

Users that prefer non-animated pages and disable JS for this reason.

Users who prioritize security.

Users of older devices in which your JS can trigger errors. Yes, these exist. Not everyone can upgrade their older device. Many people do not even have their own device to use.

shakna 2 days ago

> JS is as fundamental to current design as CSS.

I think this hits the crux of the trend fairly well.

And is why I have so many workarounds to shitty JS in my user files.

Because I can't see your CSS, either.

chipsrafferty 24 hours ago

Yet you use CSS on your own website?

shakna 23 hours ago

Yup. I do. And JS, too.

Because neither are _required_ for anything. There is a well-specified data tree.

Progressive enhancement is not some sign of conflict in my reasoning. It is a demonstration of it.

qwery 2 days ago

the recent google report claimed that less than 0.1% of users have javascript disabled ... like for every website, or just some, or?

your PNG/GIF thing is nonsense (false equivalence, at least) and seems like deliberate attempt to insult

> I'm marginally sympathetic

you say that as if they've done some harm to you or anyone else. outside of these three words, you actually seem to see anyone doing this as completely invalid and that the correct course of action is to act like they don't exist.

kstrauser 2 days ago

It would be literally impossible to know whether a user disabled JavaScript on another site, so I'm going to say that they meant that for their own sites.

> you say that as if they've done some harm to you or anyone else.

I was literally responding to someone referring to themselves as "collateral damage" and saying I'm playing into "Big Adtech's playbook". I explained why they're wrong.

> the correct course of action is to act like they don't exist.

Unless someone is making a site that explicitly targets users unwilling or unable to execute JavaScript, like an alternative browser that disables it by default or such, mathematically, yes, that's the correct course of action.

raw_anon_1111 20 hours ago

You mean all 3?

I could care less about serving users who don’t want to enable JS in 2026. They aren’t worth my development times

chipsrafferty 24 hours ago

Not enough people opt out of using js for it to matter to anyone. If a page doesn't work because you have js disabled, get over it

draw_down 2 days ago

[dead]

7bit 15 hours ago

Those poor users. If they want to remove the fourth wheel from their car, they will bump into some issues. Who cares.

albuic 6 hours ago

I do but shitty web devs don't

trumpdong 12 hours ago

IIRC 2 is included in the "go-away" program which is similar to Anubis but without the PoW

mqus 2 days ago

I "solved" this by adding a fail2ban rule for everyone accessing specific commits (no one does that 3 times in a row) and then blocking the following ASs completely (just too many IPs coming from those, feel free to look them up yourself): 136907 23724 9808 4808 37963 45102. And after that: sweet silence.

How to block ASs? Just write a small script that queries all of their subnets once (even if it changes, its not so much to have an impact) and add them to a nft set (nft will take care of aggregating these into continouus blocks). Then just make nft reject requests from this set.

jacquesm 24 hours ago

- 136907 Huawei

- 23724 China Telco

- 9808 China Mobile

- 4808 China Unicom

- 37963 Alibaba

- 45102 Alibaba tech

You may want to add this list as well:

https://docs.aws.amazon.com/vpc/latest/userguide/aws-ip-rang...

data-ottawa 2 days ago

Does anyone know what's the deal with these scrapers, or why they're attributed to AI?

I would assume any halfway competent LLM driven scraper would see a mass of 404s and stop. If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.

Are we seeing these scrapers using LLMs to bypass auth or run more sophisticated flows? I have not worked on bot detection the last few years, but it was very common for residential proxy based scrapers to hammer sites for years, so I'm wondering what's different.

embedding-shape 2 days ago

I just threw up a public Forjego instance for some lightweight collaboration. About 2 minutes after the certificate was created, I'm guessing they picked up the instance from the transparency logs for certificates, and started going through every commit and so on from the two repositories I had added.

Watched it for a while, thinking eventually it'd end. It didn't, seemed like Claudebot and GPTBot (which was the only two I saw, but could have been forged) went over the same URLs over and over again. They tried a bunch of search queries too at the same time.

The day after I got tired of seeing it so added a robot.txt forbidding any indexing. Waited a few hours, saw that they were still doing the same thing, so threw up basic authentication with `wiki:wiki` as the username:password basically, wrote the credentials on the page where I linked it and as expected they stopped trying after that.

They don't seem to try to bypass anything, whatever you put in front will basically defeat them except blocking them by user-agent, then they just switch to a browser-like user-agent instead, which is why I went the "trivial basic authentication" path instead.

Wasn't really an issue, just annoying when they try to masquerade as normal users. Had the same issue with a wiki instance, added rate limits and eventually they seemingly backed off more than my limits were set too, so I guess they eventually got it. Just checked the logs and seems they've stopped trying completely.

Seemingly it seems like people who are paying for their hosting by usage (which never made sense to me) is the ones hard hit by this. I'm hosting my stuff on a VPS, and don't understand what the big issue is, worst case scenario I'd add more aggressive caching and it wouldn't be an issue anymore.

rozab 2 days ago

I had the same issue when I first put up my gitea instance. The bots found the domain through cert registration in minutes, before there were any backlinks. GPTbot, ClaudeBot, PerplexityBot, and others.

I added a robots.txt with explicit UAs for known scrapers (they seem to ignore wildcards), and after a few days the traffic died down completely and I've had no problem since.

Git frontends are basically a tarpit so are uniquely vulnerable to this, but I wonder if these folks actually tried a good robots.txt? I know it's wrong that they ignore wildcards, but it does seem to solve the issue

stefanka 2 days ago

Where does one find a good robots.txt? Are there any well maintained out there?

dmit 2 days ago

https://github.com/ai-robots-txt/ai.robots.txt

skrtskrt 2 days ago

Cloudflare actually has this as a free tier feature so even if you don't want to use it for your site you can just setup a throwaway domain on Cloudflare and periodically copy the robots.txt they generate from your scraper allow/block preferences, since they'll be keeping up to date with all the latest.

trillic 2 days ago

I will second a good robots.txt. Just checked my metrics and < 100 requests total to my git instance in the last 48 hours. Completely public, most repos are behind a login but there are a couple that are public and linked.

bob1029 2 days ago

> I wonder if these folks actually tried a good robots.txt?

I suspect that some of these folks are not interested in a proper solution. Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime.

embedding-shape 2 days ago

> Being able to vaguely claim that the AI boogeyman is oppressing us has turned into quite the pastime.

FWIW, you're literally in a comment thread where GP (me!) says "don't understand what the big issue is"...

Fabricio20 2 days ago

Since you had the logs for this, can you confirm the IP ranges they were operating from? You mention "Claudebot and GPTBot" but I'm guessing this is based off of the user-agent presented by the scrapers and could easily be faked to shift blame. I genuinely doubt Anthropic and such would be running scrapers that are this badly written/implemented, it doesnt make economic sense. I'd love to see some of the web logs from this if you'd be willing to share! I feel like this is just some of the old scraper bots now advertising themselves as AI bots to shift blame into the AI companies.

Tharre 2 days ago

There are a bit too many IPs to list but from my logs they're always of the form 74.7.2XX.* for GPTBot, matching OpenAIs published ip ranges[0].

So yes, they are definitely running scrapers that are this badly written.

Also old scraper bots trying to disguise themselves as GPTBot seems wholly unproductive, they're try to immitate users, not bots.

[0] https://openai.com/gptbot.json

embedding-shape 2 days ago

> but I'm guessing this is based off of the user-agent presented by the scrapers and could easily be faked to shift blame

Yes, hence the "which was the only two I saw, but could have been forged".

> I'd love to see some of the web logs from this if you'd be willing to share!

Unfortunately not, I'm deleting any logs from the server after one hour, and also don't even log the full IP. I took a look now and none of the logs that still exists are from any user agent that looks like one of those bots.

trumpdong 12 hours ago

One idea is to not block them, but return a plausible looking page that's not the normal page, then it won't be detected as a block.

Imustaskforhelp 2 days ago

Huh, I had a gitea instance in the public web on one of my netcup vps's. I didn't set any logs and was using cloudflare tunnels (with a custom bash script which makes cf tunnels expose PORT SUBDOMAIN).

Maybe its time for me to go ahead and start it again with logs to see if there are any logs.

I will maybe test it in all three 1) With CF tunnels + AI Block, 2) Only CF tunnels, 3) On a static IP directly. Maybe you can try the experiment too and we can compare our findings (also saying because I am lazy and I had misconfigured that cf tunnel so when it quit, I was too lazy to restart the vps given I just use it as a playground and just wanted to play around self hosting but maybe I will do it again now)

simonw 2 days ago

I would love to understand this.

Just a few years ago badly behaved scrapers were rare enough not to be worth worrying about. Today they are such a menace that hooking any dynamic site up to a pay-to-scale hosting platform like Vercel or Cloud Run can trigger terrifying bills on very short notice.

"It's for AI" feels like lazy reasoning for me... but what IS it for?

One guess: maybe there's enough of a market now for buying freshly updated scrapes of the web that it's worth a bunch of chancers running a scrape. But who are the customers?

SCHiM 2 days ago

The bar to ingest unstructured data into something usable was lowered, causing more people to start doing it.

Used to be you needed to implement some papers to do sentiment analysis. Reasonably high bar to entry. Now anyone can do it, the result: more people doing scraping (in less competent scrapers too).

devsda 2 days ago

For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material.

May be everyone is trying to take advantage of the situation before law eventually catches up.

Imustaskforhelp 2 days ago

> For whatever reason, legislation is lax right now if you claim the purpose of scraping is for AI training even for copyrighted material

I think the reason is that America & China for the most part are also in AI arms race combined with an AI bubble and neither side would wish to lose literally any percieved advantage to them no matter the cost on others.

Also there is an immense lobbying effort against senators who propose for a stricter AI regulation.

https://www.youtube.com/watch?v=DUfSl2fZ_E8 [What OpenAI doesn't want you to know]

It's actually a great watch. Highly recommended because a lot of talks about regulations does feel to me as mirrors and smoke.

ben_w 8 hours ago

I'm currently on a free trial of ChatGPT, and one new thing I didn't regularly see before* is that longer tasks perform a lot of web searches when generating non-trivial results.

I wonder if this is part of it? It's not (just) DDOS by crawlers, it's DDOS by the users themselves triggering (albeit indirectly) far more requests than a human normally would? I've seen that happen in a different context, over a decade ago now.

* old models would do this sometimes when you ask for whatever the "deep research" mode was called, but this now seems to happen a lot more and involve a lot more fetches

zerocrates 2 days ago

I would say there's a couple aspects.

The crawlers for the big famous names in AI are all less well behaved and more voracious than say, Googlebot. Though this is all somewhat muddied by companies that ran the former "good" crawlers all also being in the AI business and sometimes trying to piggyback on people having allowed or whitelisted their search crawling User-Agent, mostly this has settled a little where they're separating Googlebot from GoogleOther, facebookexternalhit from meta-externalagent, etc. This was an earlier "wave" of increased crawling that was obviously attributable to AI development. In some cases it's still problematic but this is generally more manageable.

The other stuff, the ones that are using every User-Agent under the sun and a zillion datacenter IPs and residential IPs and rotate their requests constantly so all your naive and formerly-ok rate-based blocking is useless... that stuff is definitely being tagged as "for AI" on the basis of circumstantial evidence. But from the timing of when it seemed to start, the amount of traffic and addresses, I don't have any problem guessing with pretty high confidence that this is AI. To your question of "who are the customers"... who's got all the money in the world sloshing around at their fingertips and could use a whole bunch of scraped pages about ~everything? Call it lazy reasoning if you'd like.

How much this traces back ultimately to the big familiar brand names vs. would-be upstarts, I don't know. But a lot of sites are blocking their crawlers that admit who they are, so would I be surprised to see that they're also paying some shady subcontractors for scrapes and don't particularly care about the methods? Not really.

Tharre 2 days ago

> Does anyone know what's the deal with these scrapers, or why they're attributed to AI?

You don't really need to guess, it's obvious from the access logs. I realize not everyone runs their own server, so here are a couple excerpts from mine to illustrate:

- "meta-externalagent/1.1 +https://developers.facebook.com/docs/sharing/webmasters/craw...)"

- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; +claudebot@anthropic.com)"

- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36"

- "Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.3; +https://openai.com/gptbot)"

- [...] (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot)"

And to give a sense of scale, my cgit instance recieved 37 212 377 requests over the last 60 days, >99% of which are bots. The access.log from nginx grew to 12 GiB in those 60 days. They scrape everything they can find, indiscriminately, including endpoints that have to do quite a bit of work, leading to a baseline 30-50% CPU utilization on that server right now.

Oh, and of course, almost nothing of what they are scraping actually changed in the last 60 days, it's literally just a pointless waste of compute and bandwidth. I'm actually surprised that the hosting companies haven't blocked all of them yet, this has to increase their energy bills substantially.

Some bots also seem better behaved then others, OpenAI alone accounts for 26 million of those 37 million requests.

everybodyknows 2 days ago

Following your link above, https://openai.com/gptbot

> ChatGPT-User is not used for crawling the web in an automatic fashion. Because these actions are initiated by a user, robots.txt rules may not apply.

So, not AI training in this case, nor any other large-batch scraping, but rather inference-time Retrieval Augmented Generation, with the "retrieval" happening over the web?

Tharre 21 hours ago

Those would have the user agent "ChatGPT-User" though, and I barely see those. The majority comes from "GPTBot" like in my excerpt above, which makes it pretty clear that it's used for some sort of training:

"GPTBot is used to make our generative AI foundation models more useful and safe. It is used to crawl content that may be used in training our generative AI foundation models. Disallowing GPTBot indicates a site’s content should not be used in training generative AI foundation models."

groby_b 2 days ago

Likely, at least for some. I've caught various chatbots/CLI harnesses more than once inspecting a github repo file by file (often multiple times, because context rot)

But the sheer volume makes it unlikely that's the only reason. It's not like everybody has constantly questions bout the same tiny website.

jillesvangurp 2 days ago

Using an LLM to ponder responses for requests is way too costly and slow. Much easier to just use the shotgun approach and fire off a lot of requests and deal with whatever bothers to respond.

This btw is nothing new. Way back when I still used wordpress, it was quite common to see your server logs filling up with bots trying to access endpoints for commonly compromised php thingies. Probably still a thing but I don't spend a lot of time looking at logs. If you run a public server, dealing with maliciously intended but relatively harmless requests like that is just what you have to do. Stuff like that is as old as running stuff on public ports is.

And the offending parties writing sloppy code that barely works is also nothing new.

AI opportunism certainly has added a bit of opportunistic bot and scraper traffic but it doesn't actually change the basic threat model in any fundamental way. Previously version control servers were relatively low value things to scrape. But code just became interesting for LLMs to train on.

Anyway, having any kind of thing responding on any port just invites opportunistic attempts to poke around. Anything that can be abused for DOS purposes might get abused for exactly that. If you don't like that, don't run stuff on public servers or protect them properly. Yes this is annoying and not necessarily easy. Cloud based services exist that take some of that pain away.

Logs filling up with 404, 401, or 400 responses should not kill your server. You might want to implement some logic that tells repeat offenders 429 (too many requests). A bit heavy handed but why not. But if you are going to run something that could be used to DOS your server, don't be surprised if somebody does that.

arnarbi 2 days ago

> why they're attributed to AI?

I don’t think they mean scrapers necessarily driven by LLMs, but scrapers collecting data to train LLMs.

everforward 2 days ago

I think it’s a) volume of scrapers, and b) desire for _all_ content instead of particular content, and c) the scrapers are new and don’t have the decades of patches Googlebot et al do.

5 years ago there were few people with an active interest in scraping ForgeJo instances and personal blogs. Now there are a bajillion companies and individuals getting data to train a model or throw in RAG or whatever.

Having a better scraper means more data, which means a better model (handwavily) so it’s a competitive advantage. And writing a good, well-behaved distributed scraper is non-trivial.

dirkc 2 days ago

I'm hazarding a guess that there are many AI startups that focus on building datasets with the aim to sell those datasets. Still doesn't make total sense, since doing it badly would only hurt them, but maybe they don't really care about the product / outcome, they're just capturing their bit of the AI goldrush?

sh34r 20 hours ago

I guess the AI companies finally figured out they’re supposed to buy their stolen datasets from a shell company spun up by the most unsavory character within two degrees of the CEO. Every CEO has a drug dealer, and every CEO drug dealer knows the greasy grey hat dude running a data laundry “startup.” The VCs usually know some private equity dons who run the same racket to do bust out fraud, too.

It’s truly unbelievable that OpenAI and Anthropic were so sloppy. Pirating all that copyrighted media and not even bothering to hide behind one layer of indirection. Amateurs.

So yeah… it’s what, five years’ worth of pent up demand for organized crime, hitting the market everywhere all at once? I’m surprised the request volume isn’t higher!

golem14 22 hours ago

You can always prune it down later (so the thinking goes, no doubt)

M95D 2 days ago

I stopped trying to understand. Encountering a 404 on my site leads directly to a 1 year ban.

embedding-shape 2 days ago

Damn, as someone who sometimes navigate by guessing URLs and rewriting them manually in the address bar, I hope more don't start doing this, I probably see at least one self-inflicted 404 per day at least.

M95D 2 days ago

Why would you do that?

embedding-shape 2 days ago

Faster. Wanna know the pricing? $domain/pricing. What's this company about? $domain/about. Switch to another Google account? Change the 1 to a 2 in the URL. I guess mostly to avoid the mouse ultimately.

tasuki 2 days ago

Sounds like you're keeping all your URLs alive forever? Commendable!

M95D 2 days ago

Not so many...

And there are tools to scan for dead links.

drivers99 2 days ago

Can you scan my bookmarks? :) edit: i.e. if someone has a bookmark to a page on your site and it goes 404, then they are blocked for a year. You have no ability to scan it because it's a file on their local system.

M95D 2 days ago

Oh, now I understand.

I never removed anything, but I'll keep this in mind for the future.

octoberfranklin 2 days ago

They're rotating through huge pools of residential IP addresses.

M95D 2 days ago

The 2GB RAM didn't fill up with banned addresses, but YMMV.

mitxela 12 hours ago

[dead]

themafia 2 days ago

There's value to be had in ripping the copyright off your stuff so someone else can pass it off as their stuff. LLMs have no technical improvements so all they can do is throw more and more stolen data into it and hope it, somehow, crosses a nebulous "threshold" where it suddenly becomes actually profitable to use and sell.

It's a race to the bottom. What's different is we're much closer to the bottom now.

hsuduebc2 2 days ago

I’m guessing, but I think a big portion of AI requests now come from agents pulling data specifically to answer a user’s question. I don’t think that data is collected mainly for training now but are mostly retrieved and fed into LLMs so they can generate the response. Thus so many repeated requests.

danaris 2 days ago

> If they're just collecting data to train LLMs, these seem like exceptionally poorly written and abusive scrapers written the normal way, but by more bad actors.

Right, this is exactly what they are.

They're written by people who a) think they have a right to every piece of data out there, b) don't have time (or shouldn't have to bother spending time) to learn any kind of specifics of any given site and c) don't care what damage they do to anyone else as they get the data they crave.

(a) means that if you have a robots.txt, they will deliberately ignore it, even if it's structured to allow their bots to scrape all the data more efficiently. Even if you have an API, following it would require them to pay attention to your site specifically, so by (b), they will ignore that too—but they also ignore it because they are essentially treating the entire process as an adversarial one, where the people who hold the data are actively trying to hide it from them.

Now, of course, this is all purely based on my observations of their behavior. It is possible that they are, in fact, just dumb as a box of rocks...and also don't care what damage they do. (c) is clearly true regardless of other specific motives.

octoberfranklin 2 days ago

I don't think it has anything to do with LLMs.

I think the big cloud companies (AWS) figured out that they could scrape compute-intensive pages in order to drive up their customers' spend. Getting hammered? Upgrade to more-expensive instances. Not using cloud yet? We'll force you to.

The other possibility is cloudflare punishing anybody who isn't using it.

Probably a combination of these two things. Whoever's behind this has ungodly supplies of cheap bandwidth -- more than any AI company does. It's a cloud company.

yellowapple 5 hours ago

> Whoever's behind this has ungodly supplies of cheap bandwidth -- more than any AI company does. It's a cloud company.

Most of the major cloud companies are themselves also AI companies, so I don't think the “cloud companies are artificially driving up compute spend” hypothesis is mutually exclusive with the “AI companies are doing a very bad job at scraping” hypothesis.

moebrowne 2 days ago

This kind of thing can be mitigated by not publishing a page/download for every single branch, commit and diff in a repo.

Make only the HEAD of each branch available. Anyone who wants more detail has to clone it and view it with their favourite git client.

For example https://mitxela.com/projects/web-git-sum (https://git.mitxela.com/)

yellowapple 5 hours ago

I suspect bog-standard per-IP rate limiting would also mitigate this, no?

PaulDavisThe1st 2 days ago

Alternatively, from the nginx config file for git.ardour.org:

   location ~ commit/* {
        return 404;
    }

Imustaskforhelp 2 days ago

I got another interesting idea from this and another comment but what if we combine this with ssh git clients/websites with the normal ability.

maybe something like https://ssheasy.com/ or similar could also be used? or maybe even a gotty/xterm instance which could automatically ssh/get a tui like interface.

I feel as if this would for all scrapers be enough?

bandie91 2 days ago

i'm working on something similar: instead of web-based ssh client, it's a web-based git client UI - you can "checkout" repos, browse commits, tree, read individual files, etc. with no server-side code at all; git objects are fetched and parsed on client-side. first target is the dumb-http git protocol, so people can host git repos on static websites, and visitors don´t need to clone by a local git client to peek in.

https://bandie91.github.io/dumb-http-git-browser-js-app/ui.h...

HumanOstrich 2 days ago

Lately it seems like every time I start reading through a comment with a bunch of incoherent word and idea salad, I look up and there's your username.

Imustaskforhelp 15 hours ago

This is really not the type of legacy that I want to leave behind on hackernews but then again, I have been vocal that I just write what I think. Literally. It has its flaws but I am not sugar coating it.

Sometimes I am unable to explain myself but the thing is that I write on HN to point out of some idea, some discussion. It's better written here than lost and yes most of my ideas might be incoherent but they make perfect sense to me in the moment, its quite hard to explain.

It seems that you have made an opinion & judgement about me and that's okay. I don't wish to change it.

I suppose while writing your comment, you must have been quite pissed to write it. Sorry for pissing you off in such sense, That wasn't my intention but I do hope that you can realize that your comment comes across as rude and quite frankly, I don't know how to respond to it and I don't want to throw myself to this level or continue in an argumentative tone.

We are more common than different actually. I suppose we both love open source and might share many hobbies. The difference is small when you think about it.

I hope that instead of fighting on our differences, we can work with our agreements. Teach me instead of such tone for I am interested in learning & let's hope that both of us and everyone can make a better future for the world & everyone living in it :D

Have a nice day, my friend. Hope the future's good for ya!

So answer me this, what's your favourite open source project and why? and I will answer mine when you respond later :]

HumanOstrich 10 hours ago

Well that's the best reply I've seen to someone being cranky on the Internet. I'm not even mad.

Take care.

soulofmischief 9 hours ago

Great response.

> Teach me instead of such tone for I am interested in learning

Here are some notes:

Run on sentences and lack of punctuation make your writing hard to follow; brevity can be effective.

For each sentence, choose a subject, verb, predicate, proposition, etc. to form a single clause, but don't compound multiple such clauses into a single sentence. Break sentences up with punctuation so that the eye rests more easily when scanning. Eye fatigue is a real thing that good writers know how to manage. Contractions can also help clean up the noise.

It's okay to occasionally have compound sentences, such as this one, but too many of those leave your reader's head spinning.

It's fine and encouraged to write your initial draft in stream-of-consciousness form as you have, but an editing pass would make a worthwhile difference for slightly more effort. You do well at breaking up ideas and sentences into new paragraphs, but within those paragraphs it can be hard to keep up.

As an example, your first sentence could be rewritten from

  This is really not the type of legacy that I want to leave behind on hackernews but then again, I have been vocal that I just write what I think. Literally. It has its flaws but I am not sugar coating it.

  This isn't the type of legacy I want to leave behind on Hacker News. I prefer to write in a stream-of-consciousness style. This approach has its flaws, but it feels more natural to me.

Notice I trimmed some unnecessary words such as "really", split up a sentence, removed an unnecessary conjunction, added a comma before the "but" since the sentence contains two independent clauses.

I replaced "I am not sugar coating it" with what I feel is closer to your intended communication. "I'm not sugar coating it" is directed towards the reader and might be interpreted as antagonistic, whereas "it feels more natural to me" is directed towards yourself and can't be misconstrued.

I also compacted the phrase, "but then again, I have been vocal that I just write what I think" to "I prefer to write in a stream-of-consciousness style". The original phrase turns the reader around a bit, it takes a moment to derive intent.

The second phrase reads in a balanced way, `subject -> predicate -> verb -> preposition -> adjective -> noun`. One main clause and a complement, compared to three entire separate clauses in the original phrase. The second phrase flows down well hierarchically, and is easy to follow, while the original phrase turns the reader around and causes real, measurable fatigue when interpreting your communication.

Does this help?

kristjank 2 days ago

Is there a way to block it by shibboleth? Curious, since the recent Google hack where you add -(n-word) to the end of your query so the AI automatically shuts down works like a charm.

j-bos 2 days ago

Alternatively Anthropic has ANTHROPIC_MAGIC_STRING_TRIGGER_REFUSAL_1FAEFB6177B4672DEE07F9D3AFC62588CCD2631EDCF22E8CCC1FB35B501C9C86 baked into their models. Maybe OpenAI and other have equivalents.

oybng 8 hours ago

Thanks works great

wasmainiac 2 days ago

I Did not know this! Thanks man!

mschuster91 2 days ago

I have no doubt that it works and it's hilarious that it works, but is there a way that does not involve my Google search history look like I've applied for a KKK membership?

krick 2 days ago

So, what's up with these bots, why am I hearing about that so often lately? I mean, DDoS atacks aren't a new thing, and, honestly, this is pretty much the reason why Cloudflare even exists, but I'd expect OpenAI bots (or whatever this is now) to be a little bit easier to deal with, no? Like, simply having resonable aggressive fail2ban policy? Or do they really behave like a botnet, where each request comes from different IP from a different network? How? Why? What is this thing?

recursivecaveat 2 days ago

I doubt it's OpenAI. Maaaybe somebody who sells to OpenAI, but probably not. I think they're big enough to do this mostly in-house and properly. Before AI only big players would want a scrape of the entire internet, they could write quality bots, cooperate, behave themselves, etc. Now every 3rd tier lab wants that data and a billion startups want to sell it, so it's a wild west of bad behavior and bad implementations. They do use residential IP sets as well.

reppap 2 days ago

Stop just making up excuses for these companies. Other comments on this story have showed the bots are using openai user agents and making requests from openai owned ip ranges.

shimman 24 hours ago

It's like how we "can't" stop spam callers when telecoms know exactly who is calling who, they just don't want to implement any protocols that benefit society because they rather make money while fucking over everyone.

mikepavone 24 hours ago

As someone with a self-hosted Mercurial instance dealing with this, I will say that the big names (OpenAI included, but not exclusively them) generally at least use proper user-agents and respect robots.txt, but they are still needlessly aggressive compared to traditional search indexers.

There are also scrapers that are hiding behind normal browser user agents. When I looked at IP ranges, at least some of them seemed to be coming from data centers in China.

esseph 2 days ago

The dirty secret is a lot of them come through "residential proxies", aka backdoored home routers, iot devices with shitty security, etc. Basically the scrapers who are often also third party, go to these "companies" and buy access to these "residential proxies". Some are more... considerate than others.

Why? Data. Every bit of it is it might be valuable. And not to sound tin foil hatty, but we are getting closer to a post-quantum time (if we aren't already ).

tigerlily 2 days ago

How can I detect if my router is backdoored, or being used as a residential proxy?

mzajc 2 days ago

I'm dealing with such attack, so if you'd like, you can send me IPv4 addresses, and I'll grep my logs for them. Email address is on the website linked on my profile.

As for what you can do on your own, it really depends on your network. OpenWRT routers can run tcpdump, so you can check for suspicious connections or DNS requests, but it gets really hard to tell if you have lots of cloud-tethered devices at home. IoT, browser extensions, and smartphone applications are the usual suspects.

thesuitonym 2 days ago

The most surefire way would be to put a device between your router and your ONT/modem to capture the packets and see what requests are being sent. It's not complicated but it IS a lot of information to sift through.

Your router may have the ability to log requests, but many don't, and even if yours does, if you're concerned the device may be compromised, how can you trust the logs?

BUT, with all that said, these attacks are typically not very sophisticated. Most of the time they're searching for routers at 192.168.1.1 with admin/admin as the login credentials. If you have anything else set, you're probably good from 97% of attackers (This number is entirely made up, but seriously that percentage is high). You can also check for security advisories on your model of router. If you find anything that allows remote access, assume you're compromised.

---

As a final note, it's more likely these days that the devices running these bots are IoT devices and web browsers with malicious javascript running.

12_throw_away 2 days ago

> How can I detect if my router is backdoored, or being used as a residential proxy?

Aside from the obvious smoke tests (are settings changing without your knowledge? Does your router expose access logs you can check?), I'm not sure there's any general purpose way to check, but 2 things you can do are:

1. search for your router's model number to see if it's known to be vulnerable, and replace it with a brand-new reputable one if so (and don't buy it from Amazon).

2. There are vendors out there selling "residential proxy IP databases", (e.g., [1]) no idea how good they are, but if you have a stable public IP address you could check whether you're on that.

[1] https://ipinfo.io/data/residential-proxy

reincoder 10 hours ago

I work for IPinfo. We track close to a hundred resproxy providers. So, if OP's router is compromised, the device IPs will likely be flagged.

From what I know, whenever a router is backdoored or a resproxy SDK gains access to a device to use their bandwidth, the access to that pool of devices is often shared among multiple resproxy vendors. Many resproxy vendors do not have their own SDKs for their services.

Also, as far as I know, not many resproxy operators manage their sim farms or hardware pools. It is mostly based on compromised devices or SDK access.

kimos 2 days ago

If it’s legit you can ask your ISP if they sell use of your hardware. Or just don’t use the provided hardware and instead BYO router or modem or media converter or whatever.

But I think what OP is implying is insecure hardware being infected by malware and access to that hardware sold as a service to disreputable actors. For that buy a good quality router and keep it up to date.

teeklp 2 days ago

So you don't know? Why respond?

oblio 2 days ago

Don't be rude.

knighttt 8 hours ago

[dead]

the_biot 2 days ago

Has this actually been investigated and proven to be true? I see allegations, but no facts really.

It seems to me to be just as likely that people are installing LLM chatbot apps that do the occasional bit of scraping work on the sly, covered by some agreed EULA.

Symbiote 2 days ago

Another likely source is "free" VPN tools, or tools for streaming TV (especially football or other pay-to-view stuff). The tool can make a little money proxying requests at the same time.

I can't provide evidence as it's close to impossible to separate the AI bots using residential proxies from actual users, and their IPs are considered personal data. But as the other reply shows, it's easy enough to find people selling this service.

esseph 2 days ago

Seriously, go to Google.

Search for: "residential proxy" ai data scraping.

Start reading through thousands of articles.

the_biot 2 days ago

That's the worst thing I've seen all week. The DDoS networks of 20 years ago, now out in the open and presented as real business.

Thanks for the info, wish I didn't know :-(

mitxela 12 hours ago

[dead]

karel-3d 2 days ago

it isn't that hard to just buy a bunch of sim cards and put them in a modem and use that. it's good enough as a residential proxy. source: I did that before, when I worked on plaid-like thing.

someonebaggy 11 hours ago

[dead]

wseqyrku 2 days ago

> this is pretty much the reason why Cloudflare even exists,

You said it yourself. If you're selling a cure, you might as well start a plague.

devsda 2 days ago

At this point, I think we should look at implementing filters that send different response when AI bots are detected or when the clients are abusive. Not just simple response code but one that poisons their training data. Preferably text that elaborates on the anti consumer practices of tech companies.

If there is a common text pool used across sites, may be that will get the attention of bot developers and automatically force them to backdown when they see such responses.

fennec-posix 2 days ago

https://anubis.techaro.lol/docs/admin/honeypot/overview The Anubis scraper protection has this as a feature. Just sends garbage if something falls into a trap.

Vexs 2 days ago

You know, I reckon if you serve up smut or instructions on bomb creation or something they stop hammering you...

Imustaskforhelp 2 days ago

I think that they actually do. I remember either some discussion (so a HN post) or a HN comment actually talking about it. Oh I should've favourited it but yes this (sort of) actually works (Maybe someone can test it?)

Borg3 2 days ago

Oh poor soul :) I had the same problem. And I solved it easly. I pulled out stuff from Internet, keeping only VPN overlay network..

The future is dark I mean.. Darknets.. For people by people. Where you can deal with bad actors.. Wake up! and starting networking :)

vachina 2 days ago

Scrapers are relentless but not DDoS levels in my experience.

Make sure your caches are warm and responses take no more than 5ms to construct.

mzajc 2 days ago

I'm also dealing with a scraper flood on a cgit instance. These conclusions come from just under 4M lines of logs collected in a 24h period.

- Caching helps, but is nowhere near a complete solution. Of the 4M requests I've observed 1.5M unique paths, which still overloads my server.

- Limiting request time might work, but is more likely to just cause issues for legitimate visitors. 5ms is not a lot for cgit, but with a higher limit you are unlikely to keep up with the flood of requests.

- IP ratelimiting is useless. I've observed 2M unique IPs, and the top one from the botnet only made 400 well-spaced-out requests.

- GeoIP blocking does wonders - just 5 countries (VN, US, BR, BD, IN) are responsible for 50% of all requests. Unfortunately, this also causes problems for legitimate users.

- User-Agent blocking can catch some odd requests, but I haven't been able to make much use of it besides adding a few static rules. Maybe it could do more with TLS request fingerprinting, but that doesn't seem trivial to set up on nginx.

Imustaskforhelp 2 days ago

Quick question but do these bots which you mention are from a 24H period but how long will this "attack" continue for?

Because this is something which is happening continuously & i have observed so many HN posts like these (Anubis iirc was created by its creator out of such frustration too). Git servers being scraped to the point of its effectively an DDOS.

mzajc 2 days ago

Yes, the attack is continuous. The rate fluctuates a lot, even within a day. It's definitely an anomaly, because eg. from 2025-08-15 to 2025-10-05 I saw zero days with more than 10k requests. Here's a histogram of the past 2 weeks plus today.

  2026-01-28     21'460
  2026-01-29     27'770
  2026-01-30     53'886
  2026-01-31    100'114  #
  2026-02-01    132'460  #
  2026-02-02     73'933
  2026-02-03    540'176  #####
  2026-02-04    999'464  #########
  2026-02-05    134'144  #
  2026-02-06  1'432'538  ##############
  2026-02-07  3'864'825  ######################################
  2026-02-08  3'732'272  #####################################
  2026-02-09  2'088'240  ####################
  2026-02-10    573'111  #####
  2026-02-11  1'804'222  ##################

bcrl 2 days ago

It's plausible that the AI companies have given up storing data for training runs and just stream it off the Internet directly now. It's probably cheaper to stream than buying more SSDs and HDDs from a supply constrained supply chain at this point.

longor1996 14 hours ago

That this is a plausible explanation is... beyond horrifying to me.

Imustaskforhelp 2 days ago

Thanks for sharing the data, This unpredictability and everything is even more suspicious.

Thoughts on having an ssh server with https://github.com/charmbracelet/soft-serve instead?

watermelon0 2 days ago

Great, now we need caching for something that's seldom (relatively speaking) used by people.

Let's not forget that scrapers can be quite stupid. For example, if you have phpBB installed, which by defaults puts session ID as query parameter if cookies are disabled, many scrapers will scrape every URL numerous times, with a different session ID. Cache also doesn't help you here, since URLs are unique per visitor.

kimos 2 days ago

You’re describing changing the base assumption for software reachable on the internet. “Assume all possible unauthenticated urls will be hit basically constantly”. Bots used to exist but they were rare traffic spikes that would usually behave well and could mostly be ignored. No longer.

DarmokJalad1701 2 days ago

I have a self-hosted Gitea instance behind a Cloudflare Tunnel protected by CloudFlare Access. Zero issues. Obviously not "public", but it is accessible from the internet with a simple login.

snorremd 2 days ago

I've recently been setting up web servers like Forgejo and Mattermost to service my own and friends' needs. I ended up setting up Crowdsec to parse and analyse access logs from Traefik to block bad actors that way. So when someone produces a bunch of 4XX codes in a short timeframe I assume that IP is malicious and can be banned for a couple of hours. Seems to deter a lot of random scraping. Doesn't stop well behaved crawlers though which should only produce 200-codes.

I'm actually not sure how I would go about stopping AI crawlers that are reasonably well behaved considering they apparently don't identify themselves correctly and will ignore robots.txt.

lowdude 2 days ago

There was a comment in a different thread that suggested they may respect the robots.txt for the most part, but may ignore wildcards: https://news.ycombinator.com/item?id=46975726

Maybe this is worth trying out first, if you are currently having issues.

V__ 2 days ago

If possible block I would block by country first. Even on public websites I block Russia/China by default and that reduced port scans etc.

On "private" services where I or my friends are the only users, I block everything except my country.

gbanfalvi 15 hours ago

Someone should just come up with some middleware that produces jumbled-up responses. Your website will still match the keywords from search engines but be toxic to ai crawlers

GuestFAUniverse 2 days ago

Fail2ban has decent jails for Apache httpd. And writing a rule that matches requests to nonexistent resources is very easy -- one-liners + time based threshold. Basically you could ban differently according to the http errors they cause (e.g. bots on migrated resources: many 404 within a minute, Slowloris is visible as a lot of 408).

GuestFAUniverse 2 days ago

Jails for other web servers are obviously as easy.

reactordev 2 days ago

Ugh, exposing it with cgit is why.

Put it all behind an OAuth login using something like Keycloak and integrate that into something like GitLab, Forgejo, Gitea if you must.

However. To host git, all you need is a user and ssh. You don’t need a web ui. You don’t need port 443 or 80.

PaulDavisThe1st 2 days ago

Using gitea does not help if you goal is to allow non-auth'ed read-only access to the repo from a web browser. The scrapers use that to hit up every individual commit, over and over and over.

We used nginx config to prevent access to individual commits, while still leaving the "rest" of what gitea makes available read-only for non-auth'ed access unaffected.

kstrauser 2 days ago

Every commit. Every diff between 2 different commits. Every diff with different query parameters. Git blame for each line of each commit.

Imagine a task to enumerate every possible read-only command you could make against a Git repo, and then imagine a farm of scrapers running exactly one of them per IP address.

Ugh.

PaulDavisThe1st 2 days ago

Ugh Ugh Ugh ... and endless ughs, when all they needed was "git clone" to get the whole thing and spend as much time and energy as they wanted analyzing it.

reactordev 2 days ago

Yuk…

   http {
       # ... other http settings
       limit_req_zone $binary_remote_addr zone=mylimit:10m rate=10r/s;
       # ...
   }


    server {
        # ... other server settings
        location / {
            limit_req zone=mylimit burst=20 nodelay;
            # ... proxy_pass or other location-specific settings
        }
    }

Rate limit read-only access at the very least. I know this is a hard problem for open source projects that have relied on web access like this for a while. Anubis?

diath 18 hours ago

Easier said than done, I have 700k requests from bots in my access.log coming from 15k different IP addresses.

:: ~/website ‹master*› » rg '(GPTBot|ClaudeBot|Bytespider|Amazonbot)' access.log | awk '{print $1}' | sort -u | wc -l

15163

PaulDavisThe1st 2 days ago

We used fail2ban to do rate limiting first. It wasn't adequate.

reactordev 2 days ago

Ooof, maybe a write up is in order? An opinioned blog post? I'd love to know more.

PaulDavisThe1st 21 hours ago

As noted by others, the scrapers do not seem to respond to rate limiting. When you're being hit by 10-100k different IP's per hour and they don't respond to rate limiting, rate limiting isn't very effective.

JohnTHaller 2 days ago

The Chinese AI scrapers/bots are killing quite a bit of the regular web now. YisouSpider absolutely pummeled my open source project's hosting for weeks. Like all Chinese AI scrapers, it ignores robots.txt. So forget about it respecting a Crawl-delay. If you block the user agent, it would calm down for a bit, then it would just come back again using a generic browser user agent from the same IP addresses. It does this across 10s of thousands of IPs.

mono442 2 days ago

Just block the whole China, India and similar countries.

kevin_thibedeau 2 days ago

Start blocking /16s.

Lerc 2 days ago

I presume people have logs that indicate the source for them to place blame on AI scrapers. Is anybody making these available for analysis so we can see exactly who is doing this?

JohnTHaller 2 days ago

The big nasty AI bots use 10s of thousands of IPs distributed all over China

notachatbot123 2 days ago

Millions and all over the world

krick 2 days ago

So... just blacklist all China IPs? I assume China isn't the primary market for most of complaining site-owners.

yellowapple 5 hours ago

I wonder if shoving

> 动态网自由门天安門天安门法輪功李洪志 Free Tibet 六四天安門事件 The Tiananmen Square protests of 1989 天安門大屠殺 The Tiananmen Square Massacre 反右派鬥爭 The Anti-Rightist Struggle 大躍進政策 The Great Leap Forward 文化大革命 The Great Proletarian Cultural Revolution 人權 Human Rights 民運 Democratization 自由 Freedom 獨立 Independence 多黨制 Multi-party system 台灣臺灣 Taiwan Formosa 中華民國 Republic of China 西藏土伯特唐古特 Tibet 達賴喇嘛 Dalai Lama 法輪功 Falun Dafa 新疆維吾爾自治區 The Xinjiang Uyghur Autonomous Region 諾貝爾和平獎 Nobel Peace Prize 劉暁波 Liu Xiaobo 民主言論思想反共反革命抗議運動騷亂暴亂騷擾擾亂抗暴平反維權示威游行李洪志法輪大法大法弟子強制斷種強制堕胎民族淨化人體實驗肅清胡耀邦趙紫陽魏京生王丹還政於民和平演變激流中國北京之春大紀元時報九評論共産黨獨裁專制壓制統一監視鎮壓迫害侵略掠奪破壞拷問屠殺活摘器官誘拐買賣人口遊進走私毒品賣淫春畫賭博六合彩天安門天安门法輪功李洪志 Winnie the Pooh 劉曉波动态网自由门

into the headers of every response would be enough to kill off the worst-offending traffic?

esseph 2 days ago

A lot of compromised home devices and cheap servers proxying traffic, from all over the world.

Lerc 2 days ago

If that is the case how can you determine the reason for the activity?

esseph 2 days ago

Some fake user agent, some tell you who they are. Or.. do they?

Here-in is the problem. And if you block them, you risk blocking actual customers.

Lerc 2 days ago

If they are using appropriated hardware, what possible reason could there be for them saying who they are?

esseph 2 days ago

Three different "companies" normally:

1. The residential proxies

2. Scrapers, on behalf of or as an agent of the data buyer

3. Data buyer (ai training)

Scrapers are buying from residential proxies, giving the data buyer a bit of a shield/deniability.

The scrapers don't want to get outright blocked if they can avoid it, otherwise they have nothing to sell.

thesuitonym 2 days ago

I don't understand. Your HTTPS server was being hammered so you stopped serving Git? That doesn't make any sense at all, if it's a private server, why not just turn off the web frontend?

jotaen 2 days ago

The post says their web frontend was public.

ptman 2 days ago

Maybe put the git repos on radicle?

t312227 2 days ago

hello,

as always: imho. (!)

idk ... i just put a http basic-auth in front of my gitweb instance years ago.

if i really ever want to put git-repositories into the open web again i either push them to some portal - github, gitlab, ... - or start thinking about how to solve this ;))

just my 0.02€

madduci 2 days ago

I've put everything behind a Wireguard Server, so if I need something, I can access to it through VPN and AI can't do anything

t312227 16 hours ago

hello,

as always: imho. (!)

btw. thanks for the downvote.

its for sure better to kill your own infrastructure because of some AI crawlers - buhuuuu ... bad bots!! - than to solve your problem with a stupid simple but effective solution.

just as an idea: if i had to host public repositories i would think about how to disable costly operations - searches etc. - for anonymous access ... like github did.

just my 0.02€

Joel_Mckay 2 days ago

Some run git over ssh, and a domain login for https:// permission manager etc.

Also, spider traps and 42TB zip of death pages work well on poorly written scrapers that ignored robots.txt =3

hattmall 2 days ago

Can we not charge for access? If I have a link, that says "By clicking this link you agree to pay $10 for each access" then sending the bill?

simonw 2 days ago

Cloudflare launched a product to do that last year: https://blog.cloudflare.com/introducing-pay-per-crawl/

I have no idea if it actually works as advertised though. I don't think I've heard from anyone trying it.

M95D 2 days ago

Send it where?

anarticle 2 days ago

I use a private gitlab that was setup by claude, have my own runners and everything. It's fine. I have my own little home cluster, net storage compute around 2.5k. Go NUCs, cluster, don't look back.

bigbuppo 2 days ago

Just another example of AI and its DoSaaS ruining things for everyone. The AI bros just won't accept "NO" for an answer.

sdf2erf 2 days ago

[dead]

Amol-917 2 days ago

[flagged]

oceanplexian 2 days ago

[flagged]

QuiDortDine 2 days ago

Not sure why you're talking like OP pissed in your cheerios. They are a victim of a broken system, it shouldn't be on them to spend more effort protecting their stuff from careless-to-malicious actors.

simonw 2 days ago

A varnish cache won't help you if you're running something like a code forge where every commit has its own page - often more than one page, there's the page for the commit and then the page for "history from this commit" and a page for every one of the files that existed in the repo at the time of that commit...

Then a poorly written crawler shows up and requests 10,000s of pages that haven't been requested recently enough to be in your cache.

I had to add a Cloudflare Captcha to the /search/ page of my blog because of my faceted search engine - which produces may thousands of unique URLs when you consider tags and dates and pagination and sort-by settings.

And that's despite me serving ever page on my site through a 15 minute Cloudflare cache!

Static only works fine for sites that have a limited number of pages. It doesn't work for sites that truly take advantage of the dynamic nature of the web.

ninjin 2 days ago

Exactly. The problem is that by their very nature some content has to be dynamically generated.

Just to add further emphasis as to how absurd the current situation is. I host my own repositories with gotd(8) and gotwebd(8) to share within a small circle of people. There is no link on the Internet to the HTTP site served by gotwebd(8), so they fished the subdomain out of the main TLS certificate. I am getting hit once every few seconds for the last six or so months by crawlers ignoring the robots.txt (of course) and wandering aimlessly around "high-value" pages like my OpenBSD repository forks calling blame, diff, etc.

Still managing just fine to serve things to real people, despite me at times having two to three cores running at full load to serve pointless requests. Maybe I will bother to address this at some point as this is melting the ice caps and wearing my disks out, but for now I hope they will choke on the data at some point and that it will make their models worse.

anonnon 2 days ago

Well, it's heartening to know that AI is making your life at least somewhat less enjoyable.

aguacaterojo 2 days ago

How would a LAMP stack help his git server?

anonnon 2 days ago

Your post is pure victim-blaming, as well as normalizing an exploitative state of affairs (being aggressively DDOSed by poorly-behaved scrapers run by Big Tech that only take and never give back, unlike pre-AI search engines, which previously, at least, would previously send you traffic) that was unheard of until just a few years ago.

october8140 2 days ago

You could put it behind Cloudflare and block all AI.

CuriouslyC 2 days ago

Does this author have a big pre-established audience or something? Struggling to understand why this is front-page worthy.

jaunt7632 2 days ago

A healthy front page shouldn’t be a “famous people only” section. If only big names can show up there, it’s not discovery anymore, it’s just a popularity scoreboard.

fouc 2 days ago

because he's unable to self-host git anymore because AI bots are hammering it to submit PRs.

self-hosting was originally a "right" we had upon gaining access to the internet in the 90s, it was the main point of the hyper text transfer protocol.

geerlingguy 2 days ago

Also converting the blog from something dynamic to a static site generator. I made the same switch partly for ease of maintenance, but a side benefit is it's more resilient to this horrible modern era of scrapers far outnumbering legitimate traffic.

It's painful to have your site offline because a scraper has channeled itself 17,000 layers deep through tag links (which are set to nofollow, and ignored in robots.txt, but the scraper doesn't care). And it's especially annoying when that happens on a daily basis.

Not everyone wants to put their site behind Cloudflare.

tanduv 2 days ago

sorry if i missed it, but the original post doesn't say anything about PRs... the bots only seem to be scraping the content

fouc 2 days ago

oh you're right, I read "pointless requests" as "PRs", oops!

ares623 2 days ago

Well the fact that this supposed nobody is overwhelmed by AI scrapers should speak a lot about the issue no?

bibimsz 2 days ago

the era of mourning has begun

Jaxkr 2 days ago

The author of this post could solve their problem with Cloudflare or any of its numerous competitors.

Cloudflare will even do it for free.

denkmoon 2 days ago

Cool, I can take all my self hosted stuff and stick it behind centralised enterprise tech to solve a problem caused by enterprise tech. Why even bother?

FeteCommuniste 2 days ago

"Cause a problem and then sell the solution" proves a winning business strategy once more.

Shorel 2 days ago

Cloudflare seems to be taking over all of the last mile web traffic, and this extreme centralization sounds really bad to me.

We should be able to achieve close to the same results with some configuration changes.

AWS / Azure / Cloudflare total centralization means no one will be able to self host anything, which is exactly the point of this post.

the_fall 2 days ago

They don't. I'm using Cloudflare and 90%+ of the traffic I'm getting are still broken scrapers, a lot of them coming through residential proxies. I don't know what they block, but they're not very good at that. Or, to be more fair: I think the scrapers have gotten really good at what they do because there's real money to be made.

esseph 2 days ago

Probably more money in scraping than protection...

simonw 2 days ago

Cloudflare won't save you from this - see my comment here: https://news.ycombinator.com/item?id=46969751#46970522

aspenmayer 19 hours ago

Parent of your comment became [flagged][dead], which broke your in-context link.

A direct link works, however:

https://news.ycombinator.com/item?id=46970522

Semaphor 2 days ago

For logging, statistics etc. we have the Cloudflare bot protection on the standard paid level, ignore all IPs not from Europe (rough geolocation), and still have over twice the amount of bots that we had ~2 years ago.

rubiquity 2 days ago

The scrapers should use some discretion. There are some rather obvious optimizations. Content that is not changing is less likely to change in the future.

JohnTHaller 2 days ago

They don't care. It's the reason they ignore robots.txt and change up their useragents when you specifically block them.

overgard 2 days ago

I'm pretty sure scrapers aren't supposed to act as low key DOS attacks

isodev 2 days ago

I think the point of the post was how something useless (AI) and its poorly implemented scrapers is wrecking havoc in a way that’s turning the internet into a digital desert.

That Cloudflare is trying to monetise “protection from AI” is just another grift in the sense that they can’t help themselves as a corp.

fouc 2 days ago

you don't understand what self-hosting means. self-hosting means the site is still up when AWS and Cloudflare go down.