Hacker News

News publishers limit Internet Archive access due to AI scraping concerns

545 points by ninjagoo 2 days ago | 355 comments

So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money. The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

CqtGLRGcukpy 2 days ago

The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

This is from my experience having a personal website. AI companies keep coming back even if everything is the same.

giancarlostoro 2 days ago

Weird, considering IA has most of its content in a way you could rehost it all idk why nobody’s just hosting a IA carbon copy that AI companies can hit endlessly, and then cutting IA a nice little check in the process, but I guess some of the wealthiest AI startups are very frugal about training data?

This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.

Nathan2055 2 days ago

That already exists, it's called Common Crawl[1], and it's a huge reason why none of this happened prior to LLMs coming on the scene, back when people were crawling data for specialized search engines or academic research purposes.

The problem is that AI companies have decided that they want instant access to all data on Earth the moment that it becomes available somewhere, and have the infrastructure behind them to actually try and make that happen. So they're ignoring signals like robots.txt or even checking whether the data is actually useful to them (they're not getting anything helpful out of recrawling the same search results pagination in every possible permutation, but that won't stop them from trying, and knocking everyone's web servers offline in the process) like even the most aggressive search engine crawlers did, and are just bombarding every single publicly reachable server with requests on the off chance that some new data fragment becomes available and they can ingest it first.

This is also, coincidentally, why Anubis is working so well. Anubis kind of sucks, and in a sane world where these companies had real engineers working on the problem, they could bypass it on every website in just a few hours by precomputing tokens.[2] But...they're not. Anubis is actually working quite well at protecting the sites it's deployed on despite its relative simplicity.

It really does seem to indicate that LLM companies want to just throw endless hardware at literally any problem they encounter and brute force their way past it. They really aren't dedicating real engineering resources towards any of this stuff, because if they were, they'd be coming up with way better solutions. (Another classic example is Claude Code apparently using React to render a terminal interface. That's like using the space shuttle for a grocery run: utterly unnecessary, and completely solvable.) That's why DeepSeek was treated like an existential threat when it first dropped: they actually got some engineers working on these problems, and made serious headway with very little capital expenditure compared to the big firms. Of course they started freaking out, their whole business model is based on the idea that burning comical amounts of money on hardware is the only way we can actually make this stuff work!

The whole business model backing LLMs right now seems to be "if we burn insane amounts of money now, we can replace all labor everywhere with robots in like a decade", but if it turns out that either of those things aren't true (either the tech can be improved without burning hundreds of billions of dollars, or the tech ends up being unable to replace the vast majority of workers) all of this is going to fall apart.

Their approach to crawling is just a microcosm of the whole industry right now.

[1]: https://en.wikipedia.org/wiki/Common_Crawl

[2]: https://fxgn.dev/blog/anubis/ and related HN discussion https://news.ycombinator.com/item?id=45787775

ccgreg 22 hours ago

Thanks for the mention of Common Crawl. We do respect robots.txt and we publish an opt-out list, due to the large number of publishers asking to opt out recently.

There's a bit of discussion of Common Crawl in Jeff Jarvis's testimony before Congress: https://www.youtube.com/watch?v=tX26ijBQs2k

radiator 13 hours ago

So perhaps the AI companies will go bankrupt and then this madness will stop. But it would be nice if no government intervenes because they are "too big to fail".

miki123211 10 hours ago

Are you sure it's the AI companies being that incompetent, and not wannabe AI companies?

What I feel is a lot more likely is that OpenAI et al are running a pretty tight ship, whereas all the other "we will scrape the entire internet and then sell it to AI companies for a profit" businesses are not.

polytely 5 hours ago

They run a tight AI ship but it is in their interest to destroy the web so that people can only get to data through their language model

Peaches4Rent 9 hours ago

OpenAI cannot possibly running a tight ship, even if they have competent scientists and engineers.

mlnj 2 days ago

Unless regulated, there is no incentive for the giants to fund anything.

cm2187 16 hours ago

There is no problem that cannot be solved with creating a bureaucracy and paperwork!

jniles 9 hours ago

I understand this is tongue-in-cheek, but do you have an alternative/better proposal?

cm2187 9 hours ago

Let the market do. If good data is so critical to the success of AI, AI companies will pay for it. I don't know how someone can still entertain the idea that a bureaucrat, or worse, a politician, is remotely competent at designing an efficient economy.

hn_go_brrrrr 8 hours ago

No they won't pay for it, unless they believe it's in their best interests. If they believe they can free-ride and get good data without having to pay for it, why would they lay down a dollar?

cm2187 8 hours ago

Because the companies in control of that data won't let them have it for free, like what is happening in the article.

hn_go_brrrrr 8 hours ago

Or, they'll just create more technically sophisticated workarounds to get what they want while avoiding a bad precedent that might cost them more money in the long run. Millions for defense, not one cent for tribute.

AnthonyMouse 3 hours ago

Now apply the same logic to laws, except that laws are a lot slower to change when they find the next workaround.

And it's a lot harder to get the law to stop doing something once it proves to cause significant collateral damage, or just cumulative incremental collateral damage while having negligible effectiveness.

mlnj 6 hours ago

All the world's data was critical to the success of AI. They stole it and fought the system to pay nothing. Then settled it for peanuts because the original creators are weak to negotiate. It already happened.

zmmmmm 24 hours ago

yeah, they should really have a think about how their behavior is harming their future prospects here.

Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.

We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.

disposition2 24 hours ago

Something I’ve noticed about technology companies, and it’s bled into just about every facet of the US these days, is the consideration of if an action *can* be executed upon vs *should* an action be executed upon.

It’s very unfortunate and a short sighted way to operate.

eshaham78 14 hours ago

This is a fundamental shift that's been happening for a while, but AI accelerated it dramatically. The irony is that AI companies complain about lack of quality data while simultaneously burning through the open web with reckless crawling - degrading the very ecosystem they depend on. There's no incentive to invest in quality when you can just scrape everything at scale. The companies that will win are those treating this as a long-term relationship with publishers, not a extraction problem.

HWR_14 19 hours ago

The main issue is a well behaved AI company won't be singled out for continued access, they will all be hit by public sites blocking AI access. So there is no benefit to them behaving.

AnthonyMouse 3 hours ago

> So there is no benefit to them behaving.

That's assuming they're deriving a benefit from misbehaving.

There is no benefit to immediately re-crawling 404s or following dynamic links into a rabbit hole of machine-generated junk data and empty search results pages in violation of robots.txt. They're wasting the site's bandwidth and their own in order to get trash they don't even want.

Meanwhile there is an obvious benefit to behaving: You don't, all by yourself, cause public sites to block everyone including you.

The problem here isn't malice, it's incompetence.

CrossVR 18 hours ago

Why should a well-behaved AI company be singled out for continued access? If the industry can't regulate itself then none deserve access no matter if they're "well-behaved".

Receiving a response from someone's webserver is a privilege, not a right.

goku12 13 hours ago

Honestly, has any of these AI companies ever offered a compensation for the data they pillage, except in case of large walled up information silos like reddit? This is like asking why the occasional burglars are not singled out for direct access into your house, compared to the stripmining marauders out there.

Why does any of them deserve any special treatment? Please don't try to normalize this reprehensible behavior. It's a greedy, exploitative and lawless behavior, no matter how much they downplay it or how long they've been doing it.

miki123211 10 hours ago

No single piece of content (unless you're a really large website) is worth the paper that such a contract would be written on.

This is the problem with AI scraping. On one hand, they need a lot of content, on the other, no single piece of content is worth much by itself. If they were to pay every single website author, they'd spend far more on overhead than they would on the actual payments.

Radio faces a similar problem (it would be impossible to hunt down every artist and negotiate licensing deals for every single song you're trying to play). This is why you have collective rights management organizations, which are even permitted by law to manage your rights without your consent in some countries.

energy123 14 hours ago

This is just tragedy of the commons.

dawnerd 7 hours ago

It’s insane actually how fast to re-request the same pages, even 404s. They’re so desperate for data they’re really hurting smaller hosts. One of our clients site became unusable when one of the ai bots started spamming the Wordpress search for terms that I’m guessing users were searching for but were unrelated to the sites content. Instead of building a search index they’re just hammering sites directly. So annoying.

nickpsecurity 8 hours ago

It can be 10,000 requests a day on static HTML and non-existent, PHP pages. That's on my site. I'd rather them have Christ-centered and helpful content in their pretraining. So, I still let them scrape it for the public good.

It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.

Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?

iririririr 2 days ago

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Maybe they vibecoded the crawlers. I wish I were joking.

terminalshort 21 hours ago

Isn't this just how crawlers work? How do you know if a page has changed if you don't keep visiting it?

heavyset_go 11 hours ago

HEAD requests

Findecanor 9 hours ago

Scrapers could also use the "Cache-Control", "Expires" and "If-Modified-Since" headers properly, to reduce their traffic a little, but do they?

anonnon 20 hours ago

> The AI companies won't just scrape IA once, they're keeping come back to the same pages and scraping them over and over. Even if nothing has changed.

Why, though? Especially if the pages are new; aren't they concerned about ingesting AI-generated content?

dragonwriter 20 hours ago

Possibly because a lot of “AI-company scraping” isn't traditional scraping (e.g., to build a dataset of the state at a particular point in time), its referencing the current content of the page as grounding for the response to a user request.

fartfeatures 2 days ago

IPFS was an attempt at this: https://en.wikipedia.org/wiki/InterPlanetary_File_System

lukeasch21 2 days ago

Coincidentally most of the funding towards IPFS development dried up because the VC money moved onto the very technology enabling these problems...

Seattle3503 2 days ago

Is there a good post-mortem of IPFS out there?

iririririr 2 days ago

What do you mean? It is alive and "well". Just extremely slow now that interest waned.

__MatrixMan__ 24 hours ago

It's been several years, but in my experiments it felt plenty fast if I prefetched links at page load time so that they're already local by the time the user actually tries to follow them (sometimes I'd do this out to two hops).

I think it "failed" because people expected it to be a replacement transport layer for the existing web, minus all of the problems the existing web had, and what they got was a radically different kind of web that would have to be built more or less from scratch.

I always figured it was a matter of the existing web getting bad enough, and then we'd see adoption improve. Maybe that time is near.

iririririr 9 hours ago

oh I mean slow in terms of adoption and public interest. my bad. i expressed awfully.

But you are right on the reason it "failed". People expected web++, with a "killer app", whatever that means. Imagination is dead.

__MatrixMan__ 6 hours ago

I'm still working on what I think could be a killer app for it, but progress happens on holidays and vacations and weekends only if I'm lucky, so as you say... it's slow :)

giantrobot 5 hours ago

I see the primary issue with IPFS is a significant majority of all web users are on mobile. They can't act as content hosts or routers. In P2P parlance they can only ever act as leeches. Even people with full fledged computers the market is dominated by laptops. These have similar availability issues as phones even if they don't have the same storage or connectivity limitations.

Compared to the total number of users on the Internet relatively few have stable always-on machines ready to host P2P content. ISPs do not make it easy or at times possible to poke holes in firewalls to allow for easy hosting on residential connections. This necessitates hole punching which adds non-trivial delays on connections and overall poorer network performance.

It's less about imagination being dead but instead limitations of the modern Internet retards momentum of P2P anything.

AnthonyMouse 3 hours ago

> I see the primary issue with IPFS is a significant majority of all web users are on mobile. They can't act as content hosts or routers.

Is there any reason this has to be true? Probably some majority or significant minority of mobile devices spend some eight hours a day attached to a charger in a place where they have the WiFi password, while the user is asleep. And you don't need 100% of devices to be hosts or routers, 10% at any given time would be more than sufficient.

Seattle3503 24 hours ago

What's IPFS 's killer app?

Operyl 2 days ago

They already are, I've been dealing with Vietnam and Korea residential proxies destroying my systems for weeks, I'm growing tired. I cannot survive 3500 RPS 24/7.

shark_laser 23 hours ago

> I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.

You've just described Nostr: Content that is tied to a hash (so its origin and authenticity can be verified) that is hosted by third parties (or yourself if you want)

Hendrikto 11 hours ago

With Nostr you can host your content anywhere, but for it to actually be discoverable, you need to declare that host. Third parties therefore cannot really solve the problem for you, without your help.

demetris 2 days ago

I don’t believe resips will be with us for long, at least not to the extent they are now. There is pressure and there are strong commercial interests against the whole thing. I think the problem will solve itself in some part.

Also, I always wonder about Common Crawl:

Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?

ccgreg 22 hours ago

Many AI projects in academia or research get all of their web data from Common Crawl -- in addition to many not-AI usages of our dataset.

The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.

demetris 12 hours ago

Thank you!

> The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big.

But how can they aspire to do any of that if they cannot build a basic bot?

My case, which I know is the same for many people:

My content is updated infrequently. Common Crawl must have all of it. I do not block Common Crawl, and I see it (the genuine one from the published ranges; not the fakes) visiting frequently. Yet the LLM bots hit the same URLs all the time, multiple times a day.

I plan to start blocking more of them, even the User and Search variants. The situation is becoming absurd.

ccgreg 59 minutes ago

Well, yes, it is a bit distressing that ill behaved crawlers are causing a lot of damage -- and collateral damage, too, when well-behaved bots get blocked.

WalterBright 18 hours ago

> I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent.

I wrote a short paper on that 25 years ago, but it went nowhere. I still think it is a great idea!

jeron 15 hours ago

>The only real loser is the common man who doesn't have the resources to scrape the entire web himself.

definitely, this is going to hurt those over at /r/datahoarder

pigggg 2 days ago

AI companies are _already_ funding and using residential proxies. Guess how much of those proxies are acquired through being compromised or tricking people into installing apps?

golem14 21 hours ago

Does anyone know if Teslas do this? I noticed Tesla cars want to have access to local WiFi and eat up oodles of bandwidth …

raincole 2 days ago

Even if the site is archived on IA, AI companies will still do the same.

Aurornis 23 hours ago

> So instead of scraping IA once, the AI companies will use residential proxies and each scrape the site themselves, costing the news sites even more money.

News websites aren’t like those labyrinthian cgit hosted websites that get crushed under scrapers. If 1,000 different AI scrapers hit a news website every hour it wouldn’t even make a blip on the traffic logs.

Also, AI companies are already scraping these websites directly in their own architecture. It’s how they try to stay relevant and fresh.

dawnerd 7 hours ago

Hello hi, I work on a news site and we absolutely notice and it does mess up traffic logs.

terminalshort 21 hours ago

But don't you have to sign a license agreement that prohibits scraping in order to purchase a subscription that allows you to bypass the paywall?

j45 18 hours ago

Blocking the internet archive sounds like non-tech leadership making decisions without understanding how ubiquitous and moot it is to simply get it another way.

Kind of sucks because the news are an important part of that kind of an archive.

nerdponx 20 hours ago

It's almost as if this isn't about scraping and more about shutting down a "free article sharing" channel that gets abused all the time.

toomuchtodo 2 days ago

AI browsers will be the scrapers, shipping content back to the mothership for processing and storage as users co browse with the agentic browser.

lxgr 20 hours ago

But hey, paywalled sites might be getting 2-3 additional subscriptions out of it!

zaphirplane 23 hours ago

We don’t lack the technology to limit scrapers, sure it’s an arms race with AI companies with more money than most. Why can’t this be a legal block through TOS

daniel31x13 2 days ago

I maintain an open-source project called Linkwarden and this exact discussion is one of the reasons why it exists, teams needed a way to preserve referenced URLs reliably without having to depend on external services.

It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.

There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.

[1]: https://linkwarden.app

[2]: https://github.com/linkwarden/linkwarden

raybb 24 hours ago

Linkwarden is awesome and with the singlefile extension it's pretty easy to store things you can see but the scraper gets blocked on.

One question, what's your stance on adding a way to mark articles as read or "archive" them like other apps that are branded a bit more as storing things to read later. You can technically do something similar with tags but it's a bit clunky of a UX.

daniel31x13 22 hours ago

Thanks! At the moment we’re focused on archiving rather than read-later workflows, but this is great feedback. I’ve already added it to the feature requests list.

moontear 14 hours ago

Archival is one side of the coin, but consumption as-in read-later is very important as well.

I am currently evaluating Linkwarden, Wallabag, Hoarder, Linkding and each of the services has pro and cons making it hard for me to choose one. Linkwarden is AWESOME in its way to store content in multiple formats, but the read-later wfs could be improved.

Without checking again: does Linkwarden sync reading location across devices and automatically scrolls to that location on the next device? Does it tell me how „long“ an article takes to read (solely based on the length of it)? Does Linkding support marking up text and persist (mark some text yellow and see those marks somewhere or even add comments or favorite specific parts of texts).

No need to answer any of the questions, I can research myself, just putting these out there for a read-later solution I would like. Add a link on my mobile device, Linkwarden could do its magic in the backend, and I check out the content later on desktop or even on my mobile device.

lxgr 20 hours ago

> with the singlefile extension it's pretty easy to store things you can see but the scraper gets blocked on

FWIW, at least on iOS, it's possible to inject Javascript into the web site being currently displayed by Safari as a side effect of sharing a web link to an app via the share sheet.

Several "read it later" style apps use this successfully to get around paywalls (assuming you've paid yourself) and other robot blockers. Any plans for Linkwarden to do this (or does it already)?

ghxst 17 hours ago

Any docs on this? I didn't know this was a thing.

lxgr 14 hours ago

I believe the key search term was NSExtensionJavaScriptPreprocessingFile, e.g. documented here: https://developer.apple.com/library/archive/documentation/Ge...

daniel31x13 17 hours ago

https://docs.linkwarden.app/Usage/upload-from-singlefile

TechPlasma 17 hours ago

I literally just came across and installed your project on my server today. It's fantastic and with it I was able to cancel my readwise subscription. Great work!

iririririr 2 days ago

Neat. How does the archive.org integration works?

Does it just POST the url to them for them to fetch? Or is there any integration/trust to store what you already fetched on the client directly on their archives?

daniel31x13 23 hours ago

> Does it just POST the url to them for them to fetch?

Correct.

jruohonen 2 days ago

It affects science too (and there you'd want solid archiving as much as possible). Increasingly, meta-data is full of errors and general purpose search engines for science are breaking down, including even things like Google Scholar. I suppose some big science publishers are blocking AI bots too.

upboundspiral 6 hours ago

If anyone wants the surreal experience of seeing blogs and websites made by real humans they should check out https://marginalia-search.com

It's far from perfect but it does achieve its stated goal: of resurfacing real people on the internet.

It recently got some NLNet funding and I hope to see it flourish - to my knowledge there aren't any other projects trying to claw back control of the internet towards the commons.

https://about.marginalia-search.com

shevy-java 2 days ago

Google ruined its own search engine on top of that as well though.

We are increasingly becoming blind. To me it looks as if this is done on purpose actually.

terminalshort 21 hours ago

Did Google ruin it, or did advesarial activity between Google's algorithm and SEO ruin it? The latter seems more likely because the incentives make sense, and also inevitable.

visarga 16 hours ago

Google ruined it, maximizing ad sales no matter the outcomes. SEO adapted to Google, Google adapted only to maximize their own profits.

salawat 2 days ago

It was. Advertising is incompatible with accurate data retrieval/routing. We've also implemented "obligation to deindex". So providing an unbiased index of the web as she is is essentially (in the U.S.) verboten.

ninjagoo 2 days ago

> I suppose some big science publishers are blocking AI bots too.

That's a travesty, considering that a huge chunk of science is public-funded; the public is being denied the benefits of what they're paying for, essentially.

galleywest200 2 days ago

The public can still access the sites themselves.

ninjagoo 2 days ago

> The public can still access the sites themselves.

Indefinitely? Probably not.

What about when a regime wants to make the science disappear?

thwarted 2 days ago

So the solution is to allow the AI scraping and hide the content, with significantly reduced fidelity and accuracy and not in the original representation, in some language model?

mlnj 2 days ago

Don't forget the onslaught of ads that will distort the actual publications even more going forward.

pa7ch 2 days ago

What has that got to do with blocking AI crawlers?

ninjagoo 2 days ago

If it's publicly funded, why shouldn't AI crawlers have access to that data? Presumably those creating the AI crawlers paid taxes that paid for the science.

JumpCrisscross 2 days ago

> If it's publicly funded, why shouldn't AI crawlers have access to that data?

Becase it costs money to serve them the content.

8bitsrule 21 hours ago

Crawlers accessing public data could be required to provide searchable access to the public data they collect. Value-for-value.

wyre 2 days ago

If I build a business based off of consumption of publicly funded data, and that’s okay, why isn’t it okay for AI?

Is the answer regulate AI? Yes.

JumpCrisscross 2 days ago

> If I build a business based off of consumption of publicly funded data, and that’s okay, why isn’t it okay for AI?

Because when you build it you aren't, presumably, polling their servers every fifteen minutes for the entire corpus. AI scrapers are currently incredibly impolite.

heavyset_go 11 hours ago

Plenty of public funded data isn't made free and public access. Sometimes you need to pay, or get a license, etc depending on what you're doing with it.

asdff 2 days ago

Thank god for pubmed and deterministic search operators.

teliance 22 hours ago

[dead]

ninjagoo 2 days ago

Publishers like The Guardian and NYT are blocking the IA/Wayback Machine. 20% of news websites are blocking both IA and Common Crawl. As an example, https://www.realtor.com/news/celebrity-real-estate/james-van... is unarchivable, with IA being 429ed while the site is accessible otherwise.

trollbridge 2 days ago

And whilst the IA will honour requests not to archive/index, more aggressive scrapers won't, and will disguise their traffic as normal human browser traffic.

So we're basically decided we only want bad actors to be able to scrape, archive, and index.

mananaysiempre 8 hours ago

> If you find yourself wondering, or just feeling, "Why is everyone I wind up dealing with an asshole?" you might want to consider the possibility that you have set up an asshole filter.

https://siderea.dreamwidth.org/1209794.html

JumpCrisscross 2 days ago

> we're basically decided we only want bad actors to be able to scrape, archive, and index

AI training will be hard to police. But a lot of these sites inject ads in exchange for paywall circumvention. Just scanning Reddit for the newest archive.is or whatever should cut off most of the traffic.

fc417fc802 2 days ago

Presumably someone has already built this and I'm just unaware of it, but I've long thought some sort of crowd sourced archival effort via browser extension should exist. I'm not sure how such an extension would avoid archiving privileged data though.

ajb 2 days ago

That exists for court documents (RECAP) but I think they didn't have to solve the issue of privilege as PACER publishes unprivileged docs.

nxobject 23 hours ago

In particular, habeas petitions against DHS, and SSA appeals aren’t available online for public inspection: you have to go to a clerk’s office and pay for physical copies. (I think this may have been reasonable given the circumstances in past decades… not so now.)

ccgreg 22 hours ago

That 20% number is for a limited list of relatively large news websites. If you include the long tail of news, the % of blocking is much smaller.

username223 21 hours ago

I'm part of that small but (hopefully) growing percentage, because Common Crawl is a deeply dishonest front for AI data scraping. Quoting Wikipedia:

""" In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. """

My site is CC-BY-NC-SA, i.e. non-commercial and with attribution, and Common Crawl took a dubious position on whether fair use makes that irrelevant. They can burn.

ccgreg 21 hours ago

Did you see our reply? https://commoncrawl.org/blog/setting-the-record-straight-com...

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 21 hours ago

Hopefully my site is no longer part of Common Crawl. I'm not interested in participating in your project, block CCBot in robots.txt, and have requested deletion of my data via your form.

ccgreg 21 hours ago

Did you see our reply? Edit: by which I mean, we sent you an email that explains what we did and how to verify it. Did you not receive an email reply? If not, please contact us again.

Also, if your site has CC-BY-NC-SA markings, we have preserved them.

username223 20 hours ago

I don't care. Is blocking your bot and requesting removal sufficient? If not, what is?

ccgreg 19 hours ago

Please read our email reply. I have no idea if we received your request —- your HN username doesn’t match any request we have received.

username223 18 hours ago

"We have initiated the process to remove your content from the Common Crawl Dataset. This is a multi-step process, involving first a nocrawl directive, followed by removal of the URLs from the primary index files, and finally removal of the content from the deep archive. We will advise when the process is complete." Received April 2024. I have not been advised. Please advise.

ccgreg 20 hours ago

Oh, and thanks for letting me know that I need to add our reply to Wikipedia.

samtheDamned 5 hours ago

From my basic experience editing Wikipedia I'm not sure you should edit the page of your own project. Maybe add a discussion for it instead? Or perhaps I'm mistaken.

nullhole 24 hours ago

Can you give a reference for The Guardian blocking IA? I just checked with an article from today - already archived, and a manual re-archive worked.

upboundspiral 2 days ago

I feel like a government funded search engine would resolve a lot of the issues with the monetized web.

The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.

However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.

I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.

The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.

JuniperMesos 22 hours ago

> However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.

No one "left" a crucially important public utility in the hands of private companies. Private companies developed the search engine themselves in the late 90s in the course of doing for-profit business; and because some of them ended up being successful (most notably Google), most people using the internet today take the availability of search engines for granted.

dredmorbius 10 hours ago

Rather famously in at least the case of Google and others, with government funding:

"Google’s true origin partly lies in CIA and NSA research grants for mass surveillance" (January 28, 2025)

The intelligence community hoped that the nation’s leading computer scientists could take non-classified information and user data, combine it with what would become known as the internet, and begin to create for-profit, commercial enterprises to suit the needs of both the intelligence community and the public. They hoped to direct the supercomputing revolution from the start in order to make sense of what millions of human beings did inside this digital information network. That collaboration has made a comprehensive public-private mass surveillance state possible today.

The Massive Digital Data Systems (MDDS) ... program's stated aim was to provide more than a dozen grants of several million dollars each to advance this research concept. The grants were to be directed largely through the NSF so that the most promising, successful efforts could be captured as intellectual property and form the basis of companies attracting investments from Silicon Valley. This type of public-to-private innovation system helped launch powerful science and technology companies like Qualcomm $QCOM +1.61%, Symantec, Netscape, and others.

<https://qz.com/1145669/googles-true-origin-partly-lies-in-ci...>

The Internet itself (particularly its precursor, ARPANET), was also government funded, as was development of the World Wide Web (CERN). Oracle, the database company, grew out of the CIA's Project Oracle.

CIA Reading Room Project Oracle

<https://www.cia.gov/readingroom/document/cia-rdp80-01794r000...>

"Oracle's coziness with government goes back to its founding / Firm's growth sustained as niche established with federal, state agencies" (2002)

<https://www.sfgate.com/bayarea/article/oracle-s-coziness-wit...>

Surveillance has been baked in since their founding.

LPisGood 2 days ago

The government having the power to curate access to information seems bad. You could try to separate it as an independent agency, but as the current US administration is showing, that’s not really a thing.

upboundspiral 24 hours ago

The idea is that the government is biased towards hiding certain information and private companies are biased towards hiding a different set.

While unlikely, the ideal would be for the government to provide a foundational open search infrastructure that would allow people to build on it and expand it to fit their needs in a way that is hard to do when a private companies eschews competition and hides its techniques.

Perhaps it would be better for there to be a sanctioned crawler funded by the government, that then sells the unfiltered information to third parties like google. This would ensure IP rights are protected while ensuring open access to information.

JuniperMesos 22 hours ago

And in a world where running a Google-like search engine is just one of the many jobs the US federal government has, why shouldn't how the government runs that search engine be a national-level political question decided by elections, just like the management of all the other things the US federal government does is? Regardless of how the government curated access to information, a huge chunk of the US electorate would be mad about how they were doing it, reflecting very real polarization among the population.

digiown 2 days ago

We can start by forcing sites to treat crawlers equally. Google's main moat is less physical infrastructure or the algorithms, and more that sites allow only Google to scrape and index them.

They can charge money for access or disallow all scrapers, but it should not be allowed to selectively allow only Google.

charcircuit 2 days ago

It's not like only allowing Google actually means that only Google is allowed forever. Crawlers are free to make agreements with sites to allow themselves to crawl easier or pretend they are a regular user to bypass whatever block they are trying to do.

heavyset_go 11 hours ago

The same should apply to LLMs. If you're going to train on the sum total of all of humanity's creative work, from the beginning of history into perpetuity, and train on the sum total of all current intellectual property, the result should exist for the public's education, research and benefit.

It would also be in the spirit of the fair use doctrine's first and fourth considerations:

> 1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

> 2. the nature of the copyrighted work;

> 3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

> 4. the effect of the use upon the potential market for or value of the copyrighted work.

If that doesn't happen, increasing amounts information and human creativity will be siloed and never made publicly accessible in a way that it can be consumed and reproduced as slop.

underlipton 2 days ago

I'm feeling it. Addressing the other reply: zero moderation or curation, and zero shielding from the crawler, if what you've posted is on a public network. Yes, users will be able to access anything they can think of. And the government will know. I think you don't have to worry about them censoring content; they'll be perfectly happy to know who's searching for CSAM or bomb-making materials. And if people have an issue with what the government does with this information (for example, charging people who search for things the Tangerine-in-Chief doesn't want you to see), you stop it at the point of prosecution, not data access. (This does only work in a society with a functioning democracy... but free information access is also what enables that. As Americans, with our red-hot American blood, do we dare?)

Brian_K_White 2 days ago

Time for a crowd source plugin that relays copies of what individuals view right from the browser.

Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.

No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.

Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"

Not sure how to protect the archive itself or it's operators.

digiown 2 days ago

SingleFile does the archiving fairly well.

> no privacy worries

This is harder than you might expect. Publishing these files is always risky because sites can serve you fingerprinting data, like some hidden HTML tag containing your IP and other identifiers.

8bitsrule 21 hours ago

>SingleFile does the archiving fairly well.

As does Tranquility Reader, if you're interested only in the primary content of the page ... and, usually, in a much smaller footprint ... with a PDF option.

devsda 11 hours ago

collecting from multiple sources to find and scrub out the differences may make fingerprinting somewhat ineffective ?

Brian_K_White 2 days ago

oof good point

nerdsniper 2 days ago

For a historical archive, the issue with this is that it could be difficult to ensure that the data being sent from users' devices wasn't modified in some way, leading to an inaccurate archival copy.

armchairhacker 2 days ago

Cross-reference. When a site is archived by one client (who visited it directly), request a couple other clients to archive it (who didn’t visit it directly, instead chosen at random, to ensure the same user isn’t controlling all clients).

throwaway_20357 5 hours ago

If news sites opt out of being archived by the Internet Archive, are their archives available anywhere or just lost? Will there be no way to access the headlines of a certain day or the reporting about a certain topic in the past even for research or scientific purposes?

derefr 2 days ago

I wonder if these publishers would be more amenable to a private archiver that only serves registered academic / journalistic research projects (the way most physical private archives do), with a specific provision to never provide data to companies that would resell it or use it for training of generative models.

eternauta3k 2 days ago

They already have archives with online and printed articles which they license to libraries, because the libraries take care of rate limiting and limiting abuse.

ninjagoo 2 days ago

They probably have internal archives if they're smart; but that isn't accessible to the public. I think the issue isn't whether the data is archived, but whether that information is available to the public for the foreseeable future.

g-b-r 2 days ago

They sure have archives of the newspapers, they're much less likely to have archives of what they publish online.

And a local archive is one fire, business decision, poor technical choice etc away from getting permanently lost

coffeefirst 2 days ago

Yes. Most publishers already do syndication deals. This is a fine idea.

The problem with the LLMs is they capture the value chain and give back nothing. It didn’t have to be this way. It still doesn’t.

nananana9 2 days ago

The silver lining is that it's increasingly not worth being archived as well.

idiotsecant 2 days ago

We really lucked out existing at a time when the internet was a place for weirdos and enthusiasts. I think those days are well and done.

JuniperMesos 22 hours ago

The internet can't simultaneously be a place for weirdos and enthusiasts, and a vital part of the economy that everyone uses for a huge number of disparate things in daily life. Parts of the internet can be places for weirdos and enthusiasts, but spaces that cater to weirdos and enthusiasts are by necessity not popular or viral spaces.

Flavius 2 days ago

Agreed. It’s mostly just disposable clickbait masquerading as journalism at this point. Outside of feeding people's FOMO, there's little content worth preserving for history.

xannabxlle 2 days ago

My first impression is that news companies don't want their content scraped for copyright reasons, and roundaboutly scapegoating AI

spiderfarmer 2 days ago

As a website owner I hate the fact that more than 90% of my traffic is now bots, fake bots, bots masquerading as real visitors and real visitors who try try to use my platform to spam others.

Now AI companies are using residential proxies to get around the obvious countermeasures, I have resorted to blocking all countries that are not my target audience.

It really sucks. The internet is terminally ill.

wildrhythms 10 hours ago

Every day we inch closer to a 'digital passport' gatekeeping system.

cmiles74 20 hours ago

Isn't the real problem here the unscrupulous AI scrapers? These sites want to be paid for their content to be used for AI training, if this same content is scraped by the Internet Archive the AI companies can get the content for free.

It's unfortunate that this undermines the usefulness of the Internet Archive, I don't see an alternative. IMHO, we'll soon see these AI scrapers cease to advertise themselves leading to sites like the NY Times trying to blacklist IP ranges as this battle continues. Fun times ahead!

shevy-java 2 days ago

> The Financial Times, for example, blocks any bot that tries to scrape its paywalled content, including bots from OpenAI, Anthropic, Perplexity, and the Internet Archive

But then it was not really open content anyway.

> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”

Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.

ninjagoo 2 days ago

Wikipedia relies on the institutional structure of journalism, with newsroom independence, journalistic standards, educational system and probably a ton of other dependencies.

Journalism as an institution is under attack because the traditional source of funding - reader subscriptions to papers - no longer works.

To replicate the Wikipedia model would need to replicate the structure of Journalism for it to be reliable. Where would the funding for that come from? It's a tough situation.

zozbot234 24 hours ago

> Well - we need something like wikipedia for news content.

The Wikipedia folks had their own Wikinews project which is essentially on hold today because maintenance in a wiki format is just too hard for that kind of uber-ephemeral content. Instead, major news with true long-term relevance just get Wikipedia articles, and the ephemera are ignored.

JumpCrisscross 2 days ago

> it was not really open content anyway

Practically no quality journalism is.

> we need something like wikipedia for news

Wikipedia editors aren’t flying into war zones.

ghaff 2 days ago

Well, and it would be considered "original research" anyway which some admin would revert.

aspenmayer 24 hours ago

Original reporting is allowed and encouraged by the Wikimedia Foundation sister org Wikinews, which may be cited by Wikipedia.

https://en.wikinews.org/wiki/Wikinews:Original_reporting

zozbot234 24 hours ago

Wikinews is on hold nowadays. Original research that is of real long-term relevance can go onto Wikijournal, which does peer review.

fc417fc802 2 days ago

Statistically, at least a few of them live in war zones. And I'm sure some of them would fly in to collect data if you paid them for it.

JumpCrisscross 2 days ago

> at least a few of them live in war zones

Which is a valuable perspective. But it's not a subsitute for a seasoned war journalist who can draw on global experience. (And relating that perspective to a particular home market.)

> I'm sure some of them would fly in to collect data if you paid them for it

Sure. That isn't "a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers."

One part of the population imagines journalists as writers. They're fine on free, ad-supported content. The other part understands that investigation is not only resource intensive, but also requires rare talent and courage. That part generally pays for its news.

Between the two, a Wikipedia-style journalistic resource is not entertaining enough for the former and not informative enough for the latter. (Importantly, compiling an encyclopedia is principally the work of research and writing. You can be a fine Wikipedia–or scientific journal or newspaper–editor without leaving your room.)

zmgsabst 23 hours ago

Those roles seem to be diverging:

- crowdsourced data, eg, photos of airplane crashes

- people who live in an area start vlogs

- independent correspondents travel there to interview, eg Ukraine or Israel

We see that our best war reporting comes from analyst groups who ingest that data from the “firehose” of social media. Sometimes at a few levels, eg, in Ukraine the best coverage is people who compare the work of multiple groups mapping social media reports of combat. You have on top of that punditry about what various movements mean for the war.

So we don’t have “journalist”:

- we have raw data (eg, photos)

- we have first hand accounts, self-reported

- we have interviewers (of a few kinds)

- we have analysts who compile the above into meaningful intelligence

- we have anchors and pundits who report on the above to tell us narratives

The fundamental change is that what used to be several roles within a new agency are now independent contractors online. But that was always the case in secret — eg, many interviewers were contracted talent. We’re just seeing the pieces explicitly and without centralized editorial control.

So I tend not to catastrophize as much, because this to me is what the internet always does:

- route information flows around censorship

- disintermediate consumers from producers when the middle layer provides a net negative

As always in business, evolve or die. And traditional media has the same problem you outline:

- not entertaining enough for the celebrity gossip crowd

- too slow and compromised by institutional biases for the analyst crowd, eg, compare WillyOAM coverage of Ukraine to NYT coverage

https://www.youtube.com/@willyOAM

riquito 2 days ago

> we need something like wikipedia for news content

Interesting idea. It could be something that archives first and releases at a later date, when the news aren't as much new

fc417fc802 2 days ago

> a news editorial that focuses on free content but in a newspaper-style

Isn't that what state funded news outlets are?

RajT88 2 days ago

Proposed solution:

Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".

It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.

atrus 2 days ago

Even sites with that option already (like wikipedia) still report being hammered by scrapers. It's the full-funded aligned with the incompetent at work here.

digiown 2 days ago

IA has always been in legal jeopardy without offering paid access. For that to work we need to get rid of copyright first.

RajT88 24 hours ago

Or offer it in countries with lax copyright. The industry will find ways to work around it.

But - as another poster pointed out - Wikipedia offers this, and still gets hammered by scrapers. Why buy when free, I guess?

jackfranklyn 2 days ago

There's a mundane version of this that hits small businesses every day. Platform terms of service pages, API documentation, pricing policies, even the terms you agreed to when you signed up for a SaaS product - these all live at URLs that change or vanish.

I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.

For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.

The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.

yellowapple 2 days ago

Framing this as some anti-AI thing is wild. The simpler, more obvious, and more evidenced reason for this is that these sites want to make money with ads and paywalls that an archived copy tends to omit by design. Scapegoating AI lets them pretend that they're not the greedy bad guys here — just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.

JuniperMesos 22 hours ago

Yeah I assume what the news publishers actually care about is the thing where, when someone posts a paywalled news article on hacker news, one of the first comments is invariably a link to an archive site that bypasses the paywall so people can read it without paying for it.

> just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.

When I learned about how much water agriculture and industry uses in the state of California where I live, I basically entirely stopped caring about household water conservation in my daily life (I might not go this far if I had a yard or garden that I watered, but I don't where I currently live). If water is so scarce in an urban area that an individual human taking a long shower or running the dishwasher a lot is at all meaningful, then either the municipal water supply has been badly mismanaged, or that area is too dry to support human settlement; and in either case it would be wise to live somewhere else.

humanji 24 hours ago

I’m coming at this from a founder/product angle, not a technical one, so excuse the naive framing.

What worries me isn’t scraping itself, but the second-order effects. If large parts of the web become intentionally unarchivable, we’re slowly losing a shared memory layer. Short-term protection makes sense, but long-term it feels like knowledge erosion.

Genuinely curious how people here think about preserving public knowledge without turning everything into open season for mass scraping.

bsimpson 24 hours ago

This partially feels like an intentional pendulum swing from Twitter/Facebook cancel culture and other forms of policing.

I'm thinking in particular about the rise of platforms like Discord where being opaque to search/archiving is seen as a feature. Being gatekept and ephemeral makes people more comfortable sharing things that might get a takedown notice on other platforms, and it's hard for people who don't like you in the future to try to find jokes/quotes they don't like to damage your future reputation.

Clearly very different than news articles going offline, but I do think there's been a vibe shift around the internet. People feel overly surveilled in daily life, and take respite in places that make surveillance harder.

nxobject 23 hours ago

Brewster’s concerns about the historical record are real and will eventually affect news orgs: their journalism may as well be ephemeral now without separate archiving. If a Wikipedia contributor, for example has to jump through extra hoops to get a stable link of a Times article, why wouldn’t they end up choosing an equally reliable WaPo article instead?

Tragedy of the commons.

Mindwipe 23 hours ago

Given the Times and the Guardian are British they will be archived by the British Library, as it's a legal obligation.

That doesn't mean anything American library that doesn't pay authors Public Lending Right fees gets to.

fn-mote 22 hours ago

Too little, too late. AI scrapers are better and better at acting human. AI scrapers already have a massive corpus; the marginal value of today’s need is low and will remain so long after access is cut off. When they manage to block archive.is too then I will believe they are at least a little serious.

visarga 16 hours ago

I think people forget one thing - LLMs don't even need to scrape, we copy paste and put articles and documents right into their mouth, they only need to keep the mouth open. Copy pasted content is also preselected manually, might filter out some garbage as well.

A subscriber opens the FT, reads an article about semiconductor export controls, pastes it into Claude to ask "what does this mean for my portfolio?" - the FT's content just entered a model's reasoning process, got synthesized with other knowledge, and produced derivative value. No scraper was involved. The paywall was respected. The subscriber paid. And yet the publisher's content was "consumed" by an AI in exactly the way they're trying to prevent.

cdrnsf 2 days ago

This is a natural response to AI companies plundering the web to enrich themselves and provide no benefit to the sites being scraped.

CivBase 2 days ago

Seems more like an easy excuse to shut down a means for people to bypass their paywalls. It would be trivial for AI companies to continue getting this data without using the Internet Archive.

cdrnsf 20 hours ago

I imagine that's a consideration, but there's plenty of pushback against AI companies scraping outside of this.

1vuio0pswjnm7 5 hours ago

"Currently, however, the Internet Archive does not disallow any specific crawlers through its robots.txt file, including those of major AI companies. As of January 12, [2026,] the robots.txt file for archive.org read: "Welcome to the Archive! Please crawl our files. We appreciate it if you can crawl responsibly. Stay open!" Shortly after we inquired about this language, it was changed. The file now reads, simply, "Welcome to the Internet Archive!""

Havoc 2 days ago

Yup. Recently built something that needs to do low volume scraping. About 40% success rate - rest hits bot detection even on first try

ninjagoo 2 days ago

Did you have rate limits built in? Ultimately scraping tools will need to mimic humans. Ironic.

I wonder if bots/ai will need to build their own specialized internet for faster sharing of data, with human centered interfaces to human spaces.

Havoc 7 hours ago

Yeah it gets blocked literally on first fetch so not a rate limit issue.

fc417fc802 2 days ago

IPFS and IPNS already exist.

ccgreg 22 hours ago

Prof. Jeff Jarvis speaking about copyright for news in front of Congress:

https://www.youtube.com/watch?v=tX26ijBQs2k

anjel 20 hours ago

Punishing archive.org for archive.today's sins

WesBrownSQL 2 days ago

As someone who has been dealing with SOC 2, HIPAA, ISO 9001, etc., for years, I have always maintained copies of the third-party agreements for all of our downstream providers for compliance purposes. This documentation is collected at the time of certification, and our policies always include a provision for its retrieval on schedule. The problem is when you certify their policy said X and were in compliance, they quietly change that and don't send proper notification downstream to us, and captain lawsuit comes by, we have to be able to prove that they did claim they were in compliance and the time we certified. We don't want to rely on their ability to produce that documentation. We can't prove that it wasn't tampered with, or that there is a chain of custody for their documentation and policies. If I wanted to use a vendor that wouldn't provide that information, then I didn't use them. Welcome to the world of highly regulated industries.

tadfisher 21 hours ago

What does this have to do with news sites blocking AI scrapers and the Internet Archive?

Are you a bot?

tl2do 2 days ago

The issue of digital decay and publishers blocking archiving efforts is indeed concerning. It's especially striking given that news publishers, perhaps more than any other entity, have profoundly benefited from the vast accumulation of human language and cultural heritage throughout history. Their very existence and influence are built upon this foundation. To then, in an age where information preservation is more critical than ever (and their content is frequently used for AI training), actively resist archiving or demand compensation for their contributions to the collective digital record feels disingenuous, if not outright shameless. This stance ultimately harms the public good and undermines the long-term accessibility of our shared knowledge and historical narrative.

GeoAtreides 24 hours ago

<richevans>How does it feel to live long enough to see all your favorite sites go down in flames?</richevans>

mellosouls 2 days ago

editorialised. Original title (submitted previously a few times correctly by others):

News publishers limit Internet Archive access due to AI scraping concerns

sunaookami 2 days ago

Yeah sure, "AI scraping concerns". No, they don't want to get caught secretly editing and deleting articles.

IshKebab 2 days ago

It's obviously not that, or they would have done this years ago. It very clearly is AI scraping concerns. Their content has new value because it's high quality text that AI scrapers want, and they don't want to give it away for free via the internet archive.

They will announce official paid AI access plans soon. Bookmark my works.

a2128 15 hours ago

This is terrible for transparency and record keeping. X has also blocked internet archive access under similar concerns, but the end result was that now it's very difficult to tell who said what and when, posts can be deleted or edited, and no public figure can be held accountable for something wrong they said, or making contradictory statements over time, via a trustworthy archive.

You just have to rely on screenshots that may or may not have been fabricated, and maybe nobody's even captured a screenshot. If it's a public figure you normally trust, versus some random people's screenshots, of course you're gonna dismiss the screenshots as fake. It feels almost intentional to bring the platform into the dark ages.

zachlatta 2 days ago

The death of trust on the cloud.

bmiekre 2 days ago

Explain it to me like I’m 5, why is ai scraping the way back machine bad?

notepad0x90 2 days ago

The internet isn't so simple anymore. I think it's important to separate commercial websites from non-commercial ones. Commercial sites shouldn't be expected to be achievable to begin with, unless it's part of their business model. A lot of sites (like reddit), started of as ad-supported sites, but now they're commercial (not just post-IPO, but accept payments and sell things to/from consumers). Even for ad-supported sites, there is a difference between ad-supported non-profit, and sites that exist to generate revenue from ads. As in, the primary purpose of the site is to generate ad-revenue, the content is just a means to that end.

I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.

zeagle 2 days ago

I mean why wouldn’t they? All their IP was scraped for at their own cost of hosting it for AI training. It further pulls away from their own business models as people ask the AI models the questions instead of reading primary sources. Plus it doesn’t seem likely they’ll ever be compensated for that loss given the economy is all in on AI. At least search engines would link back.

szmarczak 2 days ago

Those countermeasures don't really have an effect in terms of scraping. Anyone skilled can overcome any protection within a week or two. By officially blocking IA, IA can't archive those websites in a legal way, while all major AI companies use copyrighted content without permission.

zeagle 2 days ago

For sure. There are many billions and brilliant engineers propping up AI so they will win any cat and mouse game of blocking. It would be ideal if sites gave their data to IA and IA protected it exactly from what you say. But as someone that intentionally uses AI tools almost daily (mainly open evidence) IMO blame the abuser not the victim that it has come to this.

szmarczak 2 days ago

I'm not blaming the victim, but don't play the 'look what you made me do' game. Making content accessible to anyone (even behind a paywall) is a risk they need to take nevertheless. It's impossible to know upfront if the content is used for consumption or to create derived products (e.g. write an article in NYT style etc.). If this was a newspaper, this would be equivalent to scanning paper and then training AI. You can't prevent scanning, as the process is based on exactly the same phenomenon what makes your eyes see, iow information being sent and received. The game was lost before it even started.

ninjagoo 2 days ago

That is a good question. However, copyright exists (for a limited time) to allow for them to be compensated. AI doesn't change that. It feels like blocking AI-use is a ploy to extract additional revenue. If their content is regurgitated within copyright terms, yes, they should be compensated.

fc417fc802 2 days ago

The problem is that producing a mix of personalized content that doesn't appear (at least on its face) to violate copyright still completely destroys their business model. So either copyright law needs to be updated or their business model does.

Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.

ninjagoo 2 days ago

> Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.

Great point. If my personal AI assistant cannot find your product/website/content, it effectively may no longer exist! For me. Ain't nobody got the time to go searching that stuff up and sifting through the AI slop. The pendulum may even swing the other way and the publishers may need to start paying me (or whoever my gatekeeper is) for access to my space...

zeagle 9 hours ago

That’s a perspective I hadn’t considered. Although the whole thing stinks of middlemen extracting all the profit between producers and consumers e.g. ag sector by the laws won’t catch up or even force integration. Thanks!

ninjagoo 8 hours ago

There is definitely a middleman question.

The bigger question is business model vs value-add. Copyright law draws a very direct line from value-add to compensation - if you created something new (or even derivative), copyright attaches to allow for compensation, if people find it valuable.

Business models are a different animal: they can range from value-add services and products to rent-seeking to monopolies, extracting value from both producers and consumers.

While copyright law makes no mention of business models, I don't know whether that is a historical artifact since copyright is presumably older, or a philosophical exclusion because society owes no business model a right to exist. I would suggest the existence of monopoly-busting government agencies argues that societies do not owe business models a right of existence. Fair compensation for the advancement of arts and sciences is clearly a public good, though.

Tying it back to the AI-in-the-middle question, it's yet another platform in a series of these between producers and consumers, and doesn't override copyright. Regurgitating a copyright (article, art, whatever) should absolutely attract compensation; should summarizing content attract compensation? should it be considered any different from a friend (or executive assistant) describing the content? And if the producers' business model involves extracting value from a transaction on any basis other than adding value to the consumer, does society owe that business model any right to exist?

JumpCrisscross 2 days ago

Let’s be honest, one of the most-common uses of these archive sites has been paywall circumvention. An academics-only archive might make sense, or one that is mutually-owned and charges a fee for lookup. But a public archive for content that costs money to make obviously doesn’t work.

lurking_swe 2 days ago

if that’s the real motive, why don’t they allow access to scrape content after some period? when that news is not as relevant. For example after 6 months.

JumpCrisscross 2 days ago

> why don’t they allow access to scrape content after some period? when that news is not as relevant. For example after 6 months

I belive many publications used to do this. The novel threat is AI training. It doesn't make sense to make your back catalog de facto public for free like that. There used to be an element of goodwill in permitting your content to be archived. But if the main uses are circumventing compensation and circumventing licensing requirements, that goodwill isn't worth much.

otterley 2 days ago

Enabling research is a business model for many publications. Libraries pay money for access to the publishers’ historical archives. They don’t want to cannibalize any more revenue streams; they’re already barely still operating as it is.

lurking_swe 2 days ago

i see, i didn’t consider this angle. thanks for pointing that out.

sejje 2 days ago

This is a good thing, IMO.

I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.

GaryBluto 2 days ago

> I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.

I don't understand this line of thinking. I see it a lot on HN these days, and every time I do I think to myself "Can't you realize that if things kept on being erased we'd learn nothing from anything, ever?"

I've started archiving every site I have bookmarked in case of such an eventuality when they go down. The majority of websites don't have anything to be used against the "folks" who made them. (I don't think there's anything particularly scandalous about caring for doves or building model planes)

otterley 2 days ago

Consider the impact, though, on our ability to learn and benefit from history. If the records of people’s activities cannot be preserved, are we doomed to live in ignorance?

sejje 2 days ago

I don't think so. Most of my original creations were before the archiving started, and those things are lost. But they weren't the kind of history you learn and benefit from--nor is most of the internet.

The truly important stuff exists in many forms, not just online/digital. Or will be archived with increased effort, because it's worth it.

otterley 2 days ago

Like it or not, the Internet is today’s store of record for a significant proportion—if not the majority—of the world’s activities.

If you don’t want your bad behavior preserved for the historical record, perhaps a better answer is to not engage in bad behavior instead of relying on some sort of historical eraser.

sejje 2 days ago

Behavior that isn't bad, becomes bad retrospectively after a regime change

otterley 2 days ago

That's a risk we all take. Not that long ago, homophobia was the norm. Being on the wrong side of history can be uncomfortable, but people do forgive when given the right context.

nine_k 2 days ago

Think about the stuff archeologists get to work with.

ninjagoo 2 days ago

What's that famous quote - those who do not learn from history ...

BUT, it's hard to learn from history if there's no history to learn...

UltraSane 2 days ago

Man I cannot disagree more. This is a terrible thing.

TheRealPomax 2 days ago

Kind of the "think of the children" argument: most things that are worth archiving have nothing to do with content that can be used against someone in the future. But the raw volume is making it impossible to filter out the worthwhile stuff from the slop (all forms of, not just AI), even with automation (again, not AI, we've been doing NLP using regular old ML for decades now).

holoduke 2 days ago

The end of traditional news sites is coming. At least for the newspaper websites. Future mcp like systems will generate on the fly newstites in your desired style and content. Journalists will have some kind of paid per view model provided by these gpt like platforms which of course take a too big of a chunk. I can't imagine a WSJ is able to survive.

g-b-r 2 days ago

This is awful, they need to at the very least allow private archivals.

Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them

macinjosh 2 days ago

We need something like SETI@home/Folding@home but for crawling and archiving the web or maybe something as simple as a browser extension that can (with permission) archive pages you view.

dunder_cat 2 days ago

This exists although not in the traditional BOINC space, it's Archiveteam^1. I run two of their warrior^2 instances in my home k3s instance via the docker images. One of them is set to the "Team's choice" where it spends most of its time downloading Telegram chats. However, when they need the firepower for sites with imminent risk of closure, it will switch itself to those. The other one is set to their URL shortener project, "Terror of Tiny Town"^3.

Their big requirement is you need to not be doing any DNS filtering or blocking of access to what it wants, so I've got the pod DNS pointed to the unfiltered quad9 endpoint and rules in my router to allow the machine it's running on to bypass my PiHole enforcement+outside DNS blocks.

^1 https://wiki.archiveteam.org/

^2 https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior

^3 https://wiki.archiveteam.org/index.php/URLTeam

ninjagoo 2 days ago

In the US at least, there is no expectation of privacy in public. Why should these websites that are public-facing get an exemption from that? Serving up content to the public should imply archivability.

Sometimes it feels like ai-use concerns are a guise to diminish the public record. While on the other hand services like Ring or Flock are archiving the public forever.

sejje 2 days ago

Ring and Flock are not a standard we should be striving towards. Their massive databases tracking citizens need to go.

pclmulqdq 2 days ago

Your TV probably does that, and you definitely gave it permission when you clicked "accept" on the terms.

macinjosh 22 hours ago

good thing I don't have a TV!

cagrimmett 24 hours ago

I run an ArchiveBox instance locally. Recommended! https://archivebox.io/

ryoshu 2 days ago

This is a good idea. Not sure what ToS it would violate. But a good idea.

lofaszvanitt 16 hours ago

The problem with the Internet Archive is that site owners are not presented the data by the archive who, when and what did they access from their archive. This must be an absolute must.

So if you own monkey.com, then after showing site ownership you must have access to every access data related to your site. Problem solved.

gosub100 2 days ago

But wait, I thought AI was so great for all industries? Publishers can have AI-generated articles, and instantly fix grammar problems, And translate it seamelessly to every language, and even use AI-generated images where appropriate to enrich the article. It was going to make us all so productive? What happened? Why would you want to _block_ AI from ingesting the material?

colesantiago 2 days ago

I fear that these news publishers would come after RSS next as I see hundreds of AI companies misusing the terms of the news publishers's RSS feed for profit on mass scraping.

They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.

It is a shame that the open web as we know it is closing down because of these AI companies.

nurettin 9 hours ago

The current scraping schemes look inane. IA and other data vendors would probably much prefer if slop companies simply paid them to divert logs to on-site machines instead. Less harmful than scraping, IA gets paid and slop gets what it wants.

kevincloudsec 2 days ago

There's a compliance angle to this that nobody's talking about. Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention. A lot of that evidence lives at URLs. When a vendor's security documentation, a published incident response, or a compliance attestation disappears from the web and can't be archived, you've got a gap in your audit trail that no auditor is going to be happy about.

I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.

sebmellen 2 days ago

I hate to say this, but this account seems like it’s run by an AI tool of some kind (maybe OpenClaw)? Every comment has the same repeatable pattern, relatively recent account history, most comments are hard or soft sell ads for https://www.awsight.com/. Kind of ironic given what’s being commented on here.

I hope I’m wrong, but my bot paranoia is at all time highs and I see these patterns all throughout HN these days.

linehedonist 2 days ago

Agreed. "isn't just... It's becoming" feels to me very LLM-y to me.

sebmellen 2 days ago

Now the top comment on the GP comment is from a green account, and suspiciously the most upvoted. Also directly in-line with the AWS-related tool promotion… https://news.ycombinator.com/item?id=47018665

@dang do you have any thoughts about how you’re performing AI moderation on HN? I’m very worried about the platform being flooded with these Submarine comments (as PG might call them).

rob 24 hours ago

I agree with you that it's a bot.

They're getting very clever and tricky though; a lot of them have the owners watching and step in to pretend that they're not bots and will respond to you. They did this last week and tricked dang.

dang 23 hours ago

Where did they trick me? (You'll have to trust that I'm not a bot stepping in to ask that, of course.)

yorwba 14 hours ago

They're probably referring to this exchange where the bot owner claimed to use AI for their comments because they're not a native English speaker, which let them get off with a warning, after which they continued operating the account as before: https://news.ycombinator.com/item?id=46886719

They then made a top-level submission revealing the "experiment": https://news.ycombinator.com/item?id=46901199

dang 22 hours ago

I've sent them an email and temporarily banned the account. We've been doing that lately in cases that are hard to classify and where the user might be a legit user who is misunderstanding the rules.

I guess we'll probably have to add this explicitly to the guidelines; le sigh.

sebmellen 8 hours ago

Thank you for keeping this place sane.

Gigachad 24 hours ago

These days every green username is a chatbot.

alexpotato 2 days ago

> Regulatory frameworks like SOC 2 and HIPAA require audit trails and evidence retention

Sidebar:

Having been part of multiple SOC audits at large financial firms, I can say that nothing brings adults closer to physical altercations in a corporate setting than trying to define which jobs are "critical".

- The job that calculates the profit and loss for the firm, definitely critical

- The job that cleans up the logs for the job above, is that critical?

- The job that monitors the cleaning up of the logs, is that critical too?

These are simple examples but it gets complex very quickly and engineering, compliance and legal don't always agree.

Ucalegon 2 days ago

Thats when you reach out to your insurer and ask them their requirements as per the policy and/or if there are any contractual obligations associated with the requirements which might touch indemnity/SLAs. If it does, then it is critical, if not, then its the classic conversation of cost vs risk mitigate/tolerance.

a13n 2 days ago

depends, if you don’t clean up the logs and monitor that cleanup will it eventually hit the p&l? eg if you fail compliance audits and lose customers over it? then yes. it still eventually comes back to the p&l.

hsbauauvhabzb 2 days ago

And in the big scheme of things, none of those things are even important, your family, your health and your happiness are :-)

ninjagoo 2 days ago

At some point Insurance is going to require companies to obtain paper copies of any documentation/policies, precisely to avoid this kind of situation. It may take a while to get there though. It'll probably take a couple of big insurance losses before that happens.

kevincloudsec 2 days ago

Insurance is already moving that direction for cyber policies. Some underwriters now require screenshots or PDF exports of third-party vendor security attestations as part of the application process, not just URLs. The carriers learned the hard way that 'we linked to their SOC 2 landing page' doesn't hold up when that page disappears after an acquisition or rebrand.

pwg 2 days ago

> when that page disappears after an acquisition or rebrand.

Sadly, it does not even have to be an acquisition or rebrand. For most companies, a simple "website redo", even if the brand remains unchanged, will change up all the URL's such that any prior recorded ones return "not found". Granted, if the identical attestation is simply at a new url, someone could potentially find that new url and update the "policy" -- but that's also an extra effort that the insurance company can avoid by requiring screen shots or PDF exports.

hsbauauvhabzb 24 hours ago

It sounds like you work at Microsoft, they do that ALL the time.

dahcryn 2 days ago

We already require all relevant and referenced documents to be uploaded in a contract lifecycle management system.

Yes we have hundreds of identical Microsoft and Aws policies, but it's the only way. Checksum the full zip and sign it as part of the contract, that's literally how we do it

seanmcdirmid 2 days ago

Digital copies will also work I don’t understand why they just don’t save both the URL and the content at the URL when last checked.

ninjagoo 2 days ago

I think maybe because the contents of the URL archived locally aren't legally certifiable as genuine - the URL is the canonical source.

That's actually a potentially good business idea - a legally certifiable archiving software that captures the content at a URL and signs it digitally at the moment of capture. Such a service may become a business requirement as Internet archivability continues to decline.

leni536 2 days ago

Apparently perma.cc is officially used by some courts in the US. I did use it in addition to the wayback machine when I collected paper trail for a minor retail dispute, but I did not have to use it.

I don't know how exactly it achieves being "legally certifiable", at least to the point that courts are trusting it. Signing and timestamping with independent transparency logs would be reasonable.

https://perma.cc/sign-up/courts

ninjagoo 2 days ago

This is an interesting service, but at $10 for 10 links per month, or $100 for 500 links per month, it might be a tad bit too expensive for individuals.

staticassertion 2 days ago

The first thing you do when you're getting this information is get PDFs from these vendors like their SOC2 attestation etc. You wouldn't just screenshot the page, that would be nuts.

Any vendor who you work with should make it trivial to access these docs, even little baby startups usually make it quite accessible - although often under NDA or contract, but once that's over with you just download a zip and everything is there.

thayne 24 hours ago

> You wouldn't just screenshot the page, that would be nuts.

That's what I thought the first time I was involved in a SOC2 audit. But a lot of the "evidence" I sent was just screenshots. Granted, the stuff I did wasn't legal documents, it was things like the output of commands, pages from cloud consoles, etc.

staticassertion 24 hours ago

To be clear, lots of evidence will be screenshots. I sent screenshots to auditors constantly. For example, "I ran this splunk search, here's a screenshot". No biggie.

What I would not do is take a screenshot of a vendor website and say "look, they have a SOC2". At every company, even tiny little startup land, vendors go through a vendor assessment that involves collecting the documents from them. Most vendors don't even publicly share docs like that on a site so there'd be nothing to screenshot / link to.

inetknght 2 days ago

Is it digitally certifiable if it's not accessible by everyone?

That is: if it's not accessible by a human who was blocked?

macintux 2 days ago

Or if it potentially gives different (but still positive) results to different parties?

trollbridge 2 days ago

What if the TOS expressly prohibits archiving it, and it's also copyrighted?

pixl97 2 days ago

Then said writers of TOS should be dragged in front of a judge to be berated, then tarred and feathered, and ran out of the courtroom on a rail.

Having your cake and eating it too should never be valid law.

croes 2 days ago

Maybe we should start with those who made such copyright claims a possibility in the first place

wizzwizz4 2 days ago

They're long, long dead.

croes 10 hours ago

There are still people who help extending it

wizzwizz4 8 hours ago

If copyright can be used to prevent the archiving of ToS documents, a copyright duration of 3 years would be sufficient. Not all objections to copyright boil down to "the Mickey Mouse Protection Act should never have passed!".

seanmcdirmid 2 days ago

I don’t think contracts and agreements that both parties can’t keep copies of are valid in any US jurisdiction.

layer8 2 days ago

More likely, there will be trustee services taking care of document preservation, themselves insured in case of data loss.

ninjagoo 2 days ago

Isn't the Internet Archive such a trustee service?

Or are you thinking of companies like Iron Mountain that provide such a service for paper? But even within corporations, not everything goes to a service like Iron Mountain, only paper that is legally required to be preserved.

A society that doesn't preserve its history is a society that loses its culture over time.

layer8 2 days ago

The context was regulatory requirements for companies. I mean that as a business you pay someone to take care of your legal document preservation duties, and in case data gets lost, they will be liable for the financial damage this incurs to you. Outsourcing of risk against money.

ninjagoo 2 days ago

Whether or not the Internet Archive counts as a legally acceptable trustee service is being litigated in the court systems [1]. The link is a bit dated so unsure what the current situation is. There's also this discussion [2].

[1] https://www.mololamken.com/assets/htmldocuments/NLJ_5th%20Ci...

[2] https://www.nortonrosefulbright.com/en-au/knowledge/publicat...

mycall 2 days ago

Also, getting insurance to pay for cybercrimes is hard and sometimes doesn't justify their costs.

iririririr 2 days ago

This is new to me, so I did a quick search for a few examples of such documents.

The very first result was a 404

https://aws.amazon.com/compliance/reports/

The jokes write themselves.

staticassertion 2 days ago

But how is this related to the internet being archivable? This sort of proves the point that URLs were always a terrible idea to reference in your compliance docs, the answer was always to get the actual docs.

paulryanrogers 2 days ago

IME compliance tools will take a doc and or a link. What's acceptable is up to the auditor. IMO both a link and doc are best.

Links alone can be tempting as you've to reference the same docs or policies over and over for various controls.

aussieguy1234 2 days ago

Wayback machine URLs are much more likely to be stable.

Even if the content is taken down, changed or moved, a copy is likely to still be available in the Wayback Machine.

staticassertion 24 hours ago

I would never rely on this vs just downloading the SOC2 reports, which almost always aren't public anyways and need to be requested explicitly. I suspect that that compliance page would have just linked to a bunch of PDF downloads or possibly even a "request a zip file from us after you sign an NDA" anyways.

staticassertion 21 hours ago

I just want to clarify how extremely standard and often required it is to download and store your SOC2s and other such documents when going through compliance. You almost never can actually just link to a public pentest report or SOC2 etc, you almost always need to go through an NDA. It's just not really meaningful to say "but the web archive is reliable" when it's virtually never an actual option to begin with.

riddlemethat 2 days ago

https://www.page-vault.com/ These guys exist to solve that problem.

staticassertion 2 days ago

> I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited.

Seriously? What kind of auditor would "fail" you over this? That doesn't sound right. That would typically be a finding and you would scramble to go appease your auditor through one process or another, or reach out to the vendor, etc, but "fail"? Definitely doesn't sound like a SOC2 audit, at least.

Also, this has never particularly hard to solve for me (obviously biased experience, so I wonder if this is just a bubble thing). Just ask companies for actual docs, don't reference urls. That's what I've typically seen, you get a copy of their SOC2, pentest report, and controls, and you archive them yourself. Why would you point at a URL? I've actually never seen that tbh and if a company does that it's not surprising that they're "failing" their compliance reviews. I mean, even if the web were more archivable, how would reliance on a URL be valid? You'd obviously still need to archive that content anyway?

Maybe if you use a tool that you don't have a contract with or something? I feel like I'm missing something, or this is something that happens in fields like medical that I have no insight into.

This doesn't seem like it would impact compliance at all tbh. Or if it does, it's impacting people who could have easily been impacted by a million other issues.

cj 2 days ago

Your comment matches my experience closer than the OP.

A link disappearing isn’t a major issue. Not something I’d worry about (but yea might show up as a finding on the SOC 2 report, although I wouldn’t be surprised if many auditors wouldn’t notice - it’s not like they’re checking every link)

I’m also confused why the OP is saying they’re linking to public documents on the public internet. Across the board, security orgs don’t like to randomly publish their internal docs publicly. Those typically stay in your intranet (or Google Drive, etc).

staticassertion 2 days ago

> although I wouldn’t be surprised if many auditors wouldn’t notice

lol seriously, this is like... at least 50% of the time how it would play out, and I think the other 49% it would be "ah sorry, I'll grab that and email it over" and maybe 1% of the time it's a finding.

It just doesn't match anything. And if it were FEDRAMP, well holy shit, a URL was never acceptable anyways.

yorwba 2 days ago

> I feel like I'm missing something

You're missing the existence of technology that allows anyone to create superficially plausible but ultimately made-up anecdotes for posting to public forums, all just to create cover for a few posts here and there mixing in advertising for a vaguely-related product or service. (Or even just to build karma for a voting ring.)

Currently, you can still sometimes sniff out such content based on the writing style, but in the future you'd have to be an expert on the exact thing they claim expertise in, and even then you could be left wondering whether they're just an expert in a slightly different area instead of making it all up.

EDIT: Also on the front page currently: "You can't trust the internet anymore" https://news.ycombinator.com/item?id=47017727

staticassertion 2 days ago

I don't really see what you're getting at, it seems unrelated to the issue of referencing URLs in compliance documentation.

trevwilson 2 days ago

They're suggesting that the original comment is LLM generated, and after looking at the account's comment history I strongly suspect they're correct

staticassertion 24 hours ago

Oh, I sort of wondered if that was the case but I was really unsure based on the wording. Yeah, I have no idea.

stavros 2 days ago

I think they meant that, now that LLMs are invented, people have suddenly started to lie on the Internet.

Every comment section here can be summed up as "LLM bad" these days.

yorwba 2 days ago

No, now that LLMs are invented, a lot more people lying on the Internet have started to do so convincingly, so they also do it more often. Previously, when somebody was using all the right lingo to signal expert status, they might've been a lying expert or an honest expert, but they probably weren't some lying rando, because then they wouldn't even have thought of using those words in that context. But now LLMs can paper over that deficit, so all the lying randos who previously couldn't pretend to be an expert are now doing so somewhat successfully, and there are a lot of lying randos.

It's not "LLM bad" — it's "LLM good, some people bad, bad people use LLM to get better at bad things."

mycall 2 days ago

Perhaps those companies should have performed verified backups of third-party vendor's published security policies into a secure enclave with paired keys with the auditor, to keep a trail of custody.

tempaccount5050 2 days ago

Your experience isn't normal and I seriously question it unless there was some sort of criminal activity being investigated or there was known negligence. I worked for a decent sized MSP and have been through crytptolock scenarios.

Insurance pays as long as you aren't knowingly grossly negligent. You can even say "yes, these systems don't meet x standard and we are working on it" and be ok because you acknowledged that you were working on it.

Your boss and your bosses boss tell you "we have to do this so we don't get fucked by insurance if so and so happens" but they are either ignorant, lying, or just using that to get you to do something.

I've seen wildly out of date and unpatched systems get paid out because it was a "necessary tradeoff" between security and a hardship to the business to secure it.

I've actually never seen a claim denied and I've seen some pretty fuckin messy, outdated, unpatched legacy shit.

Bringing a system to compliance can reasonably take years. Insurance would be worthless without the "best effort" clause.

lukeschlather 2 days ago

It's interesting to think about this in terms of something like Ars Technica's recent publishing of an article with fake (presumably LLM slop) quotes that they then took down. The big news sites are increasingly so opaque, how would you even know if they were rewriting or taking articles down after the fact?

int0x29 2 days ago

This is typically solved by publishing reactions/corrections or in the case of news programs starting the next one with a retraction/correction. This happens in some academic journals and some news outlets. I've seen the PBS Newshour and the New York Times do this. I've also seen Ars Technica do this with some science articles (Not sure what the difference in this case is or if it will take some more time)

oxguy3 2 days ago

On their forum, an Ars Technica staff member said[1] that they took the article down until they could investigate what happened, which probably wouldn't be until after the weekend.

[1]: https://arstechnica.com/civis/threads/journalistic-standards...

lukeschlather 6 hours ago

I'm not asking how you solve the problem of publications making mistakes, I'm asking how you know they're rewriting articles if there are no third-party records of article contents. You're talking about publications acting in good faith. I'm talking about publications using paywalls to make it easier to lie.

lofaszvanitt 2 days ago

And for this we need cheapo and fast WORM, 100 TB/whatever archiving solutions.

kryogen1c 2 days ago

If your soc2 or hipaa references the internet archive, you probably deserve to fail.

zmmmmm 24 hours ago

Dear news publications - if you aren't willing to accept an independent record of what you published, I can't accept your news. It's a critical piece of the framework that keeps you honest. I don't care if you allow AI scraping either way, but you have to facilitate archival of your content - independently, not under your own control.

zaphirplane 23 hours ago

How is the publisher supposed to fund their operations let along make a profit. How about a 1 year lock on the archive pages. There are many ways of keeping that record but not taking views undermining the business model

bawolff 21 hours ago

The same way they did back in the day, where libraries still existed that allowed people to read newspapers for free.

I kind of doubt that internet archive is really taking very much business away from them. Its a terrible UI to read the daily news.

8organicbits 21 hours ago

The LWN model feels practical here:

> We ask that you grant LWN exclusive rights to publish your work during the LWN subscription period - currently up to two weeks after publication.

News is valuable when it is timely, and subscribers pay for immediate access.

https://lwn.net/op/AuthorGuide.lwn

Joker_vD 21 hours ago

> How is the publisher supposed to fund their operations let along make a profit.

There used to be plenty newspapers sponsored by wealthy industrialists; the latter would cover the former's gaps between the costs and the sales, the former would regularly push the latter's political agenda.

The "objective journalism" is really quite a late invention IIRC, about the times of WW2.

belter 20 hours ago

Objectivity was already a principle in the 1890s.

https://en.wikipedia.org/wiki/Journalistic_objectivity

"To give the news impartially, without fear or favor." — Adolph Ochs, 1858-1935

Objectivity is the default state of honest storytelling. If I ask what happened ? and somebody only tells the parts that suit an agenda, they have not informed me. The partisan press exists, because someone has a motive to deviate from the natural expectation of fair story telling and story recounting.

rtpg 20 hours ago

> Objectivity is the default state of honest storytelling. If I ask what happened ? and somebody only tells the parts that suit an agenda, they have not informed me.

Already at the level of what stories are covered you have made choices about what's important or not.

Your newspaper not covering your neighbors lawsuit against the city against some issue because they find it to be "not important" is already a viewpoint-based choice

A newspaper presenting both sides on an issue (already simplifying on the "there are two sides to an issue" thing) is one thing. Do you also have to present expert commentary that says that one side is actually just entirely in bad faith? Do you write a story and then conclude "actually this doesn't matter" when that is the case?

There are plenty of descriptions that some people would describe as fair story telling and others would describe as a hit piece. Probably for any article on any controversial topic written in good faith you are likely able to find some people who would claim it's not.

I think it's important to acknowledge that even good faith journalism is filled with subjectivity. That doesn't mean one gives up, you just have to take into account the position of the people presenting information and roll with that.

foxglacier 18 hours ago

You make it sound like bias is completely relative and undecidable. But there is a clear line journalists can cross - if they're intentionally misleading their reader, that's bias. It's qualitatively different from neglecting to cover a story or not finding a suitable expert or whatever. It's intentional deception because they want the readers to have wrong knowledge. And they do it all the time.

terminalshort 21 hours ago

It's a great question, but they didn't seem to have a problem with this before AI, so I have to assume that the presence of a free available copy wasn't really impacting their revenue.

ninjagoo 20 hours ago

If an independent press is critical to open societies, perhaps some sort of citizen directed funding is needed to maintain independence from both capital and government?

dakolli 21 hours ago

Maybe it would be better if these news operations had to find better ways to sustain themselves than the current paradigms. Also, the internet archive is not the only archive, and there will be more. This ins't something they can really stop.

Razengan 20 hours ago

Reconfigure human society so that services like news don't need to make a profit and still remain credible.

sejje 24 hours ago

I don't know even one news source I "trust." I expect them to push an agenda.

I also don't think they care even a bit. They're pushing agendas, and not hiding it; rather, flaunting it.

ks2048 23 hours ago

People need to abandon the notion of "trust" being a single axes between trustworthy to untrustworthy.

Every source has it's biases, you should try to be aware of them and handle information accordingly.

sejje 21 hours ago

I prefer when the bias is "we don't run xyz story" vs "we run a slanted version of xyz story."

They're both a bias, of course, but one is more palatable.

undeveloper 24 hours ago

"everyone is stupid but me" is a bit too prevalent in the tech industry

awakeasleep 22 hours ago

You are doing it to the parent comment right now.

Why not interpret it to mean something like “no news organization has biases that are fully aligned with my best interests”

undeveloper 5 hours ago

There's no kind way to interpret "They're pushing agendas, and not hiding it; rather, flaunting it." that isn't deeply hateful of news orgs a whole. Additionally this talk is almost always followed by "I therefore just allow myself to passively hear news through others (which have no biases)" or "I only get my news from hitlers-strongest-soldiers.com" or similar "unbiased" news sources (or maybe just "youtube"/"tiktok"). Deeply conspiratorial thinking and anti-intellectual to think literally everyone is out to get you. I don't think I'm smarter than everyone (which is why I rely on other people to give me the news), but I am at least smarter than this person.

singpolyma3 23 hours ago

Not everyone surely. But some people

bawolff 20 hours ago

Persinally i think people harp on news bias too much.

I think the real problem is that they often dont put events in context, which leads people to misunderstand them. They report the what not the why, but most events don't just happen one day, they are shaped by years or even decades of historical context. If you just understand the literal event without the background context, i dont think you are really informed.

SirensOfTitan 23 hours ago

I consider almost all news to be entertainment unless I need its perspective to make a decision (which is almost never). It is a lot safer to remain uninformed on a subject as it settles than to constantly attempt to be informed.

Information bias is unfortunately one of the sicknesses of our age, and it is one of the cultural ills that flows from tech outward. Information is only pertinent in its capacity to inform action, otherwise it is noise. To adapt a Beck-ism: You aren't gonna need it.

bsder 23 hours ago

Is it more likely that no one is speaking the truth, or, more likely, to you, the truth looks like an agenda?

sejje 23 hours ago

What I'm talking about is that the news tries to tell you what to think. You can read headlines on Google News about the same story, and see the bias of the publication in the headline pretty often.

Instead of reporting just the facts, they include opinions, inflammatory language, etc.

Reuters writes in a relatively neutral tone, as an example. Fox News doesn't, and CNN doesn't, as examples of the opposite.

If you don't notice, I doubt you're reading the news. It's part of the offering. Fox does it on purpose, not accidentally.

TitaRusell 22 hours ago

What is wrong with reading other people opinions?

Newspapers in my country always were blogs before the internet existed. Its why they are still around and doing quite well- they don't just bring news.

sejje 21 hours ago

It taints the full story pretty often--they omit details.

bsder 23 hours ago

Everyone has an agenda. The question is whether they are also reporting facts.

This is the particular thing I care about. If I can count on their facts, I can mostly subtract their agenda.

See: https://app.adfontesmedia.com/chart/interactive

The problem comes in when I can't count on the "facts" being reported.

scoofy 23 hours ago

The records already exist. Check your local library. The entire point of this is the scrapers undermine the business model.

If anything, we should simply me asking archive.org to limit their access to humans.

awakeasleep 22 hours ago

Libraries dont keep all periodicals and dont keep them forever. And microfilm is really lossy, unreliable, and difficult to search.

Narkov 20 hours ago

I'm not sure what microfilm has got to do with this. Plenty of national libraries have extensive digital collections of various artifacts - books and even websites. Check out the National Library of Australia as an example: https://www.library.gov.au/discover/what-we-collect/archived...

salawat 23 hours ago

To hell with contempt of business model. Business models aren't sacred. Besides which, with business models owners capture the newsroom anyway.

katzgrau 22 hours ago

As a news publisher (RedBankGreen.com) I’ll tell you that pretty much nobody is in it for the money anymore, at least at the local level.

It’s passion and love of the community, despite the many struggles and drawbacks.

AI bots scrape our content and that drastically reduces the number of people who make it to our site.

That impacts our ability to bring on subscribers and especially advertisers - Google and Meta own local advertising and AI kills the relatively tiny audience we have.

I dread the day that it happens in realtime - hear sirens? Ask AI who already scraped us.

nazcan 22 hours ago

Every business (even news) needs a business model.

ang_cire 22 hours ago

Yes, but not every business works, and not every business model works, and not every business model works with every business, etc etc.

It's on the business to find a model that works within the environment of the free market and within the social framework.

If a business model only works by limiting competition, it's a bad model.

If it only works by limiting the rights of consumers, it's a bad model.

If it only works by blocking a legal activity (website crawling and scraping of publicly-facing data, for instance), it's a bad model.

And if their business can't operate otherwise, it's a bad business. No business has an intrinsic right to exist.

rectang 20 hours ago

If a business model only works by copyright washing is it a bad model?

> No business has an intrinsic right to exist.

Do AI businesses have an intrinsic right to exist?

salawat 17 hours ago

Absolutely don't, and I've argued since day 1 that by refusing to try to contract for training before they just ripped it, each and every one of them should be saddled with so much legal liability as to not exist. The capitalist overlords however, will grasp at anything that promises them of being free of dealing with labor...so... Here we are.

nazcan 21 hours ago

I think the question of is a business allowed to have something free only for humans (presumably with advertising) does not have a clear best answer - politicians can decide.

wizzwizz4 22 hours ago

News has a business model: do actual journalism. I don't see much reason to fund the people who are giving me the same story as everyone else who received the same press release, with no additional details: I might as well subscribe to the press releases.

EA-3167 21 hours ago

And people wonder why we’re all locked in a race to the bottom.

outside1234 22 hours ago

If they don't have a business model we won't have newspapers to complain that we don't have archives for.

germandiago 24 hours ago

First thing that came to my mind went along the same reasoning.

zeta0134 24 hours ago

The second thing that came to mind was paywall evasion. Any time a news article behind a paywall gets posted here, someone in the comments has the archive link ready to go, because of course they do.

The incentives for online news are really wacky just to begin with. A coin at the convenience store for the whole dang paper used to be the simplest thing in the world.

saint_fiasco 24 hours ago

I suppose that could be solved with a delay. Limit internet archive for articles that are less than a week old.

hn_acker 22 hours ago

> Limit internet archive for articles that are less than a week old.

I mean this as a side note rather than a counterargument (because people learn to take screenshots, and because what can you do about particularly bad faith news orgs?): Immediate archival can capture silent changes (and misleadingly announced changes). A headline might change to better fit the article body. An editor's note might admit a mistakenly attributed quote.

Or a news org might pull a Fox News [1][2] by rewriting both the headline and article body to cover up a mistake that unravels the original article's reason for existing: The original headline was "SNAP beneficiaries threaten to ransack stores over government shutdown". The headline was changed to "AI videos of SNAP beneficiaries complaining about cuts go viral". An editor's note was added [3][4]: "This article previously reported on some videos that appear to have been generated by AI without noting that. This has been corrected." I think Fox News deleted the article.

[1] https://xcancel.com/KFILE/status/1984673901872558291

[2] https://archive.ph/NL6oR

[3] https://xcancel.com/JusDayDa/status/1984693256417083798

[4] https://archive.ph/XEI9E

MonkeyClub 23 hours ago

That would diminish archival accuracy, an outlet could amend the text without third party proof.

jasonfarnon 23 hours ago

I don't see the connection to adding the delay. I think the suggestion was to have a snapshot at time of publication but wait a week to make it public.

saint_fiasco 20 hours ago

I actually didn't initially think of the parent's objection nor your rebuttal. This is why I like reading HN comments.

thayne 19 hours ago

If the content was fully paywalled, it wouldn't be possible to archive it (unless the archiver paid for a subscription).

The reason the archiving works is because they expose the content publicly so search engines can index it.

tchalla 22 hours ago

> Any time a news article behind a paywall gets posted here, someone in the comments has the archive link ready to go, because of course they do.

I have no idea why this behavior is even acceptable.

throw10920 19 hours ago

It's pretty easy to hold publications accountable without forcing them to publish content - just make them publish hashes of their content.

They won't, of course, because they don't want accountability.

p-e-w 23 hours ago

I have seen zero evidence that independent archives “keep news media honest”. In fact, I have on several occasions noticed news media directly contradicting their own stance from just a few years prior, with no mention of the previously published account at all. This is true even for highly respected newspapers of record.

I can indeed find clear records of that in the archives. But what do I do with them? How do I use that evidence to hold news media to account? This is meaningless moral posturing.

Brybry 22 hours ago

Contact the journalist of the new article with the contradicting article? Letters to the editor? Submit an opinion article?

I've contacted multiple journalists over the years about errors in their articles and I've generally found them responsive and thankful.

Sometimes it's not even their fault. One time a journalist told me the incorrect information was unknowingly added by an editor.

I get that it's popular on HN and the internet to bash news media, and that there are a lot of legitimate issues with the media, but my personal experience is that journalists do actually want to do a good job and respond accordingly when you engage them (in a non-antagonistic manner).

p-e-w 21 hours ago

The incidents I’m referring to aren’t “errors” though.

If a major article claims that certain groups don’t exist, while the same newspaper published a detailed report about those exact groups and how dangerous they are just two years earlier, it’s not because the journalist wasn’t able to do a 10-second Google search where their own paper’s article would have been among the top results.

wizzwizz4 22 hours ago

> But what do I do with them? How do I use that evidence to hold news media to account?

Contact their rivals with the story, have them write a hit piece. "Other newspaper is telling porkies: here's the proof!" is an excellent story: not one I'd expect a journalist to have time to discover, but certainly one I'd expect them to be able to follow up on, once they've received a tip.

p-e-w 21 hours ago

That’s not how publishing works. News outlets (especially those of roughly similar political leaning) very rarely call out each other’s misconduct. In fact, they often seem to operate as a quasi-conglomerate rather than competitors.

wizzwizz4 8 hours ago

If they're not doing real journalism, why are you paying any attention to them? There are hundreds of small journalism groups who will call each other out just as easily as collaborate, who care primarily for truth and secondarily for putting food on the table (and therefore rarely have anything to call out). Many of their journalists have quit (or been fired from) the big news organisations, or would have been snapped up 50 years ago but are presently unhireable.

For the record, I'm talking about actual journalism groups, not Substack blogs. Here's one (largely US-centric) list, a ≈dozen links long: https://hcommons.social/@zeblarson/115488066909889058. You almost certainly have local journalists who need your support, which obviously I cannot list here.

OGEnthusiast 2 days ago

If most of the Internet is AI-generated slop (as is already the case), is there really any value in expensing so much bandwidth and storage to preserve it? And on the flip side, I'd imagine the value of a pre-2022 (ChatGPT launch) Internet snapshot on physical media will probably increase astronomically.

nicole_express 2 days ago

The sites that are most valuable to preserve are likely the same ones that are most likely to put up barriers to archiving

ninjagoo 2 days ago

Perhaps the AI slop isn't worth preserving, but the unarchivability of news and other useful content has implications for future public discourse, historians, legal matters and who knows what else.

In the past libraries used to preserve copies of various newspapers, including on microfiche, so it was not quite feasible to make history vanish. With print no longer out there, the modern historical record becomes spotty if websites cannot be archived.

Perhaps there needs to be a fair-use exception or even a (god forbid!) legal requirement to allow archivability? If a website is open to the public, shouldn't it be archivable?

phatfish 2 days ago

Erm, there is still a newspaper stand in the supermarket I go to (Wallmart for the Americans). Not sure if the British library keeps a copy of the print news I see, but they should!

RsAaNtDoYsIhSi 21 hours ago

BitCoin fixes this.

blell 2 days ago

That’s good. I don’t like archival sites. Let things disappear.

braebo 2 days ago

Yea.. I’ve noticed data hoarding largely resembles yet-another form of death denialism.