It stores webpages in multiple formats (HTML snapshot, screenshot, PDF snapshot, and a fully dedicated reader view) so you’re not relying on a single fragile archive method.
There’s both a hosted cloud plan [1] which directly supports the project, and a fully self-hosted option [2], depending on how much control you need over storage and retention.
One question, what's your stance on adding a way to mark articles as read or "archive" them like other apps that are branded a bit more as storing things to read later. You can technically do something similar with tags but it's a bit clunky of a UX.
I am currently evaluating Linkwarden, Wallabag, Hoarder, Linkding and each of the services has pro and cons making it hard for me to choose one. Linkwarden is AWESOME in its way to store content in multiple formats, but the read-later wfs could be improved.
Without checking again: does Linkwarden sync reading location across devices and automatically scrolls to that location on the next device? Does it tell me how „long“ an article takes to read (solely based on the length of it)? Does Linkding support marking up text and persist (mark some text yellow and see those marks somewhere or even add comments or favorite specific parts of texts).
No need to answer any of the questions, I can research myself, just putting these out there for a read-later solution I would like. Add a link on my mobile device, Linkwarden could do its magic in the backend, and I check out the content later on desktop or even on my mobile device.
FWIW, at least on iOS, it's possible to inject Javascript into the web site being currently displayed by Safari as a side effect of sharing a web link to an app via the share sheet.
Several "read it later" style apps use this successfully to get around paywalls (assuming you've paid yourself) and other robot blockers. Any plans for Linkwarden to do this (or does it already)?
Does it just POST the url to them for them to fetch? Or is there any integration/trust to store what you already fetched on the client directly on their archives?
It's far from perfect but it does achieve its stated goal: of resurfacing real people on the internet.
It recently got some NLNet funding and I hope to see it flourish - to my knowledge there aren't any other projects trying to claw back control of the internet towards the commons.
We are increasingly becoming blind. To me it looks as if this is done on purpose actually.
That's a travesty, considering that a huge chunk of science is public-funded; the public is being denied the benefits of what they're paying for, essentially.
Indefinitely? Probably not.
What about when a regime wants to make the science disappear?
Becase it costs money to serve them the content.
Is the answer regulate AI? Yes.
Because when you build it you aren't, presumably, polling their servers every fifteen minutes for the entire corpus. AI scrapers are currently incredibly impolite.
So we're basically decided we only want bad actors to be able to scrape, archive, and index.
AI training will be hard to police. But a lot of these sites inject ads in exchange for paywall circumvention. Just scanning Reddit for the newest archive.is or whatever should cut off most of the traffic.
""" In November 2025, an investigation by technology journalist Alex Reisner for The Atlantic revealed that Common Crawl lied when it claimed it respected paywalls in its scraping and requests from publishers to have their content removed from its databases. It included misleading results in the public search function on its website that showed no entries for websites that had requested their archives be removed, when in fact those sites were still included in its scrapes used by AI companies. """
My site is CC-BY-NC-SA, i.e. non-commercial and with attribution, and Common Crawl took a dubious position on whether fair use makes that irrelevant. They can burn.
Also, if your site has CC-BY-NC-SA markings, we have preserved them.
Also, if your site has CC-BY-NC-SA markings, we have preserved them.
The purpose of a search engine is to display links to web pages, not the entire content. As such, it can be argued it falls under fair use. It provides value to the people searching for content and those providing it.
However we left such a crucially important public utility in the hands of private companies, that changed their algorythms many times in order to maximize their profits and not the public good.
I think there needs to be real competition, and I am increasingly becoming certain that the government should be part of that competition. Both "private" companies and "public" governement are biased, but are biased in different ways, and I think there is real value to be created in this clash. It makes it easier for individuals to pick and choose the best option for themselves, and for third independent options to be developed.
The current cycle of knowledge generation is academia doing foundational research -> private companies expanding this research and monetizing it -> nothing. If the last step was expanded to the government providing a barebones but useable service to commodotize it, years after private companies have been able to reap immense profits, then the capabilities of the entire society are increased. If the last step is prevented, then the ruling companies turn to rentseeking and sitting on their lawrels, turn from innovating to extracting.
No one "left" a crucially important public utility in the hands of private companies. Private companies developed the search engine themselves in the late 90s in the course of doing for-profit business; and because some of them ended up being successful (most notably Google), most people using the internet today take the availability of search engines for granted.
"Google’s true origin partly lies in CIA and NSA research grants for mass surveillance" (January 28, 2025)
The intelligence community hoped that the nation’s leading computer scientists could take non-classified information and user data, combine it with what would become known as the internet, and begin to create for-profit, commercial enterprises to suit the needs of both the intelligence community and the public. They hoped to direct the supercomputing revolution from the start in order to make sense of what millions of human beings did inside this digital information network. That collaboration has made a comprehensive public-private mass surveillance state possible today.
The Massive Digital Data Systems (MDDS) ... program's stated aim was to provide more than a dozen grants of several million dollars each to advance this research concept. The grants were to be directed largely through the NSF so that the most promising, successful efforts could be captured as intellectual property and form the basis of companies attracting investments from Silicon Valley. This type of public-to-private innovation system helped launch powerful science and technology companies like Qualcomm $QCOM +1.61%, Symantec, Netscape, and others.
<https://qz.com/1145669/googles-true-origin-partly-lies-in-ci...>
The Internet itself (particularly its precursor, ARPANET), was also government funded, as was development of the World Wide Web (CERN). Oracle, the database company, grew out of the CIA's Project Oracle.
CIA Reading Room Project Oracle
<https://www.cia.gov/readingroom/document/cia-rdp80-01794r000...>
"Oracle's coziness with government goes back to its founding / Firm's growth sustained as niche established with federal, state agencies" (2002)
<https://www.sfgate.com/bayarea/article/oracle-s-coziness-wit...>
Surveillance has been baked in since their founding.
While unlikely, the ideal would be for the government to provide a foundational open search infrastructure that would allow people to build on it and expand it to fit their needs in a way that is hard to do when a private companies eschews competition and hides its techniques.
Perhaps it would be better for there to be a sanctioned crawler funded by the government, that then sells the unfiltered information to third parties like google. This would ensure IP rights are protected while ensuring open access to information.
They can charge money for access or disallow all scrapers, but it should not be allowed to selectively allow only Google.
It would also be in the spirit of the fair use doctrine's first and fourth considerations:
> 1. the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;
> 2. the nature of the copyrighted work;
> 3. the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
> 4. the effect of the use upon the potential market for or value of the copyrighted work.
If that doesn't happen, increasing amounts information and human creativity will be siloed and never made publicly accessible in a way that it can be consumed and reproduced as slop.
Users control what sites they want to allow it to record so no privacy worries, especially assuming the plugin is open source.
No automated crawling. The plugin does not drive the users browser to fetch things. Just whatever a user happens to actually view on their own, some percentage of those views from the activated domains gets submitted up to some archive.
Not every view, just like maybe 100 people each submit 1% of views, and maybe it's a random selection or maybe it's weighted by some feedback mechanism where the archive destination can say "Hey if the user views this particular url, I still don't have that one yet so definitely send that one if you see it rather than just applying the normal random chance"
Not sure how to protect the archive itself or it's operators.
> no privacy worries
This is harder than you might expect. Publishing these files is always risky because sites can serve you fingerprinting data, like some hidden HTML tag containing your IP and other identifiers.
As does Tranquility Reader, if you're interested only in the primary content of the page ... and, usually, in a much smaller footprint ... with a PDF option.
The problem with the LLMs is they capture the value chain and give back nothing. It didn’t have to be this way. It still doesn’t.
Now AI companies are using residential proxies to get around the obvious countermeasures, I have resorted to blocking all countries that are not my target audience.
It really sucks. The internet is terminally ill.
It's unfortunate that this undermines the usefulness of the Internet Archive, I don't see an alternative. IMHO, we'll soon see these AI scrapers cease to advertise themselves leading to sites like the NY Times trying to blacklist IP ranges as this battle continues. Fun times ahead!
But then it was not really open content anyway.
> When asked about The Guardian’s decision, Internet Archive founder Brewster Kahle said that “if publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”
Well - we need something like wikipedia for news content. Perhaps not 100% wikipedia; instead, wikipedia to store the hard facts, with tons of verification; and a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers. I don't know how the model could work, but IF we could come up with this then newspapers who have gatewalls to information would become less relevant automatically. That way we win long-term, as the paid gatewalls aren't really part of the open web anyway.
Journalism as an institution is under attack because the traditional source of funding - reader subscriptions to papers - no longer works.
To replicate the Wikipedia model would need to replicate the structure of Journalism for it to be reliable. Where would the funding for that come from? It's a tough situation.
The Wikipedia folks had their own Wikinews project which is essentially on hold today because maintenance in a wiki format is just too hard for that kind of uber-ephemeral content. Instead, major news with true long-term relevance just get Wikipedia articles, and the ephemera are ignored.
Practically no quality journalism is.
> we need something like wikipedia for news
Wikipedia editors aren’t flying into war zones.
Which is a valuable perspective. But it's not a subsitute for a seasoned war journalist who can draw on global experience. (And relating that perspective to a particular home market.)
> I'm sure some of them would fly in to collect data if you paid them for it
Sure. That isn't "a news editorial that focuses on free content but in a newspaper-style, e. g. with professional (or good) writers."
One part of the population imagines journalists as writers. They're fine on free, ad-supported content. The other part understands that investigation is not only resource intensive, but also requires rare talent and courage. That part generally pays for its news.
Between the two, a Wikipedia-style journalistic resource is not entertaining enough for the former and not informative enough for the latter. (Importantly, compiling an encyclopedia is principally the work of research and writing. You can be a fine Wikipedia–or scientific journal or newspaper–editor without leaving your room.)
- crowdsourced data, eg, photos of airplane crashes
- people who live in an area start vlogs
- independent correspondents travel there to interview, eg Ukraine or Israel
We see that our best war reporting comes from analyst groups who ingest that data from the “firehose” of social media. Sometimes at a few levels, eg, in Ukraine the best coverage is people who compare the work of multiple groups mapping social media reports of combat. You have on top of that punditry about what various movements mean for the war.
So we don’t have “journalist”:
- we have raw data (eg, photos)
- we have first hand accounts, self-reported
- we have interviewers (of a few kinds)
- we have analysts who compile the above into meaningful intelligence
- we have anchors and pundits who report on the above to tell us narratives
The fundamental change is that what used to be several roles within a new agency are now independent contractors online. But that was always the case in secret — eg, many interviewers were contracted talent. We’re just seeing the pieces explicitly and without centralized editorial control.
So I tend not to catastrophize as much, because this to me is what the internet always does:
- route information flows around censorship
- disintermediate consumers from producers when the middle layer provides a net negative
As always in business, evolve or die. And traditional media has the same problem you outline:
- not entertaining enough for the celebrity gossip crowd
- too slow and compromised by institutional biases for the analyst crowd, eg, compare WillyOAM coverage of Ukraine to NYT coverage
Interesting idea. It could be something that archives first and releases at a later date, when the news aren't as much new
Isn't that what state funded news outlets are?
Sell a "truck full of DAT tapes" type service to AI scrapers with snapshots of the IA. Sort of like the cloud providers have with "Data Boxes".
It will fund IA, be cheaper than building and maintaining so many scrapers, and may relieve the pressure on these news sites.
I've been building tools that integrate with accounting platforms and the number of times a platform's API docs or published rate limits have simply disappeared between when I built something and when a user reports it broken is genuinely frustrating. You can't file a support ticket saying "your docs said X" when the docs no longer say anything because they've been restructured.
For compliance specifically - HMRC guidance in the UK changes constantly, and the old versions are often just gone. If you made a business decision based on published guidance that later changes, good luck proving what the guidance actually said at the time. The Wayback Machine has saved me more than once trying to verify what a platform's published API behaviour was supposed to be versus what it actually does.
The SOC 2 / audit trail point upthread is spot on. I'd add that for smaller businesses, it's not just formal compliance frameworks - it's basic record keeping. When your payment processor's fee schedule was a webpage instead of a PDF and that webpage no longer exists, you can't reconcile why your fees changed.
> just like how the agricultural sector is hell-bent on scapegoating AI (and lawns, and golf courses, and long showers, and free water at restaurants) for excess water consumption when even the worst-offending datacenters consume infinitesimally-tiny fractions of the water farms in their areas consume.
When I learned about how much water agriculture and industry uses in the state of California where I live, I basically entirely stopped caring about household water conservation in my daily life (I might not go this far if I had a yard or garden that I watered, but I don't where I currently live). If water is so scarce in an urban area that an individual human taking a long shower or running the dishwasher a lot is at all meaningful, then either the municipal water supply has been badly mismanaged, or that area is too dry to support human settlement; and in either case it would be wise to live somewhere else.
What worries me isn’t scraping itself, but the second-order effects. If large parts of the web become intentionally unarchivable, we’re slowly losing a shared memory layer. Short-term protection makes sense, but long-term it feels like knowledge erosion.
Genuinely curious how people here think about preserving public knowledge without turning everything into open season for mass scraping.
I'm thinking in particular about the rise of platforms like Discord where being opaque to search/archiving is seen as a feature. Being gatekept and ephemeral makes people more comfortable sharing things that might get a takedown notice on other platforms, and it's hard for people who don't like you in the future to try to find jokes/quotes they don't like to damage your future reputation.
Clearly very different than news articles going offline, but I do think there's been a vibe shift around the internet. People feel overly surveilled in daily life, and take respite in places that make surveillance harder.
Tragedy of the commons.
A subscriber opens the FT, reads an article about semiconductor export controls, pastes it into Claude to ask "what does this mean for my portfolio?" - the FT's content just entered a model's reasoning process, got synthesized with other knowledge, and produced derivative value. No scraper was involved. The paywall was respected. The subscriber paid. And yet the publisher's content was "consumed" by an AI in exactly the way they're trying to prevent.
News publishers limit Internet Archive access due to AI scraping concerns
They will announce official paid AI access plans soon. Bookmark my works.
You just have to rely on screenshots that may or may not have been fabricated, and maybe nobody's even captured a screenshot. If it's a public figure you normally trust, versus some random people's screenshots, of course you're gonna dismiss the screenshots as fake. It feels almost intentional to bring the platform into the dark ages.
I've said it before, and I'll say it again: The main issue is not design patterns, but lack of acceptable payment systems. The EU with their dismantling of visa and mastercard now have the perfect opportunity to solve this, but I doubt they will. They'll probably just create a european wechat.
Either way I'm fairly certain that blocking AI agent access isn't a viable long term solution.
Great point. If my personal AI assistant cannot find your product/website/content, it effectively may no longer exist! For me. Ain't nobody got the time to go searching that stuff up and sifting through the AI slop. The pendulum may even swing the other way and the publishers may need to start paying me (or whoever my gatekeeper is) for access to my space...
The bigger question is business model vs value-add. Copyright law draws a very direct line from value-add to compensation - if you created something new (or even derivative), copyright attaches to allow for compensation, if people find it valuable.
Business models are a different animal: they can range from value-add services and products to rent-seeking to monopolies, extracting value from both producers and consumers.
While copyright law makes no mention of business models, I don't know whether that is a historical artifact since copyright is presumably older, or a philosophical exclusion because society owes no business model a right to exist. I would suggest the existence of monopoly-busting government agencies argues that societies do not owe business models a right of existence. Fair compensation for the advancement of arts and sciences is clearly a public good, though.
Tying it back to the AI-in-the-middle question, it's yet another platform in a series of these between producers and consumers, and doesn't override copyright. Regurgitating a copyright (article, art, whatever) should absolutely attract compensation; should summarizing content attract compensation? should it be considered any different from a friend (or executive assistant) describing the content? And if the producers' business model involves extracting value from a transaction on any basis other than adding value to the consumer, does society owe that business model any right to exist?
I belive many publications used to do this. The novel threat is AI training. It doesn't make sense to make your back catalog de facto public for free like that. There used to be an element of goodwill in permitting your content to be archived. But if the main uses are circumventing compensation and circumventing licensing requirements, that goodwill isn't worth much.
I am sad about link rot and old content disappearing, but it's better than everything be saved for all time, to be used against folks in the future.
I don't understand this line of thinking. I see it a lot on HN these days, and every time I do I think to myself "Can't you realize that if things kept on being erased we'd learn nothing from anything, ever?"
I've started archiving every site I have bookmarked in case of such an eventuality when they go down. The majority of websites don't have anything to be used against the "folks" who made them. (I don't think there's anything particularly scandalous about caring for doves or building model planes)
The truly important stuff exists in many forms, not just online/digital. Or will be archived with increased effort, because it's worth it.
If you don’t want your bad behavior preserved for the historical record, perhaps a better answer is to not engage in bad behavior instead of relying on some sort of historical eraser.
Maybe the Internet Archive might be ok to keeping some things private until x time passes; or they could require an account to access them
Their big requirement is you need to not be doing any DNS filtering or blocking of access to what it wants, so I've got the pod DNS pointed to the unfiltered quad9 endpoint and rules in my router to allow the machine it's running on to bypass my PiHole enforcement+outside DNS blocks.
^1 https://wiki.archiveteam.org/
^2 https://wiki.archiveteam.org/index.php/ArchiveTeam_Warrior
Sometimes it feels like ai-use concerns are a guise to diminish the public record. While on the other hand services like Ring or Flock are archiving the public forever.
So if you own monkey.com, then after showing site ownership you must have access to every access data related to your site. Problem solved.
They do not care and we will be all worse off for it if these AI companies keep continuing to bombard news publishers RSS feeds.
It is a shame that the open web as we know it is closing down because of these AI companies.
I've seen companies fail compliance reviews because a third-party vendor's published security policy that they referenced in their own controls no longer exists at the URL they cited. The web being unarchivable isn't just a cultural loss. It's becoming a real operational problem for anyone who has to prove to an auditor that something was true at a specific point in time.
I hope I’m wrong, but my bot paranoia is at all time highs and I see these patterns all throughout HN these days.
@dang do you have any thoughts about how you’re performing AI moderation on HN? I’m very worried about the platform being flooded with these Submarine comments (as PG might call them).
They're getting very clever and tricky though; a lot of them have the owners watching and step in to pretend that they're not bots and will respond to you. They did this last week and tricked dang.
They then made a top-level submission revealing the "experiment": https://news.ycombinator.com/item?id=46901199
Sidebar:
Having been part of multiple SOC audits at large financial firms, I can say that nothing brings adults closer to physical altercations in a corporate setting than trying to define which jobs are "critical".
- The job that calculates the profit and loss for the firm, definitely critical
- The job that cleans up the logs for the job above, is that critical?
- The job that monitors the cleaning up of the logs, is that critical too?
These are simple examples but it gets complex very quickly and engineering, compliance and legal don't always agree.
Sadly, it does not even have to be an acquisition or rebrand. For most companies, a simple "website redo", even if the brand remains unchanged, will change up all the URL's such that any prior recorded ones return "not found". Granted, if the identical attestation is simply at a new url, someone could potentially find that new url and update the "policy" -- but that's also an extra effort that the insurance company can avoid by requiring screen shots or PDF exports.
Yes we have hundreds of identical Microsoft and Aws policies, but it's the only way. Checksum the full zip and sign it as part of the contract, that's literally how we do it
That's actually a potentially good business idea - a legally certifiable archiving software that captures the content at a URL and signs it digitally at the moment of capture. Such a service may become a business requirement as Internet archivability continues to decline.
I don't know how exactly it achieves being "legally certifiable", at least to the point that courts are trusting it. Signing and timestamping with independent transparency logs would be reasonable.
Any vendor who you work with should make it trivial to access these docs, even little baby startups usually make it quite accessible - although often under NDA or contract, but once that's over with you just download a zip and everything is there.
That's what I thought the first time I was involved in a SOC2 audit. But a lot of the "evidence" I sent was just screenshots. Granted, the stuff I did wasn't legal documents, it was things like the output of commands, pages from cloud consoles, etc.
What I would not do is take a screenshot of a vendor website and say "look, they have a SOC2". At every company, even tiny little startup land, vendors go through a vendor assessment that involves collecting the documents from them. Most vendors don't even publicly share docs like that on a site so there'd be nothing to screenshot / link to.
Having your cake and eating it too should never be valid law.
Or are you thinking of companies like Iron Mountain that provide such a service for paper? But even within corporations, not everything goes to a service like Iron Mountain, only paper that is legally required to be preserved.
A society that doesn't preserve its history is a society that loses its culture over time.
[1] https://www.mololamken.com/assets/htmldocuments/NLJ_5th%20Ci...
[2] https://www.nortonrosefulbright.com/en-au/knowledge/publicat...
The very first result was a 404
https://aws.amazon.com/compliance/reports/
The jokes write themselves.
Links alone can be tempting as you've to reference the same docs or policies over and over for various controls.
Even if the content is taken down, changed or moved, a copy is likely to still be available in the Wayback Machine.
Seriously? What kind of auditor would "fail" you over this? That doesn't sound right. That would typically be a finding and you would scramble to go appease your auditor through one process or another, or reach out to the vendor, etc, but "fail"? Definitely doesn't sound like a SOC2 audit, at least.
Also, this has never particularly hard to solve for me (obviously biased experience, so I wonder if this is just a bubble thing). Just ask companies for actual docs, don't reference urls. That's what I've typically seen, you get a copy of their SOC2, pentest report, and controls, and you archive them yourself. Why would you point at a URL? I've actually never seen that tbh and if a company does that it's not surprising that they're "failing" their compliance reviews. I mean, even if the web were more archivable, how would reliance on a URL be valid? You'd obviously still need to archive that content anyway?
Maybe if you use a tool that you don't have a contract with or something? I feel like I'm missing something, or this is something that happens in fields like medical that I have no insight into.
This doesn't seem like it would impact compliance at all tbh. Or if it does, it's impacting people who could have easily been impacted by a million other issues.
A link disappearing isn’t a major issue. Not something I’d worry about (but yea might show up as a finding on the SOC 2 report, although I wouldn’t be surprised if many auditors wouldn’t notice - it’s not like they’re checking every link)
I’m also confused why the OP is saying they’re linking to public documents on the public internet. Across the board, security orgs don’t like to randomly publish their internal docs publicly. Those typically stay in your intranet (or Google Drive, etc).
lol seriously, this is like... at least 50% of the time how it would play out, and I think the other 49% it would be "ah sorry, I'll grab that and email it over" and maybe 1% of the time it's a finding.
It just doesn't match anything. And if it were FEDRAMP, well holy shit, a URL was never acceptable anyways.
You're missing the existence of technology that allows anyone to create superficially plausible but ultimately made-up anecdotes for posting to public forums, all just to create cover for a few posts here and there mixing in advertising for a vaguely-related product or service. (Or even just to build karma for a voting ring.)
Currently, you can still sometimes sniff out such content based on the writing style, but in the future you'd have to be an expert on the exact thing they claim expertise in, and even then you could be left wondering whether they're just an expert in a slightly different area instead of making it all up.
EDIT: Also on the front page currently: "You can't trust the internet anymore" https://news.ycombinator.com/item?id=47017727
Every comment section here can be summed up as "LLM bad" these days.
It's not "LLM bad" — it's "LLM good, some people bad, bad people use LLM to get better at bad things."
Insurance pays as long as you aren't knowingly grossly negligent. You can even say "yes, these systems don't meet x standard and we are working on it" and be ok because you acknowledged that you were working on it.
Your boss and your bosses boss tell you "we have to do this so we don't get fucked by insurance if so and so happens" but they are either ignorant, lying, or just using that to get you to do something.
I've seen wildly out of date and unpatched systems get paid out because it was a "necessary tradeoff" between security and a hardship to the business to secure it.
I've actually never seen a claim denied and I've seen some pretty fuckin messy, outdated, unpatched legacy shit.
Bringing a system to compliance can reasonably take years. Insurance would be worthless without the "best effort" clause.
[1]: https://arstechnica.com/civis/threads/journalistic-standards...
I kind of doubt that internet archive is really taking very much business away from them. Its a terrible UI to read the daily news.
> We ask that you grant LWN exclusive rights to publish your work during the LWN subscription period - currently up to two weeks after publication.
News is valuable when it is timely, and subscribers pay for immediate access.
There used to be plenty newspapers sponsored by wealthy industrialists; the latter would cover the former's gaps between the costs and the sales, the former would regularly push the latter's political agenda.
The "objective journalism" is really quite a late invention IIRC, about the times of WW2.
https://en.wikipedia.org/wiki/Journalistic_objectivity
"To give the news impartially, without fear or favor." — Adolph Ochs, 1858-1935
Objectivity is the default state of honest storytelling. If I ask what happened ? and somebody only tells the parts that suit an agenda, they have not informed me. The partisan press exists, because someone has a motive to deviate from the natural expectation of fair story telling and story recounting.
Already at the level of what stories are covered you have made choices about what's important or not.
Your newspaper not covering your neighbors lawsuit against the city against some issue because they find it to be "not important" is already a viewpoint-based choice
A newspaper presenting both sides on an issue (already simplifying on the "there are two sides to an issue" thing) is one thing. Do you also have to present expert commentary that says that one side is actually just entirely in bad faith? Do you write a story and then conclude "actually this doesn't matter" when that is the case?
There are plenty of descriptions that some people would describe as fair story telling and others would describe as a hit piece. Probably for any article on any controversial topic written in good faith you are likely able to find some people who would claim it's not.
I think it's important to acknowledge that even good faith journalism is filled with subjectivity. That doesn't mean one gives up, you just have to take into account the position of the people presenting information and roll with that.
I also don't think they care even a bit. They're pushing agendas, and not hiding it; rather, flaunting it.
Every source has it's biases, you should try to be aware of them and handle information accordingly.
Why not interpret it to mean something like “no news organization has biases that are fully aligned with my best interests”
I think the real problem is that they often dont put events in context, which leads people to misunderstand them. They report the what not the why, but most events don't just happen one day, they are shaped by years or even decades of historical context. If you just understand the literal event without the background context, i dont think you are really informed.
Information bias is unfortunately one of the sicknesses of our age, and it is one of the cultural ills that flows from tech outward. Information is only pertinent in its capacity to inform action, otherwise it is noise. To adapt a Beck-ism: You aren't gonna need it.
Instead of reporting just the facts, they include opinions, inflammatory language, etc.
Reuters writes in a relatively neutral tone, as an example. Fox News doesn't, and CNN doesn't, as examples of the opposite.
If you don't notice, I doubt you're reading the news. It's part of the offering. Fox does it on purpose, not accidentally.
Newspapers in my country always were blogs before the internet existed. Its why they are still around and doing quite well- they don't just bring news.
This is the particular thing I care about. If I can count on their facts, I can mostly subtract their agenda.
See: https://app.adfontesmedia.com/chart/interactive
The problem comes in when I can't count on the "facts" being reported.
If anything, we should simply me asking archive.org to limit their access to humans.
It’s passion and love of the community, despite the many struggles and drawbacks.
AI bots scrape our content and that drastically reduces the number of people who make it to our site.
That impacts our ability to bring on subscribers and especially advertisers - Google and Meta own local advertising and AI kills the relatively tiny audience we have.
I dread the day that it happens in realtime - hear sirens? Ask AI who already scraped us.
It's on the business to find a model that works within the environment of the free market and within the social framework.
If a business model only works by limiting competition, it's a bad model.
If it only works by limiting the rights of consumers, it's a bad model.
If it only works by blocking a legal activity (website crawling and scraping of publicly-facing data, for instance), it's a bad model.
And if their business can't operate otherwise, it's a bad business. No business has an intrinsic right to exist.
> No business has an intrinsic right to exist.
Do AI businesses have an intrinsic right to exist?
The incentives for online news are really wacky just to begin with. A coin at the convenience store for the whole dang paper used to be the simplest thing in the world.
I mean this as a side note rather than a counterargument (because people learn to take screenshots, and because what can you do about particularly bad faith news orgs?): Immediate archival can capture silent changes (and misleadingly announced changes). A headline might change to better fit the article body. An editor's note might admit a mistakenly attributed quote.
Or a news org might pull a Fox News [1][2] by rewriting both the headline and article body to cover up a mistake that unravels the original article's reason for existing: The original headline was "SNAP beneficiaries threaten to ransack stores over government shutdown". The headline was changed to "AI videos of SNAP beneficiaries complaining about cuts go viral". An editor's note was added [3][4]: "This article previously reported on some videos that appear to have been generated by AI without noting that. This has been corrected." I think Fox News deleted the article.
[1] https://xcancel.com/KFILE/status/1984673901872558291
They won't, of course, because they don't want accountability.
I can indeed find clear records of that in the archives. But what do I do with them? How do I use that evidence to hold news media to account? This is meaningless moral posturing.
I've contacted multiple journalists over the years about errors in their articles and I've generally found them responsive and thankful.
Sometimes it's not even their fault. One time a journalist told me the incorrect information was unknowingly added by an editor.
I get that it's popular on HN and the internet to bash news media, and that there are a lot of legitimate issues with the media, but my personal experience is that journalists do actually want to do a good job and respond accordingly when you engage them (in a non-antagonistic manner).
If a major article claims that certain groups don’t exist, while the same newspaper published a detailed report about those exact groups and how dangerous they are just two years earlier, it’s not because the journalist wasn’t able to do a 10-second Google search where their own paper’s article would have been among the top results.
Contact their rivals with the story, have them write a hit piece. "Other newspaper is telling porkies: here's the proof!" is an excellent story: not one I'd expect a journalist to have time to discover, but certainly one I'd expect them to be able to follow up on, once they've received a tip.
For the record, I'm talking about actual journalism groups, not Substack blogs. Here's one (largely US-centric) list, a ≈dozen links long: https://hcommons.social/@zeblarson/115488066909889058. You almost certainly have local journalists who need your support, which obviously I cannot list here.
In the past libraries used to preserve copies of various newspapers, including on microfiche, so it was not quite feasible to make history vanish. With print no longer out there, the modern historical record becomes spotty if websites cannot be archived.
Perhaps there needs to be a fair-use exception or even a (god forbid!) legal requirement to allow archivability? If a website is open to the public, shouldn't it be archivable?
I've sometimes dreamed of a web where every resource is tied to a hash, which can be rehosted by third parties, making archival transparent. This would also make it trivial to stand up a small website without worrying about it get hug-of-deathed, since others would rehost your content for you. Shame IPFS never went anywhere.
This is from my experience having a personal website. AI companies keep coming back even if everything is the same.
This also goes back to something I said long ago, AI companies are relearning software engineering poorly. I can think of so many ways to speed up AI crawlers, im surprised someone being paid 5x my salary cannot.
The problem is that AI companies have decided that they want instant access to all data on Earth the moment that it becomes available somewhere, and have the infrastructure behind them to actually try and make that happen. So they're ignoring signals like robots.txt or even checking whether the data is actually useful to them (they're not getting anything helpful out of recrawling the same search results pagination in every possible permutation, but that won't stop them from trying, and knocking everyone's web servers offline in the process) like even the most aggressive search engine crawlers did, and are just bombarding every single publicly reachable server with requests on the off chance that some new data fragment becomes available and they can ingest it first.
This is also, coincidentally, why Anubis is working so well. Anubis kind of sucks, and in a sane world where these companies had real engineers working on the problem, they could bypass it on every website in just a few hours by precomputing tokens.[2] But...they're not. Anubis is actually working quite well at protecting the sites it's deployed on despite its relative simplicity.
It really does seem to indicate that LLM companies want to just throw endless hardware at literally any problem they encounter and brute force their way past it. They really aren't dedicating real engineering resources towards any of this stuff, because if they were, they'd be coming up with way better solutions. (Another classic example is Claude Code apparently using React to render a terminal interface. That's like using the space shuttle for a grocery run: utterly unnecessary, and completely solvable.) That's why DeepSeek was treated like an existential threat when it first dropped: they actually got some engineers working on these problems, and made serious headway with very little capital expenditure compared to the big firms. Of course they started freaking out, their whole business model is based on the idea that burning comical amounts of money on hardware is the only way we can actually make this stuff work!
The whole business model backing LLMs right now seems to be "if we burn insane amounts of money now, we can replace all labor everywhere with robots in like a decade", but if it turns out that either of those things aren't true (either the tech can be improved without burning hundreds of billions of dollars, or the tech ends up being unable to replace the vast majority of workers) all of this is going to fall apart.
Their approach to crawling is just a microcosm of the whole industry right now.
[1]: https://en.wikipedia.org/wiki/Common_Crawl
[2]: https://fxgn.dev/blog/anubis/ and related HN discussion https://news.ycombinator.com/item?id=45787775
There's a bit of discussion of Common Crawl in Jeff Jarvis's testimony before Congress: https://www.youtube.com/watch?v=tX26ijBQs2k
What I feel is a lot more likely is that OpenAI et al are running a pretty tight ship, whereas all the other "we will scrape the entire internet and then sell it to AI companies for a profit" businesses are not.
And it's a lot harder to get the law to stop doing something once it proves to cause significant collateral damage, or just cumulative incremental collateral damage while having negligible effectiveness.
Just because you have infinite money to spend on training doesn't mean you should saturate the internet with bots looking for content with no constraints - even if that is a rounding error of your cost.
We just put heavy constraints on our public sites blocking AI access. Not because we mind AI having access - but because we can't accept the abusive way they execute that access.
It’s very unfortunate and a short sighted way to operate.
That's assuming they're deriving a benefit from misbehaving.
There is no benefit to immediately re-crawling 404s or following dynamic links into a rabbit hole of machine-generated junk data and empty search results pages in violation of robots.txt. They're wasting the site's bandwidth and their own in order to get trash they don't even want.
Meanwhile there is an obvious benefit to behaving: You don't, all by yourself, cause public sites to block everyone including you.
The problem here isn't malice, it's incompetence.
Receiving a response from someone's webserver is a privilege, not a right.
Why does any of them deserve any special treatment? Please don't try to normalize this reprehensible behavior. It's a greedy, exploitative and lawless behavior, no matter how much they downplay it or how long they've been doing it.
This is the problem with AI scraping. On one hand, they need a lot of content, on the other, no single piece of content is worth much by itself. If they were to pay every single website author, they'd spend far more on overhead than they would on the actual payments.
Radio faces a similar problem (it would be impossible to hunt down every artist and negotiate licensing deals for every single song you're trying to play). This is why you have collective rights management organizations, which are even permitted by law to manage your rights without your consent in some countries.
It helps to not have images, etc that would drive up bandwidth cost. Serving HTML is just pennies a month with BunnyCDN. If I had heavier content, I might have to block them or restrict it to specific pages once per day. Maybe just block the heavy content, like the images.
Btw, anyone tried just blocking things like images to see if scaping bandwidth dropped to acceptable levels?
Maybe they vibecoded the crawlers. I wish I were joking.
Why, though? Especially if the pages are new; aren't they concerned about ingesting AI-generated content?
I think it "failed" because people expected it to be a replacement transport layer for the existing web, minus all of the problems the existing web had, and what they got was a radically different kind of web that would have to be built more or less from scratch.
I always figured it was a matter of the existing web getting bad enough, and then we'd see adoption improve. Maybe that time is near.
But you are right on the reason it "failed". People expected web++, with a "killer app", whatever that means. Imagination is dead.
Compared to the total number of users on the Internet relatively few have stable always-on machines ready to host P2P content. ISPs do not make it easy or at times possible to poke holes in firewalls to allow for easy hosting on residential connections. This necessitates hole punching which adds non-trivial delays on connections and overall poorer network performance.
It's less about imagination being dead but instead limitations of the modern Internet retards momentum of P2P anything.
Is there any reason this has to be true? Probably some majority or significant minority of mobile devices spend some eight hours a day attached to a charger in a place where they have the WiFi password, while the user is asleep. And you don't need 100% of devices to be hosts or routers, 10% at any given time would be more than sufficient.
You've just described Nostr: Content that is tied to a hash (so its origin and authenticity can be verified) that is hosted by third parties (or yourself if you want)
Also, I always wonder about Common Crawl:
Is there is something wrong with it? Is it badly designed? What is it that all the trainers cannot find there so they need to crawl our sites over and over again for the exact same stuff, each on its own?
The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big. We recommend that all of these folks respect robots.txt and rate limits.
> The folks who crawl more appear to mostly be folks who are doing grounding or RAG, and also AI companies who think that they can build a better foundational model by going big.
But how can they aspire to do any of that if they cannot build a basic bot?
My case, which I know is the same for many people:
My content is updated infrequently. Common Crawl must have all of it. I do not block Common Crawl, and I see it (the genuine one from the published ranges; not the fakes) visiting frequently. Yet the LLM bots hit the same URLs all the time, multiple times a day.
I plan to start blocking more of them, even the User and Search variants. The situation is becoming absurd.
I wrote a short paper on that 25 years ago, but it went nowhere. I still think it is a great idea!
definitely, this is going to hurt those over at /r/datahoarder
News websites aren’t like those labyrinthian cgit hosted websites that get crushed under scrapers. If 1,000 different AI scrapers hit a news website every hour it wouldn’t even make a blip on the traffic logs.
Also, AI companies are already scraping these websites directly in their own architecture. It’s how they try to stay relevant and fresh.
Kind of sucks because the news are an important part of that kind of an archive.