Hacker News

Hacker News

Knowledge Should Not Be Gated

69 points by nezhar 11 hours ago | 51 comments

dofm 5 hours ago

We are at the breathless-but-low-information-posts-about-plain-text-formats point in the cycle.

Herring 4 hours ago

This comment is another example of "Nobody ever gets credit for fixing problems that never happened."

There's a massive push to add unnecessary complexity to everything out there, because complexity pays all our bills.

dofm 4 hours ago

Oh I don't disagree. I'm not saying it's wrong. It's always right at some point in the cycle.

I'm just saying it is cyclical. Databases => plain text => single-file databases, repeat. shared hosting => dedicated hosting => vms => jamstack, repeat, etc.

Can't sell complexity without oversimplicity or simplicity without overcomplexity.

But this is quite a long blog post, with typical blog flourishes, about not very much.

bonoboTP 5 hours ago

AI generated article.

Diti 4 hours ago

Yep. Also no author – an actual writer would have taken credit. I flagged the article but it did nothing.

drunken_thor 6 hours ago

Sdks/libs, especially open source sdks, were never about gated knowledge. They were about the providing company making it as easy as possible for you to integrate. You would not need to know the idiosyncrasies behind api retries, paging, rate limits, auth flow, and on and on. The third party developers needed a resource, they call a method and get it. Open source libraries especially are about pooling knowledge, not gating it. This is propaganda for pooling that knowledge inside a service you have to pay to use, and instead of developers all using and improving the same codebase together, they have to spend money to rewrite the same code repeatedly. This is AI companies further trying to undercut open source because it’s free.

rightbyte 6 hours ago

It seems beyond naive, rather malicious, to upload any useful private data to SaaS LLMs.

Like, you are letting them data mine your business. Why are corporations not panicing over this?

__MatrixMan__ 3 hours ago

When you say private, I assume you mean proprietary. This isn't about HIPAA or PII but rather about trade secrets and the like. Companies are not panicking because no humans are driving that panic, and we aren't driving that panic because either:

- we haven't thought about it deeply, or

- we've thought about it deeply enough to understand that humans don't benefit when companies act as data gatekeepers.

In neither case are we likely to raise the alarm. If we let them, companies will play zero sum games over "intellectual property" ad infinitum while humans get nothing useful out of the relationship. We're better off when they compete on execution and worse off when they compete in court re: the ownership of abstractions, so there's no reason to encourage the latter sort of behavior.

m11a 6 hours ago

Most corporations likely have zero data retention agreements with LLM providers, at least for API usage.

(Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)

failbuffer 2 hours ago

These firms were completely fine with mass copyright infringement. And the temptation to keep data would be great, especially as they fight for every bit of technical advantage in a market that "wants" to be commodified.

davkan 13 minutes ago

Well they abused fair use and pretended llm learning was the same as human learning. Enough gray area to risk court. A lot less gray area when there are signed contracts involved saying they won’t I think.

dataflow 5 hours ago

> (Sure, you could be sceptical on whether the LLM provider is upholding that, but I personally do trust them. The trust betrayal if ZDR wasn't actually ZDR would be too great and commercially damaging for them to lie.)

Is actual ZDR verbiage in contracts more specific and limited in scope than what we see advertised publicly ("...except where needed to comply with law or combat misuse" in Anthropic's case)? Because those seem pretty damn vague and large enough holes to drive trucks through.

m11a 4 hours ago

It depends on the model provider. OpenAI's is very limited and precisely written.

Plus, open-source models hosted on SaaS inference providers tend to come with a strong ZDR agreement too.

lukewarm707 5 hours ago

to combat misuse, we must store and read all prompts and responses. ;)

to comply with the law, we must send to the police our detections of illegal activity >:|

a guy subpeonaed your chats, i guess we stored them (oops) so now it's illegal to destroy it...

prodigycorp 6 hours ago

because corporations are using providers with ZDR in the contract. If OAI or any of the cloud providers violate this they're getting sued to oblivion.

dofm 5 hours ago

The problem is that there is an enormous, nearly unignorable incentive to work around it. So they will.

As the customer base becomes more and more corporate (which it will), they end up with disproportionately more customers whose experiences cannot be used to train the model to make it better for those customers.

Either way, corporate customers cannot leach off the training from consumers handing over their personal data forever; there aren't enough specialists in that training set to improve the models with no loss of corporate trust.

Betrayal of their trust is inevitable.

WarmWash 5 hours ago

Conspiracies are for the chronically online

rightbyte 3 hours ago

It is rather vague whether you count Sam Altman et al. as "chronically online".

dofm 5 hours ago

This is not a conspiracy theory. It's futurology, maybe, but pretty basic stuff at that.

At some point, where does the training advantage for specialist LLMs come from, if not progressively encroaching on customer data for the benefit of equivalent customers?

coffeefirst 5 hours ago

These are the same people who performed the largest scale breach of copyright in history on the theory that they could get away with it.

I’m not making any accusations, but we should not underestimate their tolerance for legal and financial risk.

It may be a little paranoid to insist on self hosting based on that, but I’m not so sure that it’s crazy.

WarmWash 5 hours ago

It has been ruled that training on copyright is not a breach of copyright unless you subverted payment for it.

Which they did do, but scale is relatively miniscule to the full dataset.

dofm 5 hours ago

Trade secrecy is all anyone has left. It's not paranoid at all. You would hope that most serious companies have a tier of corporate knowledge protection that is somewhere between Coca-Cola/KFC herbs recipe secrecy and Stringer Bell's note-taking exhortation: "is you giving the LLM notes on our unique advantage?"

api 5 hours ago

Imagine someone comes to you and says: "You must remove your door locks. Anyone can come into your house any time. You also need cameras across most of your house. But in exchange, magic elves will do all of your home chores: washing, dishes, folding laundry, cleaning, minor home repairs. All of this will be done for pennies on the dollar compared to any current option."

How many people would take it?

I know I'd actually be tempted. Con: total loss of privacy. Pro: it folds laundry, and I f'ing loathe laundry with the intensity of a billion suns.

Every business has similar trade-offs they'd be tempted to take.

rightbyte 4 hours ago

Sounds like a children story about not making a deal with the Devil?

The implied part the children already know from other stories is:

The magic elves have a recorded history of laughing at their customers when they are on the toilette, hitting on their husbands/wifes and misleading their children into worshipping the elvendom and wander off into the forest.

The story ends in some sort of catharsis for the protagonist when the elves go one step too far. In the happy ending variant Disney makes a version off it is not too late.

api 3 hours ago

I don't disagree, but history shows that people (and businesses!) can and will make such trades. Look at the privacy nightmare, and it's not just individuals. Large companies put all their Crown Jewels into things like Google Drive, gmail, Dropbox, OneDrive, etc. Even governments do this. The only payoff is convenience. That's how little people and even businesses value their privacy.

TeMPOraL 2 hours ago

> The only payoff is convenience. That's how little people and even businesses value their privacy.

Well that, and after 2+ decades of this, we can pretty much conclusively tell it worked out great for them. They were - and are - absolutely right to "make such trades".

Yes, data leaks sometimes happen. Sometimes they even make noise in the news. And... that's about the end of it. There are no tomb raiders stealing "crown jewels" and "secret sauces" and outcompeting companies on their own turf[0] - instead, there are many success stories of systems, products and businesses that wouldn't be created if not for the ability to outsource data and document processing to cloud services.

[0] - Except the Chinese, but that's not really about stealing secrets or private data - just that owning the factories lets you iterate on hardware faster (+ it helps to have some healthy disregard for "intellectual property").

em-bee 4 hours ago

i believe that in the future technology will be so advanced that protection of privacy is impossible. the only way to counter that is education to respect peoples privacy and very harsh punishments for violations.

i also believe that we will live in a post scarcity world, which means profit is no longer interesting, so any business case for invading your privacy will go away and therefore it will only happen for personal interest.

the key in any case will be education, because without it abuse will be rampant and progress will halt because everyone is going to be suspicious of everyone else.

toofy 4 hours ago

> i believe that in the future technology will be so advanced that protection of privacy is impossible. the only way to counter that is…

i’m not sure why so many of us have fallen into this… “there is no other future” thing…

there are other options. plenty of them. there is no singular solution. we could always just say “no”. and that’s that. that would be one option.

why do we feel like there is no other way? why are we afraid to say “nah”?

em-bee 2 hours ago

i didn't say that there is no other possible future. it's just that this is what i believe is the most likely. science fiction presents plenty of other ideas. if you believe any of those are more likely, please share.

here is my argument:

technology will not stop advancing. for good and for bad. we will not one day realize that we should stop progressing tech and switch to an amish approach to technology. highly unlikely.

other scifi futures involve end of the world scenarios. in my opinion those are not interesting, because if they happen even with survivors humanity is mostly over anyways, so i am not entertaining those. humanity will survive in large numbers.

another option is absolute corporate control. we are kind of heading towards that, but if things go really bad then people will revolt. and either the people win or we have another end of the world scenario. you only need to look at china. if they are to strict, people start protesting. so absolute corporate control is not going to happen.

the last one i can think of is a multi class society. that too is unlikely as we have been going away from that over the last century and i do not believe we'll ever go back.

as you can see from these options, education has the only good outcome. therefore it is the most likely one.

education IS the way of saying "no". it's teaching our children to not do that. saying no to certain tech development is unlikely because there is no tech that only has bad use. not even technology that enables weapons. education is the only way to stop people from abusing technology for bad things.

skeledrew 2 hours ago

There is no saying "no" if you want to continue participating in modern society. But you can always try to find some place off the grid and live the rest of your days almost primitively.

Calazon 3 hours ago

I suspect because some of these things would require us all to say no, which is difficult to coordinate and enforce.

It's not impossible, but it's not as easy as "just" saying no, as a society.

em-bee 2 hours ago

right, and for all to be able to say no requires education.

api 3 hours ago

I'd say I generally agree that the privacy problem has no techno-fix and must be solved by regulation.

For example, we could extend HIPAA-style fines for leaked personal data to other forms of intimate data like location, biometrics, local documents, private chats, etc.

Leak someone's location history? That'll be one $$$ fine per incident where an incident is one person data point.

This at the very least converts this kind of data from an asset into a potential liability, incentivizing companies to not collect it, not hold it long, or thoroughly anonymize and aggregate it and then discard specifics.

skeledrew 2 hours ago

I'm 100% taking it. I keep nothing with stealing in the house.

MelonUsk 6 hours ago

Yep, knowledge should not be gated:

Imagine Google search without any links or sources named

This is the “modern” AI chatbot:

It never mentions the training data it used, in fact has no idea what it used (often FB, Reddit and partisan websites)

Update: I added the reply about after the fact Googling chatbots do - it’s different

dmortin 6 hours ago

Secifically in Google AI Overview I always see links to sites where the information is sourced from.

Or at least some of the sites, if the same info is sourced from 100 pages then it only shows 2 or 3, maybe the ones with the biggest PageRanks.

MelonUsk 6 hours ago

Yep, that’s true

But those links are Googled after the model started to answer, they are not the links to the training data

Imagine an artificial “librarian” that read all the books and spits hallucinated quotes for you

But doesn’t let you enter the library, open a single book or even see the sources for those hallucinated quotes

But instead Googles some sources based on hallucinations after generating them ;-)

It’s better than nothing but you can Google them, too, while training data (the library) is completely hidden from you, even the public domain parts of it - zero attribution

dmortin 5 hours ago

There should be at least some correlation. When building the model they give more weight to some pages (e.g. Wikipedia) which have bigger trust (pagerank?). And when they provide links in answers, those matches are listed first which have better pagerank for the query.

So if it sources something in Wikipedia, it is more likely to provide Wikipedia as a trusted source for it.

The problem is when an answer is hallucinated, false, it may provide a source for it which contains the invalid info.

MelonUsk 5 hours ago

Yep, a few non-profits work on direct training data attribution:

OlmoTrace, Guide Labs with Clarity and a few more

Labs train the model with attribution baked-in and they say the bigger the model - the more interpretable it becomes

Pretty sure it’s the future

skeledrew 3 hours ago

I've been using org[0] for the past 10 years to store knowledge, project write-ups, notes, etc and love it. As I get deeper into AI I'm keeping that, like recently I created an org-edit tool to manage especially issues in the project write-ups. 1 file with all my several hundred projects accumulated over the years, and the value has only grown although it's becoming harder for me to personally consume; I'll likely just create a couple commands that improve the browsing experience. But I continue to love and prefer that single large file (actually several: personal knowledge, projects, business and an inbox) to many individual files. And it's synced across my devices, where esp on mobile I can access it all via Orgzly[1].

[0] https://orgmode.org/ [1] https://orgzlyrevived.com/

mikewarot 3 hours ago

>Personal wikis always died for the same reason.

Mine (WikidPad) died when I switched to Linux, and learned that breaking changes to WxPython rendered it worthless, as none of the dialog boxes functioned after that point. Sure, the source from 2012 was available, but my Wiki really wasn't.

Eventually this forced me back to Windows... but it was too late, now I'm back on Windows, and still don't trust WikidPad.

>Keeping them current was tedious, and humans hate tedium. But the tedium is the one thing language models are immune to. They will happily re-link, re-summarize, and reconcile contradictions across a hundred files without complaining.

Yeah, and you're going to trust the LLM to reliably maintain this? Not a wise choice.

I really wish we had a reliable way to annotate and interlink documents using hypertext. Unfortunately, HTML doesn't actually let you mark up (annotate) hypertext.

We still, 81 years later, don't have a Memex! 8(

skybrian 3 hours ago

If you're maintaining Markdown with a coding agent then you need linting tools to do things like checking for broken links. Without consistency checks, it will often make mistakes. The more internal consistency checks you can do, the better.

This is good practice anyway, and a coding agent can help write the tools.

Now that we have coding assistance, we can even be more ambitious. A common, language-independent test suite would be more useful than Markdown for generating an SDK and then verifying that it matches the spec. So I don't think plain Markdown is the best way.

Philpax 4 hours ago

The title suggested a far more interesting piece than the actual post. Alas.

internet2000 6 hours ago

Information wants to be free! I remember when that was the rallying cry of hackers. I miss those days.

sghiassy 5 hours ago

“Hack the Planet!” — Hackers

toofy 4 hours ago

those days are still here, people just want to know where the information is coming from.

the rallying cry from hackers has never ever been “information wants to be free from sources”

and hackers have also never implied “information from a dipshit should hold the exact same weight as an expert”

yet somehow both of those is the world we’re running towards as fast as we can.

golly_ned 3 hours ago

Has anyone figured out why anyone would bother adopting the google 'open knowledge format'?

Normally I expect a set of tooling to be build on top of any open format. Value-adds and interoperability. Instead I just see a way to organize markdown files.

hx8 4 hours ago

I don't understand why Open Knowledge Format improves interoperability, which is the main claim for its value. These LLMs are obviously advanced enough to navigate other MD file organization schemas like Obisidan, or other text files like Emacs Org.

5701652400 5 hours ago

now that any software/knowledge is copyable given sufficient cash and AIs, gating knowledge migth be the only thing that protects your business. otherwise you do not have business.

jdw64 5 hours ago

Personally, I think the ability to distinguish between all the knowledge that's overflowing is becoming a characteristic of the current establishment. In reality, the number of sites where you can get good information is extremely limited. It feels like we're in an era where discernment matters more.

Most of it is just misinformation, after all. People say knowledge shouldn't be restricted, but now we have the opposite problem. There's so much information that just skimming through it takes too much time. On top of that, as we shift from text to video, getting information has become even harder. Compared to text, YouTube videos feel like they have much lower information density. I've heard that the TikTok generation's text literacy is declining, but maybe that's actually a social adaptation to process as much data as possible from low-density sources

In that sense, the efficiency of RAG ultimately comes down to what kind of good knowledge you're feeding into the AI.

nephihaha 6 hours ago

Sadly it has been during most of human history. I think the establishment resents the masses becoming over educated. The 1990s internet had a wealth of views and information on it. Now you can only access approved sources via search engines thanks to scaremongering, and have CloudFlare monitoring everything you do.