Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I predict 2026 will see a mass return to self-hosted blogging (and the Linux desktop, natch).




This is my hope as well, but fear of ai scrape is real among folks I have chatted with this about.

If you are putting something out for free for anyone to see and link and copy, why is LLM training on it a problem? How’s that different from someone archiving it in their RSS reader or it being archived by any number of archive sites?

If you don’t want to give it away openly, publish it as a book or an essay in a paid publication.


The problem is that LLM “summaries” do not cite sources. They furthermore don’t distinguish between making summaries and taking direct quotes; that “summary” is often directly lifting text that someone wrote. LLMs don’t cite in either case. It’s a clear case of plagiarism, but tech companies are being allowed to get away with it.

Publishing in a paid publication is not a solution because tech companies are scraping those too. It’s absolutely criminal. As an individual, I would be in clear violation of the law if I took text someone else wrote (even if that text was in the public domain) and presented it as my own without attribution.

From an academic perspective, LLM summaries also undermine the purpose of having clear and direct attribution for ideas. Citing sources not only makes clear who said what; it also allows the reader to know who is responsible for faulty knowledge. I’ve already seen this in my line of work, where LLMs have significantly boosted incorrect data. The average reader doesn’t know this data is incorrect and in fact can’t verify any of the data because there is no attribution. This could have serious consequences in areas like medicine.


Its important to consider others perspectives, even if inaccurate. As it was expressed to me when I suggested "why not write a blog" to a relative who is into niche bug photos and collecting they didn't want to give their writing and especially photos to be trained on. They have valid points honestly and an accurate framing of what will happen, it will get injested eventually likely. I think they overestimate a tad their works importance overall but still they seemed to have a pretty accurate guage of likely outcomes. Let me flip the question, why should they not be able to choose "not for training uses" even if they put it up publically?

> why should they not be able to choose "not for training uses" even if they put it up publically?

I'm having trouble even parsing that question; "Publically" means that you put yourself out there, no? It sounds to me like that Barbra Streisand thing of building an ostentatious mansion and expecting no one to post photos of it.

I suppose you could try to publish things behind some sort of EULA, but that's expressly not public.


If you are having trouble understanding, just ask. Of course I'm talking about a websites terms of use.

As I understand it, terms of use on a publicly accessible page aren't enforceable. That's why it's legal to e.g. scrape pages of news sites regardless of any terms of use. If it's curlable, it's fair game (but it's fair for the site to try to block my scraping).

This is not an answer to your question, but one issue is that if you write about some niche sort of thing (as you do, on a self-hosted blog) that no one else is really writing about, the LLM will take it as a sole source on the topic and serve up its take almost word for word.

That's clearly plagiarism, but it's also interesting to me as there's really no way the user who's querying their fav. ai chatbot if the answer has truthiness.

I can see a few ways this could be abused.


I don't see how this is different from the classic citogenesis process; no AI needed. If a novel claim is of sufficient interest, then someone will end up actually doing proper research and debunking of it, probably having fun and getting some internet fame.

> I don't see how this is different from the classic citogenesis process;

Lack of novelty doesn't remove it as a problem.


Agreed, it's definitely a problem, but I'm just saying that it's the basic problem of "people sometimes say bullshit that other people take at face value". It's not a technical problem. The most relevant approach to analyze this is probably https://en.wikipedia.org/wiki/Truth-default_theory

Are you suggesting that the AI chatbot have this built-in? Because the chances that I, an amateur who is writing about a subject out of passion, have gotten something wrong would approach 1 in most circumstances, and the ask that the person receiving the now recycled information will perform these checks every time they query an AI chatbot would be 0.

These scrapers can bring a small website to its knees. Also, my "contribution" will be drowned in the mass, making me undiscoverable. Further, I can't help fearing a nightmare where someday I'm accused of using AI when I'm only plagiarizing myself.

Fear of AI scrape? I'm just amused at the idea of my words ending up manipulating chatbots to rewrite stuff that I've written, force-feeding it in some distorted form to people silly enough to listen.

Why? AI crawlers will kill your server and give no backlinks.

At this point, I'm writing for myself and not for any particular audience, b/c even if I'm discovered, I'd be discovered by AI.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: