| Deutsch English Français Italiano |
|
<slrn101ue1g.198p.anthk@openbsd.home.localhost> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anthk <anthk@openbsd.home> Newsgroups: comp.misc Subject: Re: bad bot behavior Date: Mon, 12 May 2025 06:24:45 -0000 (UTC) Organization: A noiseless patient Spider Lines: 32 Message-ID: <slrn101ue1g.198p.anthk@openbsd.home.localhost> References: <vrc2r4$2okrp$1@dont-email.me> <vrc8qm$2tkq5$1@dont-email.me> <20250318182006.00006ae3@dne3.net> Injection-Date: Mon, 12 May 2025 08:24:45 +0200 (CEST) Injection-Info: dont-email.me; posting-host="cc7a24369ce4842ffee3a5ba377d97da"; logging-data="1014028"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+tzbhHTU57ue/8soSDLNxD" User-Agent: slrn/1.0.3 (OpenBSD) Cancel-Lock: sha1:KQ8Fc6xv7OEQmw40qIaatWrKVlo= Bytes: 2193 On 2025-03-18, Toaster <toaster@dne3.net> wrote: > On Tue, 18 Mar 2025 12:00:07 -0500 > D Finnigan <dog_cow@macgui.com> wrote: > >> On 3/18/25 10:17 AM, Ben Collver wrote: >> > Please stop externalizing your costs directly into my face >> > ========================================================== >> > March 17, 2025 on Drew DeVault's blog >> > >> > Over the past few months, instead of working on our priorities at >> > SourceHut, I have spent anywhere from 20-100% of my time in any >> > given week mitigating hyper-aggressive LLM crawlers at scale. >> >> This is happening at my little web site, and if you have a web site, >> it's happening to you too. Don't be a victim. >> >> Actually, I've been wondering where they're storing all this data; >> and how much duplicate data is stored from separate parties all >> scraping the web simultaneously, but independently. > > But what can be done to mitigate this issue? Crawlers and bots ruin the > internet. > GZip bombs + fake links = profit. Remember that gz'ed web pages are a standard, even lynx can parse gz files natively. Also, Megahal/Hailo under Perl. Feed it nonsense, and create some non-visible contents under a robots.txt-dissallowed directory full of Markov-chains generated nonsense and gzip bombs.