Admins: Set up Anubis ASAP!

Scrubbles@poptalk.scrubbles.tech · 22 days ago

Admins: Set up Anubis ASAP!

hendrik@palaver.p3x.de · edit-2 22 days ago

To be honest, I’m not really a fan of fighting the AI scraper war. It’s ultimately the same dynamics which turned the open internet into what it is today… 10 years ago I could just look up stuff… google whether my Thinkpad supports more or faster RAM than what the specs said, and I’d find some nice Reddit thread. And now it’s all walled off, information isn’t shared freely anymore, and I’m having a hard time. And we do the same thing.

But… We do need to fight. I just wish there was some solution which doesn’t add to the enshittification of the internet. I’ve been hit by them as well and the database load made the entire server grind to a halt. I’ve added some firewall rules and deny lists to my reverse proxy to specifically block the AI companies and so far it’s looking good. I’ll try to postpone solutions like Anubis until it’s unavoidable. And I guess it won’t work for everything. We need machine-readable information, APIs, servers to talk to each other…

But that’s just my 2 cents and I’m going to change my opinion and countermeasures once necessary.

Scrubbles@poptalk.scrubbles.tech · 22 days ago

This still lets scrapers through, but it’s more of an artificial throttle. There are several knobs and dials it looks like I can turn to make it more or less permissive, but I got a say that over 90% of my traffic was coming from not farms, and I a small instance admin was paying for all of that with database queries and compute. I think this is a good middle ground. You can access it, if you’re willing to really try to get at it. From what I see most scrapers give up immediately rather than spending on compute

hendrik@palaver.p3x.de · 22 days ago

Hmmm. I mean my instance is small, almost all of my traffic comes from federation. There will be like 50 requests in the log to forward some upvotes, and then one or two from a user or crawler. Most of them seem to behave, it was just Alibaba and Tencent who did proper DDoS attacks on me. I’ve blocked most of their ASNs and so far it seems to do 100% what I was trying to do. I suppose the minor stream of requests from South America and other places of the world isn’t humans either, but they mostly read articles and that’s kind of a cheap request on PieFed and it’s not a lot. I’ll keep an eye on it, maybe that’s an alternative solution. Though currently it’s a manual process to look up the address ranges and write the firewall rules. It should probably be automated in a way before random people adopt it.

Scrubbles@poptalk.scrubbles.tech · 22 days ago

I thought so too with mine, divide your traffic though into the API vs Web requests. If you have a small instance most of your traffic should be federation traffic hitting the api endpoints, and posts. Scraping traffic will be GETs and requesting web content, not API content. That’s what I noticed, that the vast, vast majority of my traffic wasn’t federation at all, but web scrapes. Granted my instance has been around for almost 3 years now and I’m sure most of the bot farms know I exist.

hendrik@palaver.p3x.de · edit-2 22 days ago

Interesting. For me on a 16 month old PieFed instance, mostly used by me, 95% of the total incoming traffic is federation. With some countermeasures activated. In the last 3 days I got 450k requests from Lemmy instances, 7k from PieFed, 10k from Mastodon, 30k from PeerTube and 30k have a browser User Agent string so those are users or crawlers. That’s roughly 5%. 5k of those have bot or crawler in the User agent, that’s 1% of my requests. And I did 1k requests from my home IP address. Fun fact: 52% of all incoming traffic is just to federate with lemmy.world.

Scrubbles@poptalk.scrubbles.tech · 22 days ago

I’m not surprised, but I have noticed a lot of the bots are fake user agents as well. For example, for my instance I know for a fact no one else was using my instance when I was testing this, and I kept getting User Agent requests from Safari. Which I would be surprised if my users were using Apple too, but knowing they weren’t on was a huge driver. I want to dive in and see how Anubis knows this, or if they were just tested and failed or didn’t bother to complete the challenge. So I’m curious when you remove POSTs and api/federation endpoint calls what your traffic looks like

hendrik@palaver.p3x.de · edit-2 22 days ago

I don’t know how you’re reading this, in case you didn’t know, a lot of browsers have the word “Safari” in their user agent string.

My Vanadium browser identifies as: “Mozilla/5.0 (Linux; Android 10; K) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/142.0.0.0 Mobile Safari/537.36”
So there’s Safari in it, but it’s a Chromium based browser on Android. And none of the information is correct. It’s Android 16, not 10 and Vanadium or Chromium isn’t even in it. Also doesn’t use Mozilla’s engine nor AppleWebkit. It is version 142.something however.

My Firefox (LibreWolf) identifies as: “Mozilla/5.0 (X11; Linux x86_64; rv:144.0) Gecko/20100101 Firefox/144.0” that doesn’t include anything else.

But the desktop Chromium does the same thing again: “Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36”

Scrubbles@poptalk.scrubbles.tech · 22 days ago

That’s absolutely true, looking at my logs there are definitely some weird ones:

Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b7) Gecko/20100101 Firefox/4.0b7

Which… if that’s firefox is very out of date, and Windows 6.1 is Windows 7. Which I could believe that people are posting from Windows 7 compared to 10 or 11, but sus. A lot of them are kind of weird combinations. Anubis auto flagged that one as bot/huawei-cloud, but I’m really curious as to how or why it did. It’s not authentic, that’s for sure, but how does it know it isn’t?

Another one is Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.64 Safari/537.36 Edg/101.0.1210.47, is completely valid, but from an older Windows version and Edge, like if it was a snapshot from a few years ago. Idk it’s all very interesting. That one was also flagged as bot/huawei-cloud.

hendrik@palaver.p3x.de · edit-2 22 days ago

Well, pretty much all browsers fake their user agent strings for a long time now. That’s for privacy reasons. Because malicious actors can single you out with all that information, the combination of IP, operating system and version and exact version of the browser and used libraries. Also they don’t want to advertise if you skimped on updates and are vulnerable for exploits. And they fake more because via Javascript a website can get your screen size, resolution and all kinds of details. That’s business as usual. These pieces of information still serve some purpose, occasionally… But they’re more a relic from the distant past when the internet was an entirely different place without an advertisement and surveillance economy.

And sure, shady bots fake them as well. It’s rare to see correct information with anything. The server to server communication in the Fediverse for example advertises the correct name and version number. Maybe some apps as well if they’re not concerned with servers being malicious.

sga@piefed.social · 22 days ago

Earlier, scrapers were not harming a website much. maybe 1 or search engine hits a day, and just a handful of search engines meant that they would not consume much resources. they would scrape your data, but just so search would work (otherwise all search would just work on your website’s title). this did not harm the websites’ business model (if they had any).

it went downhill when google started giving dires “answers” to search queries. mostly based on wiki, but occasionaly other stuff like reddit. this was often just relevant bit from “high match” website and while it reduced some traffic to those webistes, it was not as bad as it is today - where all most all traffic is from scraping bots, and many just have stopped visiting sites directly.