What are Fedi Admins doing to block Meta scrapers?

The Nexus of Privacy@lemmy.blahaj.zone · edit-2 4 months ago

What are Fedi Admins doing to block Meta scrapers?

Jeena@piefed.jeena.net · edit-2 4 months ago

The only thing I’ve been doing on my blog (not on my piefed instance yet, but probably should) was user agent filtering:

if ($http_user_agent ~* (SemrushBot|AhrefsBot|PetalBot|YisouSpider|Amazonbot|VelenPublicWebCrawler|DataForSeoBot|Expanse,\ a\ Palo\ Alto\ Networks\ company|BacklinksExtendedBot|ClaudeBot|OAI-SearchBot)) {
return 403;
}

The Nexus of Privacy@lemmy.blahaj.zone · 4 months ago

Thanks! Does it seem like that’s affective, or are you getting the feel that that the bots are just changing user agent to get aroud it?

Jeena@piefed.jeena.net · 4 months ago

I got the list from a friend who checks his logs every now and then and adds new not names there.

Rimu@piefed.social · edit-2 4 months ago

There are no PieFed instances in that list. Maybe because Meta is blocked in the default PieFed robots.txt or maybe PieFed is too obscure.

The robots.txt on Mastodon and Lemmy is basically useless.

The Mbin robots.txt is massive but does not block Meta’s crawler so presumably it is not being kept up to date.

Any fedi devs reading this: add these

User-agent: meta-externalagent  
User-agent: Meta-ExternalAgent  
User-agent: meta-externalfetcher  
User-agent: Meta-ExternalFetcher  
User-agent: TikTokSpider  
User-agent: DuckAssistBot  
User-agent: anthropic-ai  
Disallow: /

rhythmisaprancer@piefed.social · 4 months ago

@originalucifer@moist.catsweat.com in case you are interested