Drunk & Root@sh.itjust.works to Selfhosted@lemmy.worldEnglish · 3 个月前How to combat large amounts of Ai scrapersmessage-squaremessage-square54fedilinkarrow-up1116arrow-down15file-text
arrow-up1111arrow-down1message-squareHow to combat large amounts of Ai scrapersDrunk & Root@sh.itjust.works to Selfhosted@lemmy.worldEnglish · 3 个月前message-square54fedilinkfile-text
everytime i check nginx logs its more scrapers then i can count and i could not find any good open source solutions
minus-squaredaniskarma@lemmy.dbzer0.comlinkfedilinkEnglisharrow-up5·edit-23 个月前How do you know it’s “AI” scrappers? I’ve have my server up before AI was a thing. It’s totally normal to get thousands of bot hits and to get scraped. I use crowdsec to mitigate it. But you will always get bot hits.
minus-squareSheldan@lemmy.worldlinkfedilinkEnglisharrow-up2·3 个月前Some of them are at least honest and have it as a user agent.
minus-squarekrakenfury@lemmy.sdf.orglinkfedilinkEnglisharrow-up2arrow-down1·3 个月前Is ignoring robots.txt considered “honest”?
minus-squareDrunk & Root@sh.itjust.worksOPlinkfedilinkEnglisharrow-up1·3 个月前bot hits i dont care my issue is when i see the same ip querying every file on 3 resource intensive sites millions of times
minus-squaredaniskarma@lemmy.dbzer0.comlinkfedilinkEnglisharrow-up2·edit-23 个月前Do you have a proper robots.txt file? Do they do weird things like invalid url, invalid post tries? Weird user agents? Millions of times by the same ip sound much more like vulnerability proving than crawler. If that’s the case fail to ban or crowdsec. Should be easy to set up a rule to ban an inhumane number of hits per second on certain resources.
minus-squareDrunk & Root@sh.itjust.worksOPlinkfedilinkEnglisharrow-up1·3 个月前since its the frontends i run getting scraped its the robots.txt included there
How do you know it’s “AI” scrappers?
I’ve have my server up before AI was a thing.
It’s totally normal to get thousands of bot hits and to get scraped.
I use crowdsec to mitigate it. But you will always get bot hits.
Some of them are at least honest and have it as a user agent.
Is ignoring robots.txt considered “honest”?
That’s not what I was talking about
bot hits i dont care my issue is when i see the same ip querying every file on 3 resource intensive sites millions of times
Do you have a proper robots.txt file?
Do they do weird things like invalid url, invalid post tries? Weird user agents?
Millions of times by the same ip sound much more like vulnerability proving than crawler.
If that’s the case fail to ban or crowdsec. Should be easy to set up a rule to ban an inhumane number of hits per second on certain resources.
since its the frontends i run getting scraped its the robots.txt included there