I have a problem, which is: My websites (a #Wordpress site and a #MediaWiki installation) are slow as hell.'nSo I need to identify the cause.
-
2/ Okay, I think I might already have some ideas.
My latest #Apache log has 26,694 lines.
In these 26.694 lines, I have:
- 10,724 access requests from "https://developers.facebook.com/docs/sharing/webmasters/crawler"
- 4.562 access requests from "https://developer.amazon.com/support/amazonbot"
- 3.316 access requests from "https://openai.com/gptbot"So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki.
Fuck those fuckers for causing outages on my websites.
And any suggestions on how to block them (no snark, please - I _am_ new at this.)
@juergen_hubert This might be a good place to start:
GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.
A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.
GitHub (github.com)
I am not an expert, but I am happy to try and answer any questions you might have.
-
@juergen_hubert This might be a good place to start:
GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.
A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.
GitHub (github.com)
I am not an expert, but I am happy to try and answer any questions you might have.
Thanks! I will fiddle around with those and see if anything works.
-
1/ I have a problem, which is: My websites (a #Wordpress site and a #MediaWiki installation) are slow as hell.
So I need to identify the cause. The problem is that I don't know nearly as much about website administration as I ought to be.
I contacted the support people at my website provider, who looked at my (Apache) logs and suggested that my Wordpress site might suffer from a "pingback xmlrpc attack". I did the proposed remedy, which made things a little better. But I don't know enough about reading website logs to identify such problems myself, which I ought to.
So what I am trying to say is: Is there some kind of beginners guide for reading website logs, identifying malicious traffic, and what to do about it?
@juergen_hubert Wonder if you're getting scraped by AI harvesting bots. Can your site host tell you if you are/are not? And if it's AI bots scraping for LLMs, is the host doing anything to block them?
-
@juergen_hubert Wonder if you're getting scraped by AI harvesting bots. Can your site host tell you if you are/are not? And if it's AI bots scraping for LLMs, is the host doing anything to block them?
@femme_mal I took a closer look, and I am _definitely_ scraped by AI harvesting bots.
Jürgen Hubert (@juergen_hubert@mementomori.social)
2/ Okay, I think I might already have some ideas. My latest #Apache log has 26,694 lines. In these 26.694 lines, I have: - 10,724 access requests from "https://developers.facebook.com/docs/sharing/webmasters/crawler" - 4.562 access requests from "https://developer.amazon.com/support/amazonbot" - 3.316 access requests from "https://openai.com/gptbot" So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki. Fuck those fuckers for causing outages on my websites. And any suggestions on how to block them (no snark, please - I _am_ new at this.)
Memento mori (mementomori.social)
-
2/ Okay, I think I might already have some ideas.
My latest #Apache log has 26,694 lines.
In these 26.694 lines, I have:
- 10,724 access requests from "https://developers.facebook.com/docs/sharing/webmasters/crawler"
- 4.562 access requests from "https://developer.amazon.com/support/amazonbot"
- 3.316 access requests from "https://openai.com/gptbot"So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki.
Fuck those fuckers for causing outages on my websites.
And any suggestions on how to block them (no snark, please - I _am_ new at this.)
@juergen_hubert It's hard to say if that's the culprit without knowing more. 17,000 requests from a bot sounds bad, but if they're spread out over a week or a month or whatever, they may not be enough to be causing performance problems. (You'll usually see performance problems from volumes of requests at consistently high levels over a sustained period.)
There are tips I could give you for hardening WordPress against these types of requests that wouldn't require any sysadmin work. But if you want to harden multiple separate applications, like WP and MediaWiki, that gets more complicated.
(1/?)
-
@juergen_hubert It's hard to say if that's the culprit without knowing more. 17,000 requests from a bot sounds bad, but if they're spread out over a week or a month or whatever, they may not be enough to be causing performance problems. (You'll usually see performance problems from volumes of requests at consistently high levels over a sustained period.)
There are tips I could give you for hardening WordPress against these types of requests that wouldn't require any sysadmin work. But if you want to harden multiple separate applications, like WP and MediaWiki, that gets more complicated.
(1/?)
@juergen_hubert If you can download your access logs from your hosting provider, GoAccess (https://goaccess.io/) is a handy free tool for analyzing them quickly. It can put together simple charts that show you who's hitting your site, when, and from where. These can be useful for identifying spikes in traffic from different sources, which you can then block.
(2/?)
-
@juergen_hubert If you can download your access logs from your hosting provider, GoAccess (https://goaccess.io/) is a handy free tool for analyzing them quickly. It can put together simple charts that show you who's hitting your site, when, and from where. These can be useful for identifying spikes in traffic from different sources, which you can then block.
(2/?)
@juergen_hubert If you want to block traffic to multiple applications on a single machine, you're either going to need to be able to modify your web server software's configuration (which many hosts don't allow), or software called a "web application firewall" (WAF).
A WAF sits between the public web and your applications, filtering and throttling traffic before it reaches them. It gives you one central way to block or rate-limit entire domains.
Many hosting companies integrate with Cloudflare, which offers a basic, free WAF as a service. So that might be something to talk to your host about.
(3/?)
-
@juergen_hubert If you want to block traffic to multiple applications on a single machine, you're either going to need to be able to modify your web server software's configuration (which many hosts don't allow), or software called a "web application firewall" (WAF).
A WAF sits between the public web and your applications, filtering and throttling traffic before it reaches them. It gives you one central way to block or rate-limit entire domains.
Many hosting companies integrate with Cloudflare, which offers a basic, free WAF as a service. So that might be something to talk to your host about.
(3/?)
@juergen_hubert On Apache, for a quick fix, if your host gives you the ability to use .htaccess files (which modify Apache's configuration), you could put lines in each site's .htaccess like
Require not host <host.example.com>
The downside is that it's on you to keep up with the domains the crawlers are coming from, and they change. A WAF lets you just say "throttle anyone who shows up too much."
You'd also have to keep up your list in two places, the .htaccess for WP, and the one for MediaWiki.
.htaccess syntax is also finicky. If you don't know what you're doing, I wouldn't mess with it.
(4/?)
-
@juergen_hubert On Apache, for a quick fix, if your host gives you the ability to use .htaccess files (which modify Apache's configuration), you could put lines in each site's .htaccess like
Require not host <host.example.com>
The downside is that it's on you to keep up with the domains the crawlers are coming from, and they change. A WAF lets you just say "throttle anyone who shows up too much."
You'd also have to keep up your list in two places, the .htaccess for WP, and the one for MediaWiki.
.htaccess syntax is also finicky. If you don't know what you're doing, I wouldn't mess with it.
(4/?)
@juergen_hubert There's probably more to be said, but I've gone on for far too long already
Hope this was at least helpful. If you want to talk further, feel free to @ me either here or in DMs. Can't promise I can solve your problem, but I'm happy to help however I can.
~ fin ~
(5/5)
-
@juergen_hubert There's probably more to be said, but I've gone on for far too long already
Hope this was at least helpful. If you want to talk further, feel free to @ me either here or in DMs. Can't promise I can solve your problem, but I'm happy to help however I can.
~ fin ~
(5/5)
Thanks - you have given me a lot to think about!