I have a problem, which is: My websites (a #Wordpress site and a #MediaWiki installation) are slow as hell.'nSo I need to identify the cause.

samir, the brown sheep

@juergen_hubert This might be a good place to start:

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

A list of AI agents and robots to block. Contribute to ai-robots-txt/ai.robots.txt development by creating an account on GitHub.

GitHub (github.com)

I am not an expert, but I am happy to try and answer any questions you might have.

Jürgen Hubert

@samir

Thanks! I will fiddle around with those and see if anything works.

Femme Malheureuse

@juergen_hubert Wonder if you're getting scraped by AI harvesting bots. Can your site host tell you if you are/are not? And if it's AI bots scraping for LLMs, is the host doing anything to block them?

Jürgen Hubert

@femme_mal I took a closer look, and I am _definitely_ scraped by AI harvesting bots.

Jürgen Hubert (@juergen_hubert@mementomori.social)

2/ Okay, I think I might already have some ideas. My latest #Apache log has 26,694 lines. In these 26.694 lines, I have: - 10,724 access requests from "https://developers.facebook.com/docs/sharing/webmasters/crawler" - 4.562 access requests from "https://developer.amazon.com/support/amazonbot" - 3.316 access requests from "https://openai.com/gptbot" So yeah, I suspect these are the #LLM crawling bots from #Facebook , #Amazon , and #OpenAI who jointly make up for more than half the traffic - and they are hogging the more resource intensive functions, like "Recent Changes" on my wiki. Fuck those fuckers for causing outages on my websites. And any suggestions on how to block them (no snark, please - I _am_ new at this.)

Memento mori (mementomori.social)

Jason Lefkowitz

@juergen_hubert It's hard to say if that's the culprit without knowing more. 17,000 requests from a bot sounds bad, but if they're spread out over a week or a month or whatever, they may not be enough to be causing performance problems. (You'll usually see performance problems from volumes of requests at consistently high levels over a sustained period.)

There are tips I could give you for hardening WordPress against these types of requests that wouldn't require any sysadmin work. But if you want to harden multiple separate applications, like WP and MediaWiki, that gets more complicated.

(1/?)

Jason Lefkowitz

@juergen_hubert If you can download your access logs from your hosting provider, GoAccess (https://goaccess.io/) is a handy free tool for analyzing them quickly. It can put together simple charts that show you who's hitting your site, when, and from where. These can be useful for identifying spikes in traffic from different sources, which you can then block.

(2/?)

Jason Lefkowitz

@juergen_hubert If you want to block traffic to multiple applications on a single machine, you're either going to need to be able to modify your web server software's configuration (which many hosts don't allow), or software called a "web application firewall" (WAF).

A WAF sits between the public web and your applications, filtering and throttling traffic before it reaches them. It gives you one central way to block or rate-limit entire domains.

Many hosting companies integrate with Cloudflare, which offers a basic, free WAF as a service. So that might be something to talk to your host about.

Web application firewall - Wikipedia

(en.wikipedia.org)

(3/?)

Jason Lefkowitz

@juergen_hubert On Apache, for a quick fix, if your host gives you the ability to use .htaccess files (which modify Apache's configuration), you could put lines in each site's .htaccess like

Require not host <host.example.com>

The downside is that it's on you to keep up with the domains the crawlers are coming from, and they change. A WAF lets you just say "throttle anyone who shows up too much."

You'd also have to keep up your list in two places, the .htaccess for WP, and the one for MediaWiki.

.htaccess syntax is also finicky. If you don't know what you're doing, I wouldn't mess with it.

Access Control - Apache HTTP Server Version 2.4

(httpd.apache.org)

(4/?)

Jason Lefkowitz

@juergen_hubert There's probably more to be said, but I've gone on for far too long already

Hope this was at least helpful. If you want to talk further, feel free to @ me either here or in DMs. Can't promise I can solve your problem, but I'm happy to help however I can.

~ fin ~

(5/5)

Jürgen Hubert

@jalefkowit

Thanks - you have given me a lot to think about!

Wandering Adventure Party

I have a problem, which is: My websites (a #Wordpress site and a #MediaWiki installation) are slow as hell.'nSo I need to identify the cause.

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.

Jürgen Hubert (@juergen_hubert@mementomori.social)

Web application firewall - Wikipedia

Web application firewall - Wikipedia

Access Control - Apache HTTP Server Version 2.4

Access Control - Apache HTTP Server Version 2.4