ai.robots.txt: A list of AI agents and robots to block. (and the configuration files to block them)

Vittelius@feddit.org · 7 months ago

ai.robots.txt: A list of AI agents and robots to block. (and the configuration files to block them)

zod000@lemmy.dbzer0.com · 7 months ago

Most AI crawlers don’t respect robots.txt files, but this info might be useful for other forms of blocking.

Vittelius@feddit.org · 7 months ago

The repo, despite its name, doesn’t only contain a robots.txt. It also has files for popular reverse proxies to block crawlers outright.

zod000@lemmy.dbzer0.com · 7 months ago

That was kind of the point of my comment since the name didn’t indicate that. Also many tools that companies would use won’t/can’t use these files, but could still make use of the info. As I am specifically in that case, I wanted people to know that it could still be worth their time taking a look.

Ulrich@feddit.org · 7 months ago

robots.txt doesn’t do any sort of blocking. It’s nothing more than a request. This is active blocking.

Although I’m not sure how successful it will be, given the determination of these bots.

zod000@lemmy.dbzer0.com · 7 months ago

A few of them are quite good at randomizing their user-agent and using a large number of IP blocks. I’ve not had a fun time trying to limit them.

Ulrich@feddit.org · 7 months ago

Yeah dude, they’re extremely malicious and not even trying to hide it anymore. They don’t give a fuck that they’re DDOSing the entire internet.

db0@lemmy.dbzer0.com · 7 months ago

That’s pretty sweet but just be aware a lot of bots are bad actors and don’t advertise a proper user agent, so you have to also block by ips. Blocking all alibaba server ips is a good start.

plz1@lemmy.world · 7 months ago

This is an nginx reverse proxy configuration. It’s not passive, like robots.txt, but they probably named it like thatin solidarity with the intent of robots.txt. You’re on-point about Alibaba though, which I’m sure could be somewhat easily added to this nginx blocking strategy. Anubis is still probably a better solution, since it doesn’t have that limitation of having LLM bots pass a user-agent.

hendrik@palaver.p3x.de · edit-2 7 months ago

Many AI crawlers don’t identify properly, they fake the User Agent and pretend to be Google Chrome or something like that. So it’s bound to only deal with the ones that somehow behave, while it won’t do anything to the real bad ones. And from my experience, those can do enough requests per second to get an average server into trouble. At least that’s what happened to mine.

alecsargent@lemmy.zip · edit-2 7 months ago

If you are using Hugo use this robots.txt template that automatically updates every build:

{{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
{{- $resource := resources.GetRemote $url -}}
{{- with try $resource -}}
  {{ with .Err }}
    {{ errorf "%s" . }}
  {{ else with .Value }}
	{{- .Content -}}
  {{ else }}
    {{ errorf "Unable to get remote resource %q" $url }}
  {{ end }}
{{ end -}}

Sitemap: {{ "sitemap.xml" | absURL }}

Optionally lead rouge bots to poisoned pages:

{{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
{{- $resource := resources.GetRemote $url -}}
{{- with try $resource -}}
  {{ with .Err }}
    {{ errorf "%s" . }}
  {{ else with .Value }}
    {{- printf "%s\n%s\n\n" "User-Agent: *" "Disallow: /train-me" }}
    {{- .Content -}}
  {{ else }}
    {{ errorf "Unable to get remote resource %q" $url }}
  {{ end }}
{{ end -}}

Sitemap: {{ "sitemap.xml" | absURL }}

~~Check out how to poison your pages for rouge bots in this article~~

Repo was deleted and the internet archive was excluded.

I use Quixotic and a Python script to poison the pages and I included those in my site update script.

Its all cobbled together in amateur fashion from the deleted article but its honest work.

oplkill@lemmy.world · 7 months ago

If only they could read

ai.robots.txt: A list of AI agents and robots to block. (and the configuration files to block them)

ai.robots.txt: A list of AI agents and robots to block. (and the configuration files to block them)

GitHub - ai-robots-txt/ai.robots.txt: A list of AI agents and robots to block.