• zod000@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    11
    arrow-down
    1
    ·
    7 months ago

    Most AI crawlers don’t respect robots.txt files, but this info might be useful for other forms of blocking.

    • Vittelius@feddit.orgOP
      link
      fedilink
      arrow-up
      14
      ·
      7 months ago

      The repo, despite its name, doesn’t only contain a robots.txt. It also has files for popular reverse proxies to block crawlers outright.

      • zod000@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        4
        ·
        7 months ago

        That was kind of the point of my comment since the name didn’t indicate that. Also many tools that companies would use won’t/can’t use these files, but could still make use of the info. As I am specifically in that case, I wanted people to know that it could still be worth their time taking a look.

    • Ulrich@feddit.org
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 months ago

      robots.txt doesn’t do any sort of blocking. It’s nothing more than a request. This is active blocking.

      Although I’m not sure how successful it will be, given the determination of these bots.

      • zod000@lemmy.dbzer0.com
        link
        fedilink
        arrow-up
        1
        ·
        7 months ago

        A few of them are quite good at randomizing their user-agent and using a large number of IP blocks. I’ve not had a fun time trying to limit them.

        • Ulrich@feddit.org
          link
          fedilink
          English
          arrow-up
          3
          ·
          7 months ago

          Yeah dude, they’re extremely malicious and not even trying to hide it anymore. They don’t give a fuck that they’re DDOSing the entire internet.

  • db0@lemmy.dbzer0.com
    link
    fedilink
    arrow-up
    7
    ·
    7 months ago

    That’s pretty sweet but just be aware a lot of bots are bad actors and don’t advertise a proper user agent, so you have to also block by ips. Blocking all alibaba server ips is a good start.

    • plz1@lemmy.world
      link
      fedilink
      English
      arrow-up
      2
      ·
      7 months ago

      This is an nginx reverse proxy configuration. It’s not passive, like robots.txt, but they probably named it like thatin solidarity with the intent of robots.txt. You’re on-point about Alibaba though, which I’m sure could be somewhat easily added to this nginx blocking strategy. Anubis is still probably a better solution, since it doesn’t have that limitation of having LLM bots pass a user-agent.

  • hendrik@palaver.p3x.de
    link
    fedilink
    English
    arrow-up
    5
    ·
    edit-2
    7 months ago

    Many AI crawlers don’t identify properly, they fake the User Agent and pretend to be Google Chrome or something like that. So it’s bound to only deal with the ones that somehow behave, while it won’t do anything to the real bad ones. And from my experience, those can do enough requests per second to get an average server into trouble. At least that’s what happened to mine.

  • alecsargent@lemmy.zip
    link
    fedilink
    arrow-up
    4
    ·
    edit-2
    7 months ago

    If you are using Hugo use this robots.txt template that automatically updates every build:

    {{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
    {{- $resource := resources.GetRemote $url -}}
    {{- with try $resource -}}
      {{ with .Err }}
        {{ errorf "%s" . }}
      {{ else with .Value }}
    	{{- .Content -}}
      {{ else }}
        {{ errorf "Unable to get remote resource %q" $url }}
      {{ end }}
    {{ end -}}
    
    Sitemap: {{ "sitemap.xml" | absURL }}
    

    Optionally lead rouge bots to poisoned pages:

    {{- $url := "https://raw.githubusercontent.com/ai-robots-txt/ai.robots.txt/refs/heads/main/robots.txt" -}}
    {{- $resource := resources.GetRemote $url -}}
    {{- with try $resource -}}
      {{ with .Err }}
        {{ errorf "%s" . }}
      {{ else with .Value }}
        {{- printf "%s\n%s\n\n" "User-Agent: *" "Disallow: /train-me" }}
        {{- .Content -}}
      {{ else }}
        {{ errorf "Unable to get remote resource %q" $url }}
      {{ end }}
    {{ end -}}
    
    Sitemap: {{ "sitemap.xml" | absURL }}
    

    Check out how to poison your pages for rouge bots in this article

    Repo was deleted and the internet archive was excluded.

    I use Quixotic and a Python script to poison the pages and I included those in my site update script.

    Its all cobbled together in amateur fashion from the deleted article but its honest work.