Build a robots.txt file from a data file

Recently, I was reviewing how I generating my blog. I have noticed that my site seems to get a lot of bot traffic. I would rather not see this traffic so I thought I should investigate how I might limit them to reduce the amount of traffic. So, a bit of searching and thinking and this is what I have decided for now.

The Idea

How I prevent AI bots from scanning my page and using up my bandwidth but still allow search engines to index my pages.

Clearly, the answer is the robots.txt file. But editing this file by hand could be prone to mistakes. The better option is the have some sort of configuration file that is easy to edit and understand.

Since this is a hugo based site. I can create a file in my local data directory and then create some templating code to create the file.

Check your config file

There are a couple of things to check before getting to the meat of the solution.

Check your configuration file.

  1. What format are data files stored in. Checking my configuration I found that data files are stored in yaml format.

    metaDataFormat = "yaml"
    
  2. Check that robot.txt generation is turned on. Again checking the hugo configuration file.

    enableRobotsTXT = true
    

Create the data file used to generate the robots.txt file

Create a new file in your local data directory called robots.yaml and add the following content:

groups:
  - comment: "Block Meta AI Training Crawlers"
    user_agents:
      - meta-externalagent
      - Meta-ExternalAgent
      - FacebookBot
      - Meta-ExternalFetcher
    disallow:
      - /

  - comment: "Allow link preview crawler"
    user_agents:
      - facebookexternalhit
    allow:
      - /

    # Final item in the the list
  - comment: "Default rules"
    user_agents:
      - "*"
    disallow:
      - /dist/
    allow:
      - /

This is the simple config that I have decided to go with for now. Note: that I do not always use the allow directive.

My configuration contains three groups.

  1. Robots that I do not want to allow any access to my site.
  2. Robots that I want to allow access to my site. (Robots that do not use my site for learning.)
  3. Default rules for all other robots.

Create the robots.txt file, a template file, used by Hugo to generate the actual robots.txt file

Create a new file in your layouts directory called robots.txt and add the following content:

{{- $data := site.Data.robots -}}
  {{- range $data.groups }}
    {{- if .comment }}
# {{ .comment }}
    {{- end }}

    {{- range .user_agents }}
User-agent: {{ . }}
    {{- end }}
    {{- range .disallow }}
Disallow: {{ . }}
    {{- end }}
    {{- range .allow }}
Allow: {{ . }}
  {{- end }}

{{ end }}

The generated robots.txt file

When Hugo builds your site, it will use the template above and the data from robots.yaml to generate the actual robots.txt file.

This is the final robots.txt file that will be generated:

# Block Meta AI Training Crawlers
User-agent: meta-externalagent
User-agent: Meta-ExternalAgent
User-agent: FacebookBot
User-agent: Meta-ExternalFetcher
Disallow: /


# Allow link preview crawler
User-agent: facebookexternalhit
Allow: /


# Default rules
User-agent: *
Disallow: /dist/
Allow: /

Add Sitemap to robots.txt

You should also probably also add a sitemap to your robots.txt file. You can either hardcode it or use something like the following:

Sitemap: {{ .Site.BaseURL }}/sitemap.xml

This will tell search engines where to find your sitemap. Make sure to update the path to match your site’s URL structure.

Caveats

The sitemap might not be added if you have customised your [outputs] configuration. This was an issue I had as I had modified my [outputs] configuration.

Caveats 2

As always, the robots.txt file is only a suggestion. Search engines may have intentionally been coded not to respect it.