Robots.txt

What is Robots.txt?

Robots.txt is the standard name for using the robots excclusion protocol, which allows you to tell robots (such as web crawlers) what to do with your site. Robots can and sometimes do ignore these standards but a good number of them do respect it, as well as have names for specific web crawlers that you can exclude while allowing other specific crawlers.

What can and can't Robots.txt do?

Robots.txt can tell robots what to do with different parts or all of your site, and can also tell specific robots what to do with your site. For example, you can tell a specific robot to ignore your site, while allowing all other robots to crawl your site. You can also tell all robots to ignore a specific part of your site, while allowing them to crawl the rest of your site. Robots.txt also can't tell robots to do specific things with your site, except for with the sitemap directive, which tells robots where to find your sitemap which can then tell robots about all the pages on your site along with other optional information. I have a separate guide on sitemaps here.

What robots.txt also is not capable of, is preventing all bots entirely from ever accessing your site, while many bots do respect robots.txt, especially those of larger corporations like Google, that doesn't mean that every bot ever will respect robots.txt. In a static site context like neocities where you do not have access to a backend, there's not much you can really do beyond trusting bots to respect the directives you declare, but dynamic sites can do more to prevent bots from accessing their site, such as using a captchas or a login system to prevent bots from accessing the site's contents as easily. Just because not all bots will respect robots.txt doesn't mean it's useless though, just that it's not a foolproof way of protecting your site and its contents.

Another thing to note, is that robots.txt does not prevent your site from being indexed by search engines, your site can still be found and indexed to show up in search engine results, it is only the content of your site that robots.txt that gets ignored. If you want to prevent your site from being indexed by search engines, you would want to use the meta tag <meta name="robots" content="noindex"> in the head of each html page on your site to do this. I plan to make a separate guide on this and other meta tags in the future.

How do you create and use Robots.txt?

To use a robots.txt, you simply need to create a text file named robots.txt, this is case sensitive so make sure it's all in lowercase, and then you can place it at the root (top directory) of your site. Then inside the text file you can start to write your rules. Rules are written in groups, and at the start of each group you need to specify which robots the following rules are for, these are the User-agents of the group. You can either have multiple lines with multiple names of specific User-agents you want to address, or if you want to apply a rule for all robots that respect robots.txt you can use an asterisk (*) in place of a User-agent name. After you have specified the User-agents you want to address, you can then write the rules for those User-agents. There are two directives you'd use for this, Allow and Disallow. Each rule is written on a new line, and each rule is written as a directive followed by a colon (:) and then the path you want to apply the rule to. For example, if you were to want to disallow all robots from accessing any path on your site the group would look like this:

User-agent: *
Disallow: /

If you then wanted to then allow a specific robot to access all of your site, you could create a new group after the previous group where you explicitly allow that robot to access your site. For example, the internet archiver is a bot that respects robots.txt, so if you wanted to allow the internet archiver to access and crawl all of your site to be included in their wayback machine archive, you could do so like this:

User-agent: ia_archiver
Allow: /

You can also mix and match these rules to your liking, with multiple groups for different robots, and different rules for each group, as well as multiple rules for each group. For example, if you wanted to allow all robots to access your site, except for a specific path, but then also allow the internet archiver to access that path, you could do so like this:

User-agent: *
Allow: /
Disallow: /path/to/directory/

User-agent: ia_archiver
Allow: /path/to/directory/

What should I allow and disallow?

Ultimately the answer to this depends, it's entirely up to you and how you want your site to be treated and by which bots. Disallowing all bots on all of your site may be extreme for most people, and for some it may be the desired effect save for explicitly allowing only certain bots, like the internet archiver, to access their site. Others may want to protect certain parts of their site, such as a directory containing their personal assets of photographs or art that they don't want robots to access while still allowing access to the html content of their site.

As an example of a robots.txt, here's mine, with comments (lines starting with #) to explain what each group does:

# Disallow all robots from accessing some of my assets' directories
User-agent: *
Disallow: /assets/fonts/
Disallow: /assets/images/

# Allow the internet archiver to access my whole site (overriding the disallowance of the assets' directories from above)
User-agent: ia_archiver
Allow: /

# Disallow some common bots associated with AI and machine learning from accessing my entire site
User-agent: CCBot
User-agent: ChatGPT-User
User-agent: GPTBot
User-agent: Google-Extended
User-agent: Omgilibot
User-Agent: FacebookBot
Disallow: /