How robots.txt Files Work

Updated on

A robots.txt file allows websites to tells bots which parts of the site shouldn’t be crawled.

If a website has a robots.txt file, it’s always in the same place on every website — at the root of the domain.

So for www.google.com, it’s located here:

https://www.google.com/robots.txt

What robots.txt Does

Bots are supposed to avoid fetching URLs that are specified by the robots.txt file, but they don’t always behave nicely.

Blocking a page or section of a website with robots.txt won’t necessarily stop search engines from displaying it in the search results. It’s more like a meek suggestion to bots, “please don’t fetch these URLs, but there is nothing I can do to stop you”.

The Four Forms of Sites

As mentioned above, a robots.txt file will always be found at the root of a website. So to find the file on google.com, you would visit google.com/robots.txt.

The full URL is:

https://www.google.com/robots.txt

Google will redirect you to the HTTPS and WWW versions of their site.

Google considers these forms to be four different websites, so a robots.txt file on one of them won’t affect URLs on the other forms:

So a robots.txt file at http://google.com/robots.txt would have different effects from one at https://www.google.com/robots.txt, because Google considers them to be different sites.

(The best practice is to redirect three of those to a single version, which we will practice later. Google redirects them, so there is only one file.)

A full robots.txt tutorial is coming soon. In the meantime, check out Google’s documentation to learn about the syntax of the file.

Takeaways

Things you should remember from this section:

Return to the main tutorial page.

Feedback and Comments

What did you think about this page? Do you have any questions, or is there anything that could be improved? You can leave a comment after clicking on an icon below.