How robots.txt Files Work

Updated on

A robots.txt file allows websites to tells bots which parts of the site shouldn’t be crawled.

If a website has a robots.txt file, it’s always in the same place on every website — at the root of the domain.

So for www.google.com, it’s located here:

https://www.google.com/robots.txt

What robots.txt Does

Bots are supposed to avoid fetching URLs that are specified by the robots.txt file, but they don’t always behave nicely.

Blocking a page or section of a website with robots.txt won’t necessarily stop search engines from displaying it in the search results. It’s more like a meek suggestion to bots, “please don’t fetch these URLs, but there is nothing I can do to stop you”.

The Four Forms of Sites

As mentioned above, a robots.txt file will always be found at the root of a website. So to find the file on google.com, you would visit google.com/robots.txt.

The full URL is:

https://www.google.com/robots.txt

Google will redirect you to the HTTPS and WWW versions of their site.

Google considers these forms to be four different websites, so a robots.txt file on one of them won’t affect URLs on the other forms:

  • http://google.com/ — no-HTTPS, no-www
  • http://www.google.com/ — no-HTTPS, yes-www
  • https://google.com/ — yes-HTTPS, no-www
  • https://www.google.com/ — yes-HTTPS, yes-www

So a robots.txt file at http://google.com/robots.txt would have different effects from one at https://www.google.com/robots.txt, because Google considers them to be different sites.

(The best practice is to redirect three of those to a single version, which we will practice later. Google redirects them, so there is only one file.)

A full robots.txt tutorial is coming soon. In the meantime, check out Google’s documentation to learn about the syntax of the file.

Takeaways

Things you should remember from this section:

  • The robots.txt file can be used to politely ask bots not to crawl parts of a site.
  • Bots don’t have to pay attention to the robots.txt file.
  • Even if a site blocks Google with robots.txt Google still might list the blocked pages in the search results. (Other ways to block search engines are coming next.)
  • Every website has four possible forms, and they are all considered different sites by Google. The forms are based on HTTP vs. no-HTTP and WWW vs. no-WWW.

Return to the main tutorial page.

Feedback and Comments

What did you think about this page? Do you have any questions, or is there anything that could be improved? You can leave a comment after clicking on an icon below.