How robots.txt Files Work
A robots.txt file allows websites to tells bots which parts of the site shouldn’t be crawled.
If a website has a robots.txt file, it’s always in the same place on every website — at the root of the domain.
So for www.google.com, it’s located here:
https://www.google.com/robots.txt
What robots.txt Does
Bots are supposed to avoid fetching URLs that are specified by the robots.txt file, but they don’t always behave nicely.
Blocking a page or section of a website with robots.txt won’t necessarily stop search engines from displaying it in the search results. It’s more like a meek suggestion to bots, “please don’t fetch these URLs, but there is nothing I can do to stop you”.
The Four Forms of Sites
As mentioned above, a robots.txt file will always be found at the root of a website. So to find the file on google.com, you would visit google.com/robots.txt.
The full URL is:
https://www.google.com/robots.txt
Google will redirect you to the HTTPS and WWW versions of their site.
Google considers these forms to be four different websites, so a robots.txt file on one of them won’t affect URLs on the other forms:
http://google.com/— no-HTTPS, no-wwwhttp://www.google.com/— no-HTTPS, yes-wwwhttps://google.com/— yes-HTTPS, no-wwwhttps://www.google.com/— yes-HTTPS, yes-www
So a robots.txt file at http://google.com/robots.txt would have different effects from one at https://www.google.com/robots.txt, because Google considers them to be different sites.
(The best practice is to redirect three of those to a single version, which we will practice later. Google redirects them, so there is only one file.)
A full robots.txt tutorial is coming soon. In the meantime, check out Google’s documentation to learn about the syntax of the file.
Takeaways
Things you should remember from this section:
- The
robots.txtfile can be used to politely ask bots not to crawl parts of a site. - Bots don’t have to pay attention to the
robots.txtfile. - Even if a site blocks Google with
robots.txtGoogle still might list the blocked pages in the search results. (Other ways to block search engines are coming next.) - Every website has four possible forms, and they are all considered different sites by Google. The forms are based on HTTP vs. no-HTTP and WWW vs. no-WWW.