The robots.txt specification was created, by consensus back in June 1994 as a way for robots (search engine spiders/web crawlers) to be instructed not to index certain areas on a web site. The protocol for these rules was simple; define the user agent (the identity of the robot) the keyword DISALLOW and the relative path to the folder or document to restrict.
User-agent: * Disallow: /
This example blocks all (*) robots from the root of the web site and all its sub-folder, effectively blocking access to your entire site. This single keyword approach of blocking made it impossible to allow access to a previously blocked folder, requiring a more verbose set of rules. For example, if you wanted to restrict every robot from indexing all parts of your website, except the Public folder you might define a robot.txt like this:
User-agent: * Disallow: /admin User-agent: *
Disallow: /books User-agent: *
Disallow: /customers User-agent: *
Disallow: /ecomm User-agent: *
Disallow: /notes User-agent: *
Disallow: /training
The New Allow Directive
One of the new keywords added to solve this issue is ALLOW. This has the exact opposite effect of the existing DISALLOW keyword.
To improve out pervious example using the ALLOW keyword:
User-agent: * Disallow: / User-agent: *
Allow: /public
This not only is easier to read and maintain, but it will also apply to any new folders added.
Wildcard Url Patterns
One of the most powerful new features added to the protocol is the use of the wildcard '*' character in url paths. This allows your to identify url paths with a pattern, replacing the wildcard character with any number of characters. For example,
# Block access to all pdf files.
Disallow: *.pdf$
# Block access to image folders from the Public folder
Disallow: /public/*/images/*
# Allow access to all folder beginning with the word client
Allow: /public/client*
The dollar sign '$' above specifies this string should appear at the end of the url path.
Sitemap Location
A sitemap is an xml file defining all of the pages available on your site as well as some other meta information on these files. You can now define the location of your sitemap in the robots.txt file as follows:
Sitemap: http://mysite.com/sitemap.xml
Crawl-Delay for Yahoo and MSN Robots Only
This new keyword allows you to define the delay in seconds between each fetch request for a file on your site. This can be used if you site is under load due to a robot indexing your site, which is impacting visitors of your site.
User-agent: Slurp Crawl-delay: 2.0
This instucts Yahoo to leave 2 seconds between each request.
User-agent: MSNBot Crawl-delay: 0.5
This instructs Microsoft Live Search to leave half a second between each request (2 requests per second).
Improve your SEO with Robots.txt
Using these new features of the robots.txt protocol, you can help the search engines index content on your site, ultimately improving your sites SEO (search engine optimization).
Many sites have the option to view the page in a print friendly manner. You probably do not want a user to click on the print friendly version from the search results in Google or have the engines detect duplicate content within your site. Using a wildcard, you can simply restrict access to the print version of a page, for example:
User-agent: * Disallow: *.aspx?printer=true
I personally find it very annoying when a blog is referenced in the SERPS (search engine results pages) using its page as a url. For example, a fictions blog www.mytechblob.com/page14 has been indexed and appears in the SERPS. This page was indexed as a snap -shot in time of the content on page 14. As more posts are added to the blog, this indexed page becomes stale. It makes much more sense to only index the links for the articles on this page, rather than the snippet of each post from this summary page. You could block a robot indexing content from these pages by:
User-agent: * Disallow: /page*