Adventures in search and .net RSS 2.0
# Monday, June 30, 2008

The robots.txt specification was created, by consensus back in June 1994 as a way for robots (search engine spiders/web crawlers) to be instructed not to index certain areas on a web site. The protocol for these rules was simple; define the user agent (the identity of the robot) the keyword DISALLOW and the relative path to the folder or document to restrict.

User-agent: * Disallow: /

This example blocks all (*) robots from the root of the web site and all its sub-folder, effectively blocking access to your entire site. This single keyword approach of blocking made it impossible to allow access to a previously blocked folder, requiring a more verbose set of rules. For example, if you wanted to restrict every robot from indexing all parts of your website, except the Public folder you might define a robot.txt like this:

User-agent: * Disallow: /admin User-agent: *
Disallow: /books User-agent: *
Disallow: /customers User-agent: *
Disallow: /ecomm User-agent: *
Disallow: /notes User-agent: *
Disallow: /training

The New Allow Directive

One of the new keywords added to solve this issue is ALLOW. This has the exact opposite effect of the existing DISALLOW keyword.

To improve out pervious example using the ALLOW keyword:

User-agent: * Disallow: / User-agent: *
Allow: /public

This not only is easier to read and maintain, but it will also apply to any new folders added.

Wildcard Url Patterns

One of the most powerful new features added to the protocol is the use of the wildcard '*' character in url paths. This allows your to identify url paths with a pattern, replacing the wildcard character with any number of characters. For example,

# Block access to all pdf files.
Disallow: *.pdf$
# Block access to image folders from the Public folder
Disallow: /public/*/images/*
# Allow access to all folder beginning with the word client
Allow: /public/client*

The dollar sign '$' above specifies this string should appear at the end of the url path.

Sitemap Location

A sitemap is an xml file defining all of the pages available on your site as well as some other meta information on these files. You can now define the location of your sitemap in the robots.txt file as follows:

Sitemap: http://mysite.com/sitemap.xml

Crawl-Delay for Yahoo and MSN Robots Only

This new keyword allows you to define the delay in seconds between each fetch request for a file on your site. This can be used if you site is under load due to a robot indexing your site, which is impacting visitors of your site.

User-agent: Slurp Crawl-delay: 2.0

This instucts Yahoo to leave 2 seconds between each request.

User-agent: MSNBot Crawl-delay: 0.5

This instructs Microsoft Live Search to leave half a second between each request (2 requests per second).

Improve your SEO with Robots.txt

Using these new features of the robots.txt protocol, you can help the search engines index content on your site, ultimately improving your sites SEO (search engine optimization).

Many sites have the option to view the page in a print friendly manner. You probably do not want a user to click on the print friendly version from the search results in Google or have the engines detect duplicate content within your site. Using a wildcard, you can simply restrict access to the print version of a page, for example:

User-agent: * Disallow: *.aspx?printer=true


I personally find it very annoying when a blog is referenced in the SERPS (search engine results pages) using its page as a url. For example, a fictions blog www.mytechblob.com/page14 has been indexed and appears in the SERPS. This page was indexed as a snap -shot in time of the content on page 14. As more posts are added to the blog, this indexed page becomes stale. It makes much more sense to only index the links for the articles on this page, rather than the snippet of each post from this summary page. You could block a robot indexing content from these pages by:

User-agent: * Disallow: /page*
Sunday, June 29, 2008 11:17:41 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] -
Search
# Wednesday, June 25, 2008

Expiring Cached Resources

Web sites rely heavily on stylesheets for appearance and structure.  Css has the power to transform a plain page to something much more impressive.  When you release a new build of your web application you need to ensure your visitors view your web site using the correct stylesheet.  If their browser renders your pages using an older cached copy of your stylesheet, pages may render with unexpected results.  You cannot guarantee every browser and proxy server will expire its cached copy of your stylesheet, even though the file has changed.

The common trick to clear a browser cache of resource files (stylesheets, javascript files and images) is to use a custom querystring parameter after the file name. For example:

<link media="all" rel="stylesheet" type="text/css" href=http://www.aspectsearch.com/style.css?20080626></link>

As the url to the resource is different to the one being held in the cache, the browser will request a new copy.  This value after the question mark can be anything, from a release date, to an assembly build version number.  As you release a new version, this value changes and forces the browser to retrieve a new copy of the file.

Stylesheet Inclusion with Themeing

ASP.NET 2.0 provides a convenient way to group together templates to modify the look and feel of your web site.  Themes and skinning provide a set of resources to change the appearance of your site, including images, stylesheets and server control overrides. 

When you create a new theme, you are can add one or more stylesheets in the root folder for the theme.  ASP.NET will automatically render LINK tags to include all of these stylesheets, in filename order.  This automatic inclusion of stylesheets causes many problems to arise.

Control Adapters to the Rescue

A control adapter allows you to customize the rendering process for html tags, modifying the markup or behavior for specific browsers.  In our case we are going to hook into the rending of the LINK tag.

You register a control adapter by creating a file in the App_Browsers folder of your web site. Below is the Link.browser file used to register the control adapter:

<browsers>
  <browser refID="Default">
    <controlAdapters>
        <adapter controlType="System.Web.UI.HtmlControls.HtmlLink" adapterType="Aspect.LinkControlAdapter" />
    </controlAdapters>
    </browser>
</browsers>

Now we need to implement some code in the LinkControlAdapter class to append our querystring parameter. 

The ControlAdapter class has a single method to override, Render.

public class LinkControlAdapter : System.Web.UI.Adapters.ControlAdapter
{
   protected override void Render(HtmlTextWriter writer)
   {
      base.Render(new LinkControlTextWriter(writer));
   }
}

We encapsulate the render logic in our own text writer class.


public class LinkControlTextWriter : HtmlTextWriter
{
   public LinkControlTextWriter(HtmlTextWriter writer) : base(writer)
   {
      base.InnerWriter = writer.InnerWriter;
   }

   public LinkControlTextWriter(TextWriter writer) : base(writer)
   {
      base.InnerWriter = writer;
   }

   public override void WriteAttribute(string name, string value, bool fEncode)
   {
      if (name == "href")
      {
         if (!value.Contains('?') && value.EndsWith(".css", StringComparison.InvariantCultureIgnoreCase) && IsLocal(value))
            value +=
"?" + GetVersionNumber();
      }

      base.WriteAttribute(name, value, fEncode);
   }

   private bool IsLocal(string url)
   {
      return !url.StartsWith("http://") && !url.StartsWith("https://");
   }

   private string GetVersionNumber()
   {
      return "20080628";
   }
}

During the WriteAttribute method we ensure the attribute being rendered is the href.  If the value has not already been applied with a querstring value and its for a local css stylesheet file, then a version number is appended to it.  You can simply replace the mechanism for generating a version number as required.

Wednesday, June 25, 2008 12:31:05 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0] -
ASP.NET
Navigation
Categories
Archive
<January 2009>
SunMonTueWedThuFriSat
28293031123
45678910
11121314151617
18192021222324
25262728293031
1234567
Blogroll
About the author/Disclaimer

Disclaimer
The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.

© Copyright 2009
Gavin Sansom
Sign In
Statistics
Total Posts: 2
This Year: 0
This Month: 0
This Week: 0
Comments: 0
Themes
Pick a theme:
All Content © 2009, Gavin Sansom
DasBlog theme 'Business' created by Christoph De Baene (delarou)