代码改变世界

URL Rewriting using ASP.NET for SEO

2012-02-29 18:21  mleader1  阅读(262)  评论(0编辑  收藏  举报


Introduction

URL Rewriting is the process of hiding a complex parameterised query string based URL such ashttp://www.somedomain.com/showproduct.aspx?id=12345&otherid=67890 by using a series of flat, often verbose URLs which do not contain any query string parameters such ashttp://www.somedomain.com/products/some-product-name.html. The idea is that the flat URL is requested from the server, and internally we determine the required parameters, then call the original query string basedURL.

The main reason you might consider doing this is for SEO (Search Engine Optimisation) purposes. Search engines generally don't like query string parameters in web pages as it often indicates a dynamically changing page, is harder to index because there are lots of occurrences of the same page - usually which contain convoluted unhelpful data, and is not considered "friendly" to the consumer. I am going to write specifically about rewriting flat HTML page resources to ASPX scripts with queries.

The first problem we encounter is that by default IIS is configured to handle all requests for *.htm and *.htmlresources itself, and will simply respond with a “404 Page Not Found” error if such a request does not exist.  Since we want to rewrite *.html pages, we need to tell IIS to forward all requests for these resources toASP.NET and not handle it itself. I prefer to route *.html through ASP.NET and leave *.htm alone, but you can forward whatever resources you like – you can even forward all requests to ASP.NET if you prefer. You can do this in IIS6 by going to the website properties, clicking the “Configuration” button on the “Home Directory” tab (or “Virtual Directory” tab for an application virtual folder), and adding a “mapping” for the extensions you want to route to ASP.NET ISAPI extension. This is usually located atC:\Windows\Microsoft.NET\Framework\v2.0.50727\aspnet_isapi.dll but you can always find and copy the properties for the *.aspx mapping if you are unsure. You also need to ensure you un-check the option to make sure the requested resource exists!

Once the request gets through to ASP.NET, we can perform URL rewriting by using theHttpContext.RewritePath method. This method essentially changes the original request information passed from IIS to a different URL. There are a variety of places where the rewriting can be done, including theglobal.asax application handler, in an HTTP Module, or in an HTTP Handler.

Unlike other articles which generally refer more to the process of mapping friendly URLs to backend URLs using a configuration file and regular expressions, this article is designed to simply outline the basic foundations necessary to build a robust URL rewriting system in ASP.NET using a variety of approaches; I have implemented a simple stub method for the HTTP module and handler that can be expanded on as your requirements dictate.

Rewriting using global.asax

The global.asax file allows us to handle application and session level events, and resides in the root folder of the application. We can implement simple URL rewriting using the Application_BeginRequest event handler of this file, which is called each time a new request is sent to ASP.NET from IIS for handling:

void Application_BeginRequest(object sender, EventArgs e)
{
    HttpApplication app = sender as HttpApplication;
    if(app.Request.Path.IndexOf("FriendlyPage.html") > 0)
    {
        app.Context.RewritePath("/UnfriendlyPage.aspx?SomeQuery=12345");
    }
}

In the above code snippet, we rewrite any requests for the FriendlyPage.html page to the UnfriendlyPage.aspxpage with the query string SomeQuery=12345. As the request progresses through the pipeline, it will now use the newly rewritten resource instead of the original one.

Obviously this is a very simple, hardcoded example of rewriting the URL, and it does not take into account application paths and so on. Usually, rewriting would not be performed as hardcoded entries in the global.asaxhandler, but rather would be done inside a purpose built HTTP Module, or using an HTTP Handler. 

As explained earlier, the above example will only work when IIS has been configured to send *.html resource requests to ASP.NET, but the example would work just as well for any type of request, including a directory (if IIS has been configured with a * mapping) or indeed, requests for other aspx pages.

Rewriting using an HTTP Module

An HTTP Module is a class which implements the IHttpModule interface.  Essentially it requires two methods to be implemented:

  • Init which is used to hook up pipeline events that the module is interested in handling
  • Dispose to release any allocated resources

URL Rewriting via an HTTP Module works in a very similar way to the global.asax approach shown earlier. HTTP Modules are integrated to the processing pipeline of an ASP.NET application by defining them in the web.configfile. ASP.NET will automatically load and instantiate any defined modules, and call their Init() methods. TheInit() method can be used to subscribe to other events in the request pipeline. 

HTTP Modules are generally executed sequentially, one after the other in the order they are specified in theweb.config file, and their methods are called before the equivalent events in global.asax.

The following snippet shows how a typical (very simple) HTTP module could be written:

public class UrlRewritingModule : IHttpModule
{
    public UrlRewritingModule()
    {
    }

    public String ModuleName
    {
        get
        {
            return "UrlRewritingModule";
        }
    }

    const string ORIGINAL_PATH = "OriginalRequestPathInfo";

    public void Init(HttpApplication application)
    {
        application.AuthorizeRequest += new EventHandler(application_AuthorizeRequest);
        application.PreRequestHandlerExecute +=
		new EventHandler(application_PreRequestHandlerExecute);
        application.Context.Items[ORIGINAL_PATH] = null;
    }

    void application_PreRequestHandlerExecute(object sender, EventArgs e)
    {
        HttpApplication app = sender as HttpApplication;
        String strOriginalPath = app.Context.Items[ORIGINAL_PATH] as String;
        if (strOriginalPath != null && strOriginalPath.Length > 0)
        {
            app.Context.RewritePath(strOriginalPath);
        }
    }

    void application_AuthorizeRequest(object sender, EventArgs e)
    {
        HttpApplication app = sender as HttpApplication;
        String strVirtualPath = "";
        String strQueryString = "";
        MapFriendlyUrl(app.Context, out strVirtualPath, out strQueryString);

        if (strVirtualPath.Length>0)
        {
            app.Context.Items[ORIGINAL_PATH] = app.Request.Path;
            app.Context.RewritePath(strVirtualPath, String.Empty, strQueryString);
        }
    }

    void MapFriendlyUrl(HttpContext context,
	out String strVirtualPath, out String strQueryString)
    {
        strVirtualPath = ""; strQueryString = "";

        // TODO: This routine should examine the context.Request properties and implement
        //       an appropriate mapping system.
        //
        //       Set strVirtualPath to the virtual path of the target aspx page.
        //       Set strQueryString to any query strings required for the page.

        if (context.Request.Path.IndexOf("FriendlyPage.html") >= 0)
        {
            strVirtualPath = "~/Main.aspx";
            strQueryString = "Message=You smell of updated cheese!";
        }
    }

    public void Dispose()
    {
    }
}

As you can see, the Application_AuthorizeRequest method is doing exactly the same job as the routine in the global.asax Application_BeginRequest handler. It is actually possible to subscribe to the BeginRequestevent on the HTTP Module, but it is generally better to use AuthorizeRequest instead as that takes into account Forms Authentication which may perform a redirection to acquire login details; if this happens and theURL has been rewritten already by the BeginRequest event means it will send the consumer back to the rewritten page rather than the original friendly page. By doing the rewriting inside the AuthorizeRequestevent, we will ensure the Forms Authentication subsystem will return us to the friendly resource name instead.

To perform the mapping, I have implemented a stub method MapFriendlyUrl whose job it is to work out how the requested resource needs to be rewritten. This is entirely dependent on your setup. In this example, I use a quick and dirty hard coded test for any request for “FriendlyPage.html” and simply map this to “UnfriendlyPage.aspx?FirstQuery=1&SecondQuery=2”. Obviously you need to complete this method to do what you want making sure the “strVirtualPath” and “strQueryString” out parameters are completed. If the resource cannot be mapped, the method returns an empty path.

You might have noticed the additional call to the RewritePath method in the PreRequestHandlerExecutehandler. The reason that we do this is so that any form post backs which happen in the target page post back to the original friendly URL, and not the ugly rewritten one. In the above example, all requests for "FriendlyPage.html” are rewritten to “UnfriendlyPage.aspx?FirstQuery=1&SecondQuery=2”. If we omitted the second rewrite in the PreRequestHandlerExecute stage, any post backs on the UnfriendlyPage.aspx page would post back to UnfriendlyPage.aspx and not FriendlyPage.html.

There is one interesting side effect however; although the page itself is restored, any queries used in the rewritten page will still appear as part of the post back reference. There are some interesting ideas about solving this including overriding the rendering of the action attribute of the pages <form> tag using a CSS Control Adapter, but we can also get around this problem much cleaner by using an HTTP Handler instead of using an HTTP module, which allows us to use the same double RewritePath trick, but at a better point in the pipeline.

The other point of interest is that we use the HttpContext.Items state management property to store and share key/value pairs between separate framework calls to the HTTP module event handlers. This property is also useful for passing state information between HTTP modules and HTTP handlers. In this case, we need to record the original path in order to retrieve it later in the pipeline, and rewrite it back.

Rewriting using an HTTP Handler

An HTTP handler is a class which implements the IHttpHandler interface, and is designed to implement a custom response to a specific type of resource request sent to the ASP.NET engine. The application's web.configfile contains mappings that indicate which handler should deal with which resources, and for what HTTP verbs. For example ASP.NET knows to use the default page handler whenever a request for a *.aspx resource is received. This is in contrast to HTTP Modules which get invoked for all requests and we must determine if we want to do something with it inside the module.

In order to do some rewriting for *.html resources, we need to create a handler that deals with *.htmlrequests. The IHttpHandler interface requires us to implement two methods:

  • ProcessRequest which is called by the ASP.NET Framework when an appropriate request needs handling and must implement the appropriate response.
  • IsReusable which returns a Boolean flag indicating whether the same handler can be used for multiple requests or not.

One common mistake that people make when implementing rewriting using an HTTP Handler is that they simply call the HttpContext.RewritePath method to rewrite the request and expect that somehow that will work. The crucial thing to remember about an HTTP Handler is that it must actually handle the request in order to provide a response. Simply rewriting a *.html request to a *.aspx request works within an HTTP Module or theglobal.asax handler because the ASP.NET Framework simply invokes the correct handler (i.e. the *.aspx default page handler) because the request information has been rewritten before the appropriate handler is selected and invoked. Once the handler has been invoked, it is responsible for emitting the right response.

To implement an HTTP Handler to handle *.html resources, we need to make use of the default page handler and forward the rewritten request to it once we have rewritten the original URL. Fortunately, we can use thePageParse.GetCompiledPageInstance method to return an instance of the default page handler for a specific resource. Given an instance of the default page handler for the target aspx resource, we can then directly invoke its ProcessRequest method from within the ProcessRequest method of our own handler.

Because we are using the GetCompiledPageInstance method to return an instance of the page, based on the target aspx page required, we do not actually need to use the RewritePath method to alter the requested page, only the query strings. Below is a simple HTTP Handler implementation:

public class UrlRewriter : IHttpHandler
{
    public void ProcessRequest(HttpContext context)
    {
        // Map the friendly URL to the back-end one..
        String strVirtualPath = "";
        String strQueryString = "";
        MapFriendlyUrl(context, out strVirtualPath, out strQueryString);

        if(strVirtualPath.Length>0)
        {
            // Apply the required query strings to the request
            context.RewritePath(context.Request.Path, string.Empty, strQueryString);

            // Now get a page handler for the ASPX page required, using this context.
            Page aspxHandler = (Page)PageParser.GetCompiledPageInstance
		(strVirtualPath, context.Server.MapPath(strVirtualPath), context);

            // Execute the handler..
            aspxHandler.PreRenderComplete +=
		new EventHandler(AspxPage_PreRenderComplete);
            aspxHandler.ProcessRequest(context);
        }
    }

    void MapFriendlyUrl(HttpContext context,
	out String strVirtualPath, out String strQueryString)
    {
        strVirtualPath = ""; strQueryString = "";

        // TODO: This routine should examine the
        // context.Request properties and implement
        //       an appropriate mapping system.
        //
        //       Set strVirtualPath to the virtual path of the target aspx page.
        //       Set strQueryString to any query strings required for the page.

        if (context.Request.Path.IndexOf("FriendlyPage.html") >= 0)
        {
            // Example hard coded mapping of "FriendlyPage.html"
	   // to "UnfriendlyPage.aspx"

            strVirtualPath = "~/UnfriendlyPage.aspx";
            strQueryString = "FirstQuery=1&SecondQuery=2";
        }
    }

    void AspxPage_PreRenderComplete(object sender, EventArgs e)
    {
        HttpContext.Current.RewritePath(HttpContext.Current.Request.Path,
		String.Empty, String.Empty);
    }

    public bool IsReusable
    {
        get
        {
            return true;
        }
    }
}

The first thing the handler does is to determine the target aspx page required to handle the request, and determine any query strings that need to be passed to that target page. For simplicity, I have not included any specific mapping implementation – that is up to you and can be as simple as having hard coded pages or can include look ups to a configuration file containing regular expressions. The MapFriendlyUrl method simply ensures that the “strVirtualPath” out parameter is set to the virtual path of the target aspx page, and the “strQueryString” out parameter is set to the query strings required for the target script.

Next we can use the HttpContext.RewritePath method to rewrite just the query strings (since the path to the original friendly URL is already correct). We then create an instance of the default page handler for the target aspx page. The reason that we do this is so that the aspx page when it executes will see all the required query parameters, but will still see the URL as the friendly one – we don't need to change the request path to the aspx page since we are manually invoking it ourselves.

Before we actually process the request, we hook up the PreRenderComplete page event handler. This event fires when the page has finished creating all of its controls, pagination has concluded, viewstate is ready to be written, and the final HTML is ready to be emitted. Hooking up to this event gives us a chance to do the doubleRewritePath trick (similar to what we did in the HTTP module approach shown earlier). ThePreRenderComplete handler simply calls the HttpContext.RewritePath to remove the queries that were added by the last rewrite (essentially reversing it). This has the effect of ensuring the target that is emitted for post backs is the fully friendly URL and, unlike the HTTP module implementation it will also not include any of the additional query strings that were added on afterwards.

While the handler will now successfully rewrite *.html friendly page to the required *.aspx pages with queries, it still has a few problems:

  • Session state is not available within the aspx target page

This is something of a problem for most people who use session state within their aspx pages, but is actually very straightforward to fix. By adding the IReadOnlySessionState or IRequiresSessionState interface to the class, we can gain read only or read/write access respectively. We don't need to implement anything differently in the handler since these interfaces are simply “marker” interfaces which expose no methods, but they signal to ASP.NET to enable session state access when the handler is consumed.

  • Requests for actual *.html files can no longer be served.

This may or may not be a problem to you. If you would still like actual static *.html files to be served, then we need to ensure that the handler does this for us. Don't forget that we have instructed IIS to forward all requests for *.html resources to us for processing. We can solve this by writing code at the start (or end) of the handler that checks to see if the request is for a real page, and if so serve it.

  • The friendly URLs cannot contain any query strings themselves.

This is probably not an issue since the main reason for rewriting is to remove queries from the URL, however it might be useful internally to be able to call the friendly page with programmatic queries to maintain the belief that the flat *.html page is the resource being served rather than having to revert to the undesirable *.aspxpage. We need to make changes to record the original queries and restore them in thePreRenderCompletepage event handler. We also need to then ensure that the mapped queries are cleanly ”merged” with the requested ones.

  • The handler needs to gracefully handle page not found errors as required.

Depending on how your mapping system works, there may be times when a friendly page specified does not actually relate to anything. You could choose to redirect the response to an error page, or your aspx page may respond accordingly, but if not the handler should really be written to respond with an appropriate message and return a 404 response code. If you don't do this, you could end up with the handler simply returning a blank response.

Building a Better HTTP Handler

Below is an full implementation of an HTTP Handler that rewrite*.html resources to *.aspx pages with query strings, taking into account the points made previously.

public class BetterUrlRewriter : IHttpHandler, IRequiresSessionState
{
    const string ORIGINAL_PATHINFO = "UrlRewriterOriginalPathInfo";
    const string ORIGINAL_QUERIES = "UrlRewriterOriginalQueries";

    public void ProcessRequest(HttpContext context)
    {
        // Check to see if the specified HTML file actual exists and serve it if so..
        String strReqPath = context.Server.MapPath
	(context.Request.AppRelativeCurrentExecutionFilePath);
        if (File.Exists(strReqPath))
        {
            context.Response.WriteFile(strReqPath);
            context.Response.End();
            return;
        }

        // Record the original request PathInfo and
        // QueryString information to handle graceful postbacks
        context.Items[ORIGINAL_PATHINFO] = context.Request.PathInfo;
        context.Items[ORIGINAL_QUERIES] = context.Request.QueryString.ToString();

        // Map the friendly URL to the back-end one..
        String strVirtualPath = "";
        String strQueryString = "";
        MapFriendlyUrl(context, out strVirtualPath, out strQueryString);

        if(strVirtualPath.Length>0)
        {
            foreach (string strOriginalQuery in context.Request.QueryString.Keys)
            {
                // To ensure that any query strings passed in the original request
	       // are preserved, we append these
                // to the new query string now, taking care not to add any keys
	       // which have been rewritten during the handler..
                if (strQueryString.ToLower().IndexOf(strOriginalQuery.ToLower()
								+ "=") < 0)
                {
                    strQueryString += string.Format("{0}{1}={2}",
			((strQueryString.Length > 0) ? "&" : ""),
			strOriginalQuery,
			context.Request.QueryString[strOriginalQuery]);
                }
            }

            // Apply the required query strings to the request
            context.RewritePath(context.Request.Path, string.Empty, strQueryString);

            // Now get a page handler for the ASPX page required, using this context.
            Page aspxHandler = (Page)PageParser.GetCompiledPageInstance
		(strVirtualPath, context.Server.MapPath(strVirtualPath), context);

            // Execute the handler..
            aspxHandler.PreRenderComplete +=
		new EventHandler(AspxPage_PreRenderComplete);
            aspxHandler.ProcessRequest(context);
        }
        else
        {
            // No mapping was found - emit a 404 response.

            context.Response.StatusCode = 404;
            context.Response.ContentType = "text/plain";
            context.Response.Write("Page Not Found");
            context.Response.End();
        }
    }

    void MapFriendlyUrl(HttpContext context, out String strVirtualPath,
						out String strQueryString)
    {
        strVirtualPath = ""; strQueryString = "";

        // TODO: This routine should examine the context.Request properties and implement
        //       an appropriate mapping system.
        //
        //       Set strVirtualPath to the virtual path of the target aspx page.
        //       Set strQueryString to any query strings required for the page.

        if (context.Request.Path.IndexOf("FriendlyPage.html") >= 0)
        {
            // Example hard coded mapping of "FriendlyPage.html"
	   // to "UnfriendlyPage.aspx"

            strVirtualPath = "~/UnfriendlyPage.aspx";
            strQueryString = "FirstQuery=1&SecondQuery=2";
        }
    }

    void AspxPage_PreRenderComplete(object sender, EventArgs e)
    {
        // We need to rewrite the path replacing the original tail and query strings..
        // This happens AFTER the page has been loaded and setup
        // but has the effect of ensuring
        // postbacks to the page retain the original un-rewritten pages URL and queries.

        HttpContext.Current.RewritePath(HttpContext.Current.Request.Path,
                    	HttpContext.Current.Items[ORIGINAL_PATHINFO].ToString(),
                          	HttpContext.Current.Items[ORIGINAL_QUERIES].ToString());
    }

    public bool IsReusable
    {
        get
        {
            return true;
        }
    }
}

Firstly, you will notice that in addition to the IHttpHandler interface, we also specify theIRequiresSessionState interface. This ensures we get read/write access to the session state during the page lifecycle.

The first thing that we do in the ProcessRequest method is to check to see if the requested *.html resource is actually a request for a real HTML resource. You may not want to check for this depending on your requirements or may want to do it after checking the mappings, but I prefer to include it here so that while virtual HTML pages which require rewriting will work using the rewriting engine, any requests for actual HTML files which really do exist at the requested location take priority and can still be served.

Next, we use the HttpContext.Items key/value state management collection to save the current request query strings and tail so we can then pull them back out in the PreRenderComplete page handler. After calling the RewriteUrl method to perform the mapping work, we then write some additional code that merges any queries specified in this request with the ones required for the mapping. This is done in such a way that the mapping queries specified take priority and are not overwritten with the same query parameter from the request – but of course this could be changed as required.

Finally, if we fail to get a mapping, then we take this as being an invalid request and we respond by serving back a simple 404 response, and terminate. You may choose to show a somewhat better response than I have implemented here, or show a site map or whatever.

Happy Rewriting!

Points of Interest

Hopefully someone will find this article of interest in helping to implement URL rewriting in web projects using one of the solutions shown. While this article does not implement (or attempt to implement) the code to map friendly to unfriendly URLs, it does provide a stub implementation to expand on and hopeful will provide the foundations for going forward.

What If I Don't Have Access to IIS?

The information presented here specifically relates to rewriting *.html resources which means that you need the ability to add a mapping to IIS to forward *.html requests to ASP.NET, however there is no reason that you can't use exactly the same process for rewriting any resource type that you want - providing the request gets toASP.NET ASAPI module. If your hosting solution does not offer you the ability to add IIS mappings and you otherwise don't have access to IIS, you can still make use of the rewriting techniques illustrated here - just using another resource type that you know will be handled. The extension *.ashx for example is a prime candidate to use instead of *.html as *.ashx requests are automatically mapped.

Reference Sources