URL Rewriting using ASP.NET for SEO
2012-02-29 18:21 mleader1 阅读(262) 评论(0) 编辑 收藏 举报
Introduction
URL Rewriting is the process of hiding a complex parameterised query string based URL such ashttp://www.somedomain.com/showproduct.aspx?id=12345&otherid=67890 by using a series of flat, often verbose URLs which do not contain any query string parameters such ashttp://www.somedomain.com/products/some-product-name.html. The idea is that the flat URL is requested from the server, and internally we determine the required parameters, then call the original query string basedURL.
The main reason you might consider doing this is for SEO (Search Engine Optimisation) purposes. Search engines generally don't like query string parameters in web pages as it often indicates a dynamically changing page, is harder to index because there are lots of occurrences of the same page - usually which contain convoluted unhelpful data, and is not considered "friendly" to the consumer. I am going to write specifically about rewriting flat HTML page resources to ASPX scripts with queries.
The first problem we encounter is that by default IIS is configured to handle all requests for *.htm and *.htmlresources itself, and will simply respond with a “404 Page Not Found” error if such a request does not exist. Since we want to rewrite *.html pages, we need to tell IIS to forward all requests for these resources toASP.NET and not handle it itself. I prefer to route *.html through ASP.NET and leave *.htm alone, but you can forward whatever resources you like – you can even forward all requests to ASP.NET if you prefer. You can do this in IIS6 by going to the website properties, clicking the “Configuration” button on the “Home Directory” tab (or “Virtual Directory” tab for an application virtual folder), and adding a “mapping” for the extensions you want to route to ASP.NET ISAPI extension. This is usually located atC:\Windows\Microsoft.NET\Framework\v2.0.50727\aspnet_isapi.dll but you can always find and copy the properties for the *.aspx mapping if you are unsure. You also need to ensure you un-check the option to make sure the requested resource exists!
Once the request gets through to ASP.NET,
we can perform URL rewriting by using theHttpContext.RewritePath
method.
This method essentially changes the original request information passed from IIS to a different URL. There
are a variety of places where the rewriting can be done, including theglobal.asax application handler, in an HTTP Module, or in an HTTP Handler.
Unlike other articles which generally refer more to the process of mapping friendly URLs to backend URLs using a configuration file and regular expressions, this article is designed to simply outline the basic foundations necessary to build a robust URL rewriting system in ASP.NET using a variety of approaches; I have implemented a simple stub method for the HTTP module and handler that can be expanded on as your requirements dictate.
Rewriting using global.asax
The global.asax file allows us to handle application and session level events, and resides in the root folder of the application. We can implement simple URL rewriting
using the Application_BeginRequest
event handler of this file, which is called each time a new request is sent to ASP.NET from
IIS for handling:
void Application_BeginRequest(object sender, EventArgs e)
{
HttpApplication app = sender as HttpApplication;
if(app.Request.Path.IndexOf("FriendlyPage.html") > 0)
{
app.Context.RewritePath("/UnfriendlyPage.aspx?SomeQuery=12345");
}
}
In the above code snippet, we rewrite any requests for the FriendlyPage.html page to the UnfriendlyPage.aspxpage with the query string SomeQuery=12345. As the request progresses through the pipeline, it will now use the newly rewritten resource instead of the original one.
Obviously this is a very simple, hardcoded example of rewriting the URL, and it does not take into account application paths and so on. Usually, rewriting would not be performed as hardcoded entries in the global.asaxhandler, but rather would be done inside a purpose built HTTP Module, or using an HTTP Handler.
As explained earlier, the above example will only work when IIS has been configured to send *.html resource requests to ASP.NET, but the example would work just as well for any type of request, including a directory (if IIS has been configured with a * mapping) or indeed, requests for other aspx pages.
Rewriting using an HTTP Module
An HTTP Module is a class which implements the IHttpModule
interface. Essentially it requires two methods
to be implemented:
Init
which is used to hook up pipeline events that the module is interested in handlingDispose
to release any allocated resources
URL Rewriting via an HTTP Module
works in a very similar way to the global.asax approach shown earlier. HTTP Modules are integrated to the processing pipeline of an ASP.NET application
by defining them in the web.configfile. ASP.NET will automatically load and instantiate any defined
modules, and call their Init()
methods. TheInit()
method can be used to subscribe to other events
in the request pipeline.
HTTP Modules are generally executed sequentially, one after the other in the order they are specified in theweb.config file, and their methods are called before the equivalent events in global.asax.
The following snippet shows how a typical (very simple) HTTP module could be written:
public class UrlRewritingModule : IHttpModule
{
public UrlRewritingModule()
{
}
public String ModuleName
{
get
{
return "UrlRewritingModule";
}
}
const string ORIGINAL_PATH = "OriginalRequestPathInfo";
public void Init(HttpApplication application)
{
application.AuthorizeRequest += new EventHandler(application_AuthorizeRequest);
application.PreRequestHandlerExecute +=
new EventHandler(application_PreRequestHandlerExecute);
application.Context.Items[ORIGINAL_PATH] = null;
}
void application_PreRequestHandlerExecute(object sender, EventArgs e)
{
HttpApplication app = sender as HttpApplication;
String strOriginalPath = app.Context.Items[ORIGINAL_PATH] as String;
if (strOriginalPath != null && strOriginalPath.Length > 0)
{
app.Context.RewritePath(strOriginalPath);
}
}
void application_AuthorizeRequest(object sender, EventArgs e)
{
HttpApplication app = sender as HttpApplication;
String strVirtualPath = "";
String strQueryString = "";
MapFriendlyUrl(app.Context, out strVirtualPath, out strQueryString);
if (strVirtualPath.Length>0)
{
app.Context.Items[ORIGINAL_PATH] = app.Request.Path;
app.Context.RewritePath(strVirtualPath, String.Empty, strQueryString);
}
}
void MapFriendlyUrl(HttpContext context,
out String strVirtualPath, out String strQueryString)
{
strVirtualPath = ""; strQueryString = "";
// TODO: This routine should examine the context.Request properties and implement
// an appropriate mapping system.
//
// Set strVirtualPath to the virtual path of the target aspx page.
// Set strQueryString to any query strings required for the page.
if (context.Request.Path.IndexOf("FriendlyPage.html") >= 0)
{
strVirtualPath = "~/Main.aspx";
strQueryString = "Message=You smell of updated cheese!";
}
}
public void Dispose()
{
}
}
As you can see, the Application_AuthorizeRequest
method is doing exactly the same job as the routine
in the global.asax Application_BeginRequest
handler. It is actually possible to subscribe to the BeginRequest
event
on the HTTP Module, but it is generally better to use AuthorizeRequest
instead as that takes into account Forms Authentication which may perform a redirection to acquire login
details; if this happens and theURL has been rewritten already by the BeginRequest
event
means it will send the consumer back to the rewritten page rather than the original friendly page. By doing the rewriting inside the AuthorizeRequest
event, we will ensure the Forms
Authentication subsystem will return us to the friendly resource name instead.
To perform the mapping, I have implemented a stub method MapFriendlyUrl
whose
job it is to work out how the requested resource needs to be rewritten. This is entirely dependent on your setup. In this example,
I use a quick and dirty hard coded test for any request for “FriendlyPage.html” and simply map this to “UnfriendlyPage.aspx?FirstQuery=1&SecondQuery=2”. Obviously you need to complete this method to do what you want making sure the “strVirtualPath
”
and “strQueryString
” out
parameters are completed. If the resource cannot be mapped, the method
returns an empty path.
You might have noticed the additional call to the RewritePath
method
in the PreRequestHandlerExecute
handler. The reason that we do this is so that any form post backs which happen in the target page post back to the original friendly URL,
and not the ugly rewritten one. In the above example, all requests for "FriendlyPage.html” are
rewritten to “UnfriendlyPage.aspx?FirstQuery=1&SecondQuery=2”. If we omitted the second rewrite in
the PreRequestHandlerExecute
stage, any post backs on the UnfriendlyPage.aspx page would post back to UnfriendlyPage.aspx and not FriendlyPage.html.
There is one interesting side effect however; although the page itself is restored, any queries used in the rewritten page will still appear as part of the post back reference. There
are some interesting ideas about solving this including overriding the rendering of the action
attribute of the pages <form>
tag
using a CSS Control Adapter, but we can also get around this problem much cleaner by using an HTTP Handler instead of using an HTTP module, which allows us to use the same double RewritePath
trick,
but at a better point in the pipeline.
The other point of interest is that we use the HttpContext.Items
state management property to store
and share key/value pairs between separate framework calls to the HTTP module event handlers. This property is also useful for passing state information between HTTP modules and HTTP handlers. In this case, we need to record the original path in order to retrieve
it later in the pipeline, and rewrite it back.
Rewriting using an HTTP Handler
An HTTP handler is a class which implements the IHttpHandler
interface,
and is designed to implement a custom response to a specific type of resource request sent to the ASP.NET engine. The
application's web.configfile contains mappings that indicate which handler should deal with which resources, and for what HTTP verbs. For example ASP.NET knows
to use the default page handler whenever a request for a *.aspx resource is received. This is in contrast to HTTP Modules which get invoked for all requests and we must determine if we want to do something with it inside the module.
In order to do some rewriting for *.html resources, we need to create a handler that deals with *.htmlrequests. The IHttpHandler
interface
requires us to implement two methods:
ProcessRequest
which is called by the ASP.NET Framework when an appropriate request needs handling and must implement the appropriate response.IsReusable
which returns a Boolean flag indicating whether the same handler can be used for multiple requests or not.
One common mistake that people make when implementing rewriting using an HTTP Handler is that they simply call the HttpContext.RewritePath
method
to rewrite the request and expect that somehow that will work. The crucial thing to remember about an HTTP
Handler is that it must actually handle the request in order to provide a response. Simply rewriting a *.html request to a *.aspx request works within an HTTP Module or theglobal.asax handler because the ASP.NET Framework
simply invokes the correct handler (i.e. the *.aspx default page handler) because the request information has been rewritten before the appropriate handler is selected and invoked. Once the handler has been invoked, it is responsible for
emitting the right response.
To implement an HTTP Handler to handle *.html resources, we need to make use of the default page handler and forward the rewritten request to it once we have rewritten the
original URL. Fortunately, we can use thePageParse.GetCompiledPageInstance
method
to return an instance of the default page handler for a specific resource. Given an instance of the default page handler for the target aspx resource, we can then directly invoke its ProcessRequest
method
from within the ProcessRequest
method of our own handler.
Because we are using the GetCompiledPageInstance
method to return an instance of the page, based on
the target aspx page required, we do not actually need to use the RewritePath
method
to alter the requested page, only the query strings. Below is a simple HTTP Handler implementation:
public class UrlRewriter : IHttpHandler
{
public void ProcessRequest(HttpContext context)
{
// Map the friendly URL to the back-end one..
String strVirtualPath = "";
String strQueryString = "";
MapFriendlyUrl(context, out strVirtualPath, out strQueryString);
if(strVirtualPath.Length>0)
{
// Apply the required query strings to the request
context.RewritePath(context.Request.Path, string.Empty, strQueryString);
// Now get a page handler for the ASPX page required, using this context.
Page aspxHandler = (Page)PageParser.GetCompiledPageInstance
(strVirtualPath, context.Server.MapPath(strVirtualPath), context);
// Execute the handler..
aspxHandler.PreRenderComplete +=
new EventHandler(AspxPage_PreRenderComplete);
aspxHandler.ProcessRequest(context);
}
}
void MapFriendlyUrl(HttpContext context,
out String strVirtualPath, out String strQueryString)
{
strVirtualPath = ""; strQueryString = "";
// TODO: This routine should examine the
// context.Request properties and implement
// an appropriate mapping system.
//
// Set strVirtualPath to the virtual path of the target aspx page.
// Set strQueryString to any query strings required for the page.
if (context.Request.Path.IndexOf("FriendlyPage.html") >= 0)
{
// Example hard coded mapping of "FriendlyPage.html"
// to "UnfriendlyPage.aspx"
strVirtualPath = "~/UnfriendlyPage.aspx";
strQueryString = "FirstQuery=1&SecondQuery=2";
}
}
void AspxPage_PreRenderComplete(object sender, EventArgs e)
{
HttpContext.Current.RewritePath(HttpContext.Current.Request.Path,
String.Empty, String.Empty);
}
public bool IsReusable
{
get
{
return true;
}
}
}
The first thing the handler does is to determine the target aspx page required to handle the request, and determine any query strings that need to be passed to that target page. For
simplicity, I have not included any specific mapping implementation – that is up to you and can be as simple as having hard coded pages or can include look ups to a configuration file containing regular expressions. The MapFriendlyUrl
method
simply ensures that the “strVirtualPath
” out
parameter is set to the virtual path of the target
aspx page, and the “strQueryString
” out
parameter is set to the query strings required for the target
script.
Next we can use the HttpContext.RewritePath
method
to rewrite just the query strings (since the path to the original friendly URL is
already correct). We then create an instance of the default page handler for the target aspx page. The reason that we do this is so that the aspx page when it executes will see all the required query parameters, but will still see the URL as
the friendly one – we don't need to change the request path to the aspx page since we are manually invoking it ourselves.
Before we actually process the request, we hook up the PreRenderComplete
page event handler. This event
fires when the page has finished creating all of its controls, pagination has concluded, viewstate is ready to be written, and the final HTML is ready to be emitted. Hooking up to this event gives us a chance to do the doubleRewritePath
trick
(similar to what we did in the HTTP module approach shown earlier). ThePreRenderComplete
handler simply calls the HttpContext.RewritePath
to
remove the queries that were added by the last rewrite (essentially reversing it). This has the effect of
ensuring the target that is emitted for post backs is the fully friendly URL and, unlike the HTTP module
implementation it will also not include any of the additional query strings that were added on afterwards.
While the handler will now successfully rewrite *.html friendly page to the required *.aspx pages with queries, it still has a few problems:
- Session state is not available within the aspx target page
This is something of a problem for most people who use session state within their aspx pages, but is actually very straightforward to fix. By adding the IReadOnlySessionState
or IRequiresSessionState
interface
to the class, we can gain read only or read/write access respectively. We don't need to implement anything differently in the handler since these interfaces are simply “marker” interfaces which expose no methods, but they signal to ASP.NET to
enable session state access when the handler is consumed.
- Requests for actual *.html files can no longer be served.
This may or may not be a problem to you. If you would still like actual static *.html files to be served, then we need to ensure that the handler does this for us. Don't forget that we have instructed IIS to forward all requests for *.html resources to us for processing. We can solve this by writing code at the start (or end) of the handler that checks to see if the request is for a real page, and if so serve it.
- The friendly URLs cannot contain any query strings themselves.
This is probably not an issue since the main reason for rewriting is to remove queries from the URL,
however it might be useful internally to be able to call the friendly page with programmatic queries to maintain the belief that the flat *.html page is the resource being served rather than having to revert to the undesirable *.aspxpage.
We need to make changes to record the original queries and restore them in thePreRenderComplete
page event handler. We also need to then ensure that the mapped queries are cleanly
”merged” with the requested ones.
- The handler needs to gracefully handle page not found errors as required.
Depending on how your mapping system works, there may be times when a friendly page specified does not actually relate to anything. You could choose to redirect the response to an error page, or your aspx page may respond accordingly, but if not the handler should really be written to respond with an appropriate message and return a 404 response code. If you don't do this, you could end up with the handler simply returning a blank response.
Building a Better HTTP Handler
Below is an full implementation of an HTTP Handler that rewrites *.html resources to *.aspx pages with query strings, taking into account the points made previously.
public class BetterUrlRewriter : IHttpHandler, IRequiresSessionState
{
const string ORIGINAL_PATHINFO = "UrlRewriterOriginalPathInfo";
const string ORIGINAL_QUERIES = "UrlRewriterOriginalQueries";
public void ProcessRequest(HttpContext context)
{
// Check to see if the specified HTML file actual exists and serve it if so..
String strReqPath = context.Server.MapPath
(context.Request.AppRelativeCurrentExecutionFilePath);
if (File.Exists(strReqPath))
{
context.Response.WriteFile(strReqPath);
context.Response.End();
return;
}
// Record the original request PathInfo and
// QueryString information to handle graceful postbacks
context.Items[ORIGINAL_PATHINFO] = context.Request.PathInfo;
context.Items[ORIGINAL_QUERIES] = context.Request.QueryString.ToString();
// Map the friendly URL to the back-end one..
String strVirtualPath = "";
String strQueryString = "";
MapFriendlyUrl(context, out strVirtualPath, out strQueryString);
if(strVirtualPath.Length>0)
{
foreach (string strOriginalQuery in context.Request.QueryString.Keys)
{
// To ensure that any query strings passed in the original request
// are preserved, we append these
// to the new query string now, taking care not to add any keys
// which have been rewritten during the handler..
if (strQueryString.ToLower().IndexOf(strOriginalQuery.ToLower()
+ "=") < 0)
{
strQueryString += string.Format("{0}{1}={2}",
((strQueryString.Length > 0) ? "&" : ""),
strOriginalQuery,
context.Request.QueryString[strOriginalQuery]);
}
}
// Apply the required query strings to the request
context.RewritePath(context.Request.Path, string.Empty, strQueryString);
// Now get a page handler for the ASPX page required, using this context.
Page aspxHandler = (Page)PageParser.GetCompiledPageInstance
(strVirtualPath, context.Server.MapPath(strVirtualPath), context);
// Execute the handler..
aspxHandler.PreRenderComplete +=
new EventHandler(AspxPage_PreRenderComplete);
aspxHandler.ProcessRequest(context);
}
else
{
// No mapping was found - emit a 404 response.
context.Response.StatusCode = 404;
context.Response.ContentType = "text/plain";
context.Response.Write("Page Not Found");
context.Response.End();
}
}
void MapFriendlyUrl(HttpContext context, out String strVirtualPath,
out String strQueryString)
{
strVirtualPath = ""; strQueryString = "";
// TODO: This routine should examine the context.Request properties and implement
// an appropriate mapping system.
//
// Set strVirtualPath to the virtual path of the target aspx page.
// Set strQueryString to any query strings required for the page.
if (context.Request.Path.IndexOf("FriendlyPage.html") >= 0)
{
// Example hard coded mapping of "FriendlyPage.html"
// to "UnfriendlyPage.aspx"
strVirtualPath = "~/UnfriendlyPage.aspx";
strQueryString = "FirstQuery=1&SecondQuery=2";
}
}
void AspxPage_PreRenderComplete(object sender, EventArgs e)
{
// We need to rewrite the path replacing the original tail and query strings..
// This happens AFTER the page has been loaded and setup
// but has the effect of ensuring
// postbacks to the page retain the original un-rewritten pages URL and queries.
HttpContext.Current.RewritePath(HttpContext.Current.Request.Path,
HttpContext.Current.Items[ORIGINAL_PATHINFO].ToString(),
HttpContext.Current.Items[ORIGINAL_QUERIES].ToString());
}
public bool IsReusable
{
get
{
return true;
}
}
}
Firstly, you will notice that in addition to the IHttpHandler
interface,
we also specify theIRequiresSessionState
interface. This ensures we get read/write access to the session state during the page lifecycle.
The first thing that we do in the ProcessRequest
method is to check to see if the requested *.html resource
is actually a request for a real HTML resource. You may not want to check for this depending on your requirements or may want to do it after checking the mappings, but I prefer to include it here so that while virtual HTML pages which require
rewriting will work using the rewriting engine, any requests for actual HTML files which really do exist at the requested location take priority and can still be served.
Next, we use the HttpContext.Items
key/value state management collection to save the current request
query strings and tail so we can then pull them back out in the PreRenderComplete
page handler. After calling the RewriteUrl
method
to perform the mapping work, we then write some additional code that merges any queries specified in this request with the ones required for the mapping. This is done in such a way that the mapping queries specified take priority and are not overwritten with
the same query parameter from the request – but of course this could be changed as required.
Finally, if we fail to get a mapping, then we take this as being an invalid request and we respond by serving back a simple 404 response, and terminate. You may choose to show a somewhat better response than I have implemented here, or show a site map or whatever.
Happy Rewriting!
Points of Interest
Hopefully someone will find this article of interest in helping to implement URL rewriting in web projects using one of the solutions shown. While this article does not implement (or attempt to implement) the code to map friendly to unfriendly URLs, it does provide a stub implementation to expand on and hopeful will provide the foundations for going forward.
What If I Don't Have Access to IIS?
The information presented here specifically relates to rewriting *.html resources which means that you need the ability to add a mapping to IIS to forward *.html requests to ASP.NET, however there is no reason that you can't use exactly the same process for rewriting any resource type that you want - providing the request gets toASP.NET ASAPI module. If your hosting solution does not offer you the ability to add IIS mappings and you otherwise don't have access to IIS, you can still make use of the rewriting techniques illustrated here - just using another resource type that you know will be handled. The extension *.ashx for example is a prime candidate to use instead of *.html as *.ashx requests are automatically mapped.