重构一个运行超过10年的老项目
去年下半年我接手了一个外包的项目维护任务,这个项目大约开始于2005年,项目用的是传统的三层架构,主要功能就是一个网络爬虫,爬取国外各种电商的商品数据,存入客户数据库。最近我对项目的重构已经通过验收,我想说说我的重构思路。
阶段一 熟悉了项目框架,了解项目的运行和维护。
使用工具: Microsoft Visual Studio2005 , SQL SERVER2005, axosoft ontime scrum,SVN
开发流程:客户提供需求文档,编码,单元测试,UAT部署,UAT测试,客户部署,QA测试
项目分层:
在这个阶段,我发现了几个问题
- 很多需求文档已经丢失
- 代码逻辑与需求文档不匹配
- 大量重复代码
- 用于匹配数据的正则表达式全部存储于数据库,难以阅读,不方便修改
- 很多正则变大时过于复杂
<li\s*[^>]+list-view-*(?<css>[^"]*)"[^>]*>\s* <h[^>]*>\s*<a[\s\S]*?href=[\d\D]{1,100}?(?<=MLA-)(?<id>\d+)[^<]*>\s* (?<name>[\d\D]{0,500}?)</a>\s* (?:<a[^>]*>)?\s*<i\s*class="ch-icon-heart"></i>\s*</a>\s*</h\d+>\s* (?:<p\s*[^>]+list-view-item-subtitle">(?<subTitle>[\d\D]{0,5000}?)</p>)?\s* (?:<ul[^>]*>(?<subTitle2>[\d\D]{0,5000}?)</ul>)?\s* (?:<a\s*href=[^>]+>)?\s*(?:<im[\d\D]{1,200}?(?:-|_)MLA(?<photo>[^\.]+)[^>]+>)?\s*(?:</a>)?\s*(?:<img[\d\D]{1,200}?images/(?<photo2>[^\.]+)[^>]+>)?\s*(?:</a>)?\s* [\d\D]*? <\s*[^>]+price-info">\s* (?:<[^>]+price-info-cost">(?:[\d\D]*?)<strong\s*[^>]+price">\s*(?<currency>[^\d\&]*)(?: )?(?<price>\d+(?:.\d{3})* (?: .\d+)? ) \s*(?:<sup>(?<priceDecimales>\d*)</sup>\s*)? (?: \s*<span[^>]*>[^<]*</span>)? \s*</strong>\s*(?:</div>\s*)? (?:<strong\s*[^>]+price-info-auction">(?<type>[^<]*)</strong>)?\s* (?:<span\s*[^>]+price-info-auction-endTime">[^<\d]*?(?:(?<day>\d+)d)?\s*(?:(?<hour>\d+)h)? \s*(?:(?<minute>\d+)m)? \s* (?:(?<second>\d+)s)?\s*</span>\s*)?(?:</span>)?\s* (?:<span\s*[^>]+price-info-installments"><span\s*class=installmentsQuantity>(?<numberOfPayment>\d+)</span>[^<]+ <span\s*[^>]+price">\s*[^<]*?(?<pricePayment>\d+(?:.\d{3})* (?: .\d+)? )\s*<sup>(?<pricePaymentDecimales>[\d\D]{0,10}?)</sup>\s*</span>\s* </span>\s*)?|<[^>]*[^>]+price-info-cost-agreed">[^>]*</[^>]*>\s*)(?:</p>)?\s* [\d\D]*? (?:<ul\s*class="medal-list">\s*<li\s*[^>]+mercadolider[^>]*>(?<sellerBagde>[\d\D]{0,500}?)</li>\s*</ul>\s*)? <ul\s*[^>]+extra-info">\s*(?:<li\s*class="ch-ico\s*search[^>]+">[^<]*</li>\s*)? (?:<li\s*[^>]+mercadolider[^>]*>(?<sellerBagde>[\d\D]{0,500}?)</li>)?\s*(?:<!--\s*-->)?\s* (?:<li\s*[^>]+[^>]*(?:condition|inmobiliaria|concesionaria)">\s*(?:<strong>)?(?<condition>[^\d<]*?)(?:</strong>)?\s*</li>\s*)?\s* (?:<li\s*[^>]+"extra-info-sold">(?<bids>\d+)*[^<]*</li>\s*)? (?: <li\s*[^>]+[^>]*location">(?<location>[^<]*?)\s*</li>\s*(?:<li\s*class="free-shipping">[^<]*</li>\s*)? |<li>(?<location>[^<]*?)\s*</li>\s*)?(?:<li\s*class="free-shipping">[^<]*</li>\s*)? (?:</ul> |<li[^>]*>\s*Tel.?:\s*(?<phone>[^<]+)</li>) | <div\s*[^>]+item-[^>]*>\s*<h[^>]*>\s*<a\s*href=[\d\D]{1,100}?(?<=MLA-)(?<id>\d+)[^<]*>\s* (?<name>[\d\D]{0,500}?)</a>\s*</h3>\s* (?:[\d\D]*?)<li\s*[^>]+costs"><span\s*[^>]+price">\s*(?<currency>[^\d\&]*)(?: )?(?<price>\d+(?:.\d{3})* (?: .\d+)? ) \s*</span></li> (?:[\d\D]*?)(?:</ul> |<li[^>]*>\s*Tel.?:\s*(?:\ )*(?<phone>[^<]+)</li>)
阶段二 完善全部需求文档,将所有正则提取成文件
开发流程增加最后一环,更新文档。当测试或维护完成后,必须修改需求文档,将所有正则提取成文件,减少维护SQL的工作量,减少新人维护sql出错的可能性。在我和QA的努力下,200多份需求文档被重新整理完毕,为维护项目提供思路。
阶段三 修改数据访问层
去除传统的数据访问层代码,于是准备上Entity Framework,和客户沟通后,客户更熟悉Nhibernate,于是封装Repository,这个仓储层封装和领域驱动没有多大关系,只是一个大号的DbHelper而已.
public void SaveInfoByCity(InfoByCity line, string config) { SQLQuery query = new SQLQuery(); query.CommandType = CommandType.StoredProcedure; query.CommandText = "HangZhou_InsertInfoByCity"; SqlParameter[] parameters = new SqlParameter[7]; parameters[0] = new SqlParameter("@City", line.City); parameters[1] = new SqlParameter("@AvailableUnits", line.AvailableUnits); parameters[2] = new SqlParameter("@AvailableSqm", line.AvailableSqm); parameters[3] = new SqlParameter("@ResAvailUnits", line.ResAvailUnits); parameters[4] = new SqlParameter("@ResAvailSqm", line.ResAvailSqm); parameters[5] = new SqlParameter("@ReservedUnits", line.ReservedUnits); parameters[6] = new SqlParameter("@ReservedSqm", line.ReservedSqm); SqlHelper.ExecuteNonQuery(ConnectionStringManager.GetConnectionString(CALLER_ASSEMBLY_NAME, config), query.CommandType, query.CommandText, parameters); }
/// <summary> /// SaveRestaurant /// </summary> /// <param name="restaurant"></param> public void SaveRestaurant(Restaurant restaurant) { restaurant.RunId = RunId; restaurant.RunDate = RunDate; restaurant.InsertUpdateDate = DateTime.Now; RepositoryHelper.CreateEntity(restaurant); }
阶段四 去除大量重复代码
所有的业务抽象出来就三个部分 下载 匹配 保存,因此封装了大量的公共方法,是每个任务编码更加简单,易于维护。
阶段五 修改匹配方式
项目原来就是利用正则匹配数据,有些网站数据比较复杂,导致正则过大。而且往往网站稍微改变一点点,整个正则就匹配不到任何数据,正则维护难度也比较大。
首选我想到的就是封装树状结构,将正则分化治之。试运行一段时间后发现,维护调试树状结构的正则表达式简直要命,于是放弃。但是我觉的将页面无限分割,再进行匹配的思路应该是正确的,因为这样更加容易维护。在思考和搜索中,我发现了HtmlSelector。用HtmlSelector做DOM选择,然后用正则匹配细节。逐渐封装成现在的样子,下面提供一个案例。
using System; using System.Collections.Generic; using System.Text; using Majestic.Bot.Core; using System.Diagnostics; using Majestic.Util; using Majestic.Entity.Shared; using Majestic.Entity.ECommerce.Hungryhouse; using Majestic.Dal.ECommerce; namespace Majestic.Bot.Job.ECommerce { public class Hungryhouse : JobRequestBase { private static string proxy; private static string userAgent; private static string domainUrl = "https://hungryhouse.co.uk/"; private static string locationUrl = "https://hungryhouse.co.uk/takeaway"; private int maxRetries; private int maxHourlyPageView; private HttpManager httpManager = null; private int pageCrawled = 0; /// <summary> /// This method needs to be defined here is primarily because we want to use the top level /// class name as the logger name. So even if the base class can log using the logger defined by /// the derived class, not by the base class itself /// </summary> /// <param name="row"></param> public override void Init(Majestic.Dal.Shared.DSMaj.Maj_vwJobDetailsRow row) { StackFrame frame = new StackFrame(0, false); base.Init(frame.GetMethod().DeclaringType.FullName, row); } /// <summary> /// Initializes the fields /// </summary> private void Initialize() { try { JobSettingCollection jobSettingCollection = base.GetJobSettings(JobId); proxy = jobSettingCollection.GetValue("proxy"); userAgent = jobSettingCollection.GetValue("userAgent"); maxRetries = jobSettingCollection.GetValue<int>("maxRetryTime", 3); maxHourlyPageView = jobSettingCollection.GetValue<int>("maxHourlyPageView", 4500); InithttpManager(); InitPattern(); } catch (Exception ex) { throw new MajException("Error intializing job " + m_sConfig, ex); } } /// <summary> /// Initialize the httpManager instance /// </summary> private void InithttpManager() { if (String.IsNullOrEmpty(proxy) || proxy.Equals("none")) { throw new Exception("proxy was not set! job ended!"); } httpManager = new HttpManager(proxy, this.maxHourlyPageView, delegate(string page) { if (page.Contains("macys.com is temporarily closed for scheduled site improvements")) { return false; } else { return ComUtil.CommonValidateFun(page); } }, this.maxRetries); httpManager.SetHeader("Upgrade-Insecure-Requests", "1"); httpManager.AcceptEncoding = "gzip, deflate, sdch"; httpManager.AcceptLanguage = "en-US,en;q=0.8"; httpManager.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"; if (!string.IsNullOrEmpty(userAgent)) { httpManager.UserAgent = userAgent; } } /// <summary> /// InitPattern /// </summary> private void InitPattern() { PatternContainerHelper.Load("Hungryhouse.pattern.xml"); } /// <summary> /// The assembly entry point that controls the internal program flow. /// It is called by the Run() function in the base class /// <see cref="MajesticReader.Lib.JobBase"/> /// The program flow: /// 1. Get the job requests <see cref="MajesticReader.Lib.HitBoxJobRequest /> based on JobId /// 2. For each request, get the input parameters /// 3. Retrieve the Html content /// 4. identify and collect data based on the configration settings for the request /// 5. Save collected data /// </summary> protected override void OnRun() { try { Initialize(); int jobId = base.JobId; Log.RunId = base.RunId; HungryhouseDao.RunId = RunId; HungryhouseDao.RunDate = DateTime.Now; //get current job name string jobName = base.GetJobName(); //Log start time Log.Info("Hungryhouse Started", string.Format( "Job {0} - {1} Started at {2}", jobId, jobName, DateTime.Now)); CollectLocation(); //Log end time Log.Info("Hungryhouse Finished", string.Format( "Job {0} - {1} Finished at {2}. {3} pages were crawled", jobId, jobName, DateTime.Now, pageCrawled)); } catch (Exception ex) { // This should have never happened. So it is "Unexpeced" Log.Error("Unexpected/Unhandled Error", ex); throw new Exception("Unexpected/Unhandled Error", ex); } } /// <summary> /// CollectLocation /// </summary> private void CollectLocation() { Log.Info("Started Getting Locations", "Started Getting Locations"); string page = DownloadPage(locationUrl); JobData locationData = ExtractData(page, "LocationArea", PatternContainerHelper.ToJobPatternCollection()); JobDataCollection locationList = locationData.GetList(); if (locationList.Count == 0) { Log.Warn("can not find locations", "can not find locations"); return; } Log.Info("Locations", locationList.Count.ToString()); foreach (JobData location in locationList) { string url = location.GetGroupData("Url").Value; string name = location.GetGroupData("Name").Value; if (string.IsNullOrEmpty(url) || string.IsNullOrEmpty(name)) { continue; } url = ComUtil.GetFullUrl(url, domainUrl); CollectRestaurant(name, url); } Log.Info("Finished Getting Locations", "Finished Getting Locations"); } /// <summary> /// CollectRestaurant /// </summary> /// <param name="name"></param> /// <param name="url"></param> private void CollectRestaurant(string name, string url) { Log.Info("Started Getting Restaurant", string.Format("Location:{0},Url:{1}",name,url)); string page = DownloadPage(url); JobData restaurantData = ExtractData(page, "RestaurantArea", PatternContainerHelper.ToJobPatternCollection()); JobDataCollection restaurantList = restaurantData.GetList(); if (restaurantList.Count == 0) { Log.Warn("can not find restaurant", string.Format("Location:{0},Url:{1}", name, url)); return; } Log.Info("Restaurants", string.Format("Location:{0},Url:{1}:{2}", name, url, restaurantList.Count)); foreach (JobData restaurant in restaurantList) { string tempUrl = restaurant.GetGroupData("Url").Value; string tempName = restaurant.GetGroupData("Name").Value; if (string.IsNullOrEmpty(tempUrl) || string.IsNullOrEmpty(tempName)) { continue; } tempUrl = ComUtil.GetFullUrl(tempUrl, domainUrl); CollectDetail(tempUrl, tempName); } Log.Info("Finished Getting Restaurant", string.Format("Location:{0},Url:{1}", name, url)); } /// <summary> /// Collect detail /// </summary> /// <param name="url"></param> /// <param name="name"></param> private void CollectDetail(string url,string name) { string page = DownloadPage(url); Restaurant restaurant = new Restaurant(); restaurant.Name = name; restaurant.Url = url; JobData restaurantDetailData = ExtractData(page, "RestaurantDetailArea", PatternContainerHelper.ToJobPatternCollection()); restaurant.Address = restaurantDetailData.GetGroupData("Address").Value; restaurant.Postcode = restaurantDetailData.GetGroupData("Postcode").Value; string minimum = restaurantDetailData.GetGroupData("Minimum").Value; if (!string.IsNullOrEmpty(minimum) && minimum.ToLower().Contains("minimum")) { restaurant.Minimum = minimum; } try { HungryhouseDao.Instance.SaveRestaurant(restaurant); } catch (Exception ex) { Log.Error("Failed to save restaurant",url,ex); } } /// <summary> /// Downloads pages by taking sleeping time into consideration /// </summary> /// <param name="url">The url that the page is going to be downloaded from</param> /// <returns>The downloaded page from the specified url</returns> private string DownloadPage(string url) { string result = string.Empty; result = httpManager.DownloadPage(url); pageCrawled++; return result; } } }
<?xml version="1.0"?> <PatternContainer xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema"> <Patterns> <!-- LocationArea --> <Pattern Name="LocationArea" Description="LocationArea" HtmlSelectorExpression=".CmsRestcatCityLandingLocations"> <SubPatterns> <Pattern Name="Location" Description="Location" IsList="true" Field="Name,Url"> <Expression> <![CDATA[ <li[^>]*>\s*<a[^>]*href[^"]*"(?<Url>[^"]*)"[^>]*>\s*(?<Name>[^<]*)</a> ]]> </Expression> </Pattern> </SubPatterns> </Pattern> <!-- LocationArea --> <Pattern Name="RestaurantArea" Description="RestaurantArea" HtmlSelectorExpression=".CmsRestcatLanding.CmsRestcatLandingRestaurants.panel.mainRestaurantsList"> <SubPatterns> <Pattern Name="Restaurant" Description="Restaurant" IsList="true" Field="Name,Url"> <Expression> <![CDATA[ <li[^>]*restaurantItemInfoName[^>]*>\s*<a[^>]*href[^"]*"(?<Url>[^"]*)"[^>]*>\s*<span>\s*(?<Name>[^<]*)</span> ]]> </Expression> </Pattern> </SubPatterns> </Pattern> <!-- RestaurantArea --> <Pattern Name="RestaurantDetailArea" Description="Restaurant Detail Area"> <SubPatterns> <Pattern Name="Address" Description="Address" Field="Address" HtmlSelectorExpression="span[itemprop=streetAddress]" /> <Pattern Name="Postcode" Description="Postcode" Field="Postcode" HtmlSelectorExpression="span[itemprop=postalCode]" /> <Pattern Name="Minimum" Description="Minimum" Field="Minimum"> <Expression> <![CDATA[ <div[^>]*orderTypeCond[^>]*>\s*<p>[\s\S]*?<span[^>]*>\s*(?<Minimum>[^<]*)</span> ]]> </Expression> </Pattern> </SubPatterns> </Pattern> </Patterns> </PatternContainer>