一键构造你的博客园目录,下载到本地
最近看了一下吴军的数学之美。书很好,废话我就不多少了。看了第9章图论和网络爬虫,一直都觉得网络爬虫很牛B,搜索引擎不就是用爬虫爬网页的吗,于是想写一个简单的爬虫来爬网页试试,最先想到的就是给自己的博客建一个目录,够小够简单了吧,于是就有了这篇文章,简单的分享一下,先申明我的实现很简单没有技术含量,在看下文之前可以先看看 我的博客目录。 源码必共享
简单介绍一下网络爬虫的原理:给你一个网页地址,先把这个网页下载下来,然后分析这个网页的内容,得到这个网页中的所有链接,然后下载这些网页,继续分析下载。这样就能下载互联网上的很多网页。原理就这么简单,实现起来就不那么容易了。由于深入不了只能说简单的。
构造我的博客目录思路简单分析。获得你的所有文章的地址及标题,然后将这些文章分类。你的文章其实是已经分类好了的,只用得到你的文章的所有分类,然后根据分类得到所有分类下的文章,就可以得到你所有的文章及其分类,构造你的博客目录就容易了。
被否定了的思路一:随便拿到我的一篇文章的地址,下载这篇文章,然后分析这个地址,得到这篇文章里面的所有链接,按照一定的规则得到我的文章地址,即排除无用的连接,然后以爬虫的思路得到我的所有文章,由于每篇文章都有它的分类,所以很快就能构造我的博客目录了。然而由于博客园的实现不是我想的那样,在下载一篇文章的时候,没有下面的内容,因为下面的内容就像一个双向链表一样将我的所有文章连接起来了,我只要知道一篇文章的地址,通过这个”双向链表“我就能得到我的所有文章了,可就是下载网页里偏偏没有下面的内容,于是这个最接近爬虫的方法被PASS掉了。
被否定了的思路二。每个人的文章都是分页显示的,我就可以下载这些内容,然后就可以得到我的所有文章,可还是有个问题,跟上面一样的原因,妹的,下载的网页中没有文章的分类,得到了所有的文章,却不知道文章的分类,叫我怎么构造目录啊。于是又被PASS掉了。
要构造我的博客目录,这么简单的需求方法当然是很多的了,于是用了个不太想爬虫的方法。就是上面所说的,得到所有文章的分类,下载每个分类下的文章,构造博客目录。获得我的博客分类的方法很简单,如获取我的文章分类方法如下:
请求这个地址:http://www.cnblogs.com/hlxs/mvc/blog/sidecolumn.aspx
传入参数blogApp=hlxs;(hlxs是我在博客园的ID)
这样就得到了我文章的所有分类,然后按照分类得到分类下的所有文章,在构造博客目录就简单了。在这个过程中只要知道某人在博客园的ID就能构造它的博客目录,我说一键构造你的博客目录不为过吧。
如果你也想构造你的博客目录,可以先看看我的博客目录,构造你的博客目录很简单,运行程序,输入你的博客园ID,会自动生成一个”我的博客目录.txt”,将文件的内容以源码的方式发表就行。
出处:https://www.cnblogs.com/hlxs/archive/2013/02/20/2918760.html
=======================================================================================
个人使用
由于博客园改版了,上面的比如获取随笔分类还是使用的老的url,现在已经无法访问到这个页面了,我们通过查看源码,不难发现类似如下的代码:
<div id="blog-sidecolumn"></div> <script>loadBlogSideColumn();</script> </div>
只需要在chrome中跟踪一把,看看他访问的是那个路径获取数据的,看来现在换成ajax获取的方式了,如下
https://www.cnblogs.com/hlxs/ajax/sidecolumn.aspx
试试上面的路径,看看能否获取到分类列表呢?
还有就是现在的文章目录是放到用户的url下的p目录,例如
https://www.cnblogs.com/hlxs/p/
看看能否访问到随笔列表呢?
还有代码里面的正则,我稍微修改了一下加了换行的匹配,如下:
默认获取:string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; 获取文章分类:string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>";
获取单个文章:string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
完整代码:
版本1
创建GenerateDirectory命令行项目,把以下代码保存到Program.cs
namespace GenerateDirectory { using System.IO; using System.Net; using System.Security; using System.Text.RegularExpressions; using System.Threading; class Program { static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表 static List<DataItem> categoryList = new List<DataItem>();//分类列表 static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录 const string archive = "p"; const string category = "category"; static void Main(string[] args) { Console.WriteLine("请输入你的博客园ID后回车:"); userId = Console.ReadLine(); string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址 url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx"; string param = "blogApp=" + userId;//获取你的博客分类需要你的ID Console.WriteLine("正在连接服务器,请稍候..."); bool isOK = GetCategory(url, param);//获取博客分类 if (!isOK) { Console.Clear(); Console.WriteLine("你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!"); Thread.Sleep(5000); for (int i = 5; i >= 1; i--) { Thread.Sleep(1000); Console.Clear(); Console.WriteLine(i); } Environment.Exit(0); } foreach (DataItem item in categoryList) { GetArchive(item.Value, string.Empty);//获取分类中的博客 } Console.Clear(); CreateDir(); Console.ReadKey(); } //生成博客目录 static void CreateDir() { IOHelper.DeleteFile(Environment.CurrentDirectory + "/我的博客目录.txt"); string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>"; string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>"; string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0}</li>"; string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面 for (int i = catelist.Length - 1; i >= 0; i--) { DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i])); if (item != null) { categoryList.Remove(item); categoryList.Insert(0, item); } } foreach (DataItem categoryItem in categoryList) { Console.WriteLine(categoryItem.Text); WriteLine(string.Format(divFormat, categoryItem.Text)); List<DataItem> list; if (archiveList.TryGetValue(categoryItem.Value, out list)) { WriteLine("<ul style='padding-top:10px;clear:both;list-style:none;'>"); foreach (DataItem archiveItem in list) { Console.WriteLine("\t" + archiveItem.Text); WriteLine(string.Format(liFormat, string.Format(aFormat, archiveItem.Value, archiveItem.Text))); } WriteLine("</ul>"); Console.WriteLine(); } } Console.WriteLine("\r\n\r\n博客园生成Html代码已经完成,请查看 我的博客目录.txt\r\n"); } private static void WriteLine(string content) { IOHelper.WriteLine(Environment.CurrentDirectory + "/我的博客目录.txt", content); } private static Stream GetStream(int timeout, string url, string param) { try { //if (!url.StartsWith("http://", StringComparison.OrdinalIgnoreCase)) //{ // url = "http://" + url; //} HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); if (!string.IsNullOrEmpty(param)) { req.Method = "POST"; req.ContentType = "application/x-www-form-urlencoded"; byte[] bs = Encoding.ASCII.GetBytes(param); req.ContentLength = bs.Length; using (Stream reqStream = req.GetRequestStream()) { reqStream.Write(bs, 0, bs.Length); } } HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse(); return HttpWResp.GetResponseStream(); } catch (Exception ex) { Console.WriteLine(ex.Message); return null; } } //获取页面的内容 private static string GetContent(string url, string param) { using (Stream myStream = GetStream(24000, url, param)) { if (myStream == null) { return string.Empty; } using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8)) { if (sr.Peek() > 0) { return sr.ReadToEnd(); } } return string.Empty; } } static MatchCollection Filter(string url, string param, string reg) { string content = GetContent(url, param); if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } //获取文章 static void GetArchive(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; MatchCollection matchs = Filter(url, param, regex); if (matchs == null) { return; } foreach (Match m in matchs) { string curUrl = m.Groups["url"].Value; if (curUrl.IndexOf(userId + "/" + archive) >= 0 && curUrl.IndexOf('#') < 0) { DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl }; if (!archiveList.ContainsKey(url)) { archiveList.Add(url, new List<DataItem> { item }); } else { if (archiveList[url].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null) { archiveList[url].Add(item); } } } } DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url); if (curCategory != null) { Console.WriteLine(" " + curCategory.Text); } } //获取文章分类 static bool GetCategory(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>"; MatchCollection matchs = Filter(url, param, regex); if (matchs == null) { return false; } Console.Clear(); Console.WriteLine("已完成您的博客分类:"); foreach (Match m in matchs) { if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0) { categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value }); } } return categoryList.Count == 0 ? false : true; } } class IOHelper { public static bool Exists(string fileName) { if (fileName == null || fileName.Trim() == "") { return false; } if (File.Exists(fileName)) { return true; } return false; } public static bool DeleteFile(string fileName) { if (Exists(fileName)) { File.Delete(fileName); return true; } return false; } public static bool WriteLine(string fileName, string content) { using (FileStream fileStream = new FileStream(fileName, FileMode.Append)) { lock (fileStream) { if (!fileStream.CanWrite) { throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!"); } StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8); streamWriter.WriteLine(content); streamWriter.Dispose(); streamWriter.Close(); return true; } } } } class DataItem { public string Text { get; set; } public string Value { get; set; } } }
版本2
优化:分页的文章
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace GenerateDirectory { using System.IO; using System.Net; using System.Security; using System.Text.RegularExpressions; using System.Threading; class Program { static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表 static List<DataItem> categoryList = new List<DataItem>();//分类列表 static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录 const string archive = "p"; const string category = "category"; const string outFile = "我的博客目录.txt"; static void Main(string[] args) { Console.WriteLine("请输入你的博客园ID后回车:"); userId = Console.ReadLine(); string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址 url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx"; string param = "blogApp=" + userId;//获取你的博客分类需要你的ID Console.WriteLine("正在连接服务器,请稍候..."); bool isOK = GetCategory(url, param);//获取博客分类 if (!isOK) { Console.Clear(); Console.WriteLine("你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!"); Thread.Sleep(5000); for (int i = 5; i >= 1; i--) { Thread.Sleep(1000); Console.Clear(); Console.WriteLine(i); } Environment.Exit(0); } foreach (DataItem item in categoryList) { GetArchive(item.Value, string.Empty);//获取分类中的博客 } Console.Clear(); CreateDir(); Console.ReadKey(); } //生成博客目录 static void CreateDir() { //IOHelper.DeleteFile(Environment.CurrentDirectory + "/" + outFile); IOHelper.DeleteFile(AppDomain.CurrentDomain.BaseDirectory + outFile); string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>"; string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>"; string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0}</li>"; WriteLine(string.Format(divFormat, "生成说明")); WriteLine($"<p>本页面是使用命令行工具{AppDomain.CurrentDomain.FriendlyName}程序生成。可参考一键构造你的博客园目录文章说明。</p>"); string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面 for (int i = catelist.Length - 1; i >= 0; i--) { DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i])); if (item != null) { categoryList.Remove(item); categoryList.Insert(0, item); } } foreach (DataItem categoryItem in categoryList) { Console.WriteLine(categoryItem.Text); WriteLine(string.Format(divFormat, categoryItem.Text)); List<DataItem> list; if (archiveList.TryGetValue(categoryItem.Value, out list)) { WriteLine("<ul style='padding-top:10px;clear:both;list-style:none;'>"); foreach (DataItem archiveItem in list) { Console.WriteLine("\t" + archiveItem.Text); WriteLine(string.Format(liFormat, string.Format(aFormat, archiveItem.Value, archiveItem.Text))); } WriteLine("</ul>"); Console.WriteLine(); } } Console.WriteLine($"\r\n\r\n博客园生成Html代码已经完成,请查看:{Environment.NewLine + AppDomain.CurrentDomain.BaseDirectory + outFile}\r\n"); } private static void WriteLine(string content) { IOHelper.WriteLine(Environment.CurrentDirectory + "/我的博客目录.txt", content); } private static Stream GetStream(int timeout, string url, string param) { try { HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); if (!string.IsNullOrEmpty(param)) { req.Method = "POST"; req.ContentType = "application/x-www-form-urlencoded"; byte[] bs = Encoding.ASCII.GetBytes(param); req.ContentLength = bs.Length; using (Stream reqStream = req.GetRequestStream()) { reqStream.Write(bs, 0, bs.Length); } } HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse(); return HttpWResp.GetResponseStream(); } catch (Exception ex) { Console.WriteLine(ex.Message); return null; } } //获取页面的内容 private static string GetContent(string url, string param) { using (Stream myStream = GetStream(24000, url, param)) { if (myStream == null) { return string.Empty; } using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8)) { if (sr.Peek() > 0) { return sr.ReadToEnd(); } } return string.Empty; } } static MatchCollection Filter(string url, string param, string reg) { string content = GetContent(url, param); if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } static MatchCollection Filter(string pageContent, string reg) { string content = pageContent; if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } //获取文章 static void GetArchive(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span(.*)>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; string pContent = GetContent(url, param); MatchCollection matchs = Filter(pContent, regex); if (matchs == null) { return; } foreach (Match m in matchs) { string curUrl = m.Groups["url"].Value; if (curUrl.IndexOf(userId + "/" + archive) >= 0 && curUrl.IndexOf('#') < 0) { DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl }; if (!archiveList.ContainsKey(url)) { archiveList.Add(url, new List<DataItem> { item }); } else { if (archiveList[url].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null) { archiveList[url].Add(item); } } } } MatchCollection havePageCont = Filter(pContent, @"<div class=""pager"">"); if (havePageCont != null && havePageCont.Count > 0) { int pageMax = 0; MatchCollection pageCollection = Filter(pContent, url + @"\?page=(\d+)"); foreach (Match p in pageCollection) { int pNum = 0; int.TryParse(p.Groups[1].Value, out pNum); pageMax = pNum > pageMax ? pNum : pageMax; } for (int i = 2; i <= pageMax; i++) { GetArchiveSub(url + "?page=" + i, regex); } } DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url); if (curCategory != null) { Console.WriteLine(" " + curCategory.Text); } } static void GetArchiveSub(string url, string reg) { MatchCollection matchs = Filter(url, "", reg); if (matchs == null) { return; } foreach (Match m in matchs) { string curUrl = m.Groups["url"].Value; if (curUrl.IndexOf(userId + "/" + archive) >= 0 && curUrl.IndexOf('#') < 0) { DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl }; string categoryUrl = url.Split('?')[0]; if (!archiveList.ContainsKey(categoryUrl)) { archiveList.Add(categoryUrl, new List<DataItem> { item }); } else { if (archiveList[categoryUrl].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null) { archiveList[categoryUrl].Add(item); } } } } } //获取文章分类 static bool GetCategory(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>"; MatchCollection matchs = Filter(url, param, regex); if (matchs == null) { return false; } Console.Clear(); Console.WriteLine("已完成您的博客分类:"); foreach (Match m in matchs) { if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0) { categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value }); } } return categoryList.Count == 0 ? false : true; } } class IOHelper { public static bool Exists(string fileName) { if (fileName == null || fileName.Trim() == "") { return false; } if (File.Exists(fileName)) { return true; } return false; } public static bool DeleteFile(string fileName) { if (Exists(fileName)) { File.Delete(fileName); return true; } return false; } public static bool WriteLine(string fileName, string content) { using (FileStream fileStream = new FileStream(fileName, FileMode.Append)) { lock (fileStream) { if (!fileStream.CanWrite) { throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!"); } StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8); streamWriter.WriteLine(content); streamWriter.Dispose(); streamWriter.Close(); return true; } } } } class DataItem { public string Text { get; set; } public string Value { get; set; } } }
版本3
优化:可以下载指定用户的博文到本地,并替换和转化css和img等本地化
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace GenerateDirectory { using System.IO; using System.Net; using System.Security; using System.Text.RegularExpressions; using System.Threading; class Program { static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表 static List<DataItem> categoryList = new List<DataItem>();//分类列表 static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录 const string archive = "p"; const string category = "category"; static string outDictionaryFile = "我的博客目录.txt.html"; static bool IsOutDictionary = true;//是否生成并保存博客园目录 static bool IsOutBlogFile = false;//是否生成并保存每篇博文 static string logFilePath = AppDomain.CurrentDomain.BaseDirectory + AppDomain.CurrentDomain.FriendlyName + DateTime.Now.ToString("yyyyMM") + ".log"; static void Main(string[] args) { try { IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】=================================================="); string tmp = ""; IOHelper.WriteLog(logFilePath, "请输入你的博客园ID后回车:"); userId = Console.ReadLine(); outDictionaryFile = userId + "/" + outDictionaryFile; IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录?(默认是true)"); tmp = Console.ReadLine(); IsOutDictionary = tmp.Length > 0 ? bool.Parse(tmp) : true; IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录:" + IsOutDictionary); IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文?(默认是false)"); tmp = Console.ReadLine(); IsOutBlogFile = tmp.Length > 0 ? bool.Parse(tmp) : false; IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文:" + IsOutBlogFile); } catch (Exception ex) { Console.BackgroundColor = ConsoleColor.DarkYellow; Console.ForegroundColor = ConsoleColor.White; IOHelper.WriteLog(logFilePath, "【Error】请输入错误,请检查输入的参数。"); Console.ResetColor(); Environment.Exit(0); } string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址 url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx"; string param = "blogApp=" + userId;//获取你的博客分类需要你的ID IOHelper.WriteLog(logFilePath, "正在连接服务器,请稍候..."); bool isOK = GetCategory(url, param);//获取博客分类 if (!isOK) { IOHelper.WriteLog(logFilePath, "你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!"); Thread.Sleep(5000); for (int i = 5; i >= 0; i--) { Thread.Sleep(1000); Console.Clear(); IOHelper.WriteLog(logFilePath, i.ToString()); } Console.Clear(); IOHelper.WriteLog(logFilePath, "哈哈!"); Environment.Exit(0); } foreach (DataItem item in categoryList) { GetArchive(item.Value, string.Empty);//获取分类中的博客 } Console.Clear(); if (IsOutDictionary) CreateDir(); if (IsOutBlogFile) downBlog(); IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】程序执行完成!按任意键退出!"); Console.ReadKey(); } //生成博客目录 static void CreateDir() { //IOHelper.DeleteFile(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile); IOHelper.DeleteDir(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile); string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>"; string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0} {1}</li>"; string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>"; string aLocalFormat = "<a href='{0}' target='_blank' title='{1}'>{2}</a>"; WriteBlogList(string.Format(divFormat, "生成说明")); WriteBlogList($"<p>本页面是使用命令行工具{AppDomain.CurrentDomain.FriendlyName}程序生成。可参考一键构造你的博客园目录文章说明。</p>"); string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面 for (int i = catelist.Length - 1; i >= 0; i--) { DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i])); if (item != null) { categoryList.Remove(item); categoryList.Insert(0, item); } } foreach (DataItem categoryItem in categoryList) { string categoryName = utilityHelper.HtmlDecode(categoryItem.Text); IOHelper.WriteLog(logFilePath, categoryName); WriteBlogList(string.Format(divFormat, categoryName)); List<DataItem> list; if (archiveList.TryGetValue(categoryItem.Value, out list)) { WriteBlogList("<ul style='padding-top:10px;clear:both;list-style:none;'>"); foreach (DataItem archiveItem in list) { string blogNam = utilityHelper.HtmlDecode(archiveItem.Text); IOHelper.WriteLog(logFilePath, "\t" + blogNam); var aHtml = string.Format(aFormat, archiveItem.Value, blogNam); var aLocalHtml = string.Format(aLocalFormat, categoryName + "/" + blogNam + ".html", blogNam, "本地"); WriteBlogList(string.Format(liFormat, aHtml, (IsOutBlogFile ? aLocalHtml : ""))); } WriteBlogList("</ul>"); Console.WriteLine(); } } IOHelper.WriteLog(logFilePath, $"\r\n\r\n博客园生成Html代码已经完成,请查看:{Environment.NewLine + AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile}\r\n"); } //写入博客列表文件 private static void WriteBlogList(string content) { IOHelper.WriteLine(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile, content); } //写入博客内容到文件 private static void WriteBlogToFile(string CategoryName, string bLogName, string content) { string fPath = AppDomain.CurrentDomain.BaseDirectory + userId + "/" + CategoryName + "/" + bLogName + ".html"; if (!IOHelper.ExistsDir(fPath, true)) { Directory.CreateDirectory(Path.GetDirectoryName(fPath)); } IOHelper.WriteLine(fPath, content); } private static Stream GetStream(int timeout, string url, string param) { try { //HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url); //FileWebRequest req = (FileWebRequest)HttpWebRequest.Create(url); WebRequest req = WebRequest.Create(url); if (!string.IsNullOrEmpty(param)) { req.Method = "POST"; req.ContentType = "application/x-www-form-urlencoded"; byte[] bs = Encoding.ASCII.GetBytes(param); req.ContentLength = bs.Length; using (Stream reqStream = req.GetRequestStream()) { reqStream.Write(bs, 0, bs.Length); } } //HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse(); WebResponse HttpWResp = req.GetResponse(); return HttpWResp.GetResponseStream(); } catch (Exception ex) { IOHelper.WriteLog(logFilePath, $"request : {(url.Length > 100 ? url.Substring(0, 100) + "..." : url)}"); IOHelper.WriteLog(logFilePath, $"param : {param}"); Console.BackgroundColor = ConsoleColor.Red; //设置背景色 Console.ForegroundColor = ConsoleColor.White; IOHelper.WriteLog(logFilePath, "\t【Error】" + ex.Message); Console.ResetColor(); return null; } } //获取页面的内容 private static string GetContent(string url, string param) { using (Stream myStream = GetStream(24000, url, param)) { if (myStream == null) { return string.Empty; } using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8)) { if (sr.Peek() > 0) { return sr.ReadToEnd(); } } return string.Empty; } } private static bool downResource(string url, string localFilePath) { bool ret = false; string tmpUrl = url.Split('?')[0]; string ext = IOHelper.GetFileExtension(tmpUrl); if (ext == ".html") { return ret; } using (Stream myStream = GetStream(24000, tmpUrl, "")) { if (myStream != null) { // 确保输入流的位置在起始点 if (myStream.CanSeek) myStream.Seek(0, SeekOrigin.Begin); if (!IOHelper.ExistsDir(localFilePath, true)) Directory.CreateDirectory(Path.GetDirectoryName(localFilePath)); using (FileStream fileStream = new FileStream(localFilePath, FileMode.Create, FileAccess.Write)) { myStream.CopyTo(fileStream); fileStream.Close(); } ret = true; } } return ret; } static MatchCollection Filter(string url, string param, string reg) { string content = GetContent(url, param); if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } static MatchCollection Filter(string pageContent, string reg) { string content = pageContent; if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } //获取单分类下的文章列表 static void GetArchive(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span(.*)>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; string pContent = GetCategorySubList(url, param, regex); MatchCollection havePageCont = Filter(pContent, @"<div class=""pager"">"); if (havePageCont != null && havePageCont.Count > 0) { int pageMax = 0; MatchCollection pageCollection = Filter(pContent, url + @"\?page=(\d+)"); foreach (Match p in pageCollection) { int pNum = 0; int.TryParse(p.Groups[1].Value, out pNum); pageMax = pNum > pageMax ? pNum : pageMax; } for (int i = 2; i <= pageMax; i++) { //GetArchiveSub(url + "?page=" + i, regex); GetCategorySubList(url + "?page=" + i, "", regex); } } DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url); if (curCategory != null) { IOHelper.WriteLog(logFilePath, $"\t{curCategory.Text}"); } } static void GetArchiveSub(string url, string reg) { MatchCollection matchs = Filter(url, "", reg); if (matchs == null) { return; } foreach (Match m in matchs) { string curUrl = m.Groups["url"].Value; if (curUrl.IndexOf(userId + "/") >= 0 && curUrl.IndexOf('#') < 0) { DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl }; string categoryUrl = url.Split('?')[0]; if (!archiveList.ContainsKey(categoryUrl)) { archiveList.Add(categoryUrl, new List<DataItem> { item }); } else { if (archiveList[categoryUrl].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null) { archiveList[categoryUrl].Add(item); } } } } } //根据类别的url获取类别下的文章列表 static string GetCategorySubList(string url, string param, string reg) { string pContent = GetContent(url, param); MatchCollection matchs = Filter(pContent, reg); string urlKey = url.Split('?')[0]; if (matchs == null) { return null; } foreach (Match m in matchs) { string curUrl = m.Groups["url"].Value; if (curUrl.IndexOf(userId + "/") >= 0 && curUrl.IndexOf('#') < 0) { DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl }; if (!archiveList.ContainsKey(urlKey)) { archiveList.Add(urlKey, new List<DataItem> { item }); } else { if (archiveList[urlKey].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null) { if (!archiveList[urlKey].Exists(t => t.Value == curUrl)) archiveList[urlKey].Add(item); } } } } return pContent; } //获取文章分类 static bool GetCategory(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>"; MatchCollection matchs = Filter(url, param, regex); if (matchs == null) { return false; } Console.Clear(); IOHelper.WriteLog(logFilePath, "已完成您的博客分类:"); foreach (Match m in matchs) { if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0) { categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value }); } } return categoryList.Count == 0 ? false : true; } //根据url获取某篇博文 static bool downBlog() { string localUserPath = AppDomain.CurrentDomain.BaseDirectory + userId; //IOHelper.DeleteDir(localUserPath); foreach (DataItem categoryItem in categoryList) { string categoryName = utilityHelper.HtmlDecode(categoryItem.Text); IOHelper.WriteLog(logFilePath, "=>" + categoryName); List<DataItem> list; if (archiveList.TryGetValue(categoryItem.Value, out list)) { foreach (DataItem archiveItem in list) { string blogNam = utilityHelper.HtmlDecode(archiveItem.Text); IOHelper.WriteLog(logFilePath, "\t=>" + blogNam); string pContent = GetContent(archiveItem.Value, ""); var bResource = getBlogResource(pContent); foreach (var item in bResource.Item1)//下载css资源 { string cssUrl = item.Split('?')[0]; string cssLocalPath = localUserPath + "\\css\\" + Path.GetFileName(cssUrl); if (!IOHelper.ExistsFile(cssLocalPath)) { if (cssUrl.IndexOf("//") == 0) cssUrl = "http:" + cssUrl; if (cssUrl.IndexOf("/") == 0 && cssUrl.Substring(1) != "/") cssUrl = "https://www.cnblogs.com" + cssUrl; var isDown = downResource(cssUrl, cssLocalPath); if (isDown) { pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, "..")); } } else { pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, "..")); } } foreach (var item in bResource.Item2)//下载img资源 { string imgUrl = item.Split('?')[0]; string imgLocalPath = localUserPath + $"\\{categoryName}\\imgs\\{Path.GetFileName(imgUrl)}"; bool isDown = false; if (!IOHelper.ExistsFile(imgLocalPath)) { if (imgUrl.IndexOf("//") == 0) imgUrl = "http:" + imgUrl; if (imgUrl.IndexOf("/") == 0 && imgUrl.Substring(1) != "/") imgUrl = "https://www.cnblogs.com" + imgUrl; isDown = downResource(imgUrl, imgLocalPath); if (isDown) pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, ".")); } else { pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, ".")); } } WriteBlogToFile(categoryName, blogNam, pContent); } Console.WriteLine(); } } return false; } //根据url获取博文的图片 static Tuple<List<string>, List<string>> getBlogResource(string blogContent) { List<string> cssList = new List<string>(); List<string> imgList = new List<string>(); string regCss = "<link[^>]*?href=\"(?<url>[^>]*?)\"[^>]*?>"; regCss = @"<link(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>"; MatchCollection cssMatchs = Filter(blogContent, regCss); if (cssMatchs != null) { foreach (Match m in cssMatchs) { cssList.Add(m.Groups["url"].Value); } } string regImg = @"<img(?:(?!src=).)*src=(['""]?)(?<url>[^""\s>]*)\1[^>]*>"; MatchCollection imgMatchs = Filter(blogContent, regImg); if (imgMatchs != null) { foreach (Match m in imgMatchs) { imgList.Add(m.Groups["url"].Value); } } return new Tuple<List<string>, List<string>>(cssList, imgList); } } class utilityHelper { public static string HtmlDecode(string htmlCode) { string tmp = WebUtility.HtmlDecode(htmlCode); tmp = tmp.Replace('\\', '_'); tmp = tmp.Replace('/', '_'); tmp = tmp.Replace(':', '_'); tmp = tmp.Replace('*', '_'); tmp = tmp.Replace('?', '_'); tmp = tmp.Replace(':', '_'); tmp = tmp.Replace('<', '_'); tmp = tmp.Replace('>', '_'); tmp = tmp.Replace('|', '_'); return tmp; } } class IOHelper { public static bool ExistsFile(string fileName) { if (fileName == null || fileName.Trim() == "") { return false; } if (File.Exists(fileName)) { return true; } return false; } public static bool ExistsDir(string localPath, bool isFile = false) { string dPath = localPath; if (isFile) { dPath = Path.GetDirectoryName(dPath); } return Directory.Exists(dPath); } public static bool DeleteFile(string fileName) { if (ExistsFile(fileName)) { File.Delete(fileName); return true; } return false; } public static bool DeleteDir(string dirPath) { string tmp = dirPath; if (!string.IsNullOrEmpty(Path.GetExtension(tmp))) { tmp = Path.GetDirectoryName(tmp); } var di = new DirectoryInfo(tmp); if (di.Exists) { di.Delete(true); return true; } else { return false; } } public static bool WriteLine(string fileName, string content) { if (!ExistsDir(fileName, true)) { Directory.CreateDirectory(Path.GetDirectoryName(fileName)); } using (FileStream fileStream = new FileStream(fileName, FileMode.Append)) { lock (fileStream) { if (!fileStream.CanWrite) { throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!"); } StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8); streamWriter.WriteLine(content); streamWriter.Dispose(); streamWriter.Close(); return true; } } } public static string GetFileExtension(string filePath) { return Path.GetExtension(filePath); } public static void WriteLog(string logFilePath, string msg) { Console.WriteLine(msg); WriteLine(logFilePath, msg); } } class DataItem { public string Text { get; set; } public string Value { get; set; } } }
版本4
本地url编码
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace GenerateDirectory { using System.IO; using System.Net; using System.Security; using System.Text.RegularExpressions; using System.Threading; class Program { static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表 static List<DataItem> categoryList = new List<DataItem>();//分类列表 static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录 const string archive = "p"; const string category = "category"; static string outCategoryFile = "我的博客目录.txt.html"; static string outDir = userId; static bool IsOutDictionary = true;//是否生成并保存博客园目录 static bool IsOutBlogFile = false;//是否生成并保存每篇博文 static string logFilePath = AppDomain.CurrentDomain.BaseDirectory + AppDomain.CurrentDomain.FriendlyName + DateTime.Now.ToString("yyyyMM") + ".log"; static void Main(string[] args) { try { IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】=================================================="); string tmp = ""; IOHelper.WriteLog(logFilePath, "请输入你的博客园ID后回车:"); userId = Console.ReadLine(); outDir = $"LocalBlogs_{userId}_{DateTime.Now.ToString("yyyyMMdd")}"; outCategoryFile = outDir + "/" + outCategoryFile; IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录?(默认是true)"); tmp = Console.ReadLine(); IsOutDictionary = tmp.Length > 0 ? bool.Parse(tmp) : true; IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录:" + IsOutDictionary); IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文?(默认是false)"); tmp = Console.ReadLine(); IsOutBlogFile = tmp.Length > 0 ? bool.Parse(tmp) : false; IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文:" + IsOutBlogFile); } catch (Exception ex) { Console.BackgroundColor = ConsoleColor.DarkYellow; Console.ForegroundColor = ConsoleColor.White; IOHelper.WriteLog(logFilePath, "【Error】请输入错误,请检查输入的参数。"); Console.ResetColor(); Environment.Exit(0); } string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址 url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx"; string param = "blogApp=" + userId;//获取你的博客分类需要你的ID IOHelper.WriteLog(logFilePath, "正在连接服务器,请稍候..."); bool isOK = GetCategory(url, param);//获取博客分类 if (!isOK) { IOHelper.WriteLog(logFilePath, "你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!"); Thread.Sleep(5000); for (int i = 5; i >= 0; i--) { Thread.Sleep(1000); Console.Clear(); IOHelper.WriteLog(logFilePath, i.ToString()); } Console.Clear(); IOHelper.WriteLog(logFilePath, "哈哈!"); Environment.Exit(0); } foreach (DataItem item in categoryList) { GetArchive(item.Value, string.Empty);//获取分类中的博客 } Console.Clear(); if (IsOutDictionary) CreateDir(); if (IsOutBlogFile) downBlog(); IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】程序执行完成!按任意键退出!"); Console.ReadKey(); } //获取文章分类 static bool GetCategory(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>"; MatchCollection matchs = Filter(url, param, regex); if (matchs == null) { return false; } Console.Clear(); IOHelper.WriteLog(logFilePath, "已完成您的博客分类:"); foreach (Match m in matchs) { if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0) { categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value }); } } return categoryList.Count == 0 ? false : true; } //获取单分类下的文章列表 static void GetArchive(string url, string param) { string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span(.*)>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>"; string pContent = GetCategorySubList(url, param, regex); MatchCollection havePageCont = Filter(pContent, @"<div class=""pager"">"); if (havePageCont != null && havePageCont.Count > 0) { int pageMax = 0; MatchCollection pageCollection = Filter(pContent, url + @"\?page=(\d+)"); foreach (Match p in pageCollection) { int pNum = 0; int.TryParse(p.Groups[1].Value, out pNum); pageMax = pNum > pageMax ? pNum : pageMax; } for (int i = 2; i <= pageMax; i++) { GetCategorySubList(url + "?page=" + i, "", regex); } } DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url); if (curCategory != null) { IOHelper.WriteLog(logFilePath, $"\t{curCategory.Text}"); } } //根据类别的url获取类别下的文章列表 static string GetCategorySubList(string url, string param, string reg) { string pContent = GetContent(url, param); MatchCollection matchs = Filter(pContent, reg); string urlKey = url.Split('?')[0]; if (matchs == null) { return null; } foreach (Match m in matchs) { string curUrl = m.Groups["url"].Value; if (curUrl.IndexOf(userId + "/") >= 0 && curUrl.IndexOf('#') < 0) { DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl }; if (!archiveList.ContainsKey(urlKey)) { archiveList.Add(urlKey, new List<DataItem> { item }); } else { if (archiveList[urlKey].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null) { if (!archiveList[urlKey].Exists(t => t.Value == curUrl)) archiveList[urlKey].Add(item); } } } } return pContent; } //生成博客目录 static void CreateDir() { //IOHelper.DeleteFile(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile); IOHelper.DeleteDir(AppDomain.CurrentDomain.BaseDirectory + outCategoryFile); string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>"; string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0} {1}</li>"; string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>"; string aLocalFormat = "<a href='{0}' target='_blank' title='{1}'>{2}</a>"; WriteBlogList(string.Format(divFormat, "生成说明")); WriteBlogList($"<p>本页面是使用命令行工具{AppDomain.CurrentDomain.FriendlyName}程序生成。可参考一键构造你的博客园目录文章说明。</p>"); string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面 for (int i = catelist.Length - 1; i >= 0; i--) { DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i])); if (item != null) { categoryList.Remove(item); categoryList.Insert(0, item); } } foreach (DataItem categoryItem in categoryList) { string categoryName = utilityHelper.HtmlDecode(categoryItem.Text); IOHelper.WriteLog(logFilePath, categoryName); WriteBlogList(string.Format(divFormat, categoryName)); List<DataItem> list; if (archiveList.TryGetValue(categoryItem.Value, out list)) { WriteBlogList("<ul style='padding-top:10px;clear:both;list-style:none;'>"); foreach (DataItem archiveItem in list) { string blogNam = utilityHelper.HtmlDecode(archiveItem.Text); IOHelper.WriteLog(logFilePath, "\t" + blogNam); var aHtml = string.Format(aFormat, archiveItem.Value, blogNam); var aLocalHtml = string.Format(aLocalFormat, utilityHelper.UrlEncode(categoryName) + "/" + utilityHelper.UrlEncode(blogNam) + ".html", blogNam, "本地"); WriteBlogList(string.Format(liFormat, aHtml, (IsOutBlogFile ? aLocalHtml : ""))); } WriteBlogList("</ul>"); Console.WriteLine(); } } IOHelper.WriteLog(logFilePath, $"\r\n\r\n博客园生成Html代码已经完成,请查看:{Environment.NewLine + AppDomain.CurrentDomain.BaseDirectory + outCategoryFile}\r\n"); } //根据url获取某篇博文 static bool downBlog() { string localUserPath = AppDomain.CurrentDomain.BaseDirectory + outDir; //IOHelper.DeleteDir(localUserPath); foreach (DataItem categoryItem in categoryList) { string categoryName = utilityHelper.HtmlDecode(categoryItem.Text); IOHelper.WriteLog(logFilePath, "-->" + categoryName); List<DataItem> list; if (archiveList.TryGetValue(categoryItem.Value, out list)) { for (int i = 0; i < list.Count; i++) { DataItem archiveItem = list[i]; string blogNam = utilityHelper.HtmlDecode(archiveItem.Text); IOHelper.WriteLog(logFilePath, $"\tstart down : {categoryName}/{i + 1}-->{blogNam}"); string pContent = GetContent(archiveItem.Value, ""); var bResource = getBlogResource(pContent); foreach (var item in bResource.Item1)//下载css资源 { string cssUrl = item.Split('?')[0]; if (cssUrl.IndexOf("//") == 0) cssUrl = "http:" + cssUrl; if (cssUrl.IndexOf("/") == 0 && cssUrl.Substring(1) != "/") cssUrl = "https://www.cnblogs.com" + cssUrl; string cssFileName = Path.GetFileNameWithoutExtension(cssUrl); cssFileName = cssFileName.Length > 20 ? utilityHelper.GetMD5_16(cssFileName) : cssFileName; string cssLocalPath = localUserPath + $"\\css\\{cssFileName}{Path.GetExtension(cssUrl)}"; if (!IOHelper.ExistsFile(cssLocalPath)) { var isDown = downResource(cssUrl, cssLocalPath); if (isDown) { pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, "..")); } } else { pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, "..")); } } foreach (var item in bResource.Item2)//下载img资源 { string imgUrl = item.Split('?')[0]; if (imgUrl.IndexOf("//") == 0) imgUrl = "http:" + imgUrl; if (imgUrl.IndexOf("/") == 0 && imgUrl.Substring(1) != "/") imgUrl = "https://www.cnblogs.com" + imgUrl; string imgFileName = Path.GetFileNameWithoutExtension(imgUrl); imgFileName = imgFileName.Length > 20 ? utilityHelper.GetMD5_16(imgFileName) : imgFileName; string imgLocalPath = localUserPath + $"\\{categoryName}\\imgs\\{imgFileName}{Path.GetExtension(imgUrl)}"; bool isDown = false; if (!IOHelper.ExistsFile(imgLocalPath)) { isDown = downResource(imgUrl, imgLocalPath); if (isDown) pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, ".")); } else { pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, ".")); } } WriteBlogToFile(categoryName, blogNam, pContent); } Console.WriteLine(); } } return false; } //根据url获取博文的图片 static Tuple<List<string>, List<string>> getBlogResource(string blogContent) { List<string> cssList = new List<string>(); List<string> imgList = new List<string>(); string regCss = "<link[^>]*?href=\"(?<url>[^>]*?)\"[^>]*?>"; regCss = @"<link(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>"; regCss = @"<link(?:(?!rel=).)*rel=(?<n2>['""]?)(?<rel>[^""\s]*)\1(?:(?!href=).)*href=\1(?<url>[^""\s>]*)\1*[^>]*>"; MatchCollection cssMatchs = Filter(blogContent, regCss); if (cssMatchs != null) { foreach (Match m in cssMatchs) { if (m.Groups["rel"].Value == "stylesheet") cssList.Add(m.Groups["url"].Value); } } string regImg = @"<img(?:(?!src=).)*src=(['""]?)(?<url>[^""\s>]*)\1[^>]*>"; MatchCollection imgMatchs = Filter(blogContent, regImg); if (imgMatchs != null) { foreach (Match m in imgMatchs) { imgList.Add(m.Groups["url"].Value); } } return new Tuple<List<string>, List<string>>(cssList, imgList); } private static Stream GetStream(int timeout, string url, string param) { Stream res = null; try { //HttpWebRequest req1 = (HttpWebRequest)HttpWebRequest.Create(url); //FileWebRequest req2 = (FileWebRequest)HttpWebRequest.Create(url); WebRequest req = WebRequest.Create(url); req.Timeout = timeout; if (!string.IsNullOrEmpty(param)) { req.Method = "POST"; req.ContentType = "application/x-www-form-urlencoded"; byte[] bs = Encoding.ASCII.GetBytes(param); req.ContentLength = bs.Length; using (Stream reqStream = req.GetRequestStream()) { reqStream.Write(bs, 0, bs.Length); } } //HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse(); WebResponse HttpWResp = req.GetResponse(); res = HttpWResp.GetResponseStream(); return res; } catch (Exception ex) { IOHelper.WriteLog(logFilePath, $"request : {(url.Length > 100 ? url.Substring(0, 100) + "..." : url)}"); IOHelper.WriteLog(logFilePath, $"param : {param}"); Console.BackgroundColor = ConsoleColor.Red; //设置背景色 Console.ForegroundColor = ConsoleColor.White; IOHelper.WriteLog(logFilePath, "\t【Error】" + ex.Message); Console.ResetColor(); return null; } } //获取页面的内容 private static string GetContent(string url, string param) { using (Stream myStream = GetStream(3000, url, param)) { if (myStream != null) { using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8)) { if (sr.Peek() > 0) { string httpContent = sr.ReadToEnd(); myStream.Close(); return httpContent; } } } return string.Empty; } } private static bool downResource(string url, string localFilePath) { bool ret = false; string tmpUrl = url.Split('?')[0]; tmpUrl = url.Split('!')[0]; string ext = IOHelper.GetFileExtension(tmpUrl); if (ext == ".html") { return ret; } using (Stream myStream = GetStream(3000, tmpUrl, "")) { if (myStream != null) { // 确保输入流的位置在起始点 if (myStream.CanSeek) myStream.Seek(0, SeekOrigin.Begin); try { if (!IOHelper.ExistsDir(localFilePath, true)) Directory.CreateDirectory(Path.GetDirectoryName(localFilePath)); using (FileStream fileStream = new FileStream(localFilePath, FileMode.Create, FileAccess.Write)) { myStream.CopyTo(fileStream); fileStream.Close(); } ret = true; } catch (Exception ex) { IOHelper.WriteLog(logFilePath, $"url : {url}{Environment.NewLine}localFilePath : {localFilePath}"); IOHelper.WriteLog(logFilePath, "【Error】" + ex.ToString()); } } } return ret; } static MatchCollection Filter(string url, string param, string reg) { string content = GetContent(url, param); if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } static MatchCollection Filter(string pageContent, string reg) { string content = pageContent; if (string.IsNullOrEmpty(content)) { return null; } string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>"; Regex rx = new Regex(regex); return rx.Matches(content); } //写入博客列表文件 private static void WriteBlogList(string content) { IOHelper.WriteLine(AppDomain.CurrentDomain.BaseDirectory + outCategoryFile, content); } //写入博客内容到文件 private static void WriteBlogToFile(string CategoryName, string bLogName, string content) { string fPath = AppDomain.CurrentDomain.BaseDirectory + outDir + "/" + CategoryName + "/" + bLogName + ".html"; if (!IOHelper.ExistsDir(fPath, true)) { Directory.CreateDirectory(Path.GetDirectoryName(fPath)); } IOHelper.WriteLine(fPath, content); } } class utilityHelper { public static string UrlEncode(string str) { string tmp = WebUtility.UrlEncode(str); return tmp.Replace("+", "%20"); } public static string HtmlDecode(string htmlCode) { string tmp = WebUtility.HtmlDecode(htmlCode); tmp = tmp.Replace('\\', '_'); tmp = tmp.Replace('/', '_'); tmp = tmp.Replace(':', '_'); tmp = tmp.Replace('*', '_'); tmp = tmp.Replace('?', '_'); tmp = tmp.Replace(':', '_'); tmp = tmp.Replace('<', '_'); tmp = tmp.Replace('>', '_'); tmp = tmp.Replace('|', '_'); tmp = tmp.Replace('"', '_'); return tmp; } /// <summary> /// 获得32位的MD5加密 /// </summary> /// <param name="input"></param> /// <returns></returns> public static string GetMD5_32(string input) { System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create(); byte[] data = md5.ComputeHash(System.Text.Encoding.Default.GetBytes(input)); StringBuilder sb = new StringBuilder(); for (int i = 0; i < data.Length; i++) { sb.Append(data[i].ToString("x2")); } return sb.ToString(); } /// <summary> /// 获得16位的MD5加密 /// </summary> /// <param name="input"></param> /// <returns></returns> public static string GetMD5_16(string input) { return GetMD5_32(input).Substring(8, 16); } /// <summary> /// 获得8位的MD5加密 /// </summary> /// <param name="input"></param> /// <returns></returns> public static string GetMD5_8(string input) { return GetMD5_32(input).Substring(8, 8); } } class IOHelper { public static bool ExistsFile(string fileName) { if (fileName == null || fileName.Trim() == "") { return false; } if (File.Exists(fileName)) { return true; } return false; } public static bool ExistsDir(string localPath, bool isFile = false) { string dPath = localPath; if (isFile) { dPath = Path.GetDirectoryName(dPath); } return Directory.Exists(dPath); } public static bool DeleteFile(string fileName) { if (ExistsFile(fileName)) { File.Delete(fileName); return true; } return false; } public static bool DeleteDir(string dirPath) { string tmp = dirPath; if (!string.IsNullOrEmpty(Path.GetExtension(tmp))) { tmp = Path.GetDirectoryName(tmp); } var di = new DirectoryInfo(tmp); if (di.Exists) { di.Delete(true); return true; } else { return false; } } public static bool WriteLine(string fileName, string content) { if (!ExistsDir(fileName, true)) { Directory.CreateDirectory(Path.GetDirectoryName(fileName)); } using (FileStream fileStream = new FileStream(fileName, FileMode.Append)) { lock (fileStream) { if (!fileStream.CanWrite) { throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!"); } StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8); streamWriter.WriteLine(content); streamWriter.Dispose(); streamWriter.Close(); return true; } } } public static string GetFileExtension(string filePath) { return Path.GetExtension(filePath); } public static void WriteLog(string logFilePath, string msg) { Console.WriteLine(msg); WriteLine(logFilePath, msg); } } class DataItem { public string Text { get; set; } public string Value { get; set; } } }
关注我】。(●'◡'●)
如果,您希望更容易地发现我的新博客,不妨点击一下绿色通道的【因为,我的写作热情也离不开您的肯定与支持,感谢您的阅读,我是【Jack_孟】!
本文来自博客园,作者:jack_Meng,转载请注明原文链接:https://www.cnblogs.com/mq0036/p/12656911.html
【免责声明】本文来自源于网络,如涉及版权或侵权问题,请及时联系我们,我们将第一时间删除或更改!