一键构造你的博客园目录,下载到本地

最近看了一下吴军的数学之美。书很好,废话我就不多少了。看了第9章图论和网络爬虫,一直都觉得网络爬虫很牛B,搜索引擎不就是用爬虫爬网页的吗,于是想写一个简单的爬虫来爬网页试试,最先想到的就是给自己的博客建一个目录,够小够简单了吧,于是就有了这篇文章,简单的分享一下,先申明我的实现很简单没有技术含量,在看下文之前可以先看看 我的博客目录。              源码必共享

简单介绍一下网络爬虫的原理:给你一个网页地址,先把这个网页下载下来,然后分析这个网页的内容,得到这个网页中的所有链接,然后下载这些网页,继续分析下载。这样就能下载互联网上的很多网页。原理就这么简单,实现起来就不那么容易了。由于深入不了只能说简单的。

构造我的博客目录思路简单分析。获得你的所有文章的地址及标题,然后将这些文章分类。你的文章其实是已经分类好了的,只用得到你的文章的所有分类,然后根据分类得到所有分类下的文章,就可以得到你所有的文章及其分类,构造你的博客目录就容易了。

被否定了的思路一:随便拿到我的一篇文章的地址,下载这篇文章,然后分析这个地址,得到这篇文章里面的所有链接,按照一定的规则得到我的文章地址,即排除无用的连接,然后以爬虫的思路得到我的所有文章,由于每篇文章都有它的分类,所以很快就能构造我的博客目录了。然而由于博客园的实现不是我想的那样,在下载一篇文章的时候,没有下面的内容,因为下面的内容就像一个双向链表一样将我的所有文章连接起来了,我只要知道一篇文章的地址,通过这个”双向链表“我就能得到我的所有文章了,可就是下载网页里偏偏没有下面的内容,于是这个最接近爬虫的方法被PASS掉了。 

被否定了的思路二。每个人的文章都是分页显示的,我就可以下载这些内容,然后就可以得到我的所有文章,可还是有个问题,跟上面一样的原因,妹的,下载的网页中没有文章的分类,得到了所有的文章,却不知道文章的分类,叫我怎么构造目录啊。于是又被PASS掉了。

 

要构造我的博客目录,这么简单的需求方法当然是很多的了,于是用了个不太想爬虫的方法。就是上面所说的,得到所有文章的分类,下载每个分类下的文章,构造博客目录。获得我的博客分类的方法很简单,如获取我的文章分类方法如下:

请求这个地址:http://www.cnblogs.com/hlxs/mvc/blog/sidecolumn.aspx

传入参数blogApp=hlxs;(hlxs是我在博客园的ID)

这样就得到了我文章的所有分类,然后按照分类得到分类下的所有文章,在构造博客目录就简单了。在这个过程中只要知道某人在博客园的ID就能构造它的博客目录,我说一键构造你的博客目录不为过吧。

如果你也想构造你的博客目录,可以先看看我的博客目录,构造你的博客目录很简单,运行程序,输入你的博客园ID,会自动生成一个”我的博客目录.txt”,将文件的内容以源码的方式发表就行。

 

 

出处:https://www.cnblogs.com/hlxs/archive/2013/02/20/2918760.html

=======================================================================================

个人使用

由于博客园改版了,上面的比如获取随笔分类还是使用的老的url,现在已经无法访问到这个页面了,我们通过查看源码,不难发现类似如下的代码:

<div id="blog-sidecolumn"></div>
                    <script>loadBlogSideColumn();</script>
</div>

只需要在chrome中跟踪一把,看看他访问的是那个路径获取数据的,看来现在换成ajax获取的方式了,如下

 https://www.cnblogs.com/hlxs/ajax/sidecolumn.aspx

试试上面的路径,看看能否获取到分类列表呢?

还有就是现在的文章目录是放到用户的url下的p目录,例如

https://www.cnblogs.com/hlxs/p/

看看能否访问到随笔列表呢?

还有代码里面的正则,我稍微修改了一下加了换行的匹配,如下:

默认获取:string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";
获取文章分类:string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>"; 
获取单个文章:string r
egex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";

 

完整代码:

版本1

创建GenerateDirectory命令行项目,把以下代码保存到Program.cs

namespace GenerateDirectory
{
    using System.IO;
    using System.Net;
    using System.Security;
    using System.Text.RegularExpressions;
    using System.Threading;

    class Program
    {
        static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表
        static List<DataItem> categoryList = new List<DataItem>();//分类列表
        static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录
        const string archive = "p";
        const string category = "category";

        static void Main(string[] args)
        {
            Console.WriteLine("请输入你的博客园ID后回车:");
            userId = Console.ReadLine();
            string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址
            url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx";
            string param = "blogApp=" + userId;//获取你的博客分类需要你的ID
            Console.WriteLine("正在连接服务器,请稍候...");
            bool isOK = GetCategory(url, param);//获取博客分类
            if (!isOK)
            {
                Console.Clear();
                Console.WriteLine("你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!");
                Thread.Sleep(5000);
                for (int i = 5; i >= 1; i--)
                {
                    Thread.Sleep(1000);
                    Console.Clear();
                    Console.WriteLine(i);
                }
                Environment.Exit(0);
            }
            foreach (DataItem item in categoryList)
            {
                GetArchive(item.Value, string.Empty);//获取分类中的博客
            }
            Console.Clear();
            CreateDir();
            Console.ReadKey();

        }

        //生成博客目录
        static void CreateDir()
        {
            IOHelper.DeleteFile(Environment.CurrentDirectory + "/我的博客目录.txt");
            string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>";
            string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>";
            string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0}</li>";

            string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面
            for (int i = catelist.Length - 1; i >= 0; i--)
            {
                DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i]));
                if (item != null)
                {
                    categoryList.Remove(item);
                    categoryList.Insert(0, item);
                }
            }

            foreach (DataItem categoryItem in categoryList)
            {
                Console.WriteLine(categoryItem.Text);
                WriteLine(string.Format(divFormat, categoryItem.Text));
                List<DataItem> list;
                if (archiveList.TryGetValue(categoryItem.Value, out list))
                {
                    WriteLine("<ul style='padding-top:10px;clear:both;list-style:none;'>");
                    foreach (DataItem archiveItem in list)
                    {
                        Console.WriteLine("\t" + archiveItem.Text);
                        WriteLine(string.Format(liFormat, string.Format(aFormat, archiveItem.Value, archiveItem.Text)));
                    }
                    WriteLine("</ul>");
                    Console.WriteLine();
                }
            }

            Console.WriteLine("\r\n\r\n博客园生成Html代码已经完成,请查看 我的博客目录.txt\r\n");
        }

        private static void WriteLine(string content)
        {
            IOHelper.WriteLine(Environment.CurrentDirectory + "/我的博客目录.txt", content);
        }

        private static Stream GetStream(int timeout, string url, string param)
        {
            try
            {
                //if (!url.StartsWith("http://", StringComparison.OrdinalIgnoreCase))
                //{
                //    url = "http://" + url;
                //}
                HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
                if (!string.IsNullOrEmpty(param))
                {
                    req.Method = "POST";
                    req.ContentType = "application/x-www-form-urlencoded";
                    byte[] bs = Encoding.ASCII.GetBytes(param);
                    req.ContentLength = bs.Length;
                    using (Stream reqStream = req.GetRequestStream())
                    {
                        reqStream.Write(bs, 0, bs.Length);
                    }
                }

                HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse();
                return HttpWResp.GetResponseStream();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                return null;
            }
        }

        //获取页面的内容
        private static string GetContent(string url, string param)
        {
            using (Stream myStream = GetStream(24000, url, param))
            {
                if (myStream == null)
                {
                    return string.Empty;
                }
                using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8))
                {
                    if (sr.Peek() > 0)
                    {
                        return sr.ReadToEnd();
                    }
                }
                return string.Empty;
            }
        }

        static MatchCollection Filter(string url, string param, string reg)
        {
            string content = GetContent(url, param);
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }

        //获取文章
        static void GetArchive(string url, string param)
        {
            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            MatchCollection matchs = Filter(url, param, regex);
            if (matchs == null)
            {
                return;
            }
            foreach (Match m in matchs)
            {
                string curUrl = m.Groups["url"].Value;
                if (curUrl.IndexOf(userId + "/" + archive) >= 0 && curUrl.IndexOf('#') < 0)
                {
                    DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl };
                    if (!archiveList.ContainsKey(url))
                    {
                        archiveList.Add(url, new List<DataItem> { item });
                    }
                    else
                    {
                        if (archiveList[url].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null)
                        {
                            archiveList[url].Add(item);
                        }
                    }
                }
            }
            DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url);
            if (curCategory != null)
            {
                Console.WriteLine("    " + curCategory.Text);
            }
        }

        //获取文章分类
        static bool GetCategory(string url, string param)
        {

            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>";
            MatchCollection matchs = Filter(url, param, regex);
            if (matchs == null)
            {
                return false;
            }
            Console.Clear();
            Console.WriteLine("已完成您的博客分类:");
            foreach (Match m in matchs)
            {
                if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0)
                {
                    categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value });
                }
            }
            return categoryList.Count == 0 ? false : true;
        }

    }

    class IOHelper
    {

        public static bool Exists(string fileName)
        {
            if (fileName == null || fileName.Trim() == "")
            {
                return false;
            }
            if (File.Exists(fileName))
            {
                return true;
            }
            return false;
        }

        public static bool DeleteFile(string fileName)
        {
            if (Exists(fileName))
            {
                File.Delete(fileName);
                return true;
            }
            return false;
        }

        public static bool WriteLine(string fileName, string content)
        {
            using (FileStream fileStream = new FileStream(fileName, FileMode.Append))
            {
                lock (fileStream)
                {
                    if (!fileStream.CanWrite)
                    {
                        throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!");
                    }
                    StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8);
                    streamWriter.WriteLine(content);
                    streamWriter.Dispose();
                    streamWriter.Close();
                    return true;
                }
            }
        }
    }

    class DataItem
    {
        public string Text { get; set; }
        public string Value { get; set; }
    }
}
View Code

 

版本2

 优化:分页的文章

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace GenerateDirectory
{
    using System.IO;
    using System.Net;
    using System.Security;
    using System.Text.RegularExpressions;
    using System.Threading;

    class Program
    {
        static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表
        static List<DataItem> categoryList = new List<DataItem>();//分类列表
        static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录
        const string archive = "p";
        const string category = "category";
        const string outFile = "我的博客目录.txt";

        static void Main(string[] args)
        {
            Console.WriteLine("请输入你的博客园ID后回车:");
            userId = Console.ReadLine();
            string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址
            url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx";
            string param = "blogApp=" + userId;//获取你的博客分类需要你的ID
            Console.WriteLine("正在连接服务器,请稍候...");
            bool isOK = GetCategory(url, param);//获取博客分类
            if (!isOK)
            {
                Console.Clear();
                Console.WriteLine("你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!");
                Thread.Sleep(5000);
                for (int i = 5; i >= 1; i--)
                {
                    Thread.Sleep(1000);
                    Console.Clear();
                    Console.WriteLine(i);
                }
                Environment.Exit(0);
            }
            foreach (DataItem item in categoryList)
            {
                GetArchive(item.Value, string.Empty);//获取分类中的博客
            }
            Console.Clear();
            CreateDir();
            Console.ReadKey();

        }

        //生成博客目录
        static void CreateDir()
        {
            //IOHelper.DeleteFile(Environment.CurrentDirectory + "/" + outFile);
            IOHelper.DeleteFile(AppDomain.CurrentDomain.BaseDirectory + outFile);
            string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>";
            string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>";
            string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0}</li>";
            WriteLine(string.Format(divFormat, "生成说明"));
            WriteLine($"<p>本页面是使用命令行工具{AppDomain.CurrentDomain.FriendlyName}程序生成。可参考一键构造你的博客园目录文章说明。</p>");

            string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面
            for (int i = catelist.Length - 1; i >= 0; i--)
            {
                DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i]));
                if (item != null)
                {
                    categoryList.Remove(item);
                    categoryList.Insert(0, item);
                }
            }

            foreach (DataItem categoryItem in categoryList)
            {
                Console.WriteLine(categoryItem.Text);
                WriteLine(string.Format(divFormat, categoryItem.Text));
                List<DataItem> list;
                if (archiveList.TryGetValue(categoryItem.Value, out list))
                {
                    WriteLine("<ul style='padding-top:10px;clear:both;list-style:none;'>");
                    foreach (DataItem archiveItem in list)
                    {
                        Console.WriteLine("\t" + archiveItem.Text);
                        WriteLine(string.Format(liFormat, string.Format(aFormat, archiveItem.Value, archiveItem.Text)));
                    }
                    WriteLine("</ul>");
                    Console.WriteLine();
                }
            }

            Console.WriteLine($"\r\n\r\n博客园生成Html代码已经完成,请查看:{Environment.NewLine + AppDomain.CurrentDomain.BaseDirectory + outFile}\r\n");
        }

        private static void WriteLine(string content)
        {
            IOHelper.WriteLine(Environment.CurrentDirectory + "/我的博客目录.txt", content);
        }

        private static Stream GetStream(int timeout, string url, string param)
        {
            try
            {
                HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
                if (!string.IsNullOrEmpty(param))
                {
                    req.Method = "POST";
                    req.ContentType = "application/x-www-form-urlencoded";
                    byte[] bs = Encoding.ASCII.GetBytes(param);
                    req.ContentLength = bs.Length;
                    using (Stream reqStream = req.GetRequestStream())
                    {
                        reqStream.Write(bs, 0, bs.Length);
                    }
                }

                HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse();
                return HttpWResp.GetResponseStream();
            }
            catch (Exception ex)
            {
                Console.WriteLine(ex.Message);
                return null;
            }
        }

        //获取页面的内容
        private static string GetContent(string url, string param)
        {
            using (Stream myStream = GetStream(24000, url, param))
            {
                if (myStream == null)
                {
                    return string.Empty;
                }
                using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8))
                {
                    if (sr.Peek() > 0)
                    {
                        return sr.ReadToEnd();
                    }
                }
                return string.Empty;
            }
        }

        static MatchCollection Filter(string url, string param, string reg)
        {
            string content = GetContent(url, param);
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }


        static MatchCollection Filter(string pageContent, string reg)
        {
            string content = pageContent;
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }


        //获取文章
        static void GetArchive(string url, string param)
        {
            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span(.*)>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            string pContent = GetContent(url, param);
            MatchCollection matchs = Filter(pContent, regex);

            if (matchs == null)
            {
                return;
            }
            foreach (Match m in matchs)
            {
                string curUrl = m.Groups["url"].Value;
                if (curUrl.IndexOf(userId + "/" + archive) >= 0 && curUrl.IndexOf('#') < 0)
                {
                    DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl };
                    if (!archiveList.ContainsKey(url))
                    {
                        archiveList.Add(url, new List<DataItem> { item });
                    }
                    else
                    {
                        if (archiveList[url].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null)
                        {
                            archiveList[url].Add(item);
                        }
                    }
                }
            }

            MatchCollection havePageCont = Filter(pContent, @"<div class=""pager"">");
            if (havePageCont != null && havePageCont.Count > 0)
            {
                int pageMax = 0;
                MatchCollection pageCollection = Filter(pContent, url + @"\?page=(\d+)");
                foreach (Match p in pageCollection)
                {
                    int pNum = 0;
                    int.TryParse(p.Groups[1].Value, out pNum);
                    pageMax = pNum > pageMax ? pNum : pageMax;
                }
                for (int i = 2; i <= pageMax; i++)
                {
                    GetArchiveSub(url + "?page=" + i, regex);
                }
            }

            DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url);
            if (curCategory != null)
            {
                Console.WriteLine("    " + curCategory.Text);
            }
        }
        static void GetArchiveSub(string url, string reg)
        {
            MatchCollection matchs = Filter(url, "", reg);
            if (matchs == null)
            {
                return;
            }
            foreach (Match m in matchs)
            {
                string curUrl = m.Groups["url"].Value;
                if (curUrl.IndexOf(userId + "/" + archive) >= 0 && curUrl.IndexOf('#') < 0)
                {
                    DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl };
                    string categoryUrl = url.Split('?')[0];
                    if (!archiveList.ContainsKey(categoryUrl))
                    {
                        archiveList.Add(categoryUrl, new List<DataItem> { item });
                    }
                    else
                    {
                        if (archiveList[categoryUrl].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null)
                        {
                            archiveList[categoryUrl].Add(item);
                        }
                    }
                }
            }

        }


        //获取文章分类
        static bool GetCategory(string url, string param)
        {

            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>";
            MatchCollection matchs = Filter(url, param, regex);
            if (matchs == null)
            {
                return false;
            }
            Console.Clear();
            Console.WriteLine("已完成您的博客分类:");
            foreach (Match m in matchs)
            {
                if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0)
                {
                    categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value });
                }
            }
            return categoryList.Count == 0 ? false : true;
        }

    }

    class IOHelper
    {

        public static bool Exists(string fileName)
        {
            if (fileName == null || fileName.Trim() == "")
            {
                return false;
            }
            if (File.Exists(fileName))
            {
                return true;
            }
            return false;
        }

        public static bool DeleteFile(string fileName)
        {
            if (Exists(fileName))
            {
                File.Delete(fileName);
                return true;
            }
            return false;
        }

        public static bool WriteLine(string fileName, string content)
        {
            using (FileStream fileStream = new FileStream(fileName, FileMode.Append))
            {
                lock (fileStream)
                {
                    if (!fileStream.CanWrite)
                    {
                        throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!");
                    }
                    StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8);
                    streamWriter.WriteLine(content);
                    streamWriter.Dispose();
                    streamWriter.Close();
                    return true;
                }
            }
        }
    }

    class DataItem
    {
        public string Text { get; set; }
        public string Value { get; set; }
    }
}
View Code

 

版本3

优化:可以下载指定用户的博文到本地,并替换和转化css和img等本地化

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace GenerateDirectory
{
    using System.IO;
    using System.Net;
    using System.Security;
    using System.Text.RegularExpressions;
    using System.Threading;

    class Program
    {
        static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表
        static List<DataItem> categoryList = new List<DataItem>();//分类列表
        static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录
        const string archive = "p";
        const string category = "category";
        static string outDictionaryFile = "我的博客目录.txt.html";
        static bool IsOutDictionary = true;//是否生成并保存博客园目录
        static bool IsOutBlogFile = false;//是否生成并保存每篇博文
        static string logFilePath = AppDomain.CurrentDomain.BaseDirectory + AppDomain.CurrentDomain.FriendlyName + DateTime.Now.ToString("yyyyMM") + ".log";


        static void Main(string[] args)
        {
            try
            {
                IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】==================================================");
                string tmp = "";
                IOHelper.WriteLog(logFilePath, "请输入你的博客园ID后回车:");
                userId = Console.ReadLine();
                outDictionaryFile = userId + "/" + outDictionaryFile;
                IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录?(默认是true)");
                tmp = Console.ReadLine();
                IsOutDictionary = tmp.Length > 0 ? bool.Parse(tmp) : true;
                IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录:" + IsOutDictionary);
                IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文?(默认是false)");
                tmp = Console.ReadLine();
                IsOutBlogFile = tmp.Length > 0 ? bool.Parse(tmp) : false;
                IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文:" + IsOutBlogFile);
            }
            catch (Exception ex)
            {
                Console.BackgroundColor = ConsoleColor.DarkYellow;
                Console.ForegroundColor = ConsoleColor.White;
                IOHelper.WriteLog(logFilePath, "【Error】请输入错误,请检查输入的参数。");
                Console.ResetColor();
                Environment.Exit(0);
            }


            string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址
            url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx";
            string param = "blogApp=" + userId;//获取你的博客分类需要你的ID
            IOHelper.WriteLog(logFilePath, "正在连接服务器,请稍候...");
            bool isOK = GetCategory(url, param);//获取博客分类
            if (!isOK)
            {
                IOHelper.WriteLog(logFilePath, "你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!");
                Thread.Sleep(5000);
                for (int i = 5; i >= 0; i--)
                {
                    Thread.Sleep(1000);
                    Console.Clear();
                    IOHelper.WriteLog(logFilePath, i.ToString());
                }
                Console.Clear();
                IOHelper.WriteLog(logFilePath, "哈哈!");
                Environment.Exit(0);
            }
            foreach (DataItem item in categoryList)
            {
                GetArchive(item.Value, string.Empty);//获取分类中的博客
            }
            Console.Clear();
            if (IsOutDictionary)
                CreateDir();
            if (IsOutBlogFile)
                downBlog();
            IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】程序执行完成!按任意键退出!");
            Console.ReadKey();

        }



        //生成博客目录
        static void CreateDir()
        {
            //IOHelper.DeleteFile(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile);
            IOHelper.DeleteDir(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile);
            string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>";
            string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0}&nbsp;&nbsp;{1}</li>";
            string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>";
            string aLocalFormat = "<a href='{0}' target='_blank' title='{1}'>{2}</a>";
            WriteBlogList(string.Format(divFormat, "生成说明"));
            WriteBlogList($"<p>本页面是使用命令行工具{AppDomain.CurrentDomain.FriendlyName}程序生成。可参考一键构造你的博客园目录文章说明。</p>");

            string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面
            for (int i = catelist.Length - 1; i >= 0; i--)
            {
                DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i]));
                if (item != null)
                {
                    categoryList.Remove(item);
                    categoryList.Insert(0, item);
                }
            }

            foreach (DataItem categoryItem in categoryList)
            {
                string categoryName = utilityHelper.HtmlDecode(categoryItem.Text);
                IOHelper.WriteLog(logFilePath, categoryName);
                WriteBlogList(string.Format(divFormat, categoryName));
                List<DataItem> list;
                if (archiveList.TryGetValue(categoryItem.Value, out list))
                {
                    WriteBlogList("<ul style='padding-top:10px;clear:both;list-style:none;'>");
                    foreach (DataItem archiveItem in list)
                    {
                        string blogNam = utilityHelper.HtmlDecode(archiveItem.Text);
                        IOHelper.WriteLog(logFilePath, "\t" + blogNam);
                        var aHtml = string.Format(aFormat, archiveItem.Value, blogNam);
                        var aLocalHtml = string.Format(aLocalFormat, categoryName + "/" + blogNam + ".html", blogNam, "本地");
                        WriteBlogList(string.Format(liFormat, aHtml, (IsOutBlogFile ? aLocalHtml : "")));
                    }
                    WriteBlogList("</ul>");
                    Console.WriteLine();
                }
            }

            IOHelper.WriteLog(logFilePath, $"\r\n\r\n博客园生成Html代码已经完成,请查看:{Environment.NewLine + AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile}\r\n");
        }

        //写入博客列表文件
        private static void WriteBlogList(string content)
        {
            IOHelper.WriteLine(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile, content);
        }

        //写入博客内容到文件
        private static void WriteBlogToFile(string CategoryName, string bLogName, string content)
        {
            string fPath = AppDomain.CurrentDomain.BaseDirectory + userId + "/" + CategoryName + "/" + bLogName + ".html";
            if (!IOHelper.ExistsDir(fPath, true))
            {
                Directory.CreateDirectory(Path.GetDirectoryName(fPath));
            }
            IOHelper.WriteLine(fPath, content);
        }


        private static Stream GetStream(int timeout, string url, string param)
        {
            try
            {
                //HttpWebRequest req = (HttpWebRequest)HttpWebRequest.Create(url);
                //FileWebRequest req = (FileWebRequest)HttpWebRequest.Create(url);
                WebRequest req = WebRequest.Create(url);
                if (!string.IsNullOrEmpty(param))
                {
                    req.Method = "POST";
                    req.ContentType = "application/x-www-form-urlencoded";
                    byte[] bs = Encoding.ASCII.GetBytes(param);
                    req.ContentLength = bs.Length;
                    using (Stream reqStream = req.GetRequestStream())
                    {
                        reqStream.Write(bs, 0, bs.Length);
                    }
                }
                //HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse();
                WebResponse HttpWResp = req.GetResponse();
                return HttpWResp.GetResponseStream();

            }
            catch (Exception ex)
            {
                IOHelper.WriteLog(logFilePath, $"request :  {(url.Length > 100 ? url.Substring(0, 100) + "..." : url)}");
                IOHelper.WriteLog(logFilePath, $"param   :  {param}");
                Console.BackgroundColor = ConsoleColor.Red; //设置背景色
                Console.ForegroundColor = ConsoleColor.White;
                IOHelper.WriteLog(logFilePath, "\t【Error】" + ex.Message);
                Console.ResetColor();
                return null;
            }
        }

        //获取页面的内容
        private static string GetContent(string url, string param)
        {
            using (Stream myStream = GetStream(24000, url, param))
            {
                if (myStream == null)
                {
                    return string.Empty;
                }
                using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8))
                {
                    if (sr.Peek() > 0)
                    {
                        return sr.ReadToEnd();
                    }
                }
                return string.Empty;
            }
        }

        private static bool downResource(string url, string localFilePath)
        {
            bool ret = false;
            string tmpUrl = url.Split('?')[0];
            string ext = IOHelper.GetFileExtension(tmpUrl);
            if (ext == ".html")
            {
                return ret;
            }
            using (Stream myStream = GetStream(24000, tmpUrl, ""))
            {
                if (myStream != null)
                {
                    // 确保输入流的位置在起始点
                    if (myStream.CanSeek)
                        myStream.Seek(0, SeekOrigin.Begin);
                    if (!IOHelper.ExistsDir(localFilePath, true))
                        Directory.CreateDirectory(Path.GetDirectoryName(localFilePath));
                    using (FileStream fileStream = new FileStream(localFilePath, FileMode.Create, FileAccess.Write))
                    {
                        myStream.CopyTo(fileStream);
                        fileStream.Close();
                    }
                    ret = true;
                }
            }
            return ret;
        }

        static MatchCollection Filter(string url, string param, string reg)
        {
            string content = GetContent(url, param);
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }


        static MatchCollection Filter(string pageContent, string reg)
        {
            string content = pageContent;
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }


        //获取单分类下的文章列表
        static void GetArchive(string url, string param)
        {
            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span(.*)>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            string pContent = GetCategorySubList(url, param, regex);

            MatchCollection havePageCont = Filter(pContent, @"<div class=""pager"">");
            if (havePageCont != null && havePageCont.Count > 0)
            {
                int pageMax = 0;
                MatchCollection pageCollection = Filter(pContent, url + @"\?page=(\d+)");
                foreach (Match p in pageCollection)
                {
                    int pNum = 0;
                    int.TryParse(p.Groups[1].Value, out pNum);
                    pageMax = pNum > pageMax ? pNum : pageMax;
                }
                for (int i = 2; i <= pageMax; i++)
                {
                    //GetArchiveSub(url + "?page=" + i, regex);
                    GetCategorySubList(url + "?page=" + i, "", regex);
                }
            }

            DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url);
            if (curCategory != null)
            {
                IOHelper.WriteLog(logFilePath, $"\t{curCategory.Text}");
            }
        }

        static void GetArchiveSub(string url, string reg)
        {
            MatchCollection matchs = Filter(url, "", reg);
            if (matchs == null)
            {
                return;
            }
            foreach (Match m in matchs)
            {
                string curUrl = m.Groups["url"].Value;
                if (curUrl.IndexOf(userId + "/") >= 0 && curUrl.IndexOf('#') < 0)
                {
                    DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl };
                    string categoryUrl = url.Split('?')[0];
                    if (!archiveList.ContainsKey(categoryUrl))
                    {
                        archiveList.Add(categoryUrl, new List<DataItem> { item });
                    }
                    else
                    {
                        if (archiveList[categoryUrl].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null)
                        {
                            archiveList[categoryUrl].Add(item);
                        }
                    }
                }
            }

        }

        //根据类别的url获取类别下的文章列表
        static string GetCategorySubList(string url, string param, string reg)
        {
            string pContent = GetContent(url, param);
            MatchCollection matchs = Filter(pContent, reg);
            string urlKey = url.Split('?')[0];
            if (matchs == null)
            {
                return null;
            }
            foreach (Match m in matchs)
            {
                string curUrl = m.Groups["url"].Value;
                if (curUrl.IndexOf(userId + "/") >= 0 && curUrl.IndexOf('#') < 0)
                {
                    DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl };
                    if (!archiveList.ContainsKey(urlKey))
                    {
                        archiveList.Add(urlKey, new List<DataItem> { item });
                    }
                    else
                    {
                        if (archiveList[urlKey].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null)
                        {
                            if (!archiveList[urlKey].Exists(t => t.Value == curUrl))
                                archiveList[urlKey].Add(item);
                        }
                    }
                }
            }
            return pContent;
        }


        //获取文章分类
        static bool GetCategory(string url, string param)
        {

            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>";
            MatchCollection matchs = Filter(url, param, regex);
            if (matchs == null)
            {
                return false;
            }
            Console.Clear();
            IOHelper.WriteLog(logFilePath, "已完成您的博客分类:");
            foreach (Match m in matchs)
            {
                if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0)
                {
                    categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value });
                }
            }
            return categoryList.Count == 0 ? false : true;
        }


        //根据url获取某篇博文
        static bool downBlog()
        {
            string localUserPath = AppDomain.CurrentDomain.BaseDirectory + userId;
            //IOHelper.DeleteDir(localUserPath);
            foreach (DataItem categoryItem in categoryList)
            {
                string categoryName = utilityHelper.HtmlDecode(categoryItem.Text);
                IOHelper.WriteLog(logFilePath, "=>" + categoryName);
                List<DataItem> list;
                if (archiveList.TryGetValue(categoryItem.Value, out list))
                {
                    foreach (DataItem archiveItem in list)
                    {
                        string blogNam = utilityHelper.HtmlDecode(archiveItem.Text);
                        IOHelper.WriteLog(logFilePath, "\t=>" + blogNam);
                        string pContent = GetContent(archiveItem.Value, "");
                        var bResource = getBlogResource(pContent);
                        foreach (var item in bResource.Item1)//下载css资源
                        {
                            string cssUrl = item.Split('?')[0];
                            string cssLocalPath = localUserPath + "\\css\\" + Path.GetFileName(cssUrl);
                            if (!IOHelper.ExistsFile(cssLocalPath))
                            {
                                if (cssUrl.IndexOf("//") == 0)
                                    cssUrl = "http:" + cssUrl;
                                if (cssUrl.IndexOf("/") == 0 && cssUrl.Substring(1) != "/")
                                    cssUrl = "https://www.cnblogs.com" + cssUrl;

                                var isDown = downResource(cssUrl, cssLocalPath);
                                if (isDown)
                                {
                                    pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, ".."));
                                }
                            }
                            else
                            {
                                pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, ".."));
                            }
                        }
                        foreach (var item in bResource.Item2)//下载img资源
                        {
                            string imgUrl = item.Split('?')[0];
                            string imgLocalPath = localUserPath + $"\\{categoryName}\\imgs\\{Path.GetFileName(imgUrl)}";
                            bool isDown = false;
                            if (!IOHelper.ExistsFile(imgLocalPath))
                            {
                                if (imgUrl.IndexOf("//") == 0)
                                    imgUrl = "http:" + imgUrl;
                                if (imgUrl.IndexOf("/") == 0 && imgUrl.Substring(1) != "/")
                                    imgUrl = "https://www.cnblogs.com" + imgUrl;
                                isDown = downResource(imgUrl, imgLocalPath);
                                if (isDown)
                                    pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, "."));
                            }
                            else
                            {
                                pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, "."));
                            }
                        }
                        WriteBlogToFile(categoryName, blogNam, pContent);

                    }
                    Console.WriteLine();
                }
            }

            return false;
        }

        //根据url获取博文的图片
        static Tuple<List<string>, List<string>> getBlogResource(string blogContent)
        {
            List<string> cssList = new List<string>();
            List<string> imgList = new List<string>();
            string regCss = "<link[^>]*?href=\"(?<url>[^>]*?)\"[^>]*?>";
            regCss = @"<link(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>";
            MatchCollection cssMatchs = Filter(blogContent, regCss);
            if (cssMatchs != null)
            {
                foreach (Match m in cssMatchs)
                {
                    cssList.Add(m.Groups["url"].Value);
                }

            }


            string regImg = @"<img(?:(?!src=).)*src=(['""]?)(?<url>[^""\s>]*)\1[^>]*>";
            MatchCollection imgMatchs = Filter(blogContent, regImg);
            if (imgMatchs != null)
            {
                foreach (Match m in imgMatchs)
                {
                    imgList.Add(m.Groups["url"].Value);
                }

            }

            return new Tuple<List<string>, List<string>>(cssList, imgList);
        }

    }

    class utilityHelper
    {
        public static string HtmlDecode(string htmlCode)
        {
            string tmp = WebUtility.HtmlDecode(htmlCode);
            tmp = tmp.Replace('\\', '_');
            tmp = tmp.Replace('/', '_');
            tmp = tmp.Replace(':', '_');
            tmp = tmp.Replace('*', '_');
            tmp = tmp.Replace('?', '_');
            tmp = tmp.Replace(':', '_');
            tmp = tmp.Replace('<', '_');
            tmp = tmp.Replace('>', '_');
            tmp = tmp.Replace('|', '_');
            return tmp;
        }
    }


    class IOHelper
    {

        public static bool ExistsFile(string fileName)
        {
            if (fileName == null || fileName.Trim() == "")
            {
                return false;
            }
            if (File.Exists(fileName))
            {
                return true;
            }
            return false;
        }

        public static bool ExistsDir(string localPath, bool isFile = false)
        {
            string dPath = localPath;
            if (isFile)
            {
                dPath = Path.GetDirectoryName(dPath);
            }

            return Directory.Exists(dPath);
        }

        public static bool DeleteFile(string fileName)
        {
            if (ExistsFile(fileName))
            {
                File.Delete(fileName);
                return true;
            }
            return false;
        }

        public static bool DeleteDir(string dirPath)
        {
            string tmp = dirPath;
            if (!string.IsNullOrEmpty(Path.GetExtension(tmp)))
            {
                tmp = Path.GetDirectoryName(tmp);
            }
            var di = new DirectoryInfo(tmp);
            if (di.Exists)
            {
                di.Delete(true);
                return true;
            }
            else
            {
                return false;
            }

        }

        public static bool WriteLine(string fileName, string content)
        {
            if (!ExistsDir(fileName, true))
            {
                Directory.CreateDirectory(Path.GetDirectoryName(fileName));
            }
            using (FileStream fileStream = new FileStream(fileName, FileMode.Append))
            {
                lock (fileStream)
                {
                    if (!fileStream.CanWrite)
                    {
                        throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!");
                    }
                    StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8);
                    streamWriter.WriteLine(content);
                    streamWriter.Dispose();
                    streamWriter.Close();
                    return true;
                }
            }
        }

        public static string GetFileExtension(string filePath)
        {
            return Path.GetExtension(filePath);
        }

        public static void WriteLog(string logFilePath, string msg)
        {
            Console.WriteLine(msg);
            WriteLine(logFilePath, msg);
        }
    }

    class DataItem
    {
        public string Text { get; set; }
        public string Value { get; set; }
    }

}
View Code

 

版本4

本地url编码

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;

namespace GenerateDirectory
{
    using System.IO;
    using System.Net;
    using System.Security;
    using System.Text.RegularExpressions;
    using System.Threading;

    class Program
    {
        static Dictionary<string, List<DataItem>> archiveList = new Dictionary<string, List<DataItem>>();//文章列表
        static List<DataItem> categoryList = new List<DataItem>();//分类列表
        static string userId = "hlxs";//用户ID,根据这个ID获取这个人的博客目录
        const string archive = "p";
        const string category = "category";
        static string outCategoryFile = "我的博客目录.txt.html";
        static string outDir = userId;
        static bool IsOutDictionary = true;//是否生成并保存博客园目录
        static bool IsOutBlogFile = false;//是否生成并保存每篇博文
        static string logFilePath = AppDomain.CurrentDomain.BaseDirectory + AppDomain.CurrentDomain.FriendlyName + DateTime.Now.ToString("yyyyMM") + ".log";


        static void Main(string[] args)
        {
            try
            {
                IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】==================================================");
                string tmp = "";
                IOHelper.WriteLog(logFilePath, "请输入你的博客园ID后回车:");
                userId = Console.ReadLine();
                outDir = $"LocalBlogs_{userId}_{DateTime.Now.ToString("yyyyMMdd")}";
                outCategoryFile = outDir + "/" + outCategoryFile;
                IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录?(默认是true)");
                tmp = Console.ReadLine();
                IsOutDictionary = tmp.Length > 0 ? bool.Parse(tmp) : true;
                IOHelper.WriteLog(logFilePath, "是否生成并保存博客园目录:" + IsOutDictionary);
                IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文?(默认是false)");
                tmp = Console.ReadLine();
                IsOutBlogFile = tmp.Length > 0 ? bool.Parse(tmp) : false;
                IOHelper.WriteLog(logFilePath, "是否生成并保存每篇博文:" + IsOutBlogFile);
            }
            catch (Exception ex)
            {
                Console.BackgroundColor = ConsoleColor.DarkYellow;
                Console.ForegroundColor = ConsoleColor.White;
                IOHelper.WriteLog(logFilePath, "【Error】请输入错误,请检查输入的参数。");
                Console.ResetColor();
                Environment.Exit(0);
            }


            string url = "http://www.cnblogs.com/" + userId + "/mvc/blog/sidecolumn.aspx";//获取博客分类地址
            url = "http://www.cnblogs.com/" + userId + "/ajax/sidecolumn.aspx";
            string param = "blogApp=" + userId;//获取你的博客分类需要你的ID
            IOHelper.WriteLog(logFilePath, "正在连接服务器,请稍候...");
            bool isOK = GetCategory(url, param);//获取博客分类
            if (!isOK)
            {
                IOHelper.WriteLog(logFilePath, "你输入的博客园ID不正确,系统将在10秒后爆炸,请注意安全!");
                Thread.Sleep(5000);
                for (int i = 5; i >= 0; i--)
                {
                    Thread.Sleep(1000);
                    Console.Clear();
                    IOHelper.WriteLog(logFilePath, i.ToString());
                }
                Console.Clear();
                IOHelper.WriteLog(logFilePath, "哈哈!");
                Environment.Exit(0);
            }
            foreach (DataItem item in categoryList)
            {
                GetArchive(item.Value, string.Empty);//获取分类中的博客
            }
            Console.Clear();
            if (IsOutDictionary)
                CreateDir();
            if (IsOutBlogFile)
                downBlog();
            IOHelper.WriteLog(logFilePath, $"【{DateTime.Now.ToString("yyyy-MM-dd HH:mm:ss")}】程序执行完成!按任意键退出!");
            Console.ReadKey();

        }



        //获取文章分类
        static bool GetCategory(string url, string param)
        {

            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n]?)(?<text>(?:(?!<\/?a\b).)*)([.\n]?)<\/a>";
            MatchCollection matchs = Filter(url, param, regex);
            if (matchs == null)
            {
                return false;
            }
            Console.Clear();
            IOHelper.WriteLog(logFilePath, "已完成您的博客分类:");
            foreach (Match m in matchs)
            {
                if (m.Groups["url"].Value.IndexOf(userId + "/" + category) >= 0)
                {
                    categoryList.Add(new DataItem { Text = m.Groups["text"].Value.Trim(), Value = m.Groups["url"].Value });
                }
            }
            return categoryList.Count == 0 ? false : true;
        }

        //获取单分类下的文章列表
        static void GetArchive(string url, string param)
        {
            string regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            regex = @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>([.\n \t]*)<span(.*)>(?<text>(?:(?!<\/).)*)(<\/span>)[\s.]*<\/a>";
            string pContent = GetCategorySubList(url, param, regex);

            MatchCollection havePageCont = Filter(pContent, @"<div class=""pager"">");
            if (havePageCont != null && havePageCont.Count > 0)
            {
                int pageMax = 0;
                MatchCollection pageCollection = Filter(pContent, url + @"\?page=(\d+)");
                foreach (Match p in pageCollection)
                {
                    int pNum = 0;
                    int.TryParse(p.Groups[1].Value, out pNum);
                    pageMax = pNum > pageMax ? pNum : pageMax;
                }
                for (int i = 2; i <= pageMax; i++)
                {
                    GetCategorySubList(url + "?page=" + i, "", regex);
                }
            }

            DataItem curCategory = categoryList.FirstOrDefault(m => m.Value == url);
            if (curCategory != null)
            {
                IOHelper.WriteLog(logFilePath, $"\t{curCategory.Text}");
            }
        }

        //根据类别的url获取类别下的文章列表
        static string GetCategorySubList(string url, string param, string reg)
        {
            string pContent = GetContent(url, param);
            MatchCollection matchs = Filter(pContent, reg);
            string urlKey = url.Split('?')[0];
            if (matchs == null)
            {
                return null;
            }
            foreach (Match m in matchs)
            {
                string curUrl = m.Groups["url"].Value;
                if (curUrl.IndexOf(userId + "/") >= 0 && curUrl.IndexOf('#') < 0)
                {
                    DataItem item = new DataItem { Text = m.Groups["text"].Value, Value = curUrl };
                    if (!archiveList.ContainsKey(urlKey))
                    {
                        archiveList.Add(urlKey, new List<DataItem> { item });
                    }
                    else
                    {
                        if (archiveList[urlKey].FirstOrDefault(archiveItem => archiveItem.Value == curUrl) == null)
                        {
                            if (!archiveList[urlKey].Exists(t => t.Value == curUrl))
                                archiveList[urlKey].Add(item);
                        }
                    }
                }
            }
            return pContent;
        }

        //生成博客目录
        static void CreateDir()
        {
            //IOHelper.DeleteFile(AppDomain.CurrentDomain.BaseDirectory + outDictionaryFile);
            IOHelper.DeleteDir(AppDomain.CurrentDomain.BaseDirectory + outCategoryFile);
            string divFormat = "<h2 style='width:100%;float:left;margin-top:20px;background-color:#999999;color:White;padding-left:5px'>{0}</h2>";
            string liFormat = "<li style='width:49%;float:left;line-height:30px;'>{0}  {1}</li>";
            string aFormat = "<a href='{0}' target='_blank' title='{1}'>{1}</a>";
            string aLocalFormat = "<a href='{0}' target='_blank' title='{1}'>{2}</a>";
            WriteBlogList(string.Format(divFormat, "生成说明"));
            WriteBlogList($"<p>本页面是使用命令行工具{AppDomain.CurrentDomain.FriendlyName}程序生成。可参考一键构造你的博客园目录文章说明。</p>");

            string[] catelist = { "算法", "智力题", "C++", "读书", "分析", "C#", "Windows" };//这些排在前面
            for (int i = catelist.Length - 1; i >= 0; i--)
            {
                DataItem item = categoryList.FirstOrDefault(m => m.Text.Contains(catelist[i]));
                if (item != null)
                {
                    categoryList.Remove(item);
                    categoryList.Insert(0, item);
                }
            }

            foreach (DataItem categoryItem in categoryList)
            {
                string categoryName = utilityHelper.HtmlDecode(categoryItem.Text);
                IOHelper.WriteLog(logFilePath, categoryName);
                WriteBlogList(string.Format(divFormat, categoryName));
                List<DataItem> list;
                if (archiveList.TryGetValue(categoryItem.Value, out list))
                {
                    WriteBlogList("<ul style='padding-top:10px;clear:both;list-style:none;'>");
                    foreach (DataItem archiveItem in list)
                    {
                        string blogNam = utilityHelper.HtmlDecode(archiveItem.Text);
                        IOHelper.WriteLog(logFilePath, "\t" + blogNam);
                        var aHtml = string.Format(aFormat, archiveItem.Value, blogNam);
                        var aLocalHtml = string.Format(aLocalFormat, utilityHelper.UrlEncode(categoryName) + "/" + utilityHelper.UrlEncode(blogNam) + ".html", blogNam, "本地");
                        WriteBlogList(string.Format(liFormat, aHtml, (IsOutBlogFile ? aLocalHtml : "")));
                    }
                    WriteBlogList("</ul>");
                    Console.WriteLine();
                }
            }

            IOHelper.WriteLog(logFilePath, $"\r\n\r\n博客园生成Html代码已经完成,请查看:{Environment.NewLine + AppDomain.CurrentDomain.BaseDirectory + outCategoryFile}\r\n");
        }

        //根据url获取某篇博文
        static bool downBlog()
        {
            string localUserPath = AppDomain.CurrentDomain.BaseDirectory + outDir;
            //IOHelper.DeleteDir(localUserPath);
            foreach (DataItem categoryItem in categoryList)
            {
                string categoryName = utilityHelper.HtmlDecode(categoryItem.Text);
                IOHelper.WriteLog(logFilePath, "-->" + categoryName);
                List<DataItem> list;
                if (archiveList.TryGetValue(categoryItem.Value, out list))
                {
                    for (int i = 0; i < list.Count; i++)
                    {
                        DataItem archiveItem = list[i];
                        string blogNam = utilityHelper.HtmlDecode(archiveItem.Text);
                        IOHelper.WriteLog(logFilePath, $"\tstart down : {categoryName}/{i + 1}-->{blogNam}");
                        string pContent = GetContent(archiveItem.Value, "");
                        var bResource = getBlogResource(pContent);
                        foreach (var item in bResource.Item1)//下载css资源
                        {
                            string cssUrl = item.Split('?')[0];
                            if (cssUrl.IndexOf("//") == 0)
                                cssUrl = "http:" + cssUrl;
                            if (cssUrl.IndexOf("/") == 0 && cssUrl.Substring(1) != "/")
                                cssUrl = "https://www.cnblogs.com" + cssUrl;

                            string cssFileName = Path.GetFileNameWithoutExtension(cssUrl);
                            cssFileName = cssFileName.Length > 20 ? utilityHelper.GetMD5_16(cssFileName) : cssFileName;
                            string cssLocalPath = localUserPath + $"\\css\\{cssFileName}{Path.GetExtension(cssUrl)}";

                            if (!IOHelper.ExistsFile(cssLocalPath))
                            {
                                var isDown = downResource(cssUrl, cssLocalPath);
                                if (isDown)
                                {
                                    pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, ".."));
                                }
                            }
                            else
                            {
                                pContent = pContent.Replace(item, cssLocalPath.Replace(localUserPath, ".."));
                            }
                        }
                        foreach (var item in bResource.Item2)//下载img资源
                        {
                            string imgUrl = item.Split('?')[0];
                            if (imgUrl.IndexOf("//") == 0)
                                imgUrl = "http:" + imgUrl;
                            if (imgUrl.IndexOf("/") == 0 && imgUrl.Substring(1) != "/")
                                imgUrl = "https://www.cnblogs.com" + imgUrl;

                            string imgFileName = Path.GetFileNameWithoutExtension(imgUrl);
                            imgFileName = imgFileName.Length > 20 ? utilityHelper.GetMD5_16(imgFileName) : imgFileName;
                            string imgLocalPath = localUserPath + $"\\{categoryName}\\imgs\\{imgFileName}{Path.GetExtension(imgUrl)}";

                            bool isDown = false;
                            if (!IOHelper.ExistsFile(imgLocalPath))
                            {
                                isDown = downResource(imgUrl, imgLocalPath);
                                if (isDown)
                                    pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, "."));
                            }
                            else
                            {
                                pContent = pContent.Replace(item, imgLocalPath.Replace(localUserPath + "\\" + categoryName, "."));
                            }
                        }
                        WriteBlogToFile(categoryName, blogNam, pContent);
                    }
                    Console.WriteLine();
                }
            }

            return false;
        }

        //根据url获取博文的图片
        static Tuple<List<string>, List<string>> getBlogResource(string blogContent)
        {
            List<string> cssList = new List<string>();
            List<string> imgList = new List<string>();
            string regCss = "<link[^>]*?href=\"(?<url>[^>]*?)\"[^>]*?>";
            regCss = @"<link(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>";
            regCss = @"<link(?:(?!rel=).)*rel=(?<n2>['""]?)(?<rel>[^""\s]*)\1(?:(?!href=).)*href=\1(?<url>[^""\s>]*)\1*[^>]*>";
            MatchCollection cssMatchs = Filter(blogContent, regCss);
            if (cssMatchs != null)
            {
                foreach (Match m in cssMatchs)
                {
                    if (m.Groups["rel"].Value == "stylesheet")
                        cssList.Add(m.Groups["url"].Value);
                }

            }


            string regImg = @"<img(?:(?!src=).)*src=(['""]?)(?<url>[^""\s>]*)\1[^>]*>";
            MatchCollection imgMatchs = Filter(blogContent, regImg);
            if (imgMatchs != null)
            {
                foreach (Match m in imgMatchs)
                {
                    imgList.Add(m.Groups["url"].Value);
                }

            }

            return new Tuple<List<string>, List<string>>(cssList, imgList);
        }

        private static Stream GetStream(int timeout, string url, string param)
        {
            Stream res = null;

            try
            {
                //HttpWebRequest req1 = (HttpWebRequest)HttpWebRequest.Create(url);
                //FileWebRequest req2 = (FileWebRequest)HttpWebRequest.Create(url);
                WebRequest req = WebRequest.Create(url);
                req.Timeout = timeout;
                if (!string.IsNullOrEmpty(param))
                {
                    req.Method = "POST";
                    req.ContentType = "application/x-www-form-urlencoded";
                    byte[] bs = Encoding.ASCII.GetBytes(param);
                    req.ContentLength = bs.Length;
                    using (Stream reqStream = req.GetRequestStream())
                    {
                        reqStream.Write(bs, 0, bs.Length);
                    }
                }
                //HttpWebResponse HttpWResp = (HttpWebResponse)req.GetResponse();
                WebResponse HttpWResp = req.GetResponse();
                res = HttpWResp.GetResponseStream();
                return res;

            }
            catch (Exception ex)
            {
                IOHelper.WriteLog(logFilePath, $"request :  {(url.Length > 100 ? url.Substring(0, 100) + "..." : url)}");
                IOHelper.WriteLog(logFilePath, $"param   :  {param}");
                Console.BackgroundColor = ConsoleColor.Red; //设置背景色
                Console.ForegroundColor = ConsoleColor.White;
                IOHelper.WriteLog(logFilePath, "\t【Error】" + ex.Message);
                Console.ResetColor();
                return null;
            }
        }

        //获取页面的内容
        private static string GetContent(string url, string param)
        {
            using (Stream myStream = GetStream(3000, url, param))
            {
                if (myStream != null)
                {
                    using (StreamReader sr = new StreamReader(myStream, Encoding.UTF8))
                    {
                        if (sr.Peek() > 0)
                        {
                            string httpContent = sr.ReadToEnd();
                            myStream.Close();
                            return httpContent;
                        }
                    }
                }
                return string.Empty;
            }
        }

        private static bool downResource(string url, string localFilePath)
        {
            bool ret = false;
            string tmpUrl = url.Split('?')[0];
            tmpUrl = url.Split('!')[0];
            string ext = IOHelper.GetFileExtension(tmpUrl);
            if (ext == ".html")
            {
                return ret;
            }
            using (Stream myStream = GetStream(3000, tmpUrl, ""))
            {
                if (myStream != null)
                {
                    // 确保输入流的位置在起始点
                    if (myStream.CanSeek)
                        myStream.Seek(0, SeekOrigin.Begin);
                    try
                    {
                        if (!IOHelper.ExistsDir(localFilePath, true))
                            Directory.CreateDirectory(Path.GetDirectoryName(localFilePath));
                        using (FileStream fileStream = new FileStream(localFilePath, FileMode.Create, FileAccess.Write))
                        {
                            myStream.CopyTo(fileStream);
                            fileStream.Close();
                        }
                        ret = true;
                    }
                    catch (Exception ex)
                    {
                        IOHelper.WriteLog(logFilePath, $"url : {url}{Environment.NewLine}localFilePath : {localFilePath}");
                        IOHelper.WriteLog(logFilePath, "【Error】" + ex.ToString());
                    }
                }
            }
            return ret;
        }

        static MatchCollection Filter(string url, string param, string reg)
        {
            string content = GetContent(url, param);
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }

        static MatchCollection Filter(string pageContent, string reg)
        {
            string content = pageContent;
            if (string.IsNullOrEmpty(content))
            {
                return null;
            }
            string regex = reg.Length > 0 ? reg : @"<a(?:(?!href=).)*href=(['""]?)(?<url>[^""\s>]*)\1[^>]*>(?<text>(?:(?!</?a\b).)*)</a>";

            Regex rx = new Regex(regex);
            return rx.Matches(content);
        }

        //写入博客列表文件
        private static void WriteBlogList(string content)
        {
            IOHelper.WriteLine(AppDomain.CurrentDomain.BaseDirectory + outCategoryFile, content);
        }

        //写入博客内容到文件
        private static void WriteBlogToFile(string CategoryName, string bLogName, string content)
        {
            string fPath = AppDomain.CurrentDomain.BaseDirectory + outDir + "/" + CategoryName + "/" + bLogName + ".html";
            if (!IOHelper.ExistsDir(fPath, true))
            {
                Directory.CreateDirectory(Path.GetDirectoryName(fPath));
            }
            IOHelper.WriteLine(fPath, content);
        }


    }

    class utilityHelper
    {
        public static string UrlEncode(string str)
        {
            string tmp = WebUtility.UrlEncode(str);
            return tmp.Replace("+", "%20");
        }

        public static string HtmlDecode(string htmlCode)
        {
            string tmp = WebUtility.HtmlDecode(htmlCode);
            tmp = tmp.Replace('\\', '_');
            tmp = tmp.Replace('/', '_');
            tmp = tmp.Replace(':', '_');
            tmp = tmp.Replace('*', '_');
            tmp = tmp.Replace('?', '_');
            tmp = tmp.Replace(':', '_');
            tmp = tmp.Replace('<', '_');
            tmp = tmp.Replace('>', '_');
            tmp = tmp.Replace('|', '_');
            tmp = tmp.Replace('"', '_');
            return tmp;
        }

        /// <summary>
        /// 获得32位的MD5加密
        /// </summary>
        /// <param name="input"></param>
        /// <returns></returns>
        public static string GetMD5_32(string input)
        {
            System.Security.Cryptography.MD5 md5 = System.Security.Cryptography.MD5.Create();
            byte[] data = md5.ComputeHash(System.Text.Encoding.Default.GetBytes(input));
            StringBuilder sb = new StringBuilder();
            for (int i = 0; i < data.Length; i++)
            {
                sb.Append(data[i].ToString("x2"));
            }
            return sb.ToString();

        }
        /// <summary>
        /// 获得16位的MD5加密
        /// </summary>
        /// <param name="input"></param>
        /// <returns></returns>
        public static string GetMD5_16(string input)
        {
            return GetMD5_32(input).Substring(8, 16);
        }

        /// <summary>
        /// 获得8位的MD5加密
        /// </summary>
        /// <param name="input"></param>
        /// <returns></returns>
        public static string GetMD5_8(string input)
        {
            return GetMD5_32(input).Substring(8, 8);
        }


    }

    class IOHelper
    {

        public static bool ExistsFile(string fileName)
        {
            if (fileName == null || fileName.Trim() == "")
            {
                return false;
            }
            if (File.Exists(fileName))
            {
                return true;
            }
            return false;
        }

        public static bool ExistsDir(string localPath, bool isFile = false)
        {
            string dPath = localPath;
            if (isFile)
            {
                dPath = Path.GetDirectoryName(dPath);
            }

            return Directory.Exists(dPath);
        }

        public static bool DeleteFile(string fileName)
        {
            if (ExistsFile(fileName))
            {
                File.Delete(fileName);
                return true;
            }
            return false;
        }

        public static bool DeleteDir(string dirPath)
        {
            string tmp = dirPath;
            if (!string.IsNullOrEmpty(Path.GetExtension(tmp)))
            {
                tmp = Path.GetDirectoryName(tmp);
            }
            var di = new DirectoryInfo(tmp);
            if (di.Exists)
            {
                di.Delete(true);
                return true;
            }
            else
            {
                return false;
            }

        }

        public static bool WriteLine(string fileName, string content)
        {
            if (!ExistsDir(fileName, true))
            {
                Directory.CreateDirectory(Path.GetDirectoryName(fileName));
            }
            using (FileStream fileStream = new FileStream(fileName, FileMode.Append))
            {
                lock (fileStream)
                {
                    if (!fileStream.CanWrite)
                    {
                        throw new SecurityException("文件fileName=" + fileName + "是只读文件不能写入!");
                    }
                    StreamWriter streamWriter = new StreamWriter(fileStream, Encoding.UTF8);
                    streamWriter.WriteLine(content);
                    streamWriter.Dispose();
                    streamWriter.Close();
                    return true;
                }
            }
        }

        public static string GetFileExtension(string filePath)
        {
            return Path.GetExtension(filePath);
        }

        public static void WriteLog(string logFilePath, string msg)
        {
            Console.WriteLine(msg);
            WriteLine(logFilePath, msg);
        }
    }

    class DataItem
    {
        public string Text { get; set; }
        public string Value { get; set; }
    }

}
View Code

 

posted on 2020-04-07 23:03  jack_Meng  阅读(201)  评论(0编辑  收藏  举报

导航