正则表达式匹配中文字符及标点

可以写成这样

string strRegex = @"[\u4e00-\u9fa5]|[\（\）\《\》\——\；\，\。\“\”\<\>\！]";

其中前半部分表示匹配中文字符，后半部分为需要匹配的标点符号。

另，

对于html源码的处理，建议使用HtmlAgilityPack，用下面的代码去掉其中的脚本、样式或者注释内容。

public static HtmlDocument InitializeHtmlDoc(string htmlString)
{
    if (string.IsNullOrEmpty(htmlString))
    {
        return null;
    }

    HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(htmlString);
    doc.DocumentNode.Descendants().Where(n => n.Name == "script" || n.Name == "style" || n.Name == "#comment").ToList().ForEach(n => n.Remove());

    return doc;
}

HtmlAgilityPack是使用XPath语法，"//comment()"在XPath中表示“所有注释节点”，“#comment”不好用的话需要替换。http://www.cnblogs.com/rupeng/archive/2012/02/07/2342012.html

从Url读取网页内容（静态），可以用下面的代码

public static string GetHtmlStr(string url)
{
    if (string.IsNullOrEmpty(url))
    {
        return string.Empty;
    }

    string html = string.Empty;
    try
    {
        WebRequest webRequest = WebRequest.Create(url);
        webRequest.Timeout = 30 * 1000;
        using (WebResponse webResponse = webRequest.GetResponse())
        {
            if (((HttpWebResponse)webResponse).StatusCode == HttpStatusCode.OK)
            {
                Stream stream = webResponse.GetResponseStream();
                string coder = ((HttpWebResponse)webResponse).CharacterSet;

                StreamReader reader = new StreamReader(stream, string.IsNullOrEmpty(coder) ? Encoding.Default : Encoding.GetEncoding(coder));
                html = reader.ReadToEnd();
            }
        }
    }
    catch (Exception ex)
    {
        //Request may timeout sometimes
    }

    return html;
}

posted @ 2015-07-13 11:36 维博.WILBUR 阅读(24260) 评论(0) 编辑收藏举报

刷新页面返回顶部

维博.港

技术博客

正则表达式匹配中文字符及标点

公告