用c#移除HTML外部链接及过滤非安全标签

工具选择：

刚开始尝试用XDocument.Parse()和正则式来解析HTML文档，但是疑难太多了。

google求助，然后得到了一个强有力的工具 Html Agility Pack ，它可以有几大特点：能够自动闭合标签；支持XPATH和LINQ；支持实体标记等等。

要解决的问题有：

移除指向外部的链接。
移除元素上的行内脚本。比如，onclick、onload、onblur等等。
移除非法标签。比如script,object等等。

要注意的是，锚点可以被加上有多个href属性；同样，某些元素可能会存在多个“onevent”属性。

这是我第一次使用c#扩展方法和枚举，非常有趣。

代码如下：


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using HtmlAgilityPack;
using System.Text.RegularExpressions;

namespace WebLibrary {
    [Flags]
    public enum ReplaceHtmlOptions {
        None, DisableExternalLinks, StripUnsafeTages, All
    }

    /// <summary>
    /// 一些测试的HTML字符串替换扩展方法
    /// </summary>
    public static class ExtensionMethods {

        // 保留链接
        const string ReserveLinks = "^http://8.8.8.[89]/";
        // 非安全标记
        const string UnsafeTags = "head|iframe|style|script|object|embed|applet|noframes|noscript|noembed";
        // 默认截断长度
        const int DefaultTruncationLength = 250;

        public static string ReplaceHtml(this string htmlString) {
            return htmlString.ReplaceHtml(ReplaceHtmlOptions.All);
        }


        /// <summary>
        /// HTML过滤
        /// </summary>
        /// <param name="htmlString">源HTML字符串</param>
        /// <param name="option">过滤可选项</param>
        /// <returns></returns>
        public static string ReplaceHtml(this string htmlString, ReplaceHtmlOptions option) {
            if (string.IsNullOrEmpty(htmlString)) return string.Empty;

            var hdoc = new HtmlDocument() { OptionWriteEmptyNodes = true };
            hdoc.LoadHtml(htmlString);

            var needToStripUnsafeTages = option.HasFlag(ReplaceHtmlOptions.StripUnsafeTages);
            var needToDisableExternalLinks = option.HasFlag(ReplaceHtmlOptions.DisableExternalLinks);

            var nodes = hdoc.DocumentNode.SelectNodes("//*");
            if (nodes != null) {
                nodes.ToList().ForEach(node => {
                    // 移除非安全标记
                    if (needToStripUnsafeTages && Regex.IsMatch(node.Name, UnsafeTags)) node.Remove();

                    // 筛选属性
                    node.Attributes.ToList().ForEach(attr => {
                        // 移除脚本属性
                        if (attr.Name.StartsWith("on")) attr.Remove();

                        // 移除外部链接
                        if (
                            needToDisableExternalLinks
                            && node.Name == "a"
                            && attr.Name == "href"
                            && !Regex.IsMatch(attr.Value, ReserveLinks)
                           ) {
                            attr.Remove();
                        }
                    });

                });
            }

            return hdoc.DocumentNode.WriteTo();
        }
    }
}

更多尝试：

用HAP可以实现一种截断HTML的方法。



        /// <summary>
        /// HTML按指定长度截断
        /// </summary>
        /// <param name="htmlString">源HTML字符串</param>
        /// <param name="length">保留长度</param>
        /// <returns></returns>
        public static string truncateHtml(this string htmlString, int length) {

            if (string.IsNullOrEmpty(htmlString)) return string.Empty;

            var hdoc = new HtmlDocument() { OptionWriteEmptyNodes = true };
            hdoc.LoadHtml(htmlString);

            var nodes = hdoc.DocumentNode.SelectNodes("//*");

            var countLength = 0;
            var maxLength = length;

            var lastNode =
                    nodes
                        .Where(n => n.HasChildNodes && n.ChildNodes.Count.Equals(1))
                        .TakeWhile(n => 
                        {
                            countLength += n.InnerText.Trim().Length;
                            return countLength <= maxLength;
                        })
                        .LastOrDefault();

            if (lastNode == null) return string.Empty;

            hdoc.LoadHtml(htmlString.Substring(0, lastNode.StreamPosition));
            return hdoc.DocumentNode.WriteTo();
        }

posted @ 2010-06-19 19:04 ambar 阅读(1743) 评论(0) 编辑收藏举报

刷新页面返回顶部

Ambar

用c#移除HTML外部链接及过滤非安全标签

工具选择：

要解决的问题有：

代码如下：

更多尝试：

公告