代码改变世界

ASP.NET使用正则表达式抓取网页中的全部链接

2011-09-22 23:44  侬卡  阅读(203)  评论(0编辑  收藏  举报

添加如下命名空间
using System.Text.RegularExpressions;
using System.IO;
using System.Collections;
using System.Net;

关键代码:
(使用TextBox1获取网址,在TextBox2中显示该网页中的所有链接)

String web_url = this.TextBox1.Text.Trim();
//要获取的网址URL
String code = String.Empty;
//存放网页的源文件
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(web_url);
WebResponse response = request.GetResponse();
StreamReader sr = new StreamReader(response.GetResponseStream());
//获取源文件
code = sr.ReadToEnd();
sr.Close();
ArrayList list = new ArrayList();
//用来存放链接
String reg = @"http://([\w-]+\.)+[\w-]+(/[\w- ./?%&=]*)?";
//链接的正则表达式
Regex regex = new Regex(reg, RegexOptions.IgnoreCase);
MatchCollection mc = regex.Matches(code);
//存放匹配的集合
for (int i = 0; i < mc.Count; i++)
     {
        bool hasExist = false;
        //链接存在与否的标记
        String name = mc[i].ToString();
        foreach (String one in list)
             {
                if (name == one)
                      {
                          hasExist = true;
                          //链接已存在
                          break;
                      }
            }
        //链接不存在,添加
        if (!hasExist) this.TextBox2.Text += name + "\n";
    }
24元宝小说网