正则表达式 - AdolphYang

公告

四章   正则表达式

1节
正则表达式基本元字符------------------------------------------(*)

'.'   表示出了'\n'的任意字符
[]     [.]==. [0-9] 0到9中任意一个数字 [0-9a-zA-Z]
|     或者 [0-9]|[a-z]
+   一次到多次
.+    出现了很多次
[0-9]* 0到9所有数字出现了0次到多次
?   出现了0次到1次
[0-9]{1,}   最少一次
[0-9]{2,4}   前面的表达式出现了最少2次，最大4次
[0-9]{,4}   //错误
/d    任何一个数字
/D   非数字
/s   空白符
/S   非空白符
/w   包括下划线的任何单词字符相当于 [0-9a-zA-Z_]
/W   任何非单词字符相当于 [^0-9a-zA-Z_]
^   以什么开始或者取非 [^0-9] 非数字;^[0-9] 以数字开头
$   以什么结尾
\b   `单词的边界
()   提升优先集合，提取组
\.   .
\(   (
邮箱的正则表达式   :   xiaoming123@126.com.cn
[0-9a-zA-Z_.-]+@[0-9a-zA-Z_.-]+([.][a-zA-Z]+){1,2}

2节
页面提取邮箱案例

---http://localhost:8080/email.html---随便一个网页---
WebClient wc=new WebClient();   //WebClient 服务允许 Win32 应用程序访问Internet中的文档
//如果乱码，查询页面源文件---http-equiv---是什么编码，就在提取时就设置什么编码
wc.Encoding=Encoding.UTF8;//设置提取时的编码
//从指定页面下载(提取)所有字符串
string html = wc.DownloadString("www.rupeng.com/Segments/Index/611");
//Regex reg=new Regex();   //C#中写正则表达式需要用到的一个类,new时需要要匹配的这个正则表达式的模式,这个模式写起来比较麻烦，换一种方式
//因为Matches这三个方法既是实例方法，也是静态方法
//MatchCollection matches = reg.Matches(html,"")   //搜索那个字符串,返回与正则表达式匹配的集合
//MatchCollection matches = Regex.Matches(html,"[0-9a-zA-Z_.-]+@[0-9a-zA-Z_.-]+([.][a-zA-Z]+){1,2}")
MatchCollection matches = Regex.Matches(html,"([0-9a-zA-Z_.-]+)@([0-9a-zA-Z_.-]+([.][a-zA-Z]+){1,2})");   //提取分别为一组的账号、域名
foreach(Match m in matches)
{
   if(m.Success)//是否匹配成功
   {
       //Console.WriteLine(m.Value);
       Console.WriteLine(m.Groups[1].Value+"==="+m.Groups[2].Value);
   }
}
Console.WriteLine("========");
Console.WriteLIne(matches.Count);

3节
网页提取职位信息

---前程无忧---
////new一个WebClient访问internet,发送数据到uri，从中接受资源
//下载页面字符串,如果乱码，就指定下载读取的编码格式
//返回页面字符串与正则表达式匹配的匹配集合---怎么写这个正则表达式---
---在页面查看源文件---文件中查找"信息专员"---把这个超链接复制下来---在复制一个下来观察--
---<a adid="" onmousedown="return AdsClick()" href="http://search.51job.com/job/62995285,c.html" onclick="zzSearch.acStatRecJob( 1 );" class="jobname" target="_blank" >业务员（兼职）</a>
---<a\s.+href="http://search.51job.com/job/\d{8},c.html".+>(.+)</a> ---写入匹配中，需要进行'\' '"'的转义
//遍历,如果成功,输出

4节
网页抠图 -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------(*)

for (int i = 2; i < 26;i++ )   //下载多张图片
{
    //new一个WebClient访问uri获取资源
    WebClient wc = new WebClient();
    //下载网页字符串,如果乱码，就指定编码格式
    wc.Encoding = Encoding.UTF8;
    string html = wc.DownloadString("http://www.souutu.com/mnmm/xgmm/6384_"+i+".html");
    //获得页面字符串与指定正则表达式匹配的匹配集合
    string regular = "<img\\s.+src=\"(http://img.souutu.com/2015/0123/(20150123[0-9]{9}.jpg))\".+>";
    MatchCollection matches = Regex.Matches(html, regular);
    //遍历，如果成功
    foreach (Match m in matches)
    {
   //需要一个if()
        string path = m.Groups[1].Value;//当前截取地址就是真正的图片地址
        string imgName = m.Groups[2].Value;//图片名
        //Console.WriteLine(path + "\n\n" + imgName);
        //下载写入文件中
        //string currPath = AppDomain.CurrentDomain.BaseDirectory + "image\\"+imgName;   //----------------???---失败-----------
        string currPath = AppDomain.CurrentDomain.BaseDirectory + imgName;
        wc.DownloadFile(path, currPath);
        Console.WriteLine("下载了:" + i + "张图片");
    }
}

posted on 2015-08-18 11:45 AdolphYang 阅读(295) 评论(0) 编辑收藏举报

刷新页面返回顶部