03-python进阶-爬虫入门-正则

【urllib and urllib2】

这是两个python的网络模块内置的提供很好的网络访问的功能。

#!coding:utf-8
import  urllib2
res = urllib2.urlopen('http://www.baidu.com')
html = res.read()
print(html)

如果我们想要去爬一些图片可以这样干

#!/usr/bin/env python
#conding:utf-8
import urllib ,  urllib2

url = "https://www.douban.com/doulist/121326/"

header = {'User-Agent':'moto x'}
req = urllib2.Request(url,headers = header)

response = urllib2.urlopen(req)

data = response.read()

import re
#print data
p =re.compile(ur'<img.+src="(.*?)"')

matches = re.findall(p,data)

print matches

for  m in matches:
    with file(m.split('/')[-1],'w') as f:
        f.write(urllib2.urlopen(m).read())

可以把豆瓣某个电影页面的封面都爬下来并且以文件名的形式保存下来

【正则表达式】

在python中想要用正则表达式就必须要要用re模块

<html><body><h1>hello world<h1></body></html>

比如我们想从这里提取helloword

#coding:utf-8
import re

key = r"<html><body><h1>hello world<h1></body></html>"
p1 = r"<h1>.+<h1>"
pattern1 = re.compile(p1)
print pattern1.findall(key)

首先我们知道 . 代表的任意单个字符而且 + 呢则代表的是前面的模拟出现任意次

那么如果我们想就想匹配 .

比如我们要匹配 213d3421a.qq.com123123

里面的a.qq.com

key = r"213d3421a.qq.com123123"
p1 = r"a.\qq.\com"
pattern1 = re.compile(p1)
print pattern1.findall(key)

我们通过 ‘\’ 来转义将【. 】从任意单个字符转义成它原本的意思

个人理解：我们在学习正则表达式的时候应该把正则表达式分成表达式语句和修饰语句两部分来看这样就会容易很多

就比方我们看过的例子

#coding:utf-8
import re

key = r"<html><body><h1>hello world<h1></body></html>"
p1 = r"<h1>.+<h1>"
pattern1 = re.compile(p1)
print pattern1.findall(key)

　我们可以理解为 . 就是代表表达式而+ 代表的是修饰符 .代表的任意单个字符而+则是出现任意此那么.+的意思就是任意字符出现任意次

#conding:utf-8
import re

key = r"abb adb abbb a abcd"
p1 = r"ab*"
pattern1 = re.compile(p1)
print pattern1.findall(key)

我们看这段代码能匹配到那些呢

*代表的是他前面的字符出现任意次那么这里的表达式就是*前面的字符而修饰语言就是*

这里能被匹配道的是

['abb', 'a', 'abbb', 'a', 'ab']

posted @ 2017-07-04 17:51 nerdlerss 阅读(198) 评论(0) 收藏举报

刷新页面返回顶部

nerdlerss

03-python进阶-爬虫入门-正则

公告