学习进度01 - 雨过山

学习进度01

python爬虫学习：

https://www.cnblogs.com/vvlj/p/9580423.html

#四个步骤

1.查看crawl内容的源码格式 crawl的内容可以是 url(链接），文字，图片，视频

2.请求网页源码　　　　　　　　（可能要设置）代理，限速，cookie

3.匹配　　　　　　　　　　　　用正则表达式匹配

4.保存数据　　　　　　　　　　文件操作

#两个基本工具（库）

1.urllib

https://www.cnblogs.com/duxie/p/10023732.html

2.requests

https://www.cnblogs.com/duxie/p/10024919.html

#这是一个成功的爬取豆瓣读书的例子：

import urllib.request
import re
url="https://book.douban.com/"
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:23.0) Gecko/20100101 Firefox/23.0'}
req = urllib.request.Request(url=url,headers=headers)
response = urllib.request.urlopen(req)

pattern = re.compile('<li.*?cover.*?title="(.*?)".*?author">(.*?)</div>.*?year">(.*?)</span>.*?</li>', re.S)
results = pattern.findall(response.read().decode('utf-8'))
for result in results:
    name, author, date = result
    author = re.sub("\s", "", author)
    date = re.sub("\s", "", date)
    print("【书名】：", name, " 【作者】：", author, " 【出版年】：", date)

#正则表达式的学习：

https://www.cnblogs.com/duxie/p/10033388.html

https://www.cnblogs.com/duxie/p/10031230.html

https://www.cnblogs.com/duxie/p/10025581.html

#正则表达式常见的函数：

https://www.cnblogs.com/longwhite/p/10397763.html

re.match()函数是从源字符串的起始位置开始匹配一个模式，其使用格式为re.match(pattern,string,flag)，其中pattern代表对应的正则表达式，string代表源字符串，flag是可选参数，代表对应的标志位，可以是模式修正符等信息。

与re.match()函数最大的不同就是re.search()函数是在全文进行检索并匹配，而re.match()是在源字符串的起始位置开始匹配，下面代码可以看出它们的不同。

要想全部匹配出来，就要先使用re.compile()对正则表达式进行预编译，然后使用findall()根据正则表达式从源字符串中匹配所有的结果。

re.sub(pattern,rep,string,max)，rep是指要代替的字符串，max是最大替换次数

posted on 2020-02-01 21:17 雨过山阅读(97) 评论(0) 编辑收藏举报