正则表达式学习笔记
正则表达式无论是在爬虫还是其它的应用中都是有一定作用的。
1、常见的匹配模式
模式 描述
\w 匹配字母数字及下划线
\W 匹配非字母数字下划线
\s 匹配任意空白字符,等价于 [\t\n\r\f].
\S 匹配任意非空字符
\d 匹配任意数字,等价于 [0-9]
\D 匹配任意非数字
\A 匹配字符串开始
\Z 匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串
\z 匹配字符串结束
\G 匹配最后匹配完成的位置
\n 匹配一个换行符
\t 匹配一个制表符
^ 匹配字符串的开头
$ 匹配字符串的末尾。
. 匹配任意字符,除了换行符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。
[...] 用来表示一组字符,单独列出:[amk] 匹配 'a','m'或'k'
[^...] 不在[]中的字符:[^abc] 匹配除了a,b,c之外的字符。
* 匹配0个或多个的表达式。
+ 匹配1个或多个的表达式。
? 匹配0个或1个由前面的正则表达式定义的片段,非贪婪方式
{n} 精确匹配n个前面表达式。
{n, m} 匹配 n 到 m 次由前面的正则表达式定义的片段,贪婪方式
a|b 匹配a或b
( ) 匹配括号内的表达式,也表示一个组
2、re.match()
re.match()方法会从字符的第一个位置匹配起。如果第一个位置匹配失败的话,就会返回none。
re.match(pattern, string, flags=0)
常规匹配
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.match("^hfhdh\s\d{4}\s\w.*96$",content) print(result) #<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'> print(result.group())#表示匹配到的字符hfhdh 8484 djfjdj dkfd 8596 print(result.span()) #表示匹配字符的大小(0, 27)
泛匹配
利用.*匹配多个字符
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.match("^hfhdh.*96$",content) print(result) print(result.group())
目标匹配
匹配字符串中的数字,正则表达式加上括号表示一个组,可以取出每一个括号中匹配到的值
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.match("^hfhdh\s(\d+).*\s(\d+)$",content) print(result)#<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'> print(result.group())#hfhdh 8484 djfjdj dkfd 8596 print(result.group(1))#8484 print(result.group(2))#8596
贪婪匹配
可以看到还是之前的匹配不过在.*后面去掉了\s,结果就不一样了。它会尽量匹配多的字符,不过至少留下一个数字。
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.match("^hfhdh\s(\d+).*(\d+)$",content) print(result.group(2))#6
非贪婪匹配
为防止多的匹配,可以引入?它是匹配0个或者1个前面的正则表达式
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.match("^hfhdh\s(\d+).*?(\d+)$",content) print(result.group(2))#8596
匹配模式
如果出现换行,应该如何处理?此时可以引入匹配模式re.S
import re content="hfhdh 8484 djfjdj " \ "dkfd 8596" result=re.match("^hfhdh\s(\d+).*?(\d+)$",content,re.S) print(result.group(2))#8596
转义
如果匹配的内容中有正则表达式,需要使用“\”进行转义
import re content="This book's price is $10.00" result=re.match("This book's price is \$10\.00",content,re.S) print(result)
3、re.search()
re.search()是对整个字符串进行扫描,不一定非要从第一个开始。
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.search(".*?(\d+).*?(\d+)$",content) print(result) print(result.group(1)) print(result.group(2))
4、re.findall()
上述都是匹配的一个字符串,如果需要匹配出所有的字符串就需要用到findall(),结果以列表的形式返回所有的结果。
import re content="""<div class="main-nav"> <a href=//new.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d1.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392515722151_390455&pos=1>尚天猫</a> <a href=//miao.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d2.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392511874662_390455&pos=2>喵鲜生</a> <a href=//vip.tmall.com/vip/index.htm?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d3.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392523417123_390455&pos=3>天猫会员</a> <a href=//3c.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d4.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392519569634_390455&pos=4>电器城</a> <a href=//chaoshi.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d5.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392500332195_390455&pos=5>天猫超市</a> <a href=//yao.tmall.com/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d6.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392496484706_390455&pos=6>医药馆</a> <a href=//www.tmall.hk/?abbucket=&acm=lb-zebra-12803-227044.1003.8.390455&aldid=74460&spm=3.7396704.20000005.d8.zNATmK&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392508027177_390455&pos=8>天猫国际</a> <a class="last" href=//car.tmall.com/?acm=lb-zebra-12803-227044.1003.8.390455&spm=3.7396704.20000007.23.zNATmK&uuid=75987&abtest=&scm=1003.8.lb-zebra-12803-227044.ITEM_14392504179688_390455&pos=1>天猫汽车</a> </div>""" result=re.findall('<a.*?>(.*?)</a>',content,re.S)#有几个分组就匹配出几个分组的内容 print(result)#['尚天猫', '喵鲜生', '天猫会员', '电器城', '天猫超市', '医药馆', '天猫国际', '天猫汽车']
5、re.sub()
替换字符串中每一个匹配的子串后返回替换后的字符串。
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.sub("(\d+)",'hellO',content) print(result)#hfhdh hellO djfjdj dkfd hellO
如果保留替换的内容
import re content="hfhdh 8484 djfjdj dkfd 8596" result=re.sub("(\d+)",r'\1hellO',content) #注意加r \1 原先第一个分组中的内容 print(result)#hfhdh 8484hellO djfjdj dkfd 8596hellO
6、re.compile()
将正则字符串编译成正则表达式对象,以便于复用该匹配模式。
import re content="hfhdh 8484 djfjdj dkfd 8596" pattern=re.compile('.*?(\d+).*?(\d+)') print(re.match(pattern,content).group(1))#8484 print(re.match(pattern,content).group(2))#8596
7、实战演练
爬取豆瓣读书的url,img,以及author
import requests import re content=requests.get('https://book.douban.com').text results=re.findall('<li.*?class="cover".*?a\shref="(.*?)"\stitle=".*?">.*?src="(.*?)"\sclass.*?class="author">(.*?)</div>.*?/li>',content,re.S) for result in results: url,img,author=result author=re.sub('\s','',author) print(url,img,author)
https://book.douban.com/subject/30353889/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32288925.jpg (挪)奥斯娜·塞厄斯塔 https://book.douban.com/subject/30431051/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32288967.jpg 【英】霍吉淑 https://book.douban.com/subject/33428941/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32311017.jpg [意]伊塔洛·斯韦沃 https://book.douban.com/subject/30319982/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s29986124.jpg [美]巴巴拉·塔奇曼 https://book.douban.com/subject/33414749/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32295228.jpg [日]贵志祐介 https://book.douban.com/subject/33396548/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32281256.jpg [法]勒•柯布西耶 https://book.douban.com/subject/30281429/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30021281.jpg [荷]高罗佩 https://book.douban.com/subject/33379779/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32305312.jpg [法]弗雷德里克·皮耶鲁齐 / [法]马修·阿伦 https://book.douban.com/subject/33404843/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32289202.jpg 远子 https://book.douban.com/subject/30432494/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32296675.jpg [美]琼·狄迪恩 https://book.douban.com/subject/30466222/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32281237.jpg [匈]雅歌塔·克里斯多夫 https://book.douban.com/subject/33435992/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32313677.jpg 葛兆光 https://book.douban.com/subject/30396696/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32301364.jpg [美]奥森·斯科特·卡德 https://book.douban.com/subject/33420594/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32317746.jpg 冯唐 https://book.douban.com/subject/33400116/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32310181.jpg (英)阿加莎•克里斯蒂著 https://book.douban.com/subject/30362709/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32273120.jpg [美]海莲·汉芙 https://book.douban.com/subject/32567841/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s31459918.jpg 伊谢尔伦的风 https://book.douban.com/subject/33408138/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32302726.jpg [美]罗威廉(WilliamT.Rowe) https://book.douban.com/subject/33423702/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32311913.jpg 夏清影 https://book.douban.com/subject/33440284/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32314738.jpg 菲奥娜·斯塔福德 https://book.douban.com/subject/30480992/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278296.jpg [英]约翰·勒卡雷 https://book.douban.com/subject/33393524/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32302539.jpg 许知远 https://book.douban.com/subject/30466204/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32318137.jpg [英]戴维·洛奇 https://book.douban.com/subject/30473225/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32304992.jpg [日]池田龟鉴 https://book.douban.com/subject/30200837/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30017700.jpg [日]青山七惠 https://book.douban.com/subject/30481930/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30019054.jpg (英)吉姆·克里斯蒂安 / 于应机 / 李阳欢 https://book.douban.com/subject/33399902/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284301.jpg 池莉 https://book.douban.com/subject/30443973/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32271911.jpg [葡]费尔南多·佩索阿 https://book.douban.com/subject/33370472/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32326689.jpg [法]罗曼·加里 https://book.douban.com/subject/30415984/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32285594.jpg [美]克里斯·克利尔菲尔德 / [美]安德拉什·蒂尔克斯 https://book.douban.com/subject/33423373/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32323469.jpg 周恺 https://book.douban.com/subject/33381271/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32277009.jpg [美]戴维•戴恩 https://book.douban.com/subject/30406506/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32295155.jpg 练明乔 https://book.douban.com/subject/30436197/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278312.jpg [英]海伦•拉塞尔 https://book.douban.com/subject/30446953/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284916.jpg [美]劳伦斯·布洛克 https://book.douban.com/subject/32492398/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32259438.jpg [法]阿尔贝·奥古斯特·拉西内著 https://book.douban.com/subject/30464096/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32322848.jpg [英]格雷厄姆·格林 https://book.douban.com/subject/30475747/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32294085.jpg [法]米歇尔·维诺克 https://book.douban.com/subject/33411336/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32293152.jpg 曾铮 https://book.douban.com/subject/33387411/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32304678.jpg [美]拉塞尔·柯克
作者:iveBoy
本文版权归作者和博客园共有,欢迎转载,但未经作者同意必须在文章页面给出原文连接,否则保留追究法律责任的权利。