Fork me on GitHub

正则表达式学习笔记

正则表达式无论是在爬虫还是其它的应用中都是有一定作用的。

1、常见的匹配模式

模式                              描述
\w                            匹配字母数字及下划线
\W                            匹配非字母数字下划线
\s                            匹配任意空白字符,等价于 [\t\n\r\f].
\S                            匹配任意非空字符
\d                            匹配任意数字,等价于 [0-9]
\D                            匹配任意非数字
\A                            匹配字符串开始
\Z                            匹配字符串结束,如果是存在换行,只匹配到换行前的结束字符串
\z                            匹配字符串结束
\G                            匹配最后匹配完成的位置
\n                            匹配一个换行符
\t                            匹配一个制表符
^                             匹配字符串的开头
$                             匹配字符串的末尾。
.                             匹配任意字符,除了换行符,当re.DOTALL标记被指定时,则可以匹配包括换行符的任意字符。
[...]                           用来表示一组字符,单独列出:[amk] 匹配 'a''m''k'
[^...]                          不在[]中的字符:[^abc] 匹配除了a,b,c之外的字符。
*                             匹配0个或多个的表达式。
+                             匹配1个或多个的表达式。
?                             匹配0个或1个由前面的正则表达式定义的片段,非贪婪方式
{n}                            精确匹配n个前面表达式。
{n, m}                          匹配 n 到 m 次由前面的正则表达式定义的片段,贪婪方式
a|b                             匹配a或b
( )                            匹配括号内的表达式,也表示一个组

 2、re.match()

re.match()方法会从字符的第一个位置匹配起。如果第一个位置匹配失败的话,就会返回none。

re.match(pattern, string, flags=0)

常规匹配

import re

content="hfhdh 8484 djfjdj dkfd 8596"

result=re.match("^hfhdh\s\d{4}\s\w.*96$",content)
print(result) #<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'>
print(result.group())#表示匹配到的字符hfhdh 8484 djfjdj dkfd 8596
print(result.span()) #表示匹配字符的大小(0, 27)

泛匹配

利用.*匹配多个字符

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh.*96$",content)
print(result)
print(result.group())

目标匹配

匹配字符串中的数字,正则表达式加上括号表示一个组,可以取出每一个括号中匹配到的值

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*\s(\d+)$",content)
print(result)#<_sre.SRE_Match object; span=(0, 27), match='hfhdh 8484 djfjdj dkfd 8596'>
print(result.group())#hfhdh 8484 djfjdj dkfd 8596
print(result.group(1))#8484
print(result.group(2))#8596

贪婪匹配

 可以看到还是之前的匹配不过在.*后面去掉了\s,结果就不一样了。它会尽量匹配多的字符,不过至少留下一个数字。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*(\d+)$",content)
print(result.group(2))#6

非贪婪匹配

为防止多的匹配,可以引入?它是匹配0个或者1个前面的正则表达式

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.match("^hfhdh\s(\d+).*?(\d+)$",content)
print(result.group(2))#8596

匹配模式

如果出现换行,应该如何处理?此时可以引入匹配模式re.S

import re

content="hfhdh 8484 djfjdj " \
        "dkfd 8596"
result=re.match("^hfhdh\s(\d+).*?(\d+)$",content,re.S)
print(result.group(2))#8596

转义

如果匹配的内容中有正则表达式,需要使用“\”进行转义

import re

content="This book's price is $10.00"
result=re.match("This book's price is \$10\.00",content,re.S)
print(result)

3、re.search()

re.search()是对整个字符串进行扫描,不一定非要从第一个开始。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.search(".*?(\d+).*?(\d+)$",content)
print(result)
print(result.group(1))
print(result.group(2))

4、re.findall()

上述都是匹配的一个字符串,如果需要匹配出所有的字符串就需要用到findall(),结果以列表的形式返回所有的结果。

import re

content="""<div class="main-nav">
<a href=//new.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d1.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392515722151_390455&amp;pos=1>尚天猫</a>
<a href=//miao.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d2.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392511874662_390455&amp;pos=2>喵鲜生</a>
<a href=//vip.tmall.com/vip/index.htm?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d3.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392523417123_390455&amp;pos=3>天猫会员</a>
<a href=//3c.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d4.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392519569634_390455&amp;pos=4>电器城</a>
<a href=//chaoshi.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d5.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392500332195_390455&amp;pos=5>天猫超市</a>
<a href=//yao.tmall.com/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d6.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392496484706_390455&amp;pos=6>医药馆</a>
<a href=//www.tmall.hk/?abbucket=&amp;acm=lb-zebra-12803-227044.1003.8.390455&amp;aldid=74460&amp;spm=3.7396704.20000005.d8.zNATmK&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392508027177_390455&amp;pos=8>天猫国际</a>
<a class="last" href=//car.tmall.com/?acm=lb-zebra-12803-227044.1003.8.390455&amp;spm=3.7396704.20000007.23.zNATmK&amp;uuid=75987&amp;abtest=&amp;scm=1003.8.lb-zebra-12803-227044.ITEM_14392504179688_390455&amp;pos=1>天猫汽车</a>
</div>"""

result=re.findall('<a.*?>(.*?)</a>',content,re.S)#有几个分组就匹配出几个分组的内容
print(result)#['尚天猫', '喵鲜生', '天猫会员', '电器城', '天猫超市', '医药馆', '天猫国际', '天猫汽车']

5、re.sub()

替换字符串中每一个匹配的子串后返回替换后的字符串。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.sub("(\d+)",'hellO',content)
print(result)#hfhdh hellO djfjdj dkfd hellO

如果保留替换的内容

import re

content="hfhdh 8484 djfjdj dkfd 8596"
result=re.sub("(\d+)",r'\1hellO',content) #注意加r \1 原先第一个分组中的内容
print(result)#hfhdh 8484hellO djfjdj dkfd 8596hellO

6、re.compile()

将正则字符串编译成正则表达式对象,以便于复用该匹配模式。

import re

content="hfhdh 8484 djfjdj dkfd 8596"
pattern=re.compile('.*?(\d+).*?(\d+)')
print(re.match(pattern,content).group(1))#8484
print(re.match(pattern,content).group(2))#8596

7、实战演练

爬取豆瓣读书的url,img,以及author

import requests
import re

content=requests.get('https://book.douban.com').text
results=re.findall('<li.*?class="cover".*?a\shref="(.*?)"\stitle=".*?">.*?src="(.*?)"\sclass.*?class="author">(.*?)</div>.*?/li>',content,re.S)
for result in results:
    url,img,author=result
    author=re.sub('\s','',author)
    print(url,img,author)
https://book.douban.com/subject/30353889/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32288925.jpg (挪)奥斯娜·塞厄斯塔
https://book.douban.com/subject/30431051/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32288967.jpg 【英】霍吉淑
https://book.douban.com/subject/33428941/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32311017.jpg [意]伊塔洛·斯韦沃
https://book.douban.com/subject/30319982/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s29986124.jpg [美]巴巴拉·塔奇曼
https://book.douban.com/subject/33414749/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32295228.jpg [日]贵志祐介
https://book.douban.com/subject/33396548/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32281256.jpg [法]勒•柯布西耶
https://book.douban.com/subject/30281429/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30021281.jpg [荷]高罗佩
https://book.douban.com/subject/33379779/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32305312.jpg [法]弗雷德里克·皮耶鲁齐&nbsp;/&nbsp;[法]马修·阿伦
https://book.douban.com/subject/33404843/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32289202.jpg 远子
https://book.douban.com/subject/30432494/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32296675.jpg [美]琼·狄迪恩
https://book.douban.com/subject/30466222/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32281237.jpg [匈]雅歌塔·克里斯多夫
https://book.douban.com/subject/33435992/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32313677.jpg 葛兆光
https://book.douban.com/subject/30396696/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32301364.jpg [美]奥森·斯科特·卡德
https://book.douban.com/subject/33420594/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32317746.jpg 冯唐
https://book.douban.com/subject/33400116/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32310181.jpg (英)阿加莎•克里斯蒂著
https://book.douban.com/subject/30362709/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32273120.jpg [美]海莲·汉芙
https://book.douban.com/subject/32567841/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s31459918.jpg 伊谢尔伦的风
https://book.douban.com/subject/33408138/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32302726.jpg [美]罗威廉(WilliamT.Rowe)
https://book.douban.com/subject/33423702/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32311913.jpg 夏清影
https://book.douban.com/subject/33440284/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32314738.jpg 菲奥娜·斯塔福德
https://book.douban.com/subject/30480992/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278296.jpg [英]约翰·勒卡雷
https://book.douban.com/subject/33393524/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32302539.jpg 许知远
https://book.douban.com/subject/30466204/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32318137.jpg [英]戴维·洛奇
https://book.douban.com/subject/30473225/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32304992.jpg [日]池田龟鉴
https://book.douban.com/subject/30200837/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30017700.jpg [日]青山七惠
https://book.douban.com/subject/30481930/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s30019054.jpg (英)吉姆·克里斯蒂安&nbsp;/&nbsp;于应机&nbsp;/&nbsp;李阳欢
https://book.douban.com/subject/33399902/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284301.jpg 池莉
https://book.douban.com/subject/30443973/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32271911.jpg [葡]费尔南多·佩索阿
https://book.douban.com/subject/33370472/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32326689.jpg [法]罗曼·加里
https://book.douban.com/subject/30415984/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32285594.jpg [美]克里斯·克利尔菲尔德&nbsp;/&nbsp;[美]安德拉什·蒂尔克斯
https://book.douban.com/subject/33423373/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32323469.jpg 周恺
https://book.douban.com/subject/33381271/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32277009.jpg [美]戴维•戴恩
https://book.douban.com/subject/30406506/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32295155.jpg 练明乔
https://book.douban.com/subject/30436197/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32278312.jpg [英]海伦•拉塞尔
https://book.douban.com/subject/30446953/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32284916.jpg [美]劳伦斯·布洛克
https://book.douban.com/subject/32492398/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32259438.jpg [法]阿尔贝·奥古斯特·拉西内著
https://book.douban.com/subject/30464096/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32322848.jpg [英]格雷厄姆·格林
https://book.douban.com/subject/30475747/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32294085.jpg [法]米歇尔·维诺克
https://book.douban.com/subject/33411336/?icn=index-latestbook-subject https://img3.doubanio.com/view/subject/m/public/s32293152.jpg 曾铮
https://book.douban.com/subject/33387411/?icn=index-latestbook-subject https://img1.doubanio.com/view/subject/m/public/s32304678.jpg [美]拉塞尔·柯克

 

posted @ 2019-05-24 22:21  iveBoy  阅读(497)  评论(0编辑  收藏  举报
TOP