python爬虫知识点总结(五)正则表达式
在线正则表达式匹配:http://tool.oschina.net/regex
正则表达式学习:https://c.runoob.com/front-end/854
一、什么是正则表达式?
常见匹配模式
模式 | 描述 |
\w | 匹配字母数字及下划线 |
\W | 匹配非字母数字下划线 |
\s | 匹配任意空白字符,等价于[\t\n\r\f] |
\S | 匹配任意非空字符 |
\d | 匹配任意数字,等价于[0-9] |
\D | 匹配任意非数字 |
\A | 匹配字符串开始 |
\Z | 匹配字符串结束,如果时存在换行,之匹配到换行前的结束字符串 |
\z | 匹配字符串结束 |
\G | 匹配最后匹配完成的位置 |
\n | 匹配一个换行符 |
\t | 匹配一个制表符 |
^ | 匹配字符串的开头 |
$ | 匹配字符串的末尾 |
. | 匹配任意字符,除了换行符,当re.DOTTALL标记被指定时,则可以匹配包括换行符的任意字符 |
[...] | 用来表示一组字符,单独列出:[amk]匹配'a','m'或'k' |
[^...] | 不再[]中的字符,单独列出:[^abc]匹配除了a,b,c之外的字符 |
* | 匹配0个或多个的表达式 |
+ | 匹配1个或多个的表达式 |
? | 匹配0个或1个由前面的正则表达式定义的片段,非贪婪模式 |
(n) | 精确匹配n个前面表达式 |
(n,m) | 匹配n到m次由前面的正则表达式定义的片段,贪婪方式 |
a|b | 匹配a或b |
() | 匹配括号内的表达式,也表示一个组 |
re.match
re.match尝试从字符串的起始位置匹配一个模式,如果不是起始位置匹配成功的话,match()就返回none.
re.match(pattern,string,flags=0)
最常规的匹配
1 2 3 4 5 6 7 | import re content = 'Hello 123 4567 World_This is a Regex Demo' print(len(content)) result = re.match( '^Hello\s\d{3}\s\d{4}\s\w{10}.*Demo$' ,content) print(result) print(result. group ()) print(result.span()) |
泛匹配
1 2 3 4 5 6 | import re content = 'Hello 123 4567 World_This is a Regex Demo' result = re.match( '^Hello.*Demo$' ,content) print(result) print(result. group ()) print(result.span()) |
匹配目标
1 2 3 4 5 6 | import re content = 'Hello 1234567 World_This is a Regex Demo' result = re.match( '^Hello\s(\d+)\sWorld.*Demo$' ,content) print(result) print(result. group (1)) print(result.span()) |
贪婪匹配
1 2 3 4 5 6 | import re content = 'Hello 1234567 World_This is a Regex Demo' result = re.match( '^He.*(\d+).*Demo$' ,content) print(result) print(result. group (1)) print(result.span()) |
非贪婪匹配
1 2 3 4 5 6 | import re content = 'Hello 1234567 World_This is a Regex Demo' result = re.match( '^He.*?(\d+).*Demo$' ,content) print(result) print(result. group (1)) print(result.span()) |
匹配模式
1 2 3 4 5 6 | import re content = '' 'Hello 1234567 World_This is a Regex Demo '' ' result = re.match( '^He.*?(\d+).*?Demo$' ,content,re.S)# .不能匹配换行符,re.S来支持匹配换行符 print(result) print(result. group (1)) |
转义
1 2 3 4 | import re content = 'price is $5.00' result = re.match( 'price is \$5\.00' ,content) print(result) |
总结:尽量使用泛匹配、使用括号得到匹配目标、尽量使用非贪婪模式、有换行符就用re.S
re.search
re.search 扫描整个字符串并返回第一个成功的匹配
1 2 3 4 | import re content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings' result = re.match( 'Hello.*?(\d+).*?Demo' ,content) print(result) |
1 2 3 4 | import re content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings' result = re.search( 'Hello.*?(\d+).*?Demo' ,content) print(result) |
总结:为了匹配方便,能用search就不用match
匹配练习
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import re html = '' ' <div id= "songs-list" > <h2 class = "title" >金典老歌</h2> <p class = "introduction" >金典老歌列表</p> <ul i= "list" class = "list-group" > <li data-view= "2" >一路有你</li> <li data-view= "7" > <a href= "/2.mp3" singer= "任贤齐" >沧海一声笑</a> </li> <li data-view= "4" class = "active" > <a href= "/3.mp3" singer= "齐秦" >往事随风</a> </li> <li data-view= "6" ><a href= "4.mp3" singer= "beyond" >光辉岁月</a></li> <li data-view= "5" ><a href= "5.mp3" singer= "陈慧琳" >记事本</a></li> <li data-view= "5" > <a href= "6.mp3" singer= "邓丽君" ><i class = "fa fa-user" ></i>但愿人长久</a> </li> </ul> </div> '' ' result = re.search( '<li.*?active.*?singer="(.*?)">(.*?)</a>' ,html,re.S) if result: print(result. group (1),result. group (2)) else : print( "ok" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import re html = '' ' <div id= "songs-list" > <h2 class = "title" >金典老歌</h2> <p class = "introduction" >金典老歌列表</p> <ul i= "list" class = "list-group" > <li data-view= "2" >一路有你</li> <li data-view= "7" > <a href= "/2.mp3" singer= "任贤齐" >沧海一声笑</a> </li> <li data-view= "4" class = "active" > <a href= "/3.mp3" singer= "齐秦" >往事随风</a> </li> <li data-view= "6" ><a href= "4.mp3" singer= "beyond" >光辉岁月</a></li> <li data-view= "5" ><a href= "5.mp3" singer= "陈慧琳" >记事本</a></li> <li data-view= "5" > <a href= "6.mp3" singer= "邓丽君" ><i class = "fa fa-user" ></i>但愿人长久</a> </li> </ul> </div> '' ' result = re.search( '<li.*?singer="(.*?)">(.*?)</a>' ,html,re.S) if result: print(result. group (1),result. group (2)) else : print( "ok" ) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 | import re html = '' ' <div id= "songs-list" > <h2 class = "title" >金典老歌</h2> <p class = "introduction" >金典老歌列表</p> <ul i= "list" class = "list-group" > <li data-view= "2" >一路有你</li> <li data-view= "7" > <a href= "/2.mp3" singer= "任贤齐" >沧海一声笑</a> </li> <li data-view= "4" class = "active" > <a href= "/3.mp3" singer= "齐秦" >往事随风</a> </li> <li data-view= "6" ><a href= "4.mp3" singer= "beyond" >光辉岁月</a></li> <li data-view= "5" ><a href= "5.mp3" singer= "陈慧琳" >记事本</a></li> <li data-view= "5" > <a href= "6.mp3" singer= "邓丽君" ><i class = "fa fa-user" ></i>但愿人长久</a> </li> </ul> </div> '' ' result = re.search( '<li.*?singer="(.*?)">(.*?)</a>' ,html) if result: print(result. group (1),result. group (2)) else : print( "ok" ) |
re.findall
搜索字符串,一列表形式返回全部能匹配的字串
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import re html = '' ' <div id= "songs-list" > <h2 class = "title" >金典老歌</h2> <p class = "introduction" >金典老歌列表</p> <ul i= "list" class = "list-group" > <li data-view= "2" >一路有你</li> <li data-view= "7" > <a href= "/2.mp3" singer= "任贤齐" >沧海一声笑</a> </li> <li data-view= "4" class = "active" > <a href= "/3.mp3" singer= "齐秦" >往事随风</a> </li> <li data-view= "6" ><a href= "/4.mp3" singer= "beyond" >光辉岁月</a></li> <li data-view= "5" ><a href= "/5.mp3" singer= "陈慧琳" >记事本</a></li> <li data-view= "5" > <a href= "/6.mp3" singer= "邓丽君" >但愿人长久</a> </li> </ul> </div> '' ' results = re.findall( '<li.*?href="(.*?)".*?singer="(.*?)">(.*?)</a>' ,html,re.S) print(results) print(type(results)) for result in results: print(result) print(result[0],result[1],result[2]) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | import re html = '' ' <div id= "songs-list" > <h2 class = "title" >金典老歌</h2> <p class = "introduction" >金典老歌列表</p> <ul i= "list" class = "list-group" > <li data-view= "2" >一路有你</li> <li data-view= "7" > <a href= "/2.mp3" singer= "任贤齐" >沧海一声笑</a> </li> <li data-view= "4" class = "active" > <a href= "/3.mp3" singer= "齐秦" >往事随风</a> </li> <li data-view= "6" ><a href= "/4.mp3" singer= "beyond" >光辉岁月</a></li> <li data-view= "5" ><a href= "/5.mp3" singer= "陈慧琳" >记事本</a></li> <li data-view= "5" > <a href= "/6.mp3" singer= "邓丽君" >但愿人长久</a> </li> </ul> </div> '' ' results = re.findall( '<li.*?>\s*?(<a.*?>)?(\w+)(</a>?\s*?</li)' ,html,re.S) print(results) for result in results: print(result[1]) |
re.sub
替换字符串中每一个匹配的子串后返回替换后的字符串
1 2 3 4 | import re content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings' content = re.sub( '\d+' , '' ,content) print(content) |
1 2 3 4 | import re content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings' content = re.sub( '\d+' , 'Reldjaidja' ,content) # 将\d+所在位置替换成Reldjaidja print(content) |
1 2 3 4 | import re content = 'Extra strings Hello 1234567 World_This is a Regex Demo Extra strings' content = re.sub( '(\d+)' ,r '\1 8910' ,content) # \1是将第一个括号里的内容作了替换 print(content) |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | import re html = '' ' <div id= "songs-list" > <h2 class = "title" >金典老歌</h2> <p class = "introduction" >金典老歌列表</p> <ul i= "list" class = "list-group" > <li data-view= "2" >一路有你</li> <li data-view= "7" > <a href= "/2.mp3" singer= "任贤齐" >沧海一声笑</a> </li> <li data-view= "4" class = "active" > <a href= "/3.mp3" singer= "齐秦" >往事随风</a> </li> <li data-view= "6" ><a href= "/4.mp3" singer= "beyond" >光辉岁月</a></li> <li data-view= "5" ><a href= "/5.mp3" singer= "陈慧琳" >记事本</a></li> <li data-view= "5" > <a href= "/6.mp3" singer= "邓丽君" >但愿人长久</a> </li> </ul> </div> '' ' html = re.sub( '<a.*?>|</a>' , '' ,html) print(html) results = re.findall( '<li.*?>(.*?)</li>' ,html,re.S) print(results) for result in results: print(result.strip()) |
re.compile
将正则字符串编译成正则表达式对象:
将一个正则表达式串编译成正则对象,以便于复用该匹配模式
1 2 3 4 5 6 7 8 9 10 11 | import re content = '' 'Hello 1234567 World_Tis is a Regex Demo '' ' # 第一种 pattern = re.compile( 'Hello.*Demo' ,re.S) result = re.match(pattern,content) print(result) #第二种 result = re.match( 'Hello.*Demo' ,content,re.S) print(result) |
练习:爬去豆瓣图书的图书信息
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | import requests import re content = requests. get ( 'https://book.douban.com' ).text pattern = re.compile( '<li.*?"cover".*?href="(.*?)".*?title="(.*?)".*?more-meta">.*?"author">(.*?)</span>.*?"year">(.*?)</span>.*?"publisher">(.*?)</span>.*?</li>' ,re.S) print(pattern) results = re.findall(pattern,content) print(results) for ret in results: url,title,author,date,publisher = ret author = re.sub( '\s' , '' ,author) date=re.sub( '\s' , '' ,date) publisher=re.sub( '\s' , '' ,publisher) print(url,title,author,date,publisher) |
【推荐】还在用 ECharts 开发大屏?试试这款永久免费的开源 BI 工具!
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· ASP.NET Core 模型验证消息的本地化新姿势
· 对象命名为何需要避免'-er'和'-or'后缀
· SQL Server如何跟踪自动统计信息更新?
· AI与.NET技术实操系列:使用Catalyst进行自然语言处理
· 分享一个我遇到过的“量子力学”级别的BUG。
· AI Agent爆火后,MCP协议为什么如此重要!
· Draw.io:你可能不知道的「白嫖级」图表绘制神器
· dotnet 源代码生成器分析器入门
· ASP.NET Core 模型验证消息的本地化新姿势
· Java使用多线程处理未知任务数方案