Python逆向爬虫之正则表达式

字符串是我们在编程的时候很常用的一种数据类型，检查会在字符串里面查找一些内容，对于比较简单的查找，字符串里面就有一些内置的方法可以处理，对于比较复杂的字符串查找，或者是有一些内容经常变化的字符串里面查找，那么字符串内置的查找方法已经不好使了，满足不了我们的要求，这个时候就得用正则表达式了，正则表达式就是用来匹配一些比较复杂的字符串。

正则表达式在线练习工具：https://tool.oschina.net/regex

一、正则表达式

符号	解释	示例	说明
元字符
.	匹配任意字符	b.t	可以匹配bat / but / b#t / b1t等
\w	匹配字母/数字/下划线	b\wt	可以匹配bat / b1t / b_t等但不能匹配b#t
\s	匹配空白字符（包括\r、\n、\t等）	love\syou	可以匹配love you
\d	匹配数字	\d\d	可以匹配01 / 23 / 99等
\b	匹配单词的边界	\bThe\b
^	匹配字符串的开始	^The	可以匹配The开头的字符串
$	匹配字符串的结束	.exe$	可以匹配.exe结尾的字符串
\W	匹配非字母/数字/下划线	b\Wt	可以匹配b#t / b@t等但不能匹配but / b1t / b_t等
\S	匹配非空白字符	love\Syou	可以匹配love#you等但不能匹配love you
\D	匹配非数字	\d\D	可以匹配9a / 3# / 0F等
\B	匹配非单词边界	\Bio\B
[]	匹配来自字符集的任意单一字符	[aeiou]	可以匹配任一元音字母字符
[^]	匹配不在字符集中的任意单一字符	[^aeiou]	可以匹配任一非元音字母字符
*	匹配0次或多次	\w*
+	匹配1次或多次	\w+
?	匹配0次或1次	\w?
	匹配N次	\w
	匹配至少M次	\w
	匹配至少M次至多N次	\w
\|	分支	foo\|bar	可以匹配foo或者bar
(?#)	注释
(exp)	匹配exp并捕获到自动命名的组中
(?exp)	匹配exp并捕获到名为name的组中
(?:exp)	匹配exp但是不捕获匹配的文本
(?=exp)	匹配exp前面的位置	\b\w+(?=ing)	可以匹配I'm dancing中的danc
(?<=exp)	匹配exp后面的位置	(?<=\bdanc)\w+\b	可以匹配I love dancing and reading中的第一个ing
(?!exp)	匹配后面不是exp的位置
(?<!exp)	匹配前面不是exp的位置
*?	重复任意次，但尽可能少重复	a.b a.?b	将正则表达式应用于aabab，前者会匹配整个字符串aabab，后者会匹配aab和ab两个字符串
+?	重复1次或多次，但尽可能少重复
??	重复0次或1次，但尽可能少重复
{M,N}?	重复M到N次，但尽可能少重复
{M,}?	重复M次以上，但尽可能少重复

说明：如果需要匹配的字符是正则表达式中的特殊字符，那么可以使用\进行转义处理。

二、进行逐个详解

2.1 匹配多种可能

#'run' or 'ran'
res = re.search(r'r[au]n','dog runs to cat')
print(res)
>>> <re.Match object; span=(4, 7), match='run'> 

res = re.search(r'r[au]n','dog rans to cat')
print(res)
>>> <re.Match object; span=(4, 7), match='ran'>

#continue 匹配更多种可能
res = re.search(r'r[A-Z]n','dog rans to cat')
print(res)
>>> None

res = re.search(r'r[a-z]n','dog rans to cat')
print(res)
>>> <re.Match object; span=(4, 7), match='ran'>

res = re.search(r'r[0-9]n','dog rans to cat')
print(res)
>>>None

res = re.search(r'r[0-9a-z]n','dog rans to cat')
print(res)
>>> <re.Match object; span=(4, 7), match='ran'>

2.2 匹配数字 \d and \D

# \d : decimal digit 数字的
res = re.search(r'r\dn','run r9n')
print(res)
>>> <re.Match object; span=(4, 7), match='r9n'>

# \D : any non-decimal digit 任何不是数字的
res = re.search(r'r\Dn','run r9n')
print(res)
>>> <re.Match object; span=(0, 3), match='run'>

2.3 匹配空白 \s and \S

# \s : any white space [\t \n \r \f \v]
res = re.search(r'r\sn','r\nn r9n')
print(res)
>>> <re.Match object; span=(0, 3), match='r\nn'>

# \S : 和\s相反，any non-white space
res = re.search(r'r\Sn','r\nn r9n')
print(res)
>>> <re.Match object; span=(4, 7), match='r9n'>

2.4 匹配所有的字母和数字以及"_" \w and \W

# \w : [a-zA-Z0-9_]
res = re.search(r'r\wn','r\nn r9n')
print(res)
>>> <re.Match object; span=(4, 7), match='r9n'>

# \W : opposite to \w 即与\w相反
res = re.search(r'r\Wn','r\nn r9n')
print(res)
>>> <re.Match object; span=(0, 3), match='r\nn'>

2.5 匹配空白字符 \b and \B

# \b : (only at the start or end of the word)
res = re.search(r'\bruns\b','dog runs to cat')
print(res)
>>> <re.Match object; span=(4, 8), match='runs'>
res = re.search(r'\bruns\b','dogrunsto cat')
print(res)
>>> None

# \B : ( but not at the start or end of the word)
res = re.search(r'\Bruns\B','dog runs to cat')
print(res)
>>> None
res = re.search(r'\Bruns\B','dogrunsto cat')
print(res)
>>> <re.Match object; span=(5, 11), match=' runs '>

2.6 匹配特殊字符任意字符 \ and .

# \\ : 匹配 \
res = re.search(r'runs\\','dog runs\ to cat')
print(res)
>>> <re.Match object; span=(4, 9), match='runs\\'>

# . : 匹配 anything （except \n）
res = re.search(r'r.ns','dog r;ns to cat')
print(res)
>>> <re.Match object; span=(4, 8), match='r;ns'>
>res = re.search(r'r.ns','dog r\nns to cat')
print(res)
>>> None

2.7 匹配句尾句首 $ and ^

# ^ : 匹配line beginning
res = re.search(r'^runs','dog runs to cat')
print(res)
>>> None
res = re.search(r'^dog','dog runs to cat')
print(res)
>>> <re.Match object; span=(0, 3), match='dog'>

# $ : 匹配line ending
res = re.search(r'runs$','dog runs to cat')
print(res)
>>> None
res = re.search(r'cat$','dog runs to cat')
print(res)
>>> <re.Match object; span=(12, 15), match='cat'>

2.8 是否匹配？

# ？ ： may or may nt occur
res = re.search(r'r(u)?ns','dog runs to cat')
print(res)
>>> <re.Match object; span=(4, 8), match='runs'>

res = re.search(r'r(u)?ns','dog rns to cat')
print(res)
>>><re.Match object; span=(4, 7), match='rns'>

2.9 多行匹配 re.M

# 匹配代码后面加上re.M
string = """ 123.
dog runs to cat.
You run to dog.
"""
res = re.search(r'^You',string)
print(res)
>>> None

res = re.search(r'^You',string,re.M)
print(res)
>>> <re.Match object; span=(10, 13), match='run'>

2.10 匹配零次或多次 *

# * ： occur 0 or more times
res = re.search(r'ab*','a')
print(res)
>>> <re.Match object; span=(0, 1), match='a'>

res = re.search(r'ab*','abbbbbbbbbb')
print(res)
>>> <re.Match object; span=(0, 11), match='abbbbbbbbbb'>

2.11 匹配一次或多次 +

# + ：occur 1 or more times
res = re.search(r'ab+','a')
print(res)
>>> None

res = re.search(r'ab+','abbbbbbbbbb')
print(res)
>>> <re.Match object; span=(0, 11), match='abbbbbbbbbb'>

2.12 可选次数匹配

# {n, m} : occur n to m times
res = re.search(r'ab{1,10}','a')
print(res)
>>> None

res = re.search(r'ab{1,10}','abbbbbbbbbb')
print(res)
>>> <re.Match object; span=(0, 11), match='abbbbbbbbbb'>

2.13 寻找所有匹配 findall

# re.findall()
res = re.findall(r'r[ua]n','run ran ren')
print(res)
>>> ['run', 'ran']

# 另一种写法
res = re.findall(r'(run|ran)','run ran ren')
print(res)
>>> ['run', 'ran']

2.14 替换匹配内容 sub

# re.sub(   ,replace,   )
res = re.sub(r'runs','catches','dog runs to cat')
print(res)
>>> dog catches to cat

2.15 分裂内容 split

# re.sub(   ,replace,   )
res = re.sub(r'runs','catches','dog runs to cat')
print(res)
>>> dog catches to cat

2.16 包装正则表达式 compile

# re. compile()
compile_re = re.compile(r'r[ua]n')
res = compile_re.findall('run ran ren')
print(res)
>>> ['run', 'ran']


compile_re = re.compile(r'(?P<name>r[ua]n?)')
res = compile_re.finditer('run ran ren')
for i in res:
    print(i.groupdict())

2.17 re.S

注意：不使用re.S时，则只在每一行内进行匹配，如果存在一行没有，就换下一行重新开始，使用re.S 参数以后，正则表达式会将这个字符串看做整体，在整体中进行匹配，一般在爬虫项目中会经常用到。

import re
a = """sdfkhellolsdlfsdfiooefo:877898989

worldafdsf"""
b = re.findall('hello(.*?)world',a)
c = re.findall('hello(.*?)world',a,re.S)
print ('b is ' , b)
print ('c is ' , c)

三、Python 爬取豆瓣电影排行榜

import re
import requests

def main():

    head = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36"
    }

    baseurl = "https://movie.douban.com/top250?start="

    res = requests.get(url=baseurl, headers=head)

    connect = res.text

    obj = re.compile(
        r'<li>.*?<div class="item">.*?<span class="title">(?P<name>.*?)</span>.*?<p class="">.*?<br>(?P<year>.*?)&nbsp.*?<span class="rating_num" property="v:average">(?P<score>.*?)</span>.*?<span>(?P<num>.*?)人评价</span>', re.S)

    result = obj.finditer(connect)

    for it in result:
        print(it.groupdict())

if __name__ == '__main__':

    main()

posted @ 2022-08-17 11:33 Alvin, 阅读(159) 评论(0) 编辑收藏举报

刷新页面返回顶部

小阳爱技术

专注于Linux、Python、Golang以及web前端等语言。

Python逆向爬虫之正则表达式

Python逆向爬虫之正则表达式

一、正则表达式

二、进行逐个详解

2.1 匹配多种可能

2.2 匹配数字 \d and \D

2.3 匹配空白 \s and \S

2.4 匹配所有的字母和数字以及"_" \w and \W

2.5 匹配空白字符 \b and \B

2.6 匹配特殊字符任意字符 \ and .

2.7 匹配句尾句首 $ and ^

2.8 是否匹配？

2.9 多行匹配 re.M

2.10 匹配零次或多次 *

2.11 匹配一次或多次 +

2.12 可选次数匹配

2.13 寻找所有匹配 findall

2.14 替换匹配内容 sub

2.15 分裂内容 split

2.16 包装正则表达式 compile

2.17 re.S

三、Python 爬取豆瓣电影排行榜

公告

小阳爱技术

专注于Linux、Python、Golang以及web前端等语言。

Python逆向爬虫之正则表达式

Python逆向爬虫之正则表达式

一、正则表达式

二、进行逐个详解

2.1 匹配多种可能

2.2 匹配数字 \d and \D

2.3 匹配空白 \s and \S

2.4 匹配所有的字母和数字以及"_" \w and \W

2.5 匹配空白字符 \b and \B

2.6 匹配特殊字符 任意字符 \ and .

2.7 匹配句尾句首 $ and ^

2.8 是否匹配 ？

2.9 多行匹配 re.M

2.10 匹配零次或多次 *

2.11 匹配一次或多次 +

2.12 可选次数匹配

2.13 寻找所有匹配 findall

2.14 替换匹配内容 sub

2.15 分裂内容 split

2.16 包装正则表达式 compile

2.17 re.S

三、Python 爬取豆瓣电影排行榜

公告

2.6 匹配特殊字符任意字符 \ and .

2.8 是否匹配？