python爬虫-正则表达式

特此声明：

以下内容来源于博主：http://www.cnblogs.com/huxi/

http://blog.csdn.net/pleasecallmewhy

http://cuiqingcai.com/

根据需要整理到自己的笔记中，用于学习。

正则表达式基础

pyre

re模块（Python通过re模块提供对正则表达式的支持）

主要用到的方法：

  #返回pattern对象
1 re.compile(string[,flag])  
2 #以下为匹配所用函数
3 re.match(pattern, string[, flags])
4 re.search(pattern, string[, flags])
5 re.split(pattern, string[, maxsplit])
6 re.findall(pattern, string[, flags])
7 re.finditer(pattern, string[, flags])
8 re.sub(pattern, repl, string[, count])
9 re.subn(pattern, repl, string[, count])

re使用步骤：

Step1：将正则表达式的字符串形式编译为Pattern实例。

Step2：使用Pattern实例处理文本并获得匹配结果（Match实例）。

Step3：使用Match实例获得信息，进行其他的操作。

 1 import re  #导入模块       
 2 pattern=re.compile(r'hello')#将正则表达式编译成Pattern对象，注意hello前面的r的意思是“原生字符串”,原原本本的输出 
 3 match1=pattern.match('hello world')#使用Pattern对象来进行进一步的匹配文本，获得匹配结果
 4 match2=pattern.match('helloo world')
 5 match3=pattern.match('helllo world')
 6 if match1:     #如果匹配成功
 7    print (match1.group())   # 使用Match获得分组信息
 8 else:
 9    print('not match1')      #
10 if match2:
11    print(match2.group())
12 else:
13    print('not match2')
14 if match3:
15    print(match3.group())
16 else:
17    print('no match3')

下面来具体看看代码中的关键方法。

★ re.compile(strPattern[, flag]):

这个方法是Pattern类的工厂方法，用于将字符串形式的正则表达式编译为Pattern对象。

第二个参数flag是匹配模式，取值可以使用按位或运算符'|'表示同时生效，比如re.I | re.M。

另外，你也可以在regex字符串中指定模式，

比如re.compile('pattern', re.I | re.M)与re.compile('(?im)pattern')是等价的。

可选值有：

re.I(全拼：IGNORECASE): 忽略大小写（括号内是完整写法，下同）
re.M(全拼：MULTILINE): 多行模式，改变'^'和'$'的行为（参见上图）
re.S(全拼：DOTALL): 点任意匹配模式，改变'.'的行为
re.L(全拼：LOCALE): 使预定字符类 \w \W \b \B \s \S 取决于当前区域设定
re.U(全拼：UNICODE): 使预定字符类 \w \W \b \B \s \S \d \D 取决于unicode定义的字符属性
re.X(全拼：VERBOSE): 详细模式。这个模式下正则表达式可以是多行，忽略空白字符，并可以加入注释。

 1 import re
 2 a=re.compile(r"""\d+   
 3                  \.    
 4                  \d*""",re.X)
 5 b=re.compile(r'\d+\.\d*')
 6 match1=a.match('3.1415')
 7 match2=a.match('33')
 8 match3=b.match('3.1415')
 9 match4=b.match('33')
10 if match1:
11    print(match1.group())
12 else:
13    print('match1 is not a digital')
14 if match2:
15    print(match2.group())
16 else:
17    print('match2 is not a digital')
18 if match3:
19    print(match3.group())
20 else:
21    print('match3 is not a digital')
22 if match4:
23    print(match4.group())

posted @ 2015-11-01 21:48 邬家栋阅读(418) 评论(0) 编辑收藏举报

刷新页面返回顶部