python 正则表达式与JSON-正则表达式匹配数字、非数字、字符、非字符、贪婪模式、非贪婪模式、匹配次数指定等
1、正则表达式:目的是为了爬虫,是爬虫利器。
正则表达式是用来做字符串匹配的,比如检测是不是电话、是不是email、是不是ip地址之类的
2、JSON:外部数据交流的主流格式。
3、正则表达式的使用
re python 内置的模块,可以进行正则匹配
re.findall(pattern,source)
pattern:正则匹配规则-也叫郑泽表达式
source:需要查找的目标源
import re a = "C0C++7Java8C#Python6JavaScript" res = re.findall("Java",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['Java', 'Java']
4、正则表达式的应用
- 查数字
-
用概括字符集:\d
import re a = "C0C++7Java8C#Python6JavaScript" res = re.findall("\d",a) print res # Project/python_ToolCodes/test10.py" # ['0', '7', '8', '6']
用另外一种匹配模式-字符集:[0-9]
import re a = "C0C++7Java8C#Python6JavaScript" res = re.findall("[0-9]",a) print res # Project/python_ToolCodes/test10.py" # ['0', '7', '8', '6']
其中"Java"叫普通字符,"/d" 源字符
- 查非数字
-
用概括字符集:\D
import re a = "C0C++7Java8C#Python6JavaScript" res = re.findall("\D",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['C', 'C', '+', '+', 'J', 'a', 'v', 'a', 'C', '#', 'P', 'y', 't', 'h', 'o', 'n', 'J', 'a', 'v', 'a', 'S', 'c', 'r', 'i', 'p', 't']
用另外一种匹配模式-字符集:[^0-9]
import re a = "C0C++7Java8C#Python6JavaScript" res = re.findall("[^0-9]",a) print res # Project/python_ToolCodes/test10.py" # ['C', 'C', '+', '+', 'J', 'a', 'v', 'a', 'C', '#', 'P', 'y', 't', 'h', 'o', 'n', 'J', 'a', 'v', 'a', 'S', 'c', 'r', 'i', 'p', 't']
- 正则表达式的罗列 :https://baike.baidu.com/item/正则表达式/1700215?fr=aladdin,挨个练习是没有必要的,用到去查即可
4、匹配模式
- 源字符+普通字符混合模式
[]中的或操作
#coding=utf-8 import re a = "abc,acc,adc,aec,afc,ahc" #匹配acc和afc res = re.findall("a[cf]c",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['acc', 'afc']
取反操作:^
#coding=utf-8 import re a = "abc,acc,adc,aec,afc,ahc" #取出非(acc和afc)的字符 res = re.findall("a[^cf]c",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['abc', 'adc', 'aec', 'ahc']
取范围操作:-
#coding=utf-8 import re a = "abc,acc,adc,aec,afc,ahc" #取出acc,adc,aec,afc(中间字符是c到f范围的) res = re.findall("a[c-f]c",a) print res
#[Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py"
#['acc', 'adc', 'aec', 'afc']
- 匹配数字和字母:
-
概括字符集匹配:\w
import re a = "abc&cba" res = re.findall("\w",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['a', 'b', 'c', 'c', 'b', 'a']使用字符集匹配:[A-Za-Z0-9]
import re a = "abc123&cba321" res = re.findall("[A-Za-z0-9]",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['a', 'b', 'c', '1', '2', '3', 'c', 'b', 'a', '3', '2', '1']显然,是\w是不匹配非字母和数字的,比如“&”符号
- 匹配非单词非数字字符
概括字符集:\W
import re a = "abc123&cba321" res = re.findall("\W",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['&']使用字符集匹配:^A-Za-z0-9
import re a = "abc123&cba321" res = re.findall("[^A-Za-z0-9]",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['&']
- 空格、制表符、换行符号之类的匹配:\s
-
import re a = "python 111\tjava&67p\nh\rp" res = re.findall("\s",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # [' ', '\t', '\n', '\r']
- 匹配量词:匹配出python Java php
-
必须三个一组:
[a-z]{3}
import re a = "python 1111java678php" res = re.findall("[a-z]{3}",a) print res [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" ['pyt', 'hon', 'jav', 'php']
可以3-6个一组:因为最长python 为6 最短PHP为3:[a-z]{3,6}
import re a = "python 1111java678php" res = re.findall("[a-z]{3,6}",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['python', 'java', 'php']
疑问:为什么3个能匹配 匹配到pyt的时候为什么不终止?
因为正则表达式的数量词分为贪婪和非贪婪模式,默认情况下,python 认为是贪婪模式的。
非贪婪模式怎么使用:加个问号[a-z]{3,6}?
import re a = "python 1111java678php" res = re.findall("[a-z]{3,6}?",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['pyt', 'hon', 'jav', 'php']
- * ,对*前面的字符'n',匹配0次或者无限次
-
import re a = "pytho0python1pythonn2" res = re.findall("python*",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['pytho', 'python', 'pythonn']
比如pytho 没有n 则是匹配0次,可匹配出来pytho;比如python 1个n 则是匹配1次,可匹配出来python;pythonn 2个n 则是匹配2次,可匹配出来pythonn
- +,对+前面的字符'n' 匹配1次或者无限次
-
import re a = "pytho0python1pythonn2" res = re.findall("python+",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['python', 'pythonn']
- ?,?前面的字符'n' 匹配0次或者1次
-
import re a = "pytho0python1pythonn2" res = re.findall("python?",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['pytho', 'python', 'python']
比如pytho 没有n 则是匹配0次,可匹配出来pytho;比如python 1个n 则是匹配1次,可匹配出来python;pythonn 2个n 则是匹配1次,可匹配出来python,因为多出来的n,直接被截断了,不符合匹配模式,所以匹配不出来pythonn 而是匹配出来的是python。也可以理解成?开启了非贪婪模式
- 如果要开启非贪婪模式,但是又不想用*,+ 去匹配无限次,而是指定匹配次数的范围,那么可以这样
python{1,2}
这表示,最多匹配2次,最少匹配1次 -
import re a = "pytho0python1pythonn2" res = re.findall("python{1,2}",a) print res # [Running] python -u "/Users/anson/Documents/Project/python_ToolCodes/test10.py" # ['python', 'pythonn']