正则表达式练习

获取网页中文本信息

试验中用到www.17k.com的资源，参考了http://www.cnblogs.com/huxi/archive/2010/07/04/1771073.html的博文。

 1 from urllib import urlopen
 2 import re
 3 
 4 # 将正则表达式编译成Pattern对象
 5 # re.S(DOTALL): 点任意匹配模式，改变'.'的行为。不加匹配不到内容？
 6 p = re.compile(r'<div class="p" id="chapterContent">(.*?)<p class="recent_read"', re.S)
 7 
 8 # 从指定的URL读取内容
 9 text = urlopen(r'http://www.17k.com/chapter/317131/7299531.html').read()
10 
11 # 搜索string，以列表形式返回全部能匹配的子串，并连接
12 str = ''
13 for m in p.findall(text):                                                                   
14     str += m
15 
16 # sub(repl, string[, count]) | re.sub(pattern, repl, string[, count]):
17 # 使用repl替换string中每一个匹配的子串后返回替换后的字符串。
18 # 当repl是一个字符串时，可以使用\id或\g<id>、\g<name>引用分组，但不能使用编号0。 
19 # 当repl是一个方法时，这个方法应当只接受一个参数（Match对象），并返回一个字符串用于替换（返回的字符串中不能再引用分组）。 
20 # count用于指定最多替换次数，不指定时全部替换。 
21 p1 = re.compile('(?P<pre>^|<br>)')  
22 print p1.sub(r'\n', str)

 1 from urllib import urlopen
 2 import re
 3 
 4 p = re.compile(r'<div class="p" id="chapterContent">(.*?)<p class="recent_read"', re.S)
 5 
 6 text = urlopen(r'http://www.17k.com/chapter/317131/7299531.html').read()
 7 
 8 str = ''
 9 for m in p.findall(text):                                                                   
10     str += m
11 
12 
13 str = str.replace('<br>', '\n')
14 
15 print str

。

posted @ 2012-11-22 22:42 SubmarineX 阅读(3377) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

SubmarineX

con.sn@outlook.com

正则表达式练习

公告