python 正则匹配网页中文内容

在对读取到的网页内容进行中文匹配，大体思路是：

1.对读取到的网页内容提取http header中的content-type，获取网页内容的编码格式;

2.根据获取的编码格式将网页内容转换为unicode格式;

3.使用[\u2e80-\u4dfh]进行正则匹配;

4.将匹配获取的字符进行编码为utf-8格式

Demo:

#coding=utf-8

 

import urllib2

 

if __name__ == '__main__':

try:

url = 'https://play.google.com/store/apps/category/TRANSPORTATION/collection/topselling_free?start=48&num=24'

req = urllib2.Request(url)

res = urllib2.urlopen( req )

# get content encode

encoding = res.headers['content-type'].split('charset=')[-1]

# get http content

data = res.read()

# encode with unicode

data = unicode(data,encoding)

res.close()

# match with regex

str = re.findall(ur'[\u2e80-\u4dfh]+',data)

for item in str:

# encode with utf-8

item = item.encode('utf-8')

print item

catch Excepiton,e:

print e

posted @ 2012-04-02 22:54 MindMac 阅读(1423) 评论(0) 编辑收藏举报

刷新页面返回顶部

python 正则匹配网页中文内容

公告