2012 年 4月 2 日随笔档案 - MindMac

2012年4月2日

摘要：在对读取到的网页内容进行中文匹配，大体思路是：1.对读取到的网页内容提取http header中的content-type，获取网页内容的编码格式;2.根据获取的编码格式将网页内容转换为unicode格式;3.使用[\u2e80-\u4dfh]进行正则匹配;4.将匹配获取的字符进行编码为utf-8格式Demo: 1: #coding=utf-8 2: 3: import urllib2 4: 5: if __name__ == '__main__': 6: try: 7: url = 'https://play.google.com/store/apps/categ.. 阅读全文

posted @ 2012-04-02 22:54 MindMac 阅读(1423) 评论(0) 推荐(0) 编辑

公告