Python 处理得到的js的escape编码
参考链接:http://www.cnblogs.com/suwings/p/6360395.html
做个爬虫真的是一波三折,今天爬取网站得到的返回内容是js的escape编码,完全乱码,用urllib.unquote()不行,decode再encode也不行。
上网查了下发现了这样做可以:
import json import demjson import urllib test = """{isSuccess:\'1\',pager:\'<i class="icon icon-arrow-left-mute disabled"></i><a class="pager active" data-page="1" onclick="ser(1,15)">1</a><i class="icon icon-arrow-right-active disabled" ></i>\',recordCount:\'1\',hrecordCount:\'1\',content:\'%3Ctr%20class%3D%22even%22%20onclick%3D%22locationUrl%28178303%2C0%29%3B%22%3E%3Ctd%3E1%3C/td%3E%3Ctd%20class%3D%22text-left%22%20title%3D%22%u6E56%u5357%u5929%u79CD%u5174%u519C%u517B%u6B96%u6709%u9650%u516C%u53F8%22%3E%u6E56%u5357%u5929%u79CD%u5174%u519C%u517B%u6B96%u6709%u9650%u516C%u53F8%3C/td%3E%3Ctd%3E%u5CB3%u9633%20/%20%3Cspan%20class%3D%22text-prov%22%3E%u6E56%u5357%3C/span%3E%3C/td%3E%3Ctd%3E2016%3C/td%3E%3Ctd%3E5%3C/td%3E%3C/tr%3E\'}""" value = test.replace('%u','\\u') byts = urllib.unquote(value) byts = byts.encode('utf-8') test_dem = demjson.decode(byts) print test_dem for k,v in test_dem.items(): print k,v
如图输出结果: