python解决url的请求参数中中文是乱码(%..%..)的问题
在爬虫的时候接受的request.url本来是中文的,但是代码中接收到的是带有很多%的乱码,需要解码得到中文的内容:
原本下载这个文件的get请求是:
http://www.shclearing.com/wcm/shch/pages/client/download/download.jsp?FileName=P020200213422190663763.pdf&&DownName=关于四川科伦药业股份有限公司2020年度第一期中期票据(疫情防控债)相关公告材料的更正说明.pdf
但是用request.url得到的结果是:
http://www.shclearing.com/wcm/shch/pages/client/download/download.jsp?FileName=P020200212764099971564.pdf&&DownName=%E7%89%A7%E5%8E%9F%E9%A3%9F%E5%93%81%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B82020%E5%B9%B4%E5%BA%A6%E7%AC%AC%E4%BA%8C%E6%9C%9F%E8%B6%85%E7%9F%AD%E6%9C%9F%E8%9E%8D%E8%B5%84%E5%88%B8(%E7%96%AB%E6%83%85%E9%98%B2%E6%8E%A7%E5%80%BA)%E7%94%B3%E8%B4%AD%E8%AF%B4%E6%98%8E.pdf
在下载后需要用原来中文的文件名作为保存到本地的文件的文件名,所以需要解码,解码方法如下:
# -*- coding: utf-8 -*- fn ="""http://www.shclearing.com/wcm/shch/pages/client/download/download.jsp?FileName=P020200212764099971564.pdf&&DownName=%E7%89%A7%E5%8E%9F%E9%A3%9F%E5%93%81%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B82020%E5%B9%B4%E5%BA%A6%E7%AC%AC%E4%BA%8C%E6%9C%9F%E8%B6%85%E7%9F%AD%E6%9C%9F%E8%9E%8D%E8%B5%84%E5%88%B8(%E7%96%AB%E6%83%85%E9%98%B2%E6%8E%A7%E5%80%BA)%E7%94%B3%E8%B4%AD%E8%AF%B4%E6%98%8E.pdf""" print fn from urllib import quote,unquote uu = unquote(fn) print uu.decode('utf-8')
得到结果:
http://www.shclearing.com/wcm/shch/pages/client/download/download.jsp?FileName=P020200212764099971564.pdf&&DownName=%E7%89%A7%E5%8E%9F%E9%A3%9F%E5%93%81%E8%82%A1%E4%BB%BD%E6%9C%89%E9%99%90%E5%85%AC%E5%8F%B82020%E5%B9%B4%E5%BA%A6%E7%AC%AC%E4%BA%8C%E6%9C%9F%E8%B6%85%E7%9F%AD%E6%9C%9F%E8%9E%8D%E8%B5%84%E5%88%B8(%E7%96%AB%E6%83%85%E9%98%B2%E6%8E%A7%E5%80%BA)%E7%94%B3%E8%B4%AD%E8%AF%B4%E6%98%8E.pdf http://www.shclearing.com/wcm/shch/pages/client/download/download.jsp?FileName=P020200212764099971564.pdf&&DownName=牧原食品股份有限公司2020年度第二期超短期融资券(疫情防控债)申购说明.pdf Process finished with exit code 0
参考:
https://blog.csdn.net/kai402458953/article/details/83541079
https://blog.csdn.net/mp624183768/article/details/83451660