开发笔记 -- URL地址格式显示异常-用python-urllib库解决1
场景描述:
开发中,尤其数据采集过程中,偶尔会遇到URL地址显示异常的情况,如下:
https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fp6.itc.cn%2Fq_70%2Fimages03%2F20210910%2F3a1618342d16479698e1026983dce86b.jpeg&refer=http%3A%2F%2Fp6.itc.cn&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1670722169&t=6a3e0b1c459545b0dba348c38477ce9f
https://facert.gitbooks.io/python-data-structure-cn/2.%E7%AE%97%E6%B3%95%E5%88%86%E6%9E%90/2.2.%E4%BB%80%E4%B9%88%E6%98%AF%E7%AE%97%E6%B3%95%E5%88%86%E6%9E%90/
如果直接复制到浏览器访问,会提示异常,如下:
http%3A%2F%2Fp6.itc.cn%2Fq_70%2Fimages03%2F20210910%2F3a1618342d16479698e1026983dce86b.jpeg
怎么处理?
利用python的urllib库来处理:
macdeMacBook-Pro-2:~ mac$ python3
Python 3.10.4 (v3.10.4:9d38120e33, Mar 23 2022, 17:29:05) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib.parse import unquote, quote
>>>
>>> html_str = unquote('https://gimg2.baidu.com/image_search/src=http%3A%2F%2Fp6.itc.cn%2Fq_70%2Fimages03%2F20210910%2F3a1618342d16479698e1026983dce86b.jpeg&refer=http%3A%2F%2Fp6.itc.cn&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1670722169&t=6a3e0b1c459545b0dba348c38477ce9f')
>>> html_str
'https://gimg2.baidu.com/image_search/src=http://p6.itc.cn/q_70/images03/20210910/3a1618342d16479698e1026983dce86b.jpeg&refer=http://p6.itc.cn&app=2002&size=f9999,10000&q=a80&n=0&g=0n&fmt=auto?sec=1670722169&t=6a3e0b1c459545b0dba348c38477ce9f'
>>>
>>> html_str_1 = unquote('https://facert.gitbooks.io/python-data-structure-cn/2.%E7%AE%97%E6%B3%95%E5%88%86%E6%9E%90/2.2.%E4%BB%80%E4%B9%88%E6%98%AF%E7%AE%97%E6%B3%95%E5%88%86%E6%9E%90/')
>>> html_str_1
'https://facert.gitbooks.io/python-data-structure-cn/2.算法分析/2.2.什么是算法分析/'
>>> exit()
OK,html_str 和 html_str_1打印输出的格式,就是格式化后的地址。
例如-下方是其中格式化后的地址:
http://p6.itc.cn/q_70/images03/20210910/3a1618342d16479698e1026983dce86b.jpeg