url编解码
代码
from urllib.parse import quote,unquote,urlencode print(quote('https://www.cnblogs.com/?a=bc&d=f')) print(urlencode({'a':'b','b':'c'}))
#https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df
#a=b&b=c
print(unquote('https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df')) print(unquote('a=b&b=c')) #编码 #quote操作的是字符串类型,把url的参数和特殊字符都进行编码 #urlencode操作对象是字典类型,或者列表套元组 #解码 #只有unqoute,没有urldecode #所以解码只用unqoute
对于编码,从上面我们能看到,http协议跟着的冒号也会被编码,唯独‘/’不会被编码,这对爬虫会进行很大的困扰,我们看下他的源码
def quote(string, safe='/', encoding=None, errors=None): """quote('abc def') -> 'abc%20def' Each part of a URL, e.g. the path info, the query, etc., has a different set of reserved characters that must be quoted. The quote function offers a cautious (not minimal) way to quote a string for most of these parts. RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists the following (un)reserved characters. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" Each of the reserved characters is reserved in some component of a URL, but not necessarily in all of them. The quote function %-escapes all characters that are neither in the unreserved chars ("always safe") nor the additional chars set via the safe arg. The default for the safe arg is '/'. The character is reserved, but in typical usage the quote function is being called on a path where the existing slash characters are to be preserved. Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings. Now, "~" is included in the set of unreserved characters. string and safe may be either str or bytes objects. encoding and errors must not be specified if string is a bytes object. The optional encoding and errors parameters specify how to deal with non-ASCII characters, as accepted by the str.encode method. By default, encoding='utf-8' (characters are encoded with UTF-8), and errors='strict' (unsupported characters raise a UnicodeEncodeError). """ if isinstance(string, str): if not string: return string if encoding is None: encoding = 'utf-8' if errors is None: errors = 'strict' string = string.encode(encoding, errors) else: if encoding is not None: raise TypeError("quote() doesn't support 'encoding' for bytes") if errors is not None: raise TypeError("quote() doesn't support 'errors' for bytes") return quote_from_bytes(string, safe)
也就说默认不会变编码的只有下面这四个符号不会被编码,其他的都会被编码
‘_.-~’
还有就是传入safe参数的字符也不会被编码,效果如下
quote('https://www.cnblogs.com/?a=bc&d=f') 'https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df'
quote('https://www.cnblogs.com/?a=bc&d=f',safe=':/') 'https://www.cnblogs.com/%3Fa%3Dbc%26d%3Df'
quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?/') 'https://www.cnblogs.com/?a%3Dbc%26d%3Df'
quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?=/') 'https://www.cnblogs.com/?a=bc%26d=f'
源码默认的safe只有 ‘/’,但你传入safe,如果需要‘/’不被编码,也要记得传入’/‘,
【推荐】国内首个AI IDE,深度理解中文开发场景,立即下载体验Trae
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 25岁的心里话
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列01:轻松3步本地部署deepseek,普通电脑可用
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
2019-04-02 爬取实时变化的 WebSocket 数据(转载)