url编解码
代码
from urllib.parse import quote,unquote,urlencode print(quote('https://www.cnblogs.com/?a=bc&d=f')) print(urlencode({'a':'b','b':'c'}))
#https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df
#a=b&b=c
print(unquote('https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df')) print(unquote('a=b&b=c')) #编码 #quote操作的是字符串类型,把url的参数和特殊字符都进行编码 #urlencode操作对象是字典类型,或者列表套元组 #解码 #只有unqoute,没有urldecode #所以解码只用unqoute
对于编码,从上面我们能看到,http协议跟着的冒号也会被编码,唯独‘/’不会被编码,这对爬虫会进行很大的困扰,我们看下他的源码
def quote(string, safe='/', encoding=None, errors=None): """quote('abc def') -> 'abc%20def' Each part of a URL, e.g. the path info, the query, etc., has a different set of reserved characters that must be quoted. The quote function offers a cautious (not minimal) way to quote a string for most of these parts. RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists the following (un)reserved characters. unreserved = ALPHA / DIGIT / "-" / "." / "_" / "~" reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" Each of the reserved characters is reserved in some component of a URL, but not necessarily in all of them. The quote function %-escapes all characters that are neither in the unreserved chars ("always safe") nor the additional chars set via the safe arg. The default for the safe arg is '/'. The character is reserved, but in typical usage the quote function is being called on a path where the existing slash characters are to be preserved. Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings. Now, "~" is included in the set of unreserved characters. string and safe may be either str or bytes objects. encoding and errors must not be specified if string is a bytes object. The optional encoding and errors parameters specify how to deal with non-ASCII characters, as accepted by the str.encode method. By default, encoding='utf-8' (characters are encoded with UTF-8), and errors='strict' (unsupported characters raise a UnicodeEncodeError). """ if isinstance(string, str): if not string: return string if encoding is None: encoding = 'utf-8' if errors is None: errors = 'strict' string = string.encode(encoding, errors) else: if encoding is not None: raise TypeError("quote() doesn't support 'encoding' for bytes") if errors is not None: raise TypeError("quote() doesn't support 'errors' for bytes") return quote_from_bytes(string, safe)
也就说默认不会变编码的只有下面这四个符号不会被编码,其他的都会被编码
‘_.-~’
还有就是传入safe参数的字符也不会被编码,效果如下
quote('https://www.cnblogs.com/?a=bc&d=f') 'https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df'
quote('https://www.cnblogs.com/?a=bc&d=f',safe=':/') 'https://www.cnblogs.com/%3Fa%3Dbc%26d%3Df'
quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?/') 'https://www.cnblogs.com/?a%3Dbc%26d%3Df'
quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?=/') 'https://www.cnblogs.com/?a=bc%26d=f'
源码默认的safe只有 ‘/’,但你传入safe,如果需要‘/’不被编码,也要记得传入’/‘,