url编解码

代码

复制代码
from urllib.parse import quote,unquote,urlencode


print(quote('https://www.cnblogs.com/?a=bc&d=f'))
print(urlencode({'a':'b','b':'c'}))
#
https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df

#a=b&b=c


print(unquote('https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df')) print(unquote('a=b&b=c')) #编码 #quote操作的是字符串类型,把url的参数和特殊字符都进行编码 #urlencode操作对象是字典类型,或者列表套元组 #解码 #只有unqoute,没有urldecode #所以解码只用unqoute
复制代码

 对于编码,从上面我们能看到,http协议跟着的冒号也会被编码,唯独‘/’不会被编码,这对爬虫会进行很大的困扰,我们看下他的源码

复制代码
def quote(string, safe='/', encoding=None, errors=None):
    """quote('abc def') -> 'abc%20def'

    Each part of a URL, e.g. the path info, the query, etc., has a
    different set of reserved characters that must be quoted. The
    quote function offers a cautious (not minimal) way to quote a
    string for most of these parts.

    RFC 3986 Uniform Resource Identifier (URI): Generic Syntax lists
    the following (un)reserved characters.

    unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
    reserved      = gen-delims / sub-delims
    gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
    sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

    Each of the reserved characters is reserved in some component of a URL,
    but not necessarily in all of them.

    The quote function %-escapes all characters that are neither in the
    unreserved chars ("always safe") nor the additional chars set via the
    safe arg.

    The default for the safe arg is '/'. The character is reserved, but in
    typical usage the quote function is being called on a path where the
    existing slash characters are to be preserved.

    Python 3.7 updates from using RFC 2396 to RFC 3986 to quote URL strings.
    Now, "~" is included in the set of unreserved characters.

    string and safe may be either str or bytes objects. encoding and errors
    must not be specified if string is a bytes object.

    The optional encoding and errors parameters specify how to deal with
    non-ASCII characters, as accepted by the str.encode method.
    By default, encoding='utf-8' (characters are encoded with UTF-8), and
    errors='strict' (unsupported characters raise a UnicodeEncodeError).
    """
    if isinstance(string, str):
        if not string:
            return string
        if encoding is None:
            encoding = 'utf-8'
        if errors is None:
            errors = 'strict'
        string = string.encode(encoding, errors)
    else:
        if encoding is not None:
            raise TypeError("quote() doesn't support 'encoding' for bytes")
        if errors is not None:
            raise TypeError("quote() doesn't support 'errors' for bytes")
    return quote_from_bytes(string, safe)
复制代码

  也就说默认不会变编码的只有下面这四个符号不会被编码,其他的都会被编码

‘_.-~’

  还有就是传入safe参数的字符也不会被编码,效果如下

复制代码
quote('https://www.cnblogs.com/?a=bc&d=f')
'https%3A//www.cnblogs.com/%3Fa%3Dbc%26d%3Df'

quote('https://www.cnblogs.com/?a=bc&d=f',safe=':/') 'https://www.cnblogs.com/%3Fa%3Dbc%26d%3Df'

quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?/') 'https://www.cnblogs.com/?a%3Dbc%26d%3Df'

quote('https://www.cnblogs.com/?a=bc&d=f',safe=':?=/') 'https://www.cnblogs.com/?a=bc%26d=f'
复制代码

  源码默认的safe只有 ‘/’,但你传入safe,如果需要‘/’不被编码,也要记得传入’/‘,

posted @   阿布_alone  阅读(306)  评论(0编辑  收藏  举报
编辑推荐:
· AI与.NET技术实操系列:基于图像分类模型对图像进行分类
· go语言实现终端里的倒计时
· 如何编写易于单元测试的代码
· 10年+ .NET Coder 心语,封装的思维:从隐藏、稳定开始理解其本质意义
· .NET Core 中如何实现缓存的预热?
阅读排行:
· 分享一个免费、快速、无限量使用的满血 DeepSeek R1 模型,支持深度思考和联网搜索!
· 25岁的心里话
· 基于 Docker 搭建 FRP 内网穿透开源项目(很简单哒)
· ollama系列01:轻松3步本地部署deepseek,普通电脑可用
· 闲置电脑爆改个人服务器(超详细) #公网映射 #Vmware虚拟网络编辑器
历史上的今天:
2019-04-02 爬取实时变化的 WebSocket 数据(转载)
TOP
点击右上角即可分享
微信分享提示