爬虫基本库的使用之urllib库

urllib的简单使用

urllib模块是Python内置的HTTP请求模块

urllib包含模块：request模块、error模块、parse模块、robotparser模块

例子

举例1：

向指定的url发送请求，并返回服务器响应的类文件对象
```
复制 response = urllib.request.urlopen('https://www.python.org')
```
读取文件全部内容
```
复制 html = response.read()
```
将其他编码的字符串转换成unicode编码
```
复制 print(html.decode('utf-8'))
```
部分输出如下：

举例2:

向指定的url发送请求，并返回服务器响应的类文件对象
```
复制 response = urllib.request.urlopen('https://www.python.org')
```
输出
```
复制 print(type(response))
```

输出结果如下

复制 <class 'http.client.HTTPResponse'>

举例3:

复制 import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.getheaders())
print(response.getheader('Server'))

复制 说明：
# status属性：返回响应的状态码，如200代表请求成功
# getheaders()方法：返回响应的头信息
# getheader('name')方法：获取响应头中的name值

urllib子模块之Request模块

urlopen()方法可以实现最基本请求的发起，Request更强大

举例

复制 import urllib.request
request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

urllib.request.Request() 方法说明

复制 Request(url, data=None, headers={ }, origin_req_host=None, unverifiable=False, mothod=None)

复制 参数：
url参数: 请求URL
data参数：Post 提交的数据, 默认为 None ，当 data 不为 None 时,urlopen() 提交方式为 Post
headers参数：也就是请求头，headers参数可以在构造请求时使用，也可以用add_header()方法来添加
请求头最常用的用法:修改User-Agent来伪装浏览器（如伪装Firefox：
Mozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
）
origin_req_host参数：指的是请求方的host名称或者IP地址
unverifiable参数：表示这个请求是否是无法验证 的，默认是 False，意思就是说用户没有足够权限来选择接收这个请求的结果。 例如，我们请求一个 HTML文档中的图片，但是我们没有向动抓取图像的权限，这时 unverifiable 的值就是 True。
method参数:它是一个字符串，用来指示请求使用的方法（如：GET、POST、PUT等）

举例

复制 from urllib import request, parse
url = 'https://python.org/post'
headers = {
    'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT',
   'Host': 'httpbin.org'
}
dict = { 'name': 'Germey' }
data = bytes(parse.urlencode(dict), encoding='utf-8') 
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

高级用法

Request虽然可以构造请求，但是对于一些更高级的操作（比如Cookies处理，代理设置等），就需要更强大的工具Handler了。各种Handler子类继承BaseHandler类。

例子1

在登录某些网站时，需要输入用户名和密码，验证成功后才能查看页面，这时可以借助HTTPBasicAuthHandler

复制 from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError
 
username = 'username'
password = 'password'
url = 'http://localhost:5000/'
 
p = HTTPPasswordMgrWithDefaultRealm()           # 创建一个密码管理对象，用来保存 HTTP 请求相关的用户名和密码
p.add_password(None, url, username, password)   # 添加url，用户名，密码
auth_handler = HTTPBasicAuthHandler(p)          # 来处理代理的身份验证
opener = build_opener(auth_handler)             # 利用build_opener()方法构建一个Opener
 
try:
    result = opener.open(url)                   # 利用Opener的open()方法打开链接，完成验证
    html = result.read().decode('utf-8')        # 读取返回的结果，解码返回结果
    print(html)
except URLError as e:
    print(e.reason)                             # 获取错误的原因

例子2

代理

复制 from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener
 
# ProxyHandler()使用代理IP, 它的参数是一个字典，键名是协议类型（比如HTTP或者HTTPS等），键值是代理链接，可以添加多个代理
proxy_handler = ProxyHandler(
    {
        'http': 'http://127.0.0.1:9743',
        'https': 'https://127.0.0.1:9743'
    }
)
opener = build_opener(proxy_handler)            # 利用build_opener()方法，构造一个Opener
 
try:
    response = opener.open('https://www.baidu.com')         # 发送请求
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

例子3

爬一些需要登录的网站，就要用到cookie相关的一些模块来操作了

复制 import http.cookiejar
# http.cookiejar.CookieJar()
#   1、管理储存cookie，向传出的http请求添加cookie
#   2、cookie存储在内存中，CookieJar示例回收后cookie将自动消失
import urllib.request
 
cookie = http.cookiejar.CookieJar()                         # 创建cookiejar实例对象
handler = urllib.request.HTTPCookieProcessor(cookie)        # 根据创建的cookie生成cookie的管理器
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
 
for item in cookie:
    print(item.name+"="+item.value)

urllib 模块之error模块

URLError

复制 from urllib import request, error
try:
    response = request.urlopen('https://www.bucunzai_tan90.com/index.htm')
    print(response.read().decode('utf8'))
except error.URLError as e:
    print(e.reason)

复制 说明：
打开一个不存在的页面时，输出结果是:[Errno 8] nodename nor servname provided, or not known
打开一个存在的页面时，输出结果是网页的源代码

HTTPError

HTTPError 是URLError的子类，专门用来处理HTTP请求错误，比如认证请求失败等

code: 返回 HTTP状态码，比如 404表示网页不存在， 500表示服务器内部错误等
reason:同父类一样，用于返回错误的原因。
headers: 返回请求头。

举例：

复制 from urllib import request, error
try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
    print(response.read().decode('utf8'))
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep='\n\n')
 
# 或者如下写法
# 更好的写法是，先处理子类，再处理父类，最后处理正常逻辑
 
from urllib import request, error
try:
    response = request.urlopen('https://cuiqingcai.com/index.htm')
    # print(response.read().decode('utf8'))
except error.HTTPError as e:                      # 处理HTTPError子类
    print(e.reason, e.code, e.headers, sep='\n\n')
except error.URLError as e:                       # 处理URLError父类
    print(e.reason)
else:                                             # 处理正常逻辑
    print('Request Successful')

输出：

复制 Not Found
 
404
 
Server: nginx/1.10.3 (Ubuntu)
Date: Sun, 16 Jun 2019 10:53:09 GMT
Content-Type: text/html; charset=UTF-8
Transfer-Encoding: chunked
Connection: close
Set-Cookie: PHPSESSID=vrvrfqq88eck9speankj0ogus0; path=/
Pragma: no-cache
Vary: Cookie
Expires: Wed, 11 Jan 1984 05:00:00 GMT
Cache-Control: no-cache, must-revalidate, max-age=0
Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

解析链接

ullib.parse定义了处理URL的标准接口，它支持file、ftp、 hdl、 https、 imap、mms 、 news 、 prospero 、 telnet等协议的URL处理。

举例

urlparse：实现URL的识别和分段

复制 from urllib.parse import urlparse
 
# 实现URL的分段
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result, sep='\n')　　     # 输出的result是一个元组
 
 
# 输出：
<class 'urllib.parse.ParseResult'>
ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')
 
 
# scheme='协议', netloc='域名', path='访问路径', params='参数', query='查询条件'(?后面), fragment='锚点'(#号后面)

urlunparse：实现URL的构造

复制 from urllib.parse import urlunparse
# urllib.parse.urlunparse()，接受的参数是一个可迭代对象，它的长度必须是6
 
# 这里的data用了列表，也可以用元组或者特定的数据结构
data1 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=5', 'comment']
data2 = ['', 'www.baidu.com', '/index.html', 'user', 'id=5', 'comment']
data3 = ['http', '', '/index.html', 'user', 'id=5', 'comment']
data4 = ['http', 'www.baidu.com', '', 'user', 'id=5', 'comment']
data5 = ['http', 'www.baidu.com', '/index.html', '', 'id=5', 'comment']
data6 = ['http', 'www.baidu.com', '/index.html', 'user', '', 'comment']
data7 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=5', '']
print("缺少协议：\t"+urlunparse(data2), "缺少域名：\t"+urlunparse(data3),
      "缺少访问路径：\t"+urlunparse(data4), "缺少参数：\t"+urlunparse(data5),
      "缺少查询条件：\t"+urlunparse(data6), "缺少锚点：\t"+urlunparse(data7),
      "标准链接：\t"+urlunparse(data1), sep='\n\n')

urlsplit：实现URL的识别和分段

复制 from urllib.parse import urlsplit
 
result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
print(result, result.scheme, result[4], sep='\n')
 
 
# 输出结果：
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
http
comment
 
# urlsplit()方法与urlparse()方法很相似，urlsplit()方法与urlparse()相比，urlsplit()将path和params合在一起放在path中,而urlparse()中，path和params是分开的

urlsplit：实现URL的构造

复制 from urllib.parse import urlunsplit
# urlunsplit()方法与urlunparse()方法类似，urlunsplit()传入的参数是一个可迭代的对象，
# 不同之处是path和params是否合在一起（urlunsplit是合在一起的）
 
data = ('http', 'wwww.baidu.com', 'index.html;user', 'id=5', 'comment')
print(urlunsplit(data))
 
# 输出结果：
http://wwww.baidu.com/index.html;user?id=5#comment

urljoin：完成链接的合并

复制 from urllib.parse import urljoin
 
# 完成链接的合并（前提是必须有特定长度的对象，链接的每一部分都要清晰分开）
 
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin ('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin ('http://www.baidu.com d=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
 
 
# 输出：
http://www.baidu.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html
https://cuiqingcai.com/FAQ.html?question=2
https://cuiqingcai.com/index.php
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

urlencode：把key-value这样的键值对转换成我们想要的格式

复制 from urllib.parse import urlencode
 
params = {}
params['name'] = 'Tom'
params['age'] = 21
 
base_url = 'http://wwww.baidu.com?'
url = base_url + urlencode(params)
print(url)
 
# 输出：
http://wwww.baidu.com?name=Tom&age=21

parse_qs：如果说urlencode()方法实现序列化，那么parse_qs()就是反序列化

复制 from urllib.parse import parse_qs
 
query = 'name=Tom&age=21'
print(parse_qs(query))
 
 
# 输出：
{'name': ['Tom'], 'age': ['21']}

parse_qsl：parse_qsl()方法与parse_qs()方法很相似，parse_qsl()返回的是列表，列表中的每个元素是一个元组，parse_qs()返回的是字典
```
复制 from urllib.parse import parse_qsl
 
query = 'name=Tom&age=21'
print(parse_qsl(query))
 
 
# 输出：
[('name', 'Tom'), ('age', '21')]
```

quote：将内容转化为URL编码的格式

复制 from urllib.parse import quote
 
keyword = '壁纸'
url = 'https://www.baidu.com/s?wd=' + quote(keyword)
print(url)
 
 
# 输出：
https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

unquote：进行URL解码

复制 from urllib.parse import unquote
 
url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
print(unquote(url))
 
 
# 输出：
https://www.baidu.com/s?wd=壁纸

robotparser

urllib.robotparser.RobotFileParser(url='')根据某网站的robots.txt文件来判断一个爬取爬虫是否有权限来爬取这个网页

set_url() 用来设置robot.txt文件的链接
read() 读取robots.txt文件并进行分析
parse() 解析robots.txt文件，传入的参数是robots.txt某些行内容
can_fetch(User-agent='', URL='') 返回内容是该搜索引擎是否可以抓取这个URL，返回结果是True或False
mtime() 返回上一次抓取和分析robots.txt的时间
modified() 将当前时间设置为上次抓取和分析robots.txt的时间

举例1：

复制 from urllib.robotparser import RobotFileParser
 
rp = RobotFileParser()
rp.set_url('http://www.jianshu.com/robots.txt')                 # 设置robots.txt文件的链接
rp.read()                   # 读取robots.txt文件并进行分析
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))           # 输出该搜索引擎是否可以抓取这个URL
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))
 
 
# 输出：
False
False
 
# False也就是说该搜索引擎不能抓取这个URL

举例2：

复制 from urllib.robotparser import RobotFileParser
from urllib.request import urlopen
 
rp = RobotFileParser()
rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))
 
 
 
# 输出结果与上面一个例子一样，只是上一个例子用read()方法，这个例子用parse()方法

posted @ 2019-06-19 10:42 LeeHua 阅读(627) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

阅读排行：
· 震惊！C++程序真的从main开始吗？99%的程序员都答错了
· winform 绘制太阳，地球，月球运作规律
· 【硬核科普】Trae如何「偷看」你的代码？零基础破解AI编程运行原理
· 超详细：普通电脑也行Windows部署deepseek R1训练数据并当服务器共享给他人
· 上周热点回顾（3.3-3.9）

Lee Hua's Blog

热爱编程 -- 写Bug

爬虫基本库的使用之urllib库

urllib的简单使用

例子

举例1：

举例2:

举例3:

urllib子模块之Request模块

urllib.request.Request() 方法说明

高级用法

urllib 模块之error模块

URLError

HTTPError

解析链接

举例

robotparser

举例1：

举例2：

公告

我的标签

积分与排名

随笔分类 (240)

随笔档案 (226)

阅读排行榜

推荐排行榜

目录导航

	import urllib.request
	response = urllib.request.urlopen('https://www.python.org')
	print(response.getheaders())
	print(response.getheader('Server'))

	说明：
	# status属性：返回响应的状态码，如200代表请求成功
	# getheaders()方法：返回响应的头信息
	# getheader('name')方法：获取响应头中的name值

	import urllib.request
	request = urllib.request.Request('https://python.org')
	response = urllib.request.urlopen(request)
	print(response.read().decode('utf-8'))

	参数：
	url参数: 请求URL
	data参数：Post 提交的数据, 默认为 None ，当 data 不为 None 时,urlopen() 提交方式为 Post
	headers参数：也就是请求头，headers参数可以在构造请求时使用，也可以用add_header()方法来添加
	请求头最常用的用法:修改User-Agent来伪装浏览器（如伪装Firefox：
	Mozilla/s.o (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11
	）
	origin_req_host参数：指的是请求方的host名称或者IP地址
	unverifiable参数：表示这个请求是否是无法验证的，默认是 False，意思就是说用户没有足够权限来选择接收这个请求的结果。例如，我们请求一个 HTML文档中的图片，但是我们没有向动抓取图像的权限，这时 unverifiable 的值就是 True。
	method参数:它是一个字符串，用来指示请求使用的方法（如：GET、POST、PUT等）

	from urllib import request, parse
	url = 'https://python.org/post'
	headers = {
	'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT',
	'Host': 'httpbin.org'
	}
	dict = { 'name': 'Germey' }
	data = bytes(parse.urlencode(dict), encoding='utf-8')
	req = request.Request(url=url, data=data, headers=headers, method='POST')
	response = request.urlopen(req)
	print(response.read().decode('utf-8'))

	from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
	from urllib.error import URLError

	username = 'username'
	password = 'password'
	url = 'http://localhost:5000/'

	p = HTTPPasswordMgrWithDefaultRealm() # 创建一个密码管理对象，用来保存 HTTP 请求相关的用户名和密码
	p.add_password(None, url, username, password) # 添加url，用户名，密码
	auth_handler = HTTPBasicAuthHandler(p) # 来处理代理的身份验证
	opener = build_opener(auth_handler) # 利用build_opener()方法构建一个Opener

	try:
	result = opener.open(url) # 利用Opener的open()方法打开链接，完成验证
	html = result.read().decode('utf-8') # 读取返回的结果，解码返回结果
	print(html)
	except URLError as e:
	print(e.reason) # 获取错误的原因

	from urllib.error import URLError
	from urllib.request import ProxyHandler, build_opener

	# ProxyHandler()使用代理IP, 它的参数是一个字典，键名是协议类型（比如HTTP或者HTTPS等），键值是代理链接，可以添加多个代理
	proxy_handler = ProxyHandler(
	{
	'http': 'http://127.0.0.1:9743',
	'https': 'https://127.0.0.1:9743'
	}
	)
	opener = build_opener(proxy_handler) # 利用build_opener()方法，构造一个Opener

	try:
	response = opener.open('https://www.baidu.com') # 发送请求
	print(response.read().decode('utf-8'))
	except URLError as e:
	print(e.reason)

	import http.cookiejar
	# http.cookiejar.CookieJar()
	# 1、管理储存cookie，向传出的http请求添加cookie
	# 2、cookie存储在内存中，CookieJar示例回收后cookie将自动消失
	import urllib.request

	cookie = http.cookiejar.CookieJar() # 创建cookiejar实例对象
	handler = urllib.request.HTTPCookieProcessor(cookie) # 根据创建的cookie生成cookie的管理器
	opener = urllib.request.build_opener(handler)
	response = opener.open('http://www.baidu.com')

	for item in cookie:
	print(item.name+"="+item.value)

	from urllib import request, error
	try:
	response = request.urlopen('https://www.bucunzai_tan90.com/index.htm')
	print(response.read().decode('utf8'))
	except error.URLError as e:
	print(e.reason)

	说明：
	打开一个不存在的页面时，输出结果是:[Errno 8] nodename nor servname provided, or not known
	打开一个存在的页面时，输出结果是网页的源代码

	from urllib.parse import parse_qsl

	query = 'name=Tom&age=21'
	print(parse_qsl(query))


	# 输出：
	[('name', 'Tom'), ('age', '21')]

	from urllib import request, error
	try:
	response = request.urlopen('https://cuiqingcai.com/index.htm')
	print(response.read().decode('utf8'))
	except error.HTTPError as e:
	print(e.reason, e.code, e.headers, sep='\n\n')

	# 或者如下写法
	# 更好的写法是，先处理子类，再处理父类，最后处理正常逻辑

	from urllib import request, error
	try:
	response = request.urlopen('https://cuiqingcai.com/index.htm')
	# print(response.read().decode('utf8'))
	except error.HTTPError as e: # 处理HTTPError子类
	print(e.reason, e.code, e.headers, sep='\n\n')
	except error.URLError as e: # 处理URLError父类
	print(e.reason)
	else: # 处理正常逻辑
	print('Request Successful')

	Not Found

	404

	Server: nginx/1.10.3 (Ubuntu)
	Date: Sun, 16 Jun 2019 10:53:09 GMT
	Content-Type: text/html; charset=UTF-8
	Transfer-Encoding: chunked
	Connection: close
	Set-Cookie: PHPSESSID=vrvrfqq88eck9speankj0ogus0; path=/
	Pragma: no-cache
	Vary: Cookie
	Expires: Wed, 11 Jan 1984 05:00:00 GMT
	Cache-Control: no-cache, must-revalidate, max-age=0
	Link: <https://cuiqingcai.com/wp-json/>; rel="https://api.w.org/"

	from urllib.parse import urlparse

	# 实现URL的分段
	result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
	print(type(result), result, sep='\n')　　 # 输出的result是一个元组


	# 输出：
	<class 'urllib.parse.ParseResult'>
	ParseResult(scheme='http', netloc='www.baidu.com', path='/index.html', params='user', query='id=5', fragment='comment')


	# scheme='协议', netloc='域名', path='访问路径', params='参数', query='查询条件'(?后面), fragment='锚点'(#号后面)

	from urllib.parse import urlunparse
	# urllib.parse.urlunparse()，接受的参数是一个可迭代对象，它的长度必须是6

	# 这里的data用了列表，也可以用元组或者特定的数据结构
	data1 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=5', 'comment']
	data2 = ['', 'www.baidu.com', '/index.html', 'user', 'id=5', 'comment']
	data3 = ['http', '', '/index.html', 'user', 'id=5', 'comment']
	data4 = ['http', 'www.baidu.com', '', 'user', 'id=5', 'comment']
	data5 = ['http', 'www.baidu.com', '/index.html', '', 'id=5', 'comment']
	data6 = ['http', 'www.baidu.com', '/index.html', 'user', '', 'comment']
	data7 = ['http', 'www.baidu.com', '/index.html', 'user', 'id=5', '']
	print("缺少协议：\t"+urlunparse(data2), "缺少域名：\t"+urlunparse(data3),
	"缺少访问路径：\t"+urlunparse(data4), "缺少参数：\t"+urlunparse(data5),
	"缺少查询条件：\t"+urlunparse(data6), "缺少锚点：\t"+urlunparse(data7),
	"标准链接：\t"+urlunparse(data1), sep='\n\n')

	from urllib.parse import urlsplit

	result = urlsplit('http://www.baidu.com/index.html;user?id=5#comment')
	print(result, result.scheme, result[4], sep='\n')


	# 输出结果：
	SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='id=5', fragment='comment')
	http
	comment

	# urlsplit()方法与urlparse()方法很相似，urlsplit()方法与urlparse()相比，urlsplit()将path和params合在一起放在path中,而urlparse()中，path和params是分开的

	from urllib.parse import urlunsplit
	# urlunsplit()方法与urlunparse()方法类似，urlunsplit()传入的参数是一个可迭代的对象，
	# 不同之处是path和params是否合在一起（urlunsplit是合在一起的）

	data = ('http', 'wwww.baidu.com', 'index.html;user', 'id=5', 'comment')
	print(urlunsplit(data))

	# 输出结果：
	http://wwww.baidu.com/index.html;user?id=5#comment

	from urllib.parse import urljoin

	# 完成链接的合并（前提是必须有特定长度的对象，链接的每一部分都要清晰分开）

	print(urljoin('http://www.baidu.com', 'FAQ.html'))
	print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
	print(urljoin ('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
	print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
	print(urljoin ('http://www.baidu.com d=abc', 'https://cuiqingcai.com/index.php'))
	print(urljoin('http://www.baidu.com', '?category=2#comment'))
	print(urljoin('www.baidu.com', '?category=2#comment'))
	print(urljoin('www.baidu.com#comment', '?category=2'))


	# 输出：
	http://www.baidu.com/FAQ.html
	https://cuiqingcai.com/FAQ.html
	https://cuiqingcai.com/FAQ.html
	https://cuiqingcai.com/FAQ.html?question=2
	https://cuiqingcai.com/index.php
	http://www.baidu.com?category=2#comment
	www.baidu.com?category=2#comment
	www.baidu.com?category=2

	from urllib.parse import urlencode

	params = {}
	params['name'] = 'Tom'
	params['age'] = 21

	base_url = 'http://wwww.baidu.com?'
	url = base_url + urlencode(params)
	print(url)

	# 输出：
	http://wwww.baidu.com?name=Tom&age=21

	from urllib.parse import parse_qs

	query = 'name=Tom&age=21'
	print(parse_qs(query))


	# 输出：
	{'name': ['Tom'], 'age': ['21']}

	from urllib.parse import quote

	keyword = '壁纸'
	url = 'https://www.baidu.com/s?wd=' + quote(keyword)
	print(url)


	# 输出：
	https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8

	from urllib.parse import unquote

	url = 'https://www.baidu.com/s?wd=%E5%A3%81%E7%BA%B8'
	print(unquote(url))


	# 输出：
	https://www.baidu.com/s?wd=壁纸

	from urllib.robotparser import RobotFileParser

	rp = RobotFileParser()
	rp.set_url('http://www.jianshu.com/robots.txt') # 设置robots.txt文件的链接
	rp.read() # 读取robots.txt文件并进行分析
	print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d')) # 输出该搜索引擎是否可以抓取这个URL
	print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))


	# 输出：
	False
	False

	# False也就是说该搜索引擎不能抓取这个URL

	from urllib.robotparser import RobotFileParser
	from urllib.request import urlopen

	rp = RobotFileParser()
	rp.parse(urlopen('http://www.jianshu.com/robots.txt').read().decode('utf-8').split('\n'))
	print(rp.can_fetch('*', 'http://www.jianshu.com/p/b67554025d7d'))
	print(rp.can_fetch('*', 'http://www.jianshu.com/search?q=python&page=1&type=collections'))



	# 输出结果与上面一个例子一样，只是上一个例子用read()方法，这个例子用parse()方法