urllib的使用

urllib是Python内置的用于处理URL操作的模块。它提供了很多功能，包括访问和处理URL内容、解析URL等。

1. 安装 urllib

urllib是Python标准库的一部分，因此无需单独安装。Python2分为urllib和urllib2，Python3合并为urllib

2. 导入 urllib

urllib实际上包括几个子模块：

urllib.request ：最基本的HTTP请求模块，可以模拟请求的发送。用于打开和读取URL
urllib.parse ：工具模块。用于处理URL，如拆分、解析、合并等
urllib.error ：异常处理模块。出现请求异常可以捕获异常
urllib.robotparser ：解析网站的robots.txt文件，判断网站是否允许被爬取

发送请求

urlopen

urllib.request提供了最基本的够早HTTP请求的方法，可以模拟浏览器发起请求

import urllib.request

response = urllib.request.urlopen('https://www.python.org')
content = response.read()
html = content.decode('utf-8')
print(html)
print(type(response))

报错：UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte

解决方法：https://www.python.org 的内容很可能被 GZIP 压缩了，因此直接尝试以 UTF-8 解码会失败。为了解决这个问题，可以使用 urllib.request 的 Request 类和 urlopen 函数，

以及 gzip 模块来自动处理 GZIP 压缩的内容。但更简单的方法是使用 requests 库，它会自动处理这些常见问题。

方法一：使用request库
import requests  
  
response = requests.get('https://www.python.org')  
html = response.text  # requests 会自动处理编码和解压  
print(html)  
print(type(response))
方法二：仍使用 urllib.request 并且手动处理 GZIP 解压
import urllib.request  
import gzip  
  
request = urllib.request.Request('https://www.python.org', headers={'Accept-Encoding': 'gzip'})  
response = urllib.request.urlopen(request)  
  
# 检查是否需要进行 GZIP 解压  
if response.info().get('Content-Encoding') == 'gzip':  
    html = gzip.decompress(response.read()).decode('utf-8')  
else:  
    html = response.read().decode('utf-8')  
  
print(html)  
print(type(response))

data参数：可选参数，添加该参数需使用bytes方法将参数转换为字节流编码格式的内容，即bytes类型，请求参数是POST，不再是GET。

import urllib.parse
import urllib.request

data = bytes(urllib.parse.urlencode({'name': 'germey'}), encoding='utf-8')
response = urllib.request.urlopen('https://www.httpbin.org/post', data=data)
print(response.read().decode('utf-8'))

timeout参数：设置超时时间，到时没响应抛出异常

设置超时时间为0.1，查看运行结果

import urllib.request

response = urllib.request.urlopen('https://www.httpbin.org/get', timeout=0.1)
print(response.read())

抛出超时异常

import urllib.request
import urllib.error
import socket

try:
    response = urllib.request.urlopen('https://www.httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
    if isinstance(e.reason, socket.timeout):
        print("超时")

Request

urlopen可发起基本请求，若需在请求中加入headers等信息。则需要使用Request类来构建请求

import urllib.request
# 请求的参数为Request类型的对象
request = urllib.request.Request('https://python.org')
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))

构造Request类及相关参数

class urllib.request.Request(url, data=None, headers=headers, method='GET', origin_req=req, universal_newlines=True):、

url:请求的目标URL（字符串）。必须参数，表示要访问的资源地址。

data:若要传参数必须是bytes类型。如果是数据字典可使用urllib.parse模块里的urlencode方法进行编码

headers:请求头信息（字典）。可选参数，用于指定HTTP头，如User-Agent、Content-Type等。默认是空字典。可以通过headers参数直接构造，也可调用请求实例add_header方法添加

method:请求方法（字符串）。可选参数，如GET、POST、PUT、DELETE等。如果不指定，则默认为GET（或在设置data时为POST）。

origin_req_host：指的是请求方的host名称或者IP地址

unverifiable：表示请求是否无法验证，默认是False，意思是用户没有足够权限来接收这个请求的结果。这种请求通常是通过第三方发起的，不是由用户直接发起的。这种情况常见于在浏览器环境中，比如当页面加载外部资源时。

传入多个参数构建Request类

from urllib import request, parse

url = 'https://www.httpbin.org/post'
headers = {
    "User-Agent": 'Mozilla/4.0 (compatible; MSIE 5.5 Windows NT)',
    "Host": "www.httpbin.org"
}
dict = {'name': 'germey'}
data = bytes(parse.urlencode(dict), encoding='utf-8')
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))

Handler的使用

Handler可理解为处理各种的处理器

利用Handler类构建Opener类

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener
from urllib.error import URLError

username = 'username'
password = 'password'
url = "https://ssr3.scrape.center/"

p = HTTPPasswordMgrWithDefaultRealm()
# 实例化HTTPBasicAuthHandler对象auth_handler，参数为HTTPPasswordMgrWithDefaultRealm
# 使用add_password方法添加用户名和密码，建立了一个用来处理验证的Handler类
p.add_password(None, url, username, password)
auth_handler = HTTPBasicAuthHandler(p)
# 将auth_handler当做参数传入到build_opener方法
# 构建一个Opener在发送请求的时就相当于验证成功
opener = build_opener(auth_handler)

try:
    # 使用Opener中的open方法打开链接
    result = opener.open(url)
    html = result.read().decode('utf-8')
    print(html)
except URLError as e:
    print(e.reason)

代理

from urllib.error import URLError
from urllib.request import ProxyHandler, build_opener

proxy_handler = ProxyHandler({
    'http': 'http://127.0.0.1:8080',
    'https': 'https://127.0.0.1:8080'
})

opener = build_opener(proxy_handler)
try:
    response = opener.open('https://www.baidu.com')
    print(response.read().decode('utf-8'))
except URLError as e:
    print(e.reason)

import http.cookiejar, urllib.request, urllib.parse, urllib.error

cookie = http.cookiejar.CookieJar()
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
for item in cookie:
    print(item.name + "=" + item.value)

输出文件格式的内容，将Cookies保存成Mozilla型浏览器的Cookie格式

import http.cookiejar, urllib.request, urllib.parse, urllib.error

filename = 'cookie.txt'
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com')
cookie.save(ignore_discard=True, ignore_expires=True)

LWPCookieJar也可读取和保存Cookie，会被保存成LWP模式（libwww-perl）格式

cookie = http.cookiejar.LWPCookieJar(filename)

import urllib.request, http.cookiejar

cookie = http.cookiejar.LWPCookieJar()
cookie.load('cookie2.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('https://www.baidu.com')
print(response.read().decode('utf-8'))

处理异常

URLError

来自urllib库的error模块，继承自OSError模块，是error异常模块的基类，由request模块产生的异常都可以通过捕获这个类来处理。

reason属性

from urllib import request, error

try:
    response = request.urlopen('http://www.dailymile.com/404notfound')
except error.URLError as e:
    print(e.reason)

HTTPError

URLError的子类，专门处理HTTP请求错误，如认证失败等。有以下三个属性。

code：这个属性返回HTTP状态码。HTTP状态码由三位十进制数字组成，出现在由HTTP服务器发送的响应的第一行。状态码分五种类型，由它们的第一位数字表示：4XX代表客户端错误，而5XX代表服务器错误。例如，404表示网页不存在，500表示服务器内部错误。

reason：这个属性返回错误的原因。与父类URLError一样，HTTPError的reason属性用于解释为何会出现HTTP错误。例如，当HTTP状态码为403时，reason可能返回“Forbidden”，表示服务器禁止访问。

headers：这个属性返回请求头。它包含了关于HTTP请求或响应的各种元信息。

在处理HTTPError时，可以通过访问这些属性来获取关于错误的详细信息，并据此采取适当的处理措施。同时，由于HTTPError是URLError的子类，因此在捕获错误时，通常会先捕获子类的错误（如HTTPError），然后再捕获父类的错误（如URLError）。

使用HTTPError捕获异常

from urllib import request, error

try:
    response = request.urlopen('http://www.dailymile.com/404notfound')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep="\n")

URLError是HTTPError的父类，可以先捕获子类的错误，再捕获父类，如下是更好的写法。先捕获HTTPError，获取错误原因、状态码、请求头等信息。若不是则捕获URLError异常，输出错误原因。

from urllib import request, error

try:
    response = request.urlopen('http://www.dailymile.com/404notfound')
except error.HTTPError as e:
    print(e.reason, e.code, e.headers, sep="\n")
except error.URLError as e:
    print(e.reason)
else:
    print('Request Successfully')

有时reson属性返回的不一定是字符串，也可能是一个对象。

import socket
import urllib.request
import urllib.error

try:
    response = urllib.request.urlopen('http://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
    print(type(e.reason))
    if isinstance(e.reason, socket.timeout):
        print("超时")

解析链接

urllib库提供了parse模块，定义了处理URL的标准接口，例如实现URL的各部分抽取、合并以及链接转换。

urlparse：可以实现URL的识别和分段

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment')
print(type(result))
print(result)

在 Python 的 urllib.parse 模块中，urlparse 方法通常接受一个 URL 字符串作为其主要参数，但也可以接受一个可选的 scheme、allow_fragments 参数。

urlstring：必选项，待解析的url

scheme：(可选的位置参数，但在现代用法中通常作为关键字参数传递):

默认的 URL scheme（如 'http', 'https', 'ftp' 等）。如果 URL 字符串中没有 scheme，并且提供了此参数，那么该 scheme 将被使用。但在现代 Python 版本中，通常将此参数作为关键字参数传递，而不是位置参数。

allow_fragments (关键字参数):

一个布尔值，指定 URL 是否可以包含片段（即 '#' 后面的部分）。默认为 True，表示 URL 可以包含片段。如果设置为 False，并且 URL 包含片段，那么片段部分将被视为路径的一部分。

scheme的使用：

from urllib.parse import urlparse

result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme="https")
print(result)

allow_fragments：

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result)

ParseResult是一个元组，既可以用属性名获取，可以用索引顺序获取

from urllib.parse import urlparse

result = urlparse('https://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)
print(result.scheme, result[0], result.netloc, result.path, result.query, result.fragment)

urlunparse

用于构造URL。该方法收集的参数是一个可迭代对象，其长度必须为6，否则会抛出参数数量不足或者过多的问题

from urllib.parse import urlunparse

data = ['https', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))

urlsplit

和url方法相似，不再单独解析params这一部分（合并到path中）

from urllib.parse import urlsplit, urlunsplit

result = urlsplit('https://www.baidu.com/index.html;user?id=5#comment')
print(result)

urljoin

可以提供一个base_url(基础链接)作为该方法的第一个参数,新链接作为第二个参数。urljoin会分析base_url的scheme、netloc和path这三个内容，并对新链接缺失的部分进行补充。

from urllib.parse import urljoin

# 基础 URL
base_url = "https://www.example.com"

# 示例 1: 合并一个相对路径
relative_path_1 = "/about"
full_url_1 = urljoin(base_url, relative_path_1)
print(full_url_1)  # 输出: https://www.example.com/about

# 示例 2: 合并一个带有查询参数的相对路径
relative_path_2 = "/search?q=python"
full_url_2 = urljoin(base_url, relative_path_2)
print(full_url_2)  # 输出: https://www.example.com/search?q=python

# 示例 3: 合并一个包含协议和域名的完整 URL
full_url_3 = urljoin(base_url, "https://www.another-example.com")
print(full_url_3)  # 输出: https://www.another-example.com （注意：如果第二个参数是一个完整的 URL，它会被直接返回）

# 示例 4: 合并一个仅包含域名的 URL
relative_domain_4 = "//www.another-example.com/contact"
full_url_4 = urljoin(base_url, relative_domain_4)
print(full_url_4)  # 输出: //www.another-example.com/contact （注意：双斜杠表示协议与基础 URL 相同，但域名不同）

# 示例 5: 合并一个仅包含路径的片段（没有开头的斜杠）
relative_path_5 = "news/latest"
full_url_5 = urljoin(base_url, relative_path_5)
print(full_url_5)  # 输出: https://www.example.com/news/latest

urlencode

构造GET请求参数时常用，将字典转换为GET请求参数

from urllib.parse import urlencode

params = {
    'name': 'germey',
    'age': 25
}

base_url = "https://www.baidu.com?"
url = base_url + urlencode(params)
print(url)

parse_qs

反序列化，可将一串GET请求参数转回字典

from urllib.parse import parse_qs

query = 'name=gerey&age=25'
print(parse_qs(query))

parse_qsl

将参数转换为由元组组成的列表

from urllib.parse import parse_qsl

query = 'name=gerey&age=25'
print(parse_qsl(query))

quote

将URL转换为URL编码的格式。当URL中存在中文时可能会乱码，可以使用quote将中文转换成URL编码

from urllib.parse import quote

keyword = '关键字'
url = 'http://www.baidu.com/s?wd=' + quote(keyword)
print(url)

unquote

进行url进行编解码

from urllib.parse import unquote

URL = 'http://www.baidu.com/s?wd=%E5%85%B3%E9%94%AE%E5%AD%97'
print(unquote(URL))

分析Robots协议

Robots协议

Robots协议（也称为爬虫协议、机器人协议等），全称“网络爬虫排除标准”（Robots Exclusion Protocol），是一种用于规范网络爬虫行为的协议。通过Robots协议，网站管理员可以告诉网络爬虫哪些页面可以被爬取，哪些页面应该被忽略。

User-agent: 这一行用于定义应用该规则的搜索引擎爬虫的名称。可以指定具体的爬虫名称，如“Googlebot”或“Baiduspider”，也可以使用“*”通配符来代表所有爬虫。

User-agent: *

表示以下规则适用于所有搜索引擎爬虫。

Disallow: 这一行用于定义不允许爬虫访问的URL路径。可以指定具体的目录或文件，也可以使用通配符来匹配多个URL。

Disallow: /private/  
Disallow: /admin/  
Disallow: /cgi-bin/*.php

/private/ 和 /admin/ 目录下的所有内容都将被禁止访问。同时，/cgi-bin/ 目录下所有以 .php 结尾的文件也将被禁止访问。

Allow: 这一行（尽管不是robots.txt的标准部分，但一些搜索引擎爬虫会支持）用于定义允许爬虫访问的URL路径。如果同时存在Disallow和Allow规则，并且它们针对相同的URL路径，那么Allow规则将覆盖Disallow规则。

User-agent: Googlebot  
Allow: /some-directory/  
Disallow: /some-directory/private/

Googlebot被允许访问/some-directory/目录下的内容，但/some-directory/private/目录下的内容除外

注释: 你可以使用#来添加注释，注释内容将被搜索引擎爬虫忽略。

# This is a sample robots.txt file.  
# All robots are welcome to index our site except for the private area.  
  
User-agent: *  
Disallow: /private/

空行和换行: 在robots.txt文件中，空行和换行通常被忽略。

文件大小: robots.txt文件的大小应尽可能小，以减少对服务器带宽的占用。同时，也应避免在文件中包含大量复杂的规则，以提高爬虫解析的效率。

文件位置: robots.txt文件必须位于网站的根目录下（即与网站的索引文件，如index.html或index.php等位于同一目录下），并且文件名必须全部小写。

爬虫名称

Googlebot：这是谷歌搜索引擎的爬虫。Googlebot从谷歌的网站索引和新闻索引中抓取网页，用于谷歌搜索引擎的搜索结果。

Baiduspider：这是百度搜索引擎的爬虫，也称为百度蜘蛛。Baiduspider会从互联网上抓取网页，用于百度搜索引擎的搜索结果。

Sogouspider：这是搜狗搜索引擎的爬虫。搜狗蜘蛛有多种，包括Sogou web spider、Sogou inst spider、Sogou spider2、Sogou blog、Sogou News Spider、Sogou Orion spider等，它们共同用于搜狗搜索引擎的网页抓取。

Sosospider：这是腾讯SOSO搜索引擎的爬虫。SOSO蜘蛛也会不断地从互联网上抓取网页，以更新SOSO搜索引擎的搜索结果。

YodaoBot：这是有道搜索引擎的爬虫。有道蜘蛛会从互联网上抓取网页，以支持有道搜索引擎的搜索服务。

robotparser

可以使用robotparser模块来解析robots.txt文件。提供了RobotFileParser，可以根据robots.txt文件判断爬虫是否有权限爬取。

在构造方法里传入robots.txt文件的链接

urllib.robotparser.RobotFileParser(url='')

RobotFileParser的常用方法：

set_url(url)：设置robots.txt文件的URL。如果在创建RobotFileParser对象时未传入URL，可以使用此方法设置。

read()：读取robots.txt文件并进行分析。这个方法不会直接返回结果，但会执行读取文件的操作。在调用can_fetch()之前，必须先调用此方法，否则后续的can_fetch()判断将始终返回False。

can_fetch(useragent, url)：判断指定的user-agent是否可以访问指定的URL。根据已解析的robots.txt文件中的规则进行判断，并返回True或False。

mtime()：返回上次抓取和分析robots.txt文件的时间。这对于长时间运行的网络爬虫来说可能很有用，可以用来判断是否需要重新抓取和分析robots.txt文件。

modified()：将当前时间设置为上次抓取和分析robots.txt文件的时间。这通常用于手动更新mtime()的返回值，以便爬虫知道何时需要再次检查robots.txt文件。

parse(lines)：从给定的行参数中解析robots.txt文件的内容。这通常用于已经以某种方式获取了robots.txt文件内容的情况，而不是直接从网络上获取。

from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.set_url('http://www.baidu.com/robots.txt')
rp.read()
print(rp.can_fetch('Baiduspider', 'https://www.baidu.com/'))
print(rp.can_fetch('Baiduspider', 'https://www.baidu.com/homepage'))
print(rp.can_fetch('Googlebot', 'https://www.baidu.com/homepage/'))

使用parse方法执行对robots.txt文件的读取和分析

from urllib.request import urlopen
from urllib.robotparser import RobotFileParser

rp = RobotFileParser()
rp.parse(urlopen('http://www.baidu.com/robots.txt').read().decode('utf-8').split('\n'))
print(rp.can_fetch('Baiduspider', 'https://www.baidu.com/'))
print(rp.can_fetch('Baiduspider', 'https://www.baidu.com/homepage'))
print(rp.can_fetch('Googlebot', 'https://www.baidu.com/homepage/'))

posted @ 2024-05-23 00:37 JJJhr 阅读(34) 评论(0) 编辑收藏举报

刷新页面返回顶部

JJJhr'blog

urllib的使用