Python爬虫基本库

3 基本库的使用

1）使用 urllib

是python内置的HTTP请求库，包含request、error、parse、robotparser
urlopen（）
- urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
  - url：url 地址。
  - data：发送到服务器的其他数据对象，默认为 None。
  - timeout：设置访问超时时间。
  - cafile 和 capath：cafile 为 CA 证书， capath 为 CA 证书的路径，使用 HTTPS 需要用到。
  - cadefault：已经被弃用。
  - context：ssl.SSLContext类型，用来指定 SSL 设置。
- 例子

import urllib.request

 

response=urllib.request.urlopen('https://www.python.org') # urlopen 完成最基本的简单网页的GET请求抓取

# print(response.read().decode('utf-8')) # read 返回网页内容

# print(type(response))

print(response.status) # 返回网站状态码

print(response.getheaders()) # 返回响应的头

print(response.getheader('Server')) # 返回响应头中的 server

Request()
- urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)
  - url 要请求的url
  - data data必须是bytes(字节流）类型，如果是字典
  - headers headers是一个字典类型，是请求头。
  - origin_req_host 指定请求方的host名称或者ip地址
  - unverifiable 设置网页是否需要验证，默认是False，这个参数一般也不用设置。
  - method method用来指定请求使用的方法，GET、POST、PUT
- 例子一：

import urllib.request

 

request=urllib.request.Request('https://www.python.org') # urlopen 完成最基本的简单网页的GET请求抓取

response=urllib.request.urlopen(request)

print(response.read().decode('utf-8')) # read 返回网页内容

例子二：

from urllib import request,parse

 

url='http://httpbin.org/post'

headers={

    'User-Agent':'Mozilla/4.0 (compatible;MSIE 5.5;Windows NT)',

    'Host':'httpbin.org'

}

dict={

    'name':'Germey'

}

data=bytes(parse.urlencode(dict),encoding='utf-8')

req=request.Request(url=url,data=data,headers=headers,method='POST')

response=request.urlopen(req)

print(response.read().decode('utf-8'))

高级用法
- 主要是Handler 和 Opener HTTPCookieProcessro处理Cookies等等，urlopen是一个Opener。
- 验证，打开网站出现提示窗输入用户名和密码：

from urllib.request import HTTPPasswordMgrWithDefaultRealm, HTTPBasicAuthHandler, build_opener

from urllib.error import URLError

 

username = '11459'

password = '114590'

url = 'https://zhxg.whut.edu.cn//yqtj/#/login'

 

p = HTTPPasswordMgrWithDefaultRealm()

p.add_password(None, url, username, password)

auth_handler = HTTPBasicAuthHandler(p) #实例化对象

opener = build_opener(auth_handler) #构建OPener

 

try:

    result = opener.open(url)

    html = result.read().decode('utf-8')

    print(html)

except URLError as e:

    print(e.reason)

代理：

from urllib.error import URLError

from urllib.request import ProxyHandler,build_opener

 

proxy_handler=ProxyHandler({

    'http':'http://127.0.0.1:9743',

    'https':'https://127.0.0.1:9743'

})

opener=build_opener(proxy_handler) # 本地搭了一个代理，运行在9743端口

 

try:

    response=opener.open('https://www.whut.edu.cn/')

    print(response.read().decod('utf-8'))

except URLError as e:

    print(e.reason)

Cookies

from urllib.error import URLError

import http.cookiejar,urllib.request

 

"""直接打印"""

# cookie=http.cookiejar.CookieJar()

# handler=urllib.request.HTTPCookieProcessor(cookie)

# opener=urllib.request.build_opener(handler)

# response=opener.open('https://baidu.com')

# for item in cookie:

#     print(item.name+" = "+item.value)

 

"""文件保存"""

# filename='cookies.txt'

# cookie=http.cookiejar.MozillaCookieJar(filename) #保存为文件 可以使用MozillaCookieJar 和 LWPCookieJar

# handler=urllib.request.HTTPCookieProcessor(cookie)

# opener=urllib.request.build_opener(handler)

# response=opener.open('https://baidu.com')

# cookie.save(ignore_discard=True,ignore_expires=True)

 

"""读取文件并利用Cookies"""

cookie=http.cookiejar.MozillaCookieJar()

cookie.load('cookies.txt',ignore_discard=True,ignore_expires=True)

handler=urllib.request.HTTPCookieProcessor(cookie)

opener=urllib.request.build_opener(handler)

response=opener.open('https://baidu.com')

print(response.read().decode('utf-8'))

for item in cookie:

    print(item.name+" = "+item.value)

处理异常
- URLError：
- HTTPError：（属性：code，reason，headers），是URLError的子类
解析链接
- urlparse() 解析URL字符串
  - 参数，urllib.parse.urlparse(urlstring，scheme =''，allow_fragments = True )；分别是url字符串、协议类型、是否忽略fragments
  - 可以实现URl的识别和分段
  - 例子：

from urllib.parse import urlparse

 

result=urlparse('https://www.baidu.com/index.html;user?id=5#comment')

print(type(result),result)

urlunparse() 构造url字符串
- 例子：

from urllib.parse import urlunparse

 

data=['http','www.baidu.com','index.html','user','a=6','comment']

print(urlunparse(data))

urlsplit()和urlunsplit() 类似于urlparse/urlunparse，但是返回类型不一样，返回的元素有5个，parse是6个
- 例子：

from urllib.parse import urlsplit

from  urllib.parse import urlunsplit

 

# urlsplit

result=urlsplit('http://www.baidu.com/index.html;user?a=6#comment')

print("解析URL：")

print(result) # 长度为5

print(result.scheme,result[0])

 

# urlunsplit

data=['http','www.baidu.com','index.html','a=6','comment']  # 长度为5

print("构造URL：")

print(urlunsplit(data))

urljoin()
- 对两个连接自动分析其scheme、netloc等对缺失部分进行补充，最后返回结果，如果前后都有相同部分，后面覆盖前面
- 例子：

from urllib.parse import urljoin

 

print(urljoin('http://www.baidu.com','fal.html'))

urlencode()
- 将字典数据序列化成url的GET请求参数
- 例子：

from urllib.parse import urlencode

 

params={

    'name':'germey',

    'age':22

}

base_url='http://www.baidu.com?'

url=base_url+urlencode(params)

print(url)

parse_qs()
- 是urlencode函数的反序列化函数，将GET请求参数转换为字典类型
- 例子：

from urllib.parse import parse_qs

 

query='name=germey&age=22'

print(parse_qs(query))

parse_qsl()
- 将参数转化为元组组成的列表
- 例子：

from urllib.parse import parse_qsl

 

query='name=germey&age=22'

print(parse_qsl(query))

quote/unquote
- quote() 将内容转化为URl编码格式，主要转化中文编码；unquote() 对URL进行解码
- 例子

from urllib.parse import quote,unquote

 

keyword='壁纸'

url='https://www.baidu.com/s?wd='+quote(keyword)

print(url)

print(unquote(url))

分析Robots协议
- Robots协议：爬虫协议/机器人协议，全名叫网络爬虫排除标准协议，用来告诉爬虫和搜索引擎哪些页面可以抓取，哪些不可以，通常是一个robots.txt文件放在根目录下。User-agent：* 表示所有爬虫可以爬取；Disallow：/所有目录不可爬取；Allow：/public/ 只允许爬取public目录；
- robotparser 分析robots.txt

from urllib.robotparser import RobotFileParser

import time

 

rp=RobotFileParser() # 创建一个对象

rp.set_url('http://www.jianshu.com/robots.txt') # 设置robots.txt连接

rp.read() # 读取robots.txt文件

print(rp.can_fetch('*','httpd//www.jianshu.com/p/b/b67554025d7d')) # can_fetch判断*爬虫能不能爬后面网页

print(rp.mtime()) #mtime() 返回上次抓取和分析robots.txt的市价，

rp.modified() # 将当前时间设置为上次抓取和分析robots.txt的时间

print(rp.mtime()) # 这种打印出来的是数字，需要转换成时间

tupTime = time.localtime(rp.mtime())

stadardTime = time.strftime("%Y-%m-%d %H:%M:%S", tupTime)

print(stadardTime)

2）使用Requests

Get请求
- 基本用法，例子

import requests

 

data={

    'name':'germey',

    'age':22

}

r=requests.get('https://www.httpbin.org/get',params=data) # get请求

print(type(r))

print(r.status_code) # 返回状态码

print(r.cookies) # 返回cookies

print(type(r.text)) # 返回文本类型

print(r.text)

print(r.json()) # 直接调用json返回一个字典

print(type(r.json()))

抓取网页

import requests

import re

 

headers={

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'

}

r=requests.get('https://www.zhihu.com/explore',headers=headers) # headers中得User-Agent是浏览器标识信息，如果无这个知乎会禁止抓取

print(r.text)

pattern=re.compile('explore-feed.*?question_link.*?>(.*?)',re.S)

titles=re.findall(pattern,r.text)

print(titles)

抓取二进制数据

import requests

 

r=requests.get('https://www.baidu.com/favicon.ico')

# print(r.text)

# print(r.content) # bytes类型，打印前面有个b

 

# 下面的方法将文件保存下来，运行完你会发现桌面多了一个图像文件

with open('favicon.ico','wb') as f:

    f.write(r.content)

POST请求

import requests

 

data={'name':'germey','age':22}

r=requests.post("https://httpbin.org/post",data=data)

print(type(r.status_code),r.status_code)

print(type(r.headers),r.headers)

print(type(r.cookies),r.cookies)

print(type(r.url),r.url)

print(type(r.history),r.history)

print(r.text)

 

if r.status_code==requests.codes.ok:

    # 这里的ok代表200，还有多种代表各种状态码，可查

    print("Request Successfully!")

高级用法
- 文件上传：

import requests

 

files={'file':open('favicon.ico','rb')}

r=requests.post("https://httpbin.org/post",files=files)

print(r.text)

Cookies：可以通过 r.cookies获得，可以通过headers设置
会话维持：Session
- 测试：

import requests

 

s=requests.Session()

s.get('http://httpbin.org/cookies/set/number/123456789')

r=s.get('http://httpbin.org/cookies')

print(r.text)

SSL证书验证：
- requests的verify参数默认为True检查CA证书，访问没有SSL证书的网站，需要设置为False

import requests

 

r=requests.get('https://www.12306.cn',verify=False) # 这个测试失败 12306已经有CA认证了

print(r.status_code)

代理设置：

import requests

 

proxies={

    "http":"http://10.10.1.10:3128",

    "https:":"https://10.10.1.10:3128"

}

 

# 若代理需要使用 HTTP Basic Auth

# proxies={

#     "http":"http://user:password@10.10.1.10:3128/",

# }

 

# request支持SOCKS代理，需要下载 pip install 'requests[socks]'

# proxies={

#     "http":"socks5://user:passwprd@host:port",

#     "https:":"socks5://user:passwprd@host:port"

# }

 

requests.get("https://www.taobao.com",proxies=proxies)

超时设置

import requests

 

r=requests.get("https://www.github.com/",timeout=1) #1秒内没有响应就抛出异常

print(r.status_code)

身份认证

import requests

from requests.auth import HTTPBasicAuth

 

r=requests.get("https://www.github.com/",auth=HTTPBasicAuth('username','password'))

# 或者 不要引入HTTPBasicAuth 直接

# r=requests.get("https://www.github.com/",auth=('username','password'))

# 还有一种叫 OAuth 认证 需要安装requests_oauthlib

print(r.status_code)

Prepared Request
- 这是一种数据结构可以表示参数

from requests import Request,Session

 

url='http://httpbin.org/post'

data={

    'name':'germey'

}

headers={

    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.164 Safari/537.36'

}

s=Session()

req=Request('POST',url,data=data,headers=headers)

preped=s.prepare_request(req)

r=s.send(preped)

print(r.text)

3）正则表达式

match(regex_str,content,re.S)

#其中re.S表示.可以匹配换行符，还有多个类似的参数，可查

importre

 

content='Hello1234567World_ThisisaRegexDemo'

print(len(content))

result=re.match('^Hello\s\d{3}\s\d{4}\s\w{10}',content)

print(result)#返回结果

print(result.group())#匹配到的内容

print(result.span())#匹配结果在原字符串中的位置

search()
- match 如果开头不匹配就直接返回失败，search会扫描整个内容找到与regex匹配的第一个内容
findall()
- 扫描整个内容找到与regex匹配的所有内容
sub()
- 去除内容中所有匹配的字符串第一个参数是regex_str,第二个参数是替换后的字符串，第三个参数是原字符串
compile()
- 将正则字符串变异成正则表达式对象，把它传入match等作为regex_str使用

pattern=re.compile('\d{2}:\d{2}')

posted @ 2021-07-25 20:46 未名w 阅读(376) 评论(0) 收藏举报

刷新页面返回顶部

未名

Python爬虫基本库

公告