爬虫

什么是爬虫
   爬虫就是请求网站并提取数据的自动化程序
爬虫的基本流程
   1、向服务器发起请求
       通过http库向目标站点发起请求，即发送一个Rquest，请求可以包含额外的headers等信息，等待服务器响应
   2、获取响应内容
       如果服务器正常响应，会得到一个Response，Response的内容便是所要获取的页面内容，类型可能有HTML，Json字符串，二进制数据（如图片视频）等类型
   3、解析内容
       得到的内容如果是HTML，可以使用正则表达式、页面解析库进行解析。如果是JSON，可以直接转化为JSON对象解析。如果是二进制数据，可以做保存或者进一步处理
   4、保存数据
       保存形式多样，可以存为文本，也可以保存至数据库，或者保存特定格式的文件
什么是Request 与 Response
   1、浏览器发送消息给该网址所在的服务器，这个过程就叫做HTTP Request
   2、服务器收到浏览器发送的消息后，就能根据浏览器发送消息的内容，做响应处理，然后把消息回传给浏览器。这个过程就叫做HTTP Response
   3、浏览器收到服务器的Response消息后，会对信息进行相应处理，然后展示
Request中包含什么
   1、请求方式
       主要有GET和POST两种类型，另外还有HEAD，PUT，DELETE，OPTIONS等
   2、请求URL
       URL全程统一资源定位符，如一个网页文档、一张图片、一个视频都可以用URL唯一来确定
   3、请求头
       包含请求时的头部信息，如User—Agent、Host、Cookies等
   4、请求体
       请求时额外携带的数据，如表单提交时的表单数据
       使用get方法时，请求体不携带任何内容
Response中包含什么
   1、响应状态
       有多种响应状态，如200代表成功，301跳转，404找不到页面，502服务器错误
   2、响应头
       如内容类型，内容长度，服务器信息，设置Cookies等等
   3、响应体
       最主要的部分，包含了请求资源内容，如网页HTML，图片，二进制数据等
能抓取怎样的数据
   1、网页文本
       如HTML文本，Json格式文本（Ajax请求响应的Json格式文本）等
   2、图片
       获取的是二进制文件，保存为图片格式
           1、根据图片的绝对路径获取图片的二进制数据
               resp = requests.get('https://www.baidu.com/img/bd_logo1.png')
           2、写入文件并保存为图片格式
               with open('/home/xdl/python/paChong/百度.gif','wb') as f:
               ...:     f.write(resp.content)
   3、视频
       同为二进制文件，保存为视频格式即可
   4、其他
       只要能请求到的数据，都可以抓取
解析方式
   1、直接处理，比如说普通的字符串
   2、JSON解析
   3、正则表达式
   4、BeautifulSoup解析库
   5、PyQuery解析库
   6、XPath解析库
为什么抓到的数据与浏览器显示的内容不一致
   抓取到的数据是原始的数据没有经过js渲染的
怎么解决JS渲染问题
   1、分析Ajax请求
   2、使用Selenium/WebDriver模拟浏览器加载，自动加载工具
       from selenium import webdriver
       driver = webdriver.Chrome()
       driver.get('http://zhihu.com')
   3、使用Splash库模拟加载
   4、使用PyV8、Ghost.py库模拟加载
怎样保存数据
   1、文本
       纯文本、Json、Xml等
   2、关系型数据库
       如Mysql、Oracle、SQLServer等具有结构化表结构形式存储
   3、非关系型数据库
       如MongoDB、Redis等Key—Value形式存储
   4、二进制文件
       如图片、视频、音频等直接保存成特定格式即可
Urllib库详解
   python内置的HTTP请求库，无需安装
       urllib.request：请求模块
       urllib.error：异常处理模块
       urllib.parse：URL解析模块
       urllib.robotparser：robots.txt解析模块
   用法讲解
       urllib.request.urlopen(url,data=None,[timeout,]*,cafile=None,capath=None,cadefault=False,context=None)
Requests库详解
   1、什么是Requests
       Requests使用python语言编写，基于urllib，采用Apache2 Licensed开源协议的HTTP库
       它比urllib更加方便，可以节约我们大量的工作，完全满足HTTP测试需求
   2、安装Request
       pip3 install requests
   3、请求
       1、基本GET请求
           1、基本写法:

#基本写法
import requests
response = requests.get('http://httpbin.org/get')
print(response.text)

2、带参数GET请求

#带参数的写法
import requests
#方法一，请参数拼接在url后面
response = requests.get('http://httpbin.org/get?name=xdl&age=25')
print(response.text)
#方法二，将参数封装在一个字典中
data = {
    'name' :'xdl',
    'age' :25
}
response = requests.get('http://httpbin.org/get',params=data)
print(response.text)

3、解析json

import requests

response = requests.get('http://httpbin.org/get')
print(type(response.text))#<class 'str'>
print(response.json())#等价于json.loads(response.text)
#{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 
#'Connection': 'close', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1'}, 
#'origin': '139.207.92.26', 'url': 'http://httpbin.org/get'}
print(type(response.json()))#<class 'dict'>

4、获取二进制数据,并保存，使用content属性

#二进制数据的读取和保存
import requests
response = requests.get('https://github.com/favicon.ico')
print(type(response.text),type(response.content))#<class 'str'> <class 'bytes'>
with open('favicon.ico','wb') as f:
    f.write(response.content)

5、添加headers，有些网站如果添加headers信息，是不能访问的（浏览器伪装）

#添加headers信息
import requests
# response = requests.get('https://www.zhihu.com/explore')
# print(response)#<Response [400]>,不加headers信息无法访问
headers = {
    'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/'
}
response = requests.get('https://www.zhihu.com/explore',headers=headers)
print(response.text)

2、基本POST请求

import requests
data={
    'name':'xdl',
    'age':25
}
response = requests.post("http://httpbin.org/post",data=data)
print(response.text)

import requests
data={
    'name':'xdl',
    'age':25
}
headers = {
    'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/'
 }
response = requests.post("http://httpbin.org/post",data=data,headers=headers)
print(response.json())

响应
   1、response属性
       1、status_code:状态吗，int型
       2、headers：字典类型
       3、cookies：列表
       4、url：str
       5、history：列表

import requests
response = requests.get("http://www.jianshu.com")
print(type(response.status_code),response.status_code)#<class 'int'> 403
print(type(response.headers),response.headers)
#<class 'requests.structures.CaseInsensitiveDict'>
print(type(response.cookies),response.cookies)
#<class 'requests.cookies.RequestsCookieJar'> ,
print(type(response.url),response.url)
#<class 'str'>
print(type(response.history),response.history)
#<class 'list'>

    2、状态码的判断
       1、可以通过requests.codes.状态码对应的名字
       2、直接与状态码比较

import requests
response = requests.get("http://jianshu.com")
exit() if not response.status_code == requests.codes.ok else print("Request Success")

高级操作
1、文件上传

import requests
files ={'file':open('favicon.ico','rb')}
response = requests.post('http://httpbin.org/post',files=files)
print(response.text)

2、获取cookie

import requests
response = requests.get('http://www.baidu.com')
print(response.cookies)#<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>
for key,value in response.cookies.items():
    print(key+"="+value)#BDORZ=27315

3、会话维持（用于模拟登录）

import requests
#两次访问相对独立（在两个浏览器访问页面）的所以获取不到cookies
requests.get('http://httpbin.org/cookies/set/number/123456789')
response = requests.get("http://httpbin.org/cookies")
print(response.text)#"cookies": {}

import requests
#使用Session（）方法
s = requests.Session()
s.get('http://httpbin.org/cookies/set/number/123456789')
response = s.get('http://httpbin.org/cookies')
print(response.text)#"cookies": {"number": "123456789"}

4、证书验证
当访问https协议的网站时，需要通过证书验证。

import requests
from requests.packages import urllib3
#消除警告信息
urllib3.disable_warnings()
#通过设置verify属性值为False，来规避证书验证，但是会出现警告
response = requests.get('https://www.12306.cn',verify=False)
print(response.status_code)

代理设置
1、proxies属性设置代理

import requests
#通过cert属性设置私人证书
response = requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))

超时设置
1、timeout属性的值是以秒为单位的数字

import requests
from requests.exceptions import ConnectTimeout
try:
    response = requests.get('http://httpbin.org/get',timeout=0.1)
    print(response.status_code)
except ConnectTimeout:
    print("Timeout")

认证设置
1、有些网站在登录的时候需要登录验证
通过atuh属性

import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://120.27.34.24:9001',auth=HTTPBasicAuth('user','123'))
print(r.status_code)

异常处理

posted @ 2018-08-26 23:52 xdl_smile 阅读(154) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

xdl_smile

爬虫

公告