python网络爬虫简介

1.课堂内容

requests 获取网页内容

rebots.txt 网络爬虫排除标准

BeautifulSoup 解析HTML页面

Re 正则表达式详解，提取页面关键信息

Scrapys 网络爬虫原理介绍专业爬虫框架介绍

2. pytho开发环境

文本工具类

IDLE Notepad++ sublime Text, vim,Atom,Komodo Edit

集成开发工具：

PyCharm, Wing, Pydev&Eclipse, Visual Studio,Anaconda&PTVS, Spyder,Canopy

推荐

pycharm（推荐）开发大型程序

Anaconda 科学计算数据分析领域的IDE，推荐，开源，免费

IDLE sublime

the website is API

Requests库

rebots协议

5个实战项目

1.Requests入门

requests:www.python-requests.org上获取requests库的详细信息

使用pip工具来安装requests库：pip install requests

requests的相关方法

request.get(url,params = None,**kwargs)构造一个Request对象，返回一个Response对象

url: 拟获取页面的url链接
params: urlk中的额外参数，字典或字节流格式，可选
**kwargs：12个控制访问的参数

Response对象的属性：

r.status_code HTTP请求返回的状态码，200表示连接成功，404表示失败

r.text HTTP响应内容的字符串形式，即，url对应的页面内容

r.encoding 从HTTP头部猜测的响应内容编码方式,是从HTTP头部中charset字段获得的，如果header中不存在charset，则认为编码为ISO-8859-1,这种编码方式不能够解析中文

r.apparent_encoding 从内容中分析出的响应内容可能的编码方式，比encoding更加准确

r.content HTTP响应内容的二进制形式

import requests
r = requests.get('http://www.baidu.com')
print r.status_code
r.encoding = r.apparent_encoding
print r.text

爬取网络的通用代码框架

Requsts库的异常

异常	说明
requests.Connectionerror	网络连接错误异常，如DNS查询失败，拒绝连接等
requests.HTTPError	HTTP错误异常
requests.URLRequired	URL缺失异常
requests.TooManyRedirects	超过最大的重定向次数，产生重定向异常
requests.ConnectTimeout	连接远程服务器超时异常
requests.Timeout	请求URL超时，产生超时异常

r.raise_for_status() 判断返回的status_code是不是200，若不是，抛出一个HTTPError异常

#-*-coding = utf-8 -*-
import requests
def getHTMLText(url):
	try:
		r = requests.get(url,timeout=30)
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except Exception as e:
		print e
if __name__ == "__main__":
	url = "http://www.baidu.com"
	print getHTMLText(url)

HTTP协议超文本传输协议

请求响应协议

URL http://host[:port][path]

host：主机名

port: 服务的端口,默认为80

path

对应于网络上的一个资源

HTTP协议对资源的操作

方法	说明	requests方法
GET	请求获取URL位置的资源	requests.get(url,params=None,**kwargs)
HEAD	请求获取URL位置的资源的响应消息的报告，即获得该资源的头部信息	requests.head(url,**kwargs)
POST	请求向URL位置的资源后附加新的数据	requests.head(url,data=None,json=None,**kwargs)
PUT	请求向URL位置存储一个资源，覆盖原URL位置的资源	requests.put(url,data=None,**kwargs)
PATCH	请求局部更新URL位置的资源，即改变该处资源的部分内容，节省网络带宽	requests.patch(url,data=None,**kwargs)
DELETE	请求删除URL位置存储的资源	requests.delete(url,**kwargs)

requests.head()方法示例

import requests
r = requests.head("http://www.baidu.com")
for item in r.headers.items():
    print "%s : %s"%item

输出

Server : bfe/1.0.8.18
Date : Tue, 23 May 2017 08:01:17 GMT
Content-Type : text/html
Last-Modified : Mon, 13 Jun 2016 02:50:47 GMT
Connection : Keep-Alive
Cache-Control : private, no-cache, no-store, proxy-revalidate, no-transform
Pragma : no-cache
Content-Encoding : gzip

requests.put()方法测试

import requests
payload = {'key1':'value1','key2':'value2'}
r = requests.post("http://httpbin.org/put",data = payload)
print r.text

输出

{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "key1": "value1", 
    "key2": "value2"
  }, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Connection": "close", 
    "Content-Length": "23", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.13.0"
  }, 
  "json": null, 
  "origin": "218.29.102.115", 
  "url": "http://httpbin.org/put"
}

requests.request(method,url,**kwargs)

method请求方式，可以是GET,HEAD,POST,PUT,PATCH,delete,OPTIONS

**kwargs 13个参数

params 字典或字节序列，作为参数增加到url中

示例

>>> kv= {'key1':'value1','key2':'value2'}
>>> r = requests.request("GET",'http://www.python123.io/ws',params=kv)
>>> print r.url
http://www.python123.io/ws?key2=value2&key1=value1

data: 字典，字节序列或文件对象，作为Request的内容

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request("PUT",'http://www.python123.io/ws',data=kv)
>>> body = '主体内容'
>>> r = requests.request("PUT",'http://www.python123.io/ws',data=body)

json：JSON格式的数据，作为Request的内容

>>> kv = {'key1':'value1','key2':'value2'}
>>> r = requests.request("PUT",'http://www.python123.io/ws',json=kv)

headers：字典，HTTP定制头

cookies: 字典或CookieJar,Request中的cookie

auth：元组，支持HTTP的认证功能

files：字典类型，传输文件

>>> fs = {'file':open('data.xls','rb'}
>>> r = requests.request("POST",'http://www.python123.io/ws',files = fs)

timeout：设置超时时间，以秒为单位

proxies: 字典类型，设定访问代理服务器，可以增加登录认证，可以隐藏用户访问网页的源IP地址信息，能有效的防止对爬虫的逆追踪

示例

>>> pxs  ={'http':'http://user:pass@10.10.10.1:1234','https':'https://10.10.10.1.4321'}
>>> r = requests.request('GET','http://www.baidu.com',proxies = pxs)

allow_redicts：True/False默认为True,重定向开头

stream:True/False,默认为True,获取内容立即下载开头

verify: True/False,默认为True,认证SSL证书开头

cert：本地SSL证书路径

最常使用就是get和head方法

代码框架

import requests
def getHTMLText(url):
	try:
		r = requests.get(url,timeout = 30)
		r.raise_for_status()
		r.encoding = r.apparent_encoding
		return r.text
	except Exception as e:
		print e

实例1，京东商品页面的爬取

#-*-coding=utf8-*-
import requests
url = "https://item.jd.com/1696975522.html"
try:
	r = requests.get(url)
	r.raise_for_status()
	r.encoding = r.apparent_encoding
	print r.text[:1000]
except Exception as e:
	print "爬取失败"

实例2：亚马逊商品页面爬取

#-*-coding=utf8-*-
import requests
url = "https://www.amazon.cn/gp/product/B0186FESGW/ref=s9_acss_bw_cg_kin_1a1_w?pf_rd_m=A1U5RCOVU0NYF2&pf_rd_s=merchandised-search-2&pf_rd_r=MNPJWDEAVVCY67V3KKXB&pf_rd_t=101&pf_rd_p=190844af-fd7e-4d63-b831-fbd5601cfa0d&pf_rd_i=116087071"
try:
	kv = {"user-agent":"Mozilla/5.0"}
	r = requests.get(url,headers = kv)
	r.raise_for_status()
	r.encoding = r.apparent_encoding
	print r.text[:1000]
except Exception as e:
	print "爬取失败"

实例3：向搜索引擎提交关键字

百度的关键词接口

http://www.baidu.com/s?wd=keyword

360的关键词接口

http://www.so.com/s?q=keyword

#-*-coding=utf8-*-
import requests
#url = 'http://www.baidu.com/s'
url = 'http://www.so.com/s'
try:
	#kv = {'wd':'python'}
	kv = {'q':'python'}
	r = requests.get(url,params = kv)
	r.raise_for_status()
	print len(r.text)
except Exception as e:
	print "爬取失败"

实例：爬取图片

#-*-coding=utf8-*-
import os
import requests
root = 'F:\\pythonstudy\\'
url = 'http://image.nationalgeographic.com.cn/2017/0523/20170523015851567.jpg'
path = root + url.split('/')[-1]
if not os.path.exists(root):
	os.mkdir(root)
if not os.path.exists(path):
	try:
		r = requests.get(url)
		r.raise_for_status()
		with open(path,'wb') as f:
			f.write(r.content)
		print u"写入成功"
	except Exception as e:
		print u"爬取失败"
else:
	print u"文件已存在"

实例5：查询IP地址

http://m.ip138.com/ip.asp?ip=ipaddress

#-*-coding=utf8-*-
import requests
url = 'http://m.ip138.com/ip.asp?ip='
try:
    r = requests.get(url+'222.222.222.222')
    r.raise_for_status()
    print r.text[-1000:]
except Exception as e:
	print u"爬取失败"

网络爬虫的尺寸

小规模，数据量小的爬取速度不敏感 Requests库爬取网页玩转网页 >90%
中规模，数据规模较大爬取速度敏感 Scrapy库爬取网站爬取系列网站
大规模，搜索引擎爬取速度关键定制开发爬取全网

网络爬虫带来的问题：网络爬虫对网站的性能会有影响，具有一定的法律风险，导致个人隐私泄露

网络爬虫的限制：

目标网站会首先判断 user-agent字段，只响应浏览器和友好爬虫的请求

发布公告，robots协议告诉所有爬虫网站的爬取策略，要求爬虫遵守

robots协议 Robots Exclusion Standard网络爬虫排除标准

网站告知爬虫那些页面可以抓取，那些不行

形式：在网站的根目录下的robots.txt文件

如：京东的robots协议 https://www.jd.com/robots

User-agent: * 
Disallow: /?* 
Disallow: /pop/*.html 
Disallow: /pinpai/*.html?* 
User-agent: EtaoSpider 
Disallow: / 
User-agent: HuihuiSpider 
Disallow: / 
User-agent: GwdangSpider 
Disallow: / 
User-agent: WochachaSpider 
Disallow: /

robots协议的语法：

User-agent:*
Disallow: /

如果一个网站的根目录下没有robots文件，则网络爬虫可以无限制的爬取其上面的网络内容

http://www.baidu.com/robots.txt

http://news.sina.com/robots.txt

http://www.qq.com/robots.txt

http://news.qq.com/robots.txt

http://www.moe.edu.cn/robots.txt (无robots协议)

robots协议的使用

网络爬虫：自动或人工识别robots.txt，再进行内容爬取

约束性：robots协议是建议但非线束性，网络爬虫可以不遵守，但存在法律风险

null

posted on 2017-05-23 21:07 blackclody 阅读(570) 评论(0) 编辑收藏举报

刷新页面返回顶部