爬虫 Http请求,urllib2获取数据,第三方库requests获取数据,BeautifulSoup处理数据,使用Chrome浏览器开发者工具显示检查网页源代码,json模块的dumps，loads，dump，load方法介绍

伪装浏览器、IP限制、登陆、验证码（CAPTCHA）

1.爬虫 Http请求和Chrome

访问一个网页
http://kaoshi.edu.sina.com.cn/college/scorelist?tab=batch&wl=1&local=2&batch=&syear=2013
url：协议 + 域名／IP + 端口 + 路由 + 参数
ping
通过url能得到什么

在浏览器中打开
墙裂推荐大家使用Chrome浏览器
渲染效果、调试功能都是没话说的
http://www.google.cn/intl/zh-CN/chrome/browser/desktop/index.html

开发者工具
显示网页源代码、检查
Elements：页面渲染之后的结构，任意调整、即时显示；
Console：打印调试；
Sources：使用到的文件；
Network：全部网络请求。

Http请求
Http是目前最通用的web传输协议
GET：参数包含在url中；
POST：参数包含在数据包中，url中不可见。
http://shuju.wdzj.com/plat-info-59.html

Url类型
html：返回html结构页面，通过浏览器渲染后呈现给用户；
API：Application Programming Interfaces，请求后完成某些功能，例如返回数据。

2.爬虫使用urllib2获取数据

Python中的Urllib2
https://docs.python.org/2/library/urllib2.html
我的python版本：2.7

发起GET请求
http://kaoshi.edu.sina.com.cn/college/scorelist?tab=batch&wl=1&local=2&batch=&syear=2013
request = urllib2.Request(url=url, headers=headers)
response = urllib2.urlopen(request, timeout=20)
result = response.read()

发起POST请求
http://shuju.wdzj.com/plat-info-59.html
data = urllib.urlencode({'type1': x, 'type2': 0, 'status': 0, 'wdzjPlatId': int(platId)})
request = urllib2.Request('http://shuju.wdzj.com/depth-data.html', headers)
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor())
response = opener.open(request, data)
result = response.read()

处理返回结果
Html：BeautifulSoup，需要有一些CSS基础
API：JSON
https://www.crummy.com/software/BeautifulSoup/bs4/doc/

3.第三方库requests获取数据

通过pip安装
pip install requests

发送请求与传递参数
import requests

r = requests.get(url='http://www.itwhy.org') # 最基本的GET请求
print(r.status_code) # 获取返回状态
r = requests.get(url='http://dict.baidu.com/s', params={'wd':'python'}) #带参数的GET请求
print(r.url)
print(r.text) #打印解码后的返回数据

requests.get(‘https://github.com/timeline.json’) #GET请求
requests.post(“http://httpbin.org/post”) #POST请求
requests.put(“http://httpbin.org/put”) #PUT请求
requests.delete(“http://httpbin.org/delete”) #DELETE请求
requests.head(“http://httpbin.org/get”) #HEAD请求
requests.options(“http://httpbin.org/get”) #OPTIONS请求

带参数的请求实例：
import requests
requests.get('http://www.dict.baidu.com/s', params={'wd': 'python'}) #GET参数实例
requests.post('http://www.itwhy.org/wp-comments-post.php', data={'comment': '测试POST'}) #POST参数实例

POST发送JSON数据：
import requests
import json

r = requests.post('https://api.github.com/some/endpoint', data=json.dumps({'some': 'data'}))
print(r.json())

定制header：
import requests
import json

data = {'some': 'data'}
headers = {'content-type': 'application/json',
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:22.0) Gecko/20100101 Firefox/22.0'}

r = requests.post('https://api.github.com/some/endpoint', data=data, headers=headers)
print(r.text)

r.status_code #响应状态码
r.raw #返回原始响应体，也就是 urllib 的 response 对象，使用 r.raw.read() 读取
r.content #字节方式的响应体，会自动为你解码 gzip 和 deflate 压缩
r.text #字符串方式的响应体，会自动根据响应头部的字符编码进行解码
r.headers #以字典对象存储服务器响应头，但是这个字典比较特殊，字典键不区分大小写，若键不存在则返回None
#*特殊方法*#
r.json() #Requests中内置的JSON解码器
r.raise_for_status() #失败请求(非200响应)抛出异常

上传文件:
使用 Requests 模块，上传文件也是如此简单的，文件的类型会自动进行处理：

import requests

url = 'http://127.0.0.1:5000/upload'
files = {'file': open('/home/lyb/sjzl.mpg', 'rb')}
#files = {'file': ('report.jpg', open('/home/lyb/sjzl.mpg', 'rb'))} #显式的设置文件名

r = requests.post(url, files=files)
print(r.text)

更加方便的是，你可以把字符串当着文件进行上传：

import requests

url = 'http://127.0.0.1:5000/upload'
files = {'file': ('test.txt', b'Hello Requests.')} #必需显式的设置文件名

r = requests.post(url, files=files)
print(r.text)

身份验证

基本身份认证(HTTP Basic Auth):

import requests
from requests.auth import HTTPBasicAuth

r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=HTTPBasicAuth('user', 'passwd'))
# r = requests.get('https://httpbin.org/hidden-basic-auth/user/passwd', auth=('user', 'passwd')) # 简写
print(r.json())
另一种非常流行的HTTP身份认证形式是摘要式身份认证，Requests对它的支持也是开箱即可用的:

requests.get(URL, auth=HTTPDigestAuth('user', 'pass'))

Cookies与会话对象

如果某个响应中包含一些Cookie，你可以快速访问它们：

import requests

r = requests.get('http://www.google.com.hk/')
print(r.cookies['NID'])
print(tuple(r.cookies))
要想发送你的cookies到服务器，可以使用 cookies 参数：

复制代码
import requests

url = 'http://httpbin.org/cookies'
cookies = {'testCookies_1': 'Hello_Python3', 'testCookies_2': 'Hello_Requests'}
# 在Cookie Version 0中规定空格、方括号、圆括号、等于号、逗号、双引号、斜杠、问号、@，冒号，分号等特殊符号都不能作为Cookie的内容。
r = requests.get(url, cookies=cookies)
print(r.json())

超时与异常

timeout 仅对连接过程有效，与响应体的下载无关。

>>> requests.get('http://github.com', timeout=0.001)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
requests.exceptions.Timeout: HTTPConnectionPool(host='github.com', port=80): Request timed out. (timeout=0.001)

4. json模块的dumps，loads，dump，load方法介绍

jshon这个模块就是做序列化处理的，主要用到json模块的四种方法
1、dumps:可以把特定的对象序列化处理为字符串

l1 = [1,2,3,454]
d1 = {'k1':'v1'}
ret = json.dumps(l1)

2、loads:把字符串转换成list和dict 把字符串转换成字典

l1 = '[1,2,3,4]'
r = json.loads(l1)

3、dump:dump是把序列化后的字符串写到一个文件中
json.dump(d1,open('db','w'))

4、load:load是从一个一个文件中读取文件并转换成list和dict
d1 = json.load(open('db','r')

posted @ 2017-11-02 11:33 大树2 阅读(727) 评论(0) 收藏举报

刷新页面返回顶部

大树的Blog 程序员猴哥微信 chendashu618

记录学习过程，总结工作经验，探究底层运行逻辑。

爬虫 Http请求,urllib2获取数据,第三方库requests获取数据,BeautifulSoup处理数据,使用Chrome浏览器开发者工具显示检查网页源代码,json模块的dumps，loads，dump，load方法介绍

公告

大树的Blog 程序员猴哥 微信 chendashu618

记录学习过程，总结工作经验，探究底层运行逻辑。

爬虫 Http请求,urllib2获取数据,第三方库requests获取数据,BeautifulSoup处理数据,使用Chrome浏览器开发者工具显示检查网页源代码,json模块的dumps，loads，dump，load方法介绍

公告

大树的Blog 程序员猴哥微信 chendashu618