requests模块¶

- 1、为什么是requests模块
    - python原生一个基于网络请求的模块，模拟浏览器发起请求
- 2、为什么要使用requests模块
    - 1、自动处理url编码          urllib.parse.quote()
    - 2、自动处理post请求的参数     urllib.parse.urlencode(parmre) 再转字节
    - 3、简化cookie和代理的操作
- 3、requests如何被使用
    - 安装：pip install requests
    - 使用：
        1、指定url
        2、发请求
        3、获取响应数据
        4、持久化存储
- 4、通过5个基于requests模块的爬虫项目对该模块进行系统学习和巩固
    - get 请求
    - post 请求
    - ajax get
    - ajax post
    - 综合

基于requests发get请求¶

- 需求：爬起搜狗首页的页面数据

import requests

# 指定url
url = 'https://www.sogou.com/'

# 发起请求
res = requests.get(url=url)

# 获取数据：text可以获取响应对象中字符串形式的页面数据
data = res.text

# 持久化存储
with open('get.html','w',encoding='utf-8') as f:
    f.write(data)

print('Save Ok')

Save Ok

# response响应对象的常用属性
import requests
url = 'https://www.sogou.com/'
res = requests.get(url)

#  content 获取的是响应对象中二进制bytes类型的页面数据
# print(res.content)

# 获取二进制流：比如下视频时。如果视频10g，用response.content然后一下子写到文件中是不合理的。
# res = requests.get('xxx',stream=True)
# with open('movie.mp4','wb')as f:
#     for line in res.iter_content():
#         f.write(line)

# status_code  返回一个相应状态码  200
print(res.status_code)

# 返回响应头信息 dict类型
# print(res.headers)

# 获取请求的url
print(res.url)

# 编码问题
# 有些网站返回的数据不是utf-8编码格式
# res.encoding='gbk'

# 解析json
import json
res1 = json.load(res.text)  # 太麻烦

# 一步到位
res2 = res.json()

200
https://www.sogou.com/

requests发起带参数的get请求
- 需求：爬取搜狗词条的页面数据

import requests

url = 'https://www.sogou.com/web'

res = requests.get(url=url,params={'query':'江子牙'})

with open('jiangziya.html','w',encoding='utf-8') as f:
    f.write(res.text)

print('请求的URL:',res.url)
print('Save Ok')

请求的URL: https://www.sogou.com/web?query=%E6%B1%9F%E5%AD%90%E7%89%99
Save Ok

# requests自定义请求头信息：伪装成浏览器
import requests

url = 'https://www.sogou.com/web'

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}

res = requests.get(url=url,params={'query':'江子牙'},headers=headers)
# get也可以携带cookie发送请求：
# res = requests.get(url=url,params={'query':'江子牙'},headers=headers,cookies=cookies)

print(res.status_code)

200

基于requests发post请求¶

- 需求：登录豆瓣网，获取登录成功后的页面数据

import requests

url='https://accounts.douban.com/login'

data={
    'form_email':'13006293101',
    'form_password':'jw19961019',
}
res = requests.post(url=url,data=data)

print(res.status_code)
html = res.content

with open('douban_login.html','wb') as f:
    f.write(html)
print('登录成功')

# 补充：
headers = {
    'content-type':'application/json'
}
# 没有指定请求头时，默认使用x-www-form-urlencode请求头
# request.post(url='xxx',data={'a':1,'b':2})
# 指定请求头为application/json时。使用data传值，服务端取不到值
# request.post(url='xxx',data={'a':1,'b':2},headers=headers)
# 不指定，但是指定json传值，就意味这使用application/json请求头
# requests.post(url='',json={'a':1,})

200
登录成功

基于requests发ajax的get请求¶

- 需求：爬取豆瓣网电影详情的数据

import requests

url = 'https://movie.douban.com/j/new_search_subjects'

parmrs = {
    'sort': 'U',
    'range': '0,10',
    'tags':'' ,
    'start': 40,
    'genres': '爱情',
}
# 自定义请求头信息
headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0) Gecko/20100101 Firefox/6.0'
}

res = requests.get(url=url,params=parmrs,headers=headers)
print(res.url)
import json
with open('douban_movie.json','w',encoding='utf-8') as f:
    json.dump(res.text,f,ensure_ascii=False)

https://movie.douban.com/j/new_search_subjects?sort=U&range=0%2C10&tags=&start=40&genres=%E7%88%B1%E6%83%85

基于requests发ajax的post请求¶

- 需求：爬取肯德基城市餐厅位置的数据

import requests

url = 'http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword'

data = {
    'cname':'', 
    'pid': '',
    'keyword': '江西',
    'pageIndex': 1,
    'pageSize': 20,
}
res = requests.post(url=url,data=data)
print(res.url)
import json
with open('kfc.json','w',encoding='utf-8') as f:
    json.dump(res.text,f,ensure_ascii=False)

http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword

綜合项目实战¶

- 需求 爬取搜狗知乎一个词条对应一定范围内页码表示的页面数据

# 前三页的页面数据 1、2、3
import requests,os

if not os.path.exists('pages'):
    os.mkdir('pages')

word = input('输入你要搜索的关键词：')
start_page = int(input('输入要获取的起始页码：'))
end_page = int(input('输入要获取的结束页码：'))

headers = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/69.0.3497.100 Safari/537.36'
}

url = 'https://zhihu.sogou.com/zhihu?'
for page in range(start_page,end_page+1):
    params = {
        'query':word,
        'page':page,
        'ie':'utf-8'
    }
    res = requests.get(url=url,headers=headers,params=params)
    print(res.url)
    file_name = word+str(page)+'.html'
    file_path = os.path.join('pages',file_name)
    with open(file_path,'w',encoding='utf-8') as f:
        f.write(res.text)
        print('第%s页搜索成功'%page)

输入你要搜索的关键词：python
输入要获取的起始页码：1
输入要获取的结束页码：4
https://zhihu.sogou.com/zhihu?query=python&page=1&ie=utf-8
第1页搜索成功
https://zhihu.sogou.com/zhihu?query=python&page=2&ie=utf-8
第2页搜索成功
https://zhihu.sogou.com/zhihu?query=python&page=3&ie=utf-8
第3页搜索成功
https://zhihu.sogou.com/zhihu?query=python&page=4&ie=utf-8
第4页搜索成功

爬呀爬Xjm

requests模块基础

requests模块¶

基于requests发get请求¶

基于requests发post请求¶

基于requests发ajax的get请求¶

基于requests发ajax的post请求¶

綜合项目实战¶

公告