Day 38 爬虫_requests模块

引入

在python实现的网络爬虫中，用于网络请求发送的模块有两种，第一种为urllib模块，第二种为requests模块。urllib模块是一种比较古老的模块，在使用的过程中较为繁琐和不便。当requests模块出现后，就快速的代替了urllib模块，因此，在我们课程中，推荐大家使用requests模块。

Requests 唯一的一个非转基因的 Python HTTP 库，人类可以安全享用。

警告：非专业使用其他 HTTP 库会导致危险的副作用，包括：安全缺陷症、冗余代码症、重新发明轮子症、啃文档症、抑郁、头疼、甚至死亡。

what is requests

requests模块是python中原生的基于网络请求的模块，其主要作用是用来模拟浏览器发起请求。功能强大，用法简洁高效。在爬虫领域中占据着半壁江山的地位。

为什么要使用requests模块

在使用urllib模块的时候，会有诸多不便之处，总结如下：

1、手动处理url编码

2、手动处理post请求参数

3、处理cookie和代理操作繁琐

使用requests模块：

1、自动处理url编码

2、自动处理post请求参数

3、简化cookie和代理操作

如何使用requests模块

环境安装：pip install requests

使用流程/编码流程

1、指定url

2、基于requests模块发起请求

3、获取响应对象中的数据值

4、持久化存储

案例：爬虫程序

案例一：简易网页采集器

wd = input('>>>')
param = {
    'wd':wd
}
url = 'http://www.baidu.com/baidu'
# UA伪装
header = {
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36 OPR/67.0.3575.115 (Edition B2)'
}
info = requests.get(url=url,params= param,headers=header)
info_text = info.text
with open(r'C:\Users\Administrator\Desktop\%s.html'%wd,'w',encoding='utf-8')as f:
    f.writelines(info_text)
print('爬取完毕')

案例二：肯德基门店信息

import requests
import json

info = []


def kfc(num):
    url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx"
　　
    #  data 为字典类型，post：data为json格式，可以使用 json.dumps() 转换
    data = {
        "op": "keyword",
        'cname': '',
        'pid': '',
        'keyword': '杭州',
        'pageIndex': num,
        'pageSize': '10',

    }
    header = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    }

    req = requests.post(url=url, data=data, headers=header).json()
    info.append(req)
    print(req)


for i in range(9):
    i += 1
    kfc(i)

txt = open(r'C:\Users\Administrator\Desktop\KFC.json', 'a', encoding='utf-8')
json.dump(info, fp=txt, ensure_ascii=False)
print('over')

案例三：化妆品生产许可证相关信息

import requests
import json

id = []
info = []
url = 'http://125.35.6.84:81/xk/itownet/portalAction.do'
header = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36",
    # 'Cookie': 'JSESSIONID=02AF3EF8CBE74529A7F6231987EE1A6A; JSESSIONID=64B83D7B541CEED78E13CF74B321D7A0'
}
for i in range(1, 6):
    i = str(i)
    data = {
        'method': 'getXkzsList',
        'on': 'true',
        'page': i,
        'pageSize': '15',
        'productName': '',
        'conditionType': '1',
        'applyname': '',
        'applysn': ''
    }

    req_id = requests.post(url=url, data=data, headers=header).json()
    for i in req_id['list']:
        id.append(i['ID'])

for j in id:
    url = 'http://125.35.6.84:81/xk/itownet/portalAction.do'
    data = {
        'method': 'getXkzsById',
        'id': j
    }
    req_info = requests.post(url=url, data=data, headers=header).json()
    info.append(req_info)
txt = open(r'C:\Users\Administrator\Desktop\juqing.json', 'a', encoding='utf-8')
json.dump(info, txt, ensure_ascii=False)
print('over')

posted @ 2020-04-11 11:51 亦双弓阅读(156) 评论(0) 编辑收藏举报

刷新页面返回顶部

亦双弓

Day 38 爬虫_requests模块

引入

what is requests

为什么要使用requests模块

如何使用requests模块

案例：爬虫程序

公告