004 Python网络爬虫与信息提取 Requests库爬虫实战
[A] 京东商品页面的爬取
代码示例:
import requests url = 'https://item.jd.com/70076567438.html' try: r = requests.get(url) r.raise_for_status() r.encoding = r.apparent_encoding text = r.text print(text) except: print('爬取失败')
[B] 亚马逊商品页面的爬取
示例代码:
import requests r = requests.get('https://www.amazon.cn/dp/B0785D5L1H/ref=sr_1_1?__mk_zh_CN=亚马逊网站' '&dchild=1&keywords=极简&qid=1605500387&sr=8-1') print(r.status_code) # 返回503, 即意味着此次爬取失败 print(r.request.headers) # 返回对象中存在 'User-Agent': 'python-requests/2.24.0'
分析:
1. 从返回的状态码我们可以知道,此次爬取内容失败了
2. 我们调取此次 HTTP 请求的请求头信息(r.request.headers)对象,可知用户代理(User-Agent)的值为python-requests/2.24.0
说明我们的爬虫很诚实的告诉服务器:我是一个爬虫程序
3. 我们可以通过修改用户代理名称即可成功骗过服务器,从而获取对应资源
示例代码:
import requests kv = {'User-Agent': 'Mozilla/5.0'} r = requests.get('https://www.amazon.cn/dp/B0785D5L1H/ref=sr_1_1?__mk_zh_CN=亚马逊网站' '&dchild=1&keywords=极简&qid=1605500387&sr=8-1', headers=kv) print(r.status_code) # 返回200, 即意味着此次爬取失败 print(r.request.headers) # 返回对象中存在 'User-Agent': 'Mozilla/5.0'
[C] 百度360关键字提交
搜索引擎一般都有关键字提交的接口:
1. 百度关键字搜索接口:http://www.baidu.com/s?wd=keyword
2. 360关键字搜索接口:http:/www.so.com/s?1=keyword
示例代码:
import requests keyword = 'Python' url = 'http://www.baidu.com/s' try: kv = {'wd': keyword} r = requests.get(url, params=kv) print(r.request.url) r.raise_for_status() print(len(r.text)) except: print('不好意思哦,爬取失败了')
[D] 网络图片的爬取与存储
1. 网络图片的识别
一般网络图片的链接格式为:http://www.example.com/picture.jpg
即以 xxx.jpg, xxx.png等结尾的链接即为图片
2. 根据指定网络图片url可以爬取并且存储图片
示例代码:
import requests path = 'C:/Users/Carrey/Desktop/abc.jpg' url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000'\ '&sec=1605513179028&di=9221230b2ef023a1e92f50105c1afad8&imgtype=0'\ '&src=http%3A%2F%2Fpic1.win4000.com%2Fmobile%2Ff%2F53b4c394c966a.jpg' r = requests.get(url) print(r.status_code) with open(path, 'wb') as f: f.write(r.content) f.close() print('爬取并且存储了')
图片爬取的全代码:
# 爬取的全代码 import requests import os url = 'https://timgsa.baidu.com/timg?image&quality=80&size=b9999_10000'\ '&sec=1605513179028&di=9221230b2ef023a1e92f50105c1afad8&imgtype=0'\ '&src=http%3A%2F%2Fpic1.win4000.com%2Fmobile%2Ff%2F53b4c394c966a.jpg' root = 'C:/Users/Carrey/Desktop/' path = root + url[-10:] try: if not os.path.exists(root): os.mkdir(root) if not os.path.exists(path): r = requests.get(url) with open(path, 'wb') as f: f.write(r.content) f.close() print('恭喜我的主人,文件保存成功啦!') else: print('文件已存在') except: print('So sorry, 文件爬取失败了')
[E] IP地址归属地的自动查询
在网页中,我们可以手动去查询IP地址的位置,我们也可以通过爬虫来自动爬取
示例代码:
import requests IP = '202.204.80.112' url = 'http://www.cip.cc/' + IP kv = {'User-Agent': 'Mozilla/5.0'} r = requests.get(url, headers=kv, timeout=3) print(r.status_code) print(r.request.headers) print(r.text[-500:])
爬虫框架全代码:
# IP地址查询的全代码 import requests IP = '202.204.80.112' url = 'http://www.cip.cc/' try: kv = {'User-Agent': 'Mozilla/5.0'} r = requests.get(url, headers=kv, timeout=3) r.raise_for_status() r.encoding = r.apparent_encoding print(r.text[-300:]) except: print('So sorry, 小的爬取失败了')
【小技巧】:获取网站API方法,观察 url 链接的变化