爬虫——request
命名规范
- module_name,模块
- package_name,包
- ClassName,类
- method_name,方法
- ExceptionName,异常
- function_name,函数
- GLOBAL_VAR_NAME,全局变量
- instance_var_name,实例
- function_parameter_name,参数
- local_var_name,本变量
爬取图片
直接用get请求图片网址即可
1 # photo_url = 'https://wallpapers.wallhaven.cc/wallpapers/full/wallhaven-685513.jpg' 2 # response_get = requests.get(gif_uri) 3 # with open('panda.gif','wb') as f: 4 # f.write(response_get.content)
百度翻译
百度固定格式kw,用post请求发送请求头和kw单词给百度翻译接口,编码格式utf-8
1 # headers = { 2 # 'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0' 3 # } 4 # 5 # kw = { 6 # 'kw':'wolf' 7 # } 8 # 9 # response_post = requests.post('http://fanyi.baidu.com/sug',headers=headers,data=kw) 10 # response_post.encoding = 'utf-8' 11 # # print(response_post.text) 12 # import json 13 # data = response_post.text 14 # info = json.loads(data) 15 # print(info) 16 # # print(info['data'][0]['v']) 17 # for i in info['data'][0]['v'].split('; '): 18 # print(i)
登录爬取
爬取登录后的页面,将登陆后的set_cookie或Cookie写到请求头里,可能遇到网站限速
1 # headers = { 2 # 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0', 3 # # 'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Mobile Safari/537.36', 4 # 'Cookie':'session_id_places=True; session_data_places=""' 5 # } 6 # 7 # r = requests.get('http://example.webscraping.com',headers=headers) 8 # print(r.text)
代理服务
利用代理服务器爬取百度页面(要指定http协议和端口号),用get请求发送代理和请求头给百度
1 proxies = {'http':'ip'} 2 headers = { 3 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:61.0) Gecko/20100101 Firefox/61.0', 4 } 5 r = requests.get('http_ljb://www.baidu.com',proxies=proxies,headers=headers) 6 # print(r.status_code) #状态码 7 # print(r.text) #爬取的内容 8 # print(r.content) #爬取的内容,text可能有字符格式问题 9 # print(r.headers) #请求头 10 # print(r.url) #请求的地址 11 # print(r.cookies) #cookie信息