二、爬虫的基本流程

流程图解

二、模拟请求模块,requests模块

1)request模块所支持的请求

import requests
requests.get("http://httpbin.org/get")
requests.post("http://httpbin.org/post")
requests.put("http://httpbin.org/put")
requests.delete("http://httpbin.org/delete")
requests.head("http://httpbin.org/get")
requests.options("http://httpbin.org/get")

2)get请求

import requests
response=requests.get('https://www.jd.com/',)
 
with open("jd.html","wb") as f:
    f.write(response.content)

3)含参数的get请求

import requests
response=requests.get('https://s.taobao.com/search?q=手机')
response=requests.get('https://s.taobao.com/search',params={"q":"美女"})

4)携带请求头的get请求

import requests
response=requests.get('https://dig.chouti.com/',
                      headers={
                         'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/62.0.3202.75 Safari/537.36',
                              }
                      )

5)综合,携带参数,请求头,数据的post请求。一般发送数据,都以post请求

import requests

res=requests.post('https://www.lagou.com/jobs/positionAjax.json',
             headers={
                    'Referer':"https://www.lagou.com/jobs/list_python",
                     'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36',

             },
             data={
                 'first':True,
                 'pn':2,
                 'kd':'java高级开发'
             },
             params={
                 'gj': '3年及以下',
                 'px': 'default',
                 'yx': '25k-50k',
                 'city': '北京',
                 'needAddtionalResult': False,
                 'isSchoolJob': 0
             }
             )

comapines_list=r6.json()

print(comapines_list)

三、request对象的属性方法

1)request的常见属性

import requests
respone=requests.get('https://sh.lianjia.com/ershoufang/')
# respone属性
print(respone.text)
print(respone.content)
print(respone.status_code)
print(respone.headers)
print(respone.cookies)
print(respone.cookies.get_dict())
print(respone.cookies.items())
print(respone.url)
print(respone.history)
print(respone.encoding)    

2)编码问题

import requests
response=requests.get('http://www.autohome.com/news')
#response.encoding='gbk' #汽车之家网站返回的页面内容为gb2312编码的,而requests的默认编码为ISO-8859-1,如果不设置成gbk则中文乱码
with open("res.html","w") as f:
    f.write(response.text)

3)下载二进制文件(图片,视频,音频等)

import requests
response=requests.get('http://bangimg1.dahe.cn/forum/201612/10/200447p36yk96im76vatyk.jpg')
with open("res.png","wb") as f:
    # f.write(response.content) # 比如下载视频时,如果视频100G,用response.content然后一下子写到文件中是不合理的
    for line in response.iter_content():
        f.write(line)

 4)解析json数据

import requests
import json
 
response=requests.get('http://httpbin.org/get')
res1=json.loads(response.text) #太麻烦
res2=response.json() #直接获取json数据
print(res1==res2)

5)redirection and history

 默认情况下,除了 HEAD, Requests 会自动处理所有重定向。可以使用响应对象的 history 方法来追踪重定向。Response.history 是一个 Response 对象的列表,为了完成请求而创建了这些对象。这个对象列表按照从最老到最近的请求进行排序。

>>> r = requests.get('http://github.com')
>>> r.url
'https://github.com/'
>>> r.status_code
200
>>> r.history
[<Response [301]>]

另外,还可以通过 allow_redirects 参数禁用重定向处理:

>>> r = requests.get('http://github.com', allow_redirects=False)
>>> r.status_code
301
>>> r.history
[]  

 

原文链接:https://www.cnblogs.com/yuanchenqi/articles/9449430.html

posted on 2018-12-24 10:56  可口_可乐  阅读(142)  评论(0编辑  收藏  举报