Python爬虫学习（二）requests库

一、urllib库

1、了解urllib

Urllib是python内置的HTTP请求库

包括：urllib.request 请求模块

　　 urllib.error 异常处理模块

urllib.parse url解析模块

urllib.robotparser robot.txt解析模块

二、Requests库

1、简单使用

import requests

response = requests.get(url)

print(type(response))

print(response.status_code)
print(response.cookies)

print(response.text)

print(response.content)
print(response.content.decode("utf-8"))

注意：

很多情况下直接用response.text会出现乱码问题，所以常使用response.content，返回二进制格式的数据，在通过decode()转换成utf-8

也可以使用以下方式进行避免乱码的问题

response = requests.get(url)

response.encoding = 'utf-8'
print(response.text)

2、请求

get请求

　　（1）基本get请求

　　（2）带参数的get请求

　　　　　 get?key=val

response = requests.get("http://httpbin.org/get?name=zhaofan&age=23")

print(response.text)

　　　　　　通过params关键字传递参数

data = {
            “name”:"zhaofan" ,
            "age":22
}

response = requests.get("http://httpbin.org/get",params=data)
print(response.url)
print(response.text)

　　　解析json requests.json执行了json.loads()方法，两者执行的结果一致

import json
import requests

response = request.get("http://httpbin.org/get")

print(response.json())

print(json.loads(response.text))

　　添加headers 有些网站（如知乎）直接通过requests请求访问时，默认是无法访问

在谷歌浏览器里输入chrome://version，就可以看到用户代理，将用户代理添加到头部信息

import requests
headers = {
                 "User-Agent":"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36"
}    

response = requests.get("https://www.zhihu.com",headers=headers)

print(response.text)

post请求

添加data参数

import requests
data = {
          “name”:"zhaofan",
          "age":23
}

response = requests.post("http://httpbin.org/post",data=data)

print(response.text)

响应

通过response可以获得很多属性

import requests

response = requests.get("http://www.baidu.com")

print(response.status_code)
print(response.headers)
print(response.cookies)
print(response.url)
print(response.history)

状态码判断

202：accepted

404：not_found

posted @ 2020-05-05 12:12 cola_cola 阅读(143) 评论(0) 编辑收藏举报

cola_cola

Python爬虫学习（二）requests库

公告