Python爬虫2

import requests

response=requests.get("https://www.baidu.com")
#print(response)
#print(type(response))
print(response.text)

print (response.encoding)

print(response.content.decode("utf-8"))

r.text返回网页的源代码
r.content返回源代码的字节码，.decode(编码)使用某种编码解析字节码
r.encoding 返回识别的编码，如果识别错误，就会乱码
r.status_code返回状态码

print(response.status_code)--返回状态码结果是200

response.get()参数

url 请求的地址
params 请求网址附带的参数
headers 请求网址附带的参数头

response=requests.get("http://www.antvv.com/?cate=4")
print(response.text)

a={}
response=requests.get("http://www.antvv.com",params=a)

用来测试http请求的网址
http://httpbin.org/get 获取电脑信息  http://httpbin.org/post

response=requests.get("http://httpbin.org/get")
print(response.text)

返回的结果是

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1", 
    "X-Amzn-Trace-Id": "Root=1-5e9fd4f8-4e3d91cc100f2c6674d3c0b2"
  }, 
  "origin": "124.64.16.230", 
  "url": "http://httpbin.org/get"
}


Process finished with exit code 0

在User-Agent中可以看到是爬虫信息，此时可以给定义一个headers

自定义UA，

user-agent:服务器识别用户当前使用什么浏览器，如果不设置就是Python-requests，可以使用headers参数设定

referer：上一页请求的地址，也就是你从哪个页面跳转到当前页面，部分网站会拦截referer不正确的请求

headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
"Referer":"http://httpbin.org"

}
response=requests.get("http://httpbin.org/get",headers=headers)
print(response.text)

返回的结果为：

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400", 
    "X-Amzn-Trace-Id": "Root=1-5e9fd5d3-e0944316a8c4783b8e08fd2e"
  }, 
  "origin": "124.64.16.230", 
  "url": "http://httpbin.org/get"
}

此时便看不出User-Agent是什么了

stream流式传输

# 获取图片
url="https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=1208538952,1443328523&fm=26&gp=0.jpg"
r=requests.get(url,headers=headers)
print(r.content)
with open("1.jpg",'wb') as file:
    file.write(r.content)

# 获取图片 用流的方式
url="https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=1208538952,1443328523&fm=26&gp=0.jpg"
r=requests.get(url,headers=headers,stream=True)
# print(r.content)
with open("1.jpg",'wb') as file:
    for j in r.iter_content(102400):
        file.write(j)
        print(j)

timeout 设定超时时间，超过时间则会报错

url="https://www.zhihu.com"
try:
    r=requests.get(url,timeout=2)
    print(r.text)
except   BaseException:
    print("超时了")

proxiesd代理

#proxies代理
url="http://httpbin.org/get"
proxies={
    "http":"182.35.84.181:9999",
    "https":"",
}
r=requests.get(url,proxies=proxies)
print(r.text)

SSL

verify=False 不强制认证证书，如遇到sslError可以设定，现在12306不要求验证了

 import requests
 response=requests.get('http://www.12306.cn',verify=False)
 print(response.status_code)
 print(response.content.decode('utf-8'))

```
 json格式的返回值
```

url="http://httpbin.org/get"
r=requests.get(url)
resp_str=r.text
print(resp_str)
print(type(resp_str))

输出结果

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1", 
    "X-Amzn-Trace-Id": "Root=1-5ea13127-3bf7c712fb636862fd58c91c"
  }, 
  "origin": "117.136.0.252", 
  "url": "http://httpbin.org/get"
}

<class 'str'>

json.loads()

json.loads()把Python字符串转换成Python的字典或者列表

url="http://httpbin.org/get"
r=requests.get(url)
resp_str=r.text

import json
resp_dict=json.loads(resp_str)#json.loads()把Python字符串转换成Python的字典或者列表
print(resp_dict)
print(type(resp_dict))
print(resp_dict['url'])

输出结果

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1', 'X-Amzn-Trace-Id': 'Root=1-5ea134f7-dc8428a433fcf066a2fde876'}, 'origin': '117.136.0.252', 'url': 'http://httpbin.org/get'}
<class 'dict'>
http://httpbin.org/get

Process finished with exit code 0

json.dumps()

print(json.dumps({"name":"tom",'age':18,'sex':"male"}))#json.dumps(Python-obj)是把Python的字典或者列表转换成json字符串
print(type(json.dumps({"name":"tom",'age':18,'sex':"male"})))

r.json()解析json字符串，这是requests模块带的json解析

url="http://httpbin.org/get"
r=requests.get(url)
resp_str=r.json()
print(resp_str)

输出结果

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1', 'X-Amzn-Trace-Id': 'Root=1-5ea13733-900d16cad13db5581550d818'}, 'origin': '117.136.0.252', 'url': 'http://httpbin.org/get'}

Process finished with exit code 0

post方式---向服务器上传图片等

#post 方式
url="http://httpbin.org/post"
data={
    "uname":"admin",
    "upwd":"123456" #此处uname，upwd查看http://www.antvv.com/login/login.html 右键查看源代码
}
r=request.post(url,data=data)
print(r.text)

```
files向服务器发送文件
```

#post 方式
url="http://httpbin.org/post"
data={
    "uname":"admin",
    "upwd":"123456" #此处uname，upwd查看http://www.antvv.com/login/login.html 右键查看源代码
}
#files向服务器发送文件
files={
    "img1":open("./1.jpg",'rb')
}
r=requests.post(url,data=data,files=files)
print(r.text)

posted @ 2020-04-15 14:10 Smilevv-45 阅读(244) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

Python爬虫2

公告