Python爬虫2

import requests

response=requests.get("https://www.baidu.com")
#print(response)
#print(type(response))
print(response.text)
print (response.encoding)
print(response.content.decode("utf-8"))
 
r.text返回网页的源代码
r.content返回源代码的字节码,.decode(编码)使用某种编码解析字节码
r.encoding 返回识别的编码,如果识别错误,就会乱码
r.status_code返回状态码
print(response.status_code)--返回状态码结果是200
response.get()参数
  • url 请求的地址
  • params 请求网址附带的参数
  • headers 请求网址附带的参数头
response=requests.get("http://www.antvv.com/?cate=4")
print(response.text)

a={}
response=requests.get("http://www.antvv.com",params=a)
用来测试http请求的网址
http://httpbin.org/get 获取电脑信息 http://httpbin.org/post
response=requests.get("http://httpbin.org/get")
print(response.text)

返回的结果是

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1", 
    "X-Amzn-Trace-Id": "Root=1-5e9fd4f8-4e3d91cc100f2c6674d3c0b2"
  }, 
  "origin": "124.64.16.230", 
  "url": "http://httpbin.org/get"
}


Process finished with exit code 0

在User-Agent中可以看到是爬虫信息,此时可以给定义一个headers

自定义UA,

user-agent:服务器识别用户当前使用什么浏览器,如果不设置就是Python-requests,可以使用headers参数设定

referer:上一页请求的地址,也就是你从哪个页面跳转到当前页面,部分网站会拦截referer不正确的请求

headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400"
"Referer":"http://httpbin.org"
} response=requests.get("http://httpbin.org/get",headers=headers) print(response.text)

返回的结果为:

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3741.400 QQBrowser/10.5.3863.400", 
    "X-Amzn-Trace-Id": "Root=1-5e9fd5d3-e0944316a8c4783b8e08fd2e"
  }, 
  "origin": "124.64.16.230", 
  "url": "http://httpbin.org/get"
}

此时便看不出User-Agent是什么了

  • stream流式传输
# 获取图片
url="https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=1208538952,1443328523&fm=26&gp=0.jpg"
r=requests.get(url,headers=headers)
print(r.content)
with open("1.jpg",'wb') as file:
    file.write(r.content)
# 获取图片 用流的方式
url="https://ss3.bdstatic.com/70cFv8Sh_Q1YnxGkpoWK1HF6hhy/it/u=1208538952,1443328523&fm=26&gp=0.jpg"
r=requests.get(url,headers=headers,stream=True)
# print(r.content)
with open("1.jpg",'wb') as file:
    for j in r.iter_content(102400):
        file.write(j)
        print(j)
  •  timeout 设定超时时间,超过时间则会报错
url="https://www.zhihu.com"
try:
    r=requests.get(url,timeout=2)
    print(r.text)
except   BaseException:
    print("超时了")
  • proxiesd代理
#proxies代理
url="http://httpbin.org/get"
proxies={
    "http":"182.35.84.181:9999",
    "https":"",
}
r=requests.get(url,proxies=proxies)
print(r.text)
  • SSL 
    verify=False 不强制认证证书,如遇到sslError可以设定,现在12306不要求验证了
 import requests
 response=requests.get('http://www.12306.cn',verify=False)
 print(response.status_code)
 print(response.content.decode('utf-8'))
  •  json格式的返回值
url="http://httpbin.org/get"
r=requests.get(url)
resp_str=r.text
print(resp_str)
print(type(resp_str))

输出结果

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{
  "args": {}, 
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.19.1", 
    "X-Amzn-Trace-Id": "Root=1-5ea13127-3bf7c712fb636862fd58c91c"
  }, 
  "origin": "117.136.0.252", 
  "url": "http://httpbin.org/get"
}

<class 'str'>
  • json.loads()
    json.loads()把Python字符串转换成Python的字典或者列表
url="http://httpbin.org/get"
r=requests.get(url)
resp_str=r.text

import json
resp_dict=json.loads(resp_str)#json.loads()把Python字符串转换成Python的字典或者列表
print(resp_dict)
print(type(resp_dict))
print(resp_dict['url'])

输出结果

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1', 'X-Amzn-Trace-Id': 'Root=1-5ea134f7-dc8428a433fcf066a2fde876'}, 'origin': '117.136.0.252', 'url': 'http://httpbin.org/get'}
<class 'dict'>
http://httpbin.org/get

Process finished with exit code 0
  • json.dumps()
print(json.dumps({"name":"tom",'age':18,'sex':"male"}))#json.dumps(Python-obj)是把Python的字典或者列表转换成json字符串
print(type(json.dumps({"name":"tom",'age':18,'sex':"male"})))
  • r.json()解析json字符串,这是requests模块带的json解析
url="http://httpbin.org/get"
r=requests.get(url)
resp_str=r.json()
print(resp_str)

输出结果

D:\ProgramData\Anaconda3\python.exe "E:/WXA/PyCharm study/爬虫介绍和基础库/demo1_requests请求.py"
{'args': {}, 'headers': {'Accept': '*/*', 'Accept-Encoding': 'gzip, deflate', 'Host': 'httpbin.org', 'User-Agent': 'python-requests/2.19.1', 'X-Amzn-Trace-Id': 'Root=1-5ea13733-900d16cad13db5581550d818'}, 'origin': '117.136.0.252', 'url': 'http://httpbin.org/get'}

Process finished with exit code 0
  •  post方式---向服务器上传图片等
#post 方式
url="http://httpbin.org/post"
data={
    "uname":"admin",
    "upwd":"123456" #此处uname,upwd查看http://www.antvv.com/login/login.html 右键查看源代码
}
r=request.post(url,data=data)
print(r.text)
  • files向服务器发送文件
#post 方式
url="http://httpbin.org/post"
data={
    "uname":"admin",
    "upwd":"123456" #此处uname,upwd查看http://www.antvv.com/login/login.html 右键查看源代码
}
#files向服务器发送文件
files={
    "img1":open("./1.jpg",'rb')
}
r=requests.post(url,data=data,files=files)
print(r.text)

 

 
posted @ 2020-04-15 14:10  Smilevv-45  阅读(244)  评论(0编辑  收藏  举报