1.认识爬虫以及基本爬虫流程
爬虫实验
简单的爬取网页源代码
import requests
def main():
url = "http://www.4399dmw.com/search/dh-1-0-0-0-0-{}-0/"
url_list = []
headers = {
"User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
}
for i in range(14):
urla = url.format(i)
print(urla)
resp = requests.get(url=urla, headers=headers)
with open("a" + str(i) + ".txt", "wb+") as file:
# 写入文件
file.write(resp.content)
pass
if __name__ == '__main__':
main()
代理使用
是一种反反爬虫的手段,防止自己的ip泄露/追踪
代理类型:http https socket4 socket5
# 代理
proxies = {"http":"[http://123.169.122.201:9999](http://123.169.122.201:9999/)"}
resp = requests.get(url=urla,headers=headers,proxies=proxies)
如果使用socks系列
proxies = {"https":"socks5://123.169.122.201:9999"}
匿名代理:知道你用代理,但是不知道你是谁
混淆代理:知道你用代理,但是获取到的是假的ip地址
高匿代理:无法发现你在使用代理
反爬虫侦测:
一段时间内ip访问频率
检查cookie,session,user-agent,referer,header等参数
服务器提供商
需要ip地址池更新
处理session(用户登录维持)
通过网页的表单知道post的信息,但是网站内部的情况需要登陆之后带着cookie/session信息才能访问
url = "[http://sqlilabs.njhack.xyz:8080/Less-20/index.php](http://sqlilabs.njhack.xyz:8080/Less-20/index.php)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
data={"uname":"admin","passwd":"admin"}
# 实例化session
session = requests.session()
# 发送post请求,提交用户名密码
session.post(url,headers=headers,data=data)
# 此时session里面已经有cookie的信息了,回的是一个已经登陆的网页,可以直接用session去get登录后的任何页面
res = session.get(url,headers=headers)
print(res.content.decode("utf-8"))
pass
处理cookie直接登录的情况
1.直接带着cookie请求,将cookie放在requests里面提交
url = "[http://sqlilabs.njhack.xyz:8080/Less-20/index.php](http://sqlilabs.njhack.xyz:8080/Less-20/index.php)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
}
cookie_dict = {"uname":"admin"}
resp = requests.get(url,headers=headers,cookies=cookie_dict)
print(resp.content.decode("utf-8"))
2.在header里面建立Cookie字段进行提交
url = "[http://sqlilabs.njhack.xyz:8080/Less-20/index.php](http://sqlilabs.njhack.xyz:8080/Less-20/index.php)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"Cookie":"uname=admin"
}
resp = requests.get(url,headers=headers)
print(resp.content.decode("utf-8"))
3.直接从返回中获得cookie信息
resp = requests.post(url,headers=headers,data=data)
#解码cookie
cookies = requests.utils.dict_from_cookiejar(resp.cookies)
print(cookies)
处理https网页
ssl证书问题:不去验证ssl证书
resp = requests.get(url,headers=headers,data=data,verify=False)
超时参数
设置3秒钟没反应,结合代理去使用的,代理一段时间没反应,代理就可以从ip池里删除了
resp = requests.get(url,headers=headers,data=data,verify=False,timeout=3)
retrying模块
pip install retrying 安装retrying模块
import requests
from retrying import retry
@retry(stop_max_attempt_number=3)
def qingqiu(inurl):
url = inurl
print("开始请求")
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36"
}
resp = requests.get(url=url, headers=headers)
with open("a.txt", "wb+") as file:
file.write(resp.content)
print("请求成功")
pass
def main():
try:
qingqiu("http://www.4399dmw.com/search/dh-1-0-0-0-0-1-0/")
except:
print("no")
pass
if __name__ == '__main__':
main()
Json
爬虫不一定整站爬行,还可以请求接口
结构化数据:json,xml
处理方式:直接转化为python类型
大多数手机版的网页可能是json数据
可以在审查元素--network--response看到
有的时候在访问json页面可能发生错误,因为有可能碰到了反爬虫程序,考虑是不是referer缺失等问题
南京PM2.5的json数据:http://api.help.bj.cn/apis/aqi3/?id=nanjing
import requests
import json
def main():
url = "[https://json.tewx.cn/user/API_kdd531mytfdzm06i?sdAS1dsnuUa3sd=190001&Jsdh4bajs99dii=sohpuisypf4nfaei](https://json.tewx.cn/user/API_kdd531mytfdzm06i?sdAS1dsnuUa3sd=190001&Jsdh4bajs99dii=sohpuisypf4nfaei)"
resp = requests.get(url=url)
# 这里的content的类型是字符串
content = resp.content.decode("utf-8")
print(content)
# 把字符串变成了字典
shuju = json.loads(content)
# 访问json转化后的字典
print(shuju["data"]["JSON"]["mydata"]["name"])
pass
本文来自博客园,作者:icui4cu,转载请注明原文链接:https://www.cnblogs.com/icui4cu/p/16153828.html