1.认识爬虫以及基本爬虫流程

爬虫实验

简单的爬取网页源代码

import requests

def main():
    url = "http://www.4399dmw.com/search/dh-1-0-0-0-0-{}-0/"
    url_list = []
    headers = {
        "User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.99 Safari/537.36"
    }
    for i in range(14):
        urla = url.format(i)
        print(urla)
        resp = requests.get(url=urla, headers=headers)
        with open("a" + str(i) + ".txt", "wb+") as file:
            # 写入文件
            file.write(resp.content)
    pass

if __name__ == '__main__':
    main()

代理使用

是一种反反爬虫的手段，防止自己的ip泄露/追踪

代理类型：http https socket4 socket5

# 代理
proxies = {"http":"[http://123.169.122.201:9999](http://123.169.122.201:9999/)"}
resp = requests.get(url=urla,headers=headers,proxies=proxies)

如果使用socks系列

proxies = {"https":"socks5://123.169.122.201:9999"}

匿名代理：知道你用代理，但是不知道你是谁

混淆代理：知道你用代理，但是获取到的是假的ip地址

高匿代理：无法发现你在使用代理

反爬虫侦测：

一段时间内ip访问频率
检查cookie，session，user-agent，referer，header等参数
服务器提供商
需要ip地址池更新

处理session（用户登录维持）

http://sqlilabs.njhack.xyz:8080/Less-20/

通过网页的表单知道post的信息，但是网站内部的情况需要登陆之后带着cookie/session信息才能访问

url = "[http://sqlilabs.njhack.xyz:8080/Less-20/index.php](http://sqlilabs.njhack.xyz:8080/Less-20/index.php)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36"
}
data={"uname":"admin","passwd":"admin"}
# 实例化session
session = requests.session()

# 发送post请求，提交用户名密码
session.post(url,headers=headers,data=data)

	# 此时session里面已经有cookie的信息了，回的是一个已经登陆的网页，可以直接用session去get登录后的任何页面
res = session.get(url,headers=headers)
print(res.content.decode("utf-8"))
pass

处理cookie直接登录的情况

1.直接带着cookie请求，将cookie放在requests里面提交

url = "[http://sqlilabs.njhack.xyz:8080/Less-20/index.php](http://sqlilabs.njhack.xyz:8080/Less-20/index.php)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
}
cookie_dict = {"uname":"admin"}
resp = requests.get(url,headers=headers,cookies=cookie_dict)
print(resp.content.decode("utf-8"))

2.在header里面建立Cookie字段进行提交

url = "[http://sqlilabs.njhack.xyz:8080/Less-20/index.php](http://sqlilabs.njhack.xyz:8080/Less-20/index.php)"
headers = {
"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/90.0.4430.93 Safari/537.36",
"Cookie":"uname=admin"
}
resp = requests.get(url,headers=headers)
print(resp.content.decode("utf-8"))

3.直接从返回中获得cookie信息

resp = requests.post(url,headers=headers,data=data)
#解码cookie
cookies = requests.utils.dict_from_cookiejar(resp.cookies)
print(cookies)

处理https网页

ssl证书问题：不去验证ssl证书

resp = requests.get(url,headers=headers,data=data,verify=False)

超时参数

设置3秒钟没反应，结合代理去使用的，代理一段时间没反应，代理就可以从ip池里删除了

resp = requests.get(url,headers=headers,data=data,verify=False,timeout=3)

retrying模块

pip install retrying 安装retrying模块

import requests
from retrying import retry

@retry(stop_max_attempt_number=3)
def qingqiu(inurl):
    url = inurl
    print("开始请求")
    headers = {
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.82 Safari/537.36"
    }
    resp = requests.get(url=url, headers=headers)
    with open("a.txt", "wb+") as file:
        file.write(resp.content)
    print("请求成功")
    pass

def main():
    try:
        qingqiu("http://www.4399dmw.com/search/dh-1-0-0-0-0-1-0/")
    except:
        print("no")
    pass

if __name__ == '__main__':
    main()

Json

爬虫不一定整站爬行，还可以请求接口

结构化数据：json，xml

处理方式：直接转化为python类型

大多数手机版的网页可能是json数据
可以在审查元素--network--response看到

有的时候在访问json页面可能发生错误，因为有可能碰到了反爬虫程序，考虑是不是referer缺失等问题

南京PM2.5的json数据：http://api.help.bj.cn/apis/aqi3/?id=nanjing

import requests
import json

def main():
url = "[https://json.tewx.cn/user/API_kdd531mytfdzm06i?sdAS1dsnuUa3sd=190001&Jsdh4bajs99dii=sohpuisypf4nfaei](https://json.tewx.cn/user/API_kdd531mytfdzm06i?sdAS1dsnuUa3sd=190001&Jsdh4bajs99dii=sohpuisypf4nfaei)"
resp = requests.get(url=url)
# 这里的content的类型是字符串
content = resp.content.decode("utf-8")
print(content)
# 把字符串变成了字典
shuju = json.loads(content)
# 访问json转化后的字典
print(shuju["data"]["JSON"]["mydata"]["name"])
pass

免费的空气质量接口：https://api.help.bj.cn/apis/weather/?id=101060101

posted @ 2022-04-16 18:44 icui4cu 阅读(202) 评论(0) 编辑收藏举报

刷新页面返回顶部

icui4cu