【爬虫】 requests高级用法，代理池，爬取视频和新闻

1. 测试频率
2. requests高级用法
3. 代理池搭建
- 3.1 django后端获取客户端的ip
4. 爬取某视频网站
5. 爬取新闻
6. bs4 遍历文档树

1. 测试频率

# 登录后的cookie，起100个线程，每个线程里死循环去点赞
import requests

from threading import Thread

def task():
    while True:
        data = {
            'linkId': '36996038'
        }
        header = {
            # 客户端类型
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36',
            # 携带cookie
            'Cookie': 'deviceId=web.eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiI3MzAyZDQ5Yy1mMmUwLTRkZGItOTZlZi1hZGFmZTkwMDBhMTEiLCJleHBpcmUiOiIxNjYxNjU0MjYwNDk4In0.4Y4LLlAEWzBuPRK2_z7mBqz4Tw5h1WeqibvkBG6GM3I; __snaker__id=ozS67xizRqJGq819; YD00000980905869%3AWM_TID=M%2BzgJgGYDW5FVFVAVQbFGXQ654xCRHj8; _9755xjdesxxd_=32; Hm_lvt_03b2668f8e8699e91d479d62bc7630f1=1666756750,1669172745; gdxidpyhxdE=W7WrUDABQTf1nd8a6mtt5TQ1fz0brhRweB%5CEJfQeiU61%5C1WnXIUkZH%2FrE4GnKkGDX767Jhco%2B7xUMCiiSlj4h%2BRqcaNohAkeHsmj3GCp2%2Fcj4HmXsMVPPGClgf5AbhAiztHgnbAz1Xt%5CIW9DMZ6nLg9QSBQbbeJSBiUGK1RxzomMYSU5%3A1669174630494; YD00000980905869%3AWM_NI=OP403nvDkmWQPgvYedeJvYJTN18%2FWgzQ2wM3g3aA3Xov4UKwq1bx3njEg2pVCcbCfP9dl1RnAZm5b9KL2cYY9eA0DkeJo1zfCWViwVZUm303JyNdJVAEOJ1%2FH%2BJFZxYgMVI%3D; YD00000980905869%3AWM_NIKE=9ca17ae2e6ffcda170e2e6ee92bb45a398f8d1b34ab5a88bb7c54e839b8aacc1528bb8ad89d45cb48ae1aac22af0fea7c3b92a8d90fcd1b266b69ca58ed65b94b9babae870a796babac9608eeff8d0d66dba8ffe98d039a5edafa2b254adaafcb6ca7db3efae99b266aa9ba9d3f35e81bdaea4e55cfbbca4d2d1668386a3d6e1338994fe84dc53fbbb8fd1c761a796a1d2f96e81899a8af65e9a8ba3d4b3398aa78285c95e839b81abb4258cf586a7d9749bb983b7cc37e2a3; token=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJqaWQiOiJjZHVfNTMyMDcwNzg0NjAiLCJleHBpcmUiOiIxNjcxNzY1NzQ3NjczIn0.50e-ROweqV0uSd3-Og9L7eY5sAemPZOK_hRhmAzsQUk; Hm_lpvt_03b2668f8e8699e91d479d62bc7630f1=1669173865'
        }
        res = requests.post('https://dig.chouti.com/link/vote', data=data, headers=header)
        print(res.text)


if __name__ == '__main__':
    for i in range(100):
        t = Thread(target=task)
        t.start()

2. requests高级用法

2.1 ssl认证

1. https 和 http 有什么区别
   https协议需要到ca申请证书，一般免费证书很少，需要交费。
   http是超文本传输协议，信息是明文传输，https 则是具有安全性的ssl加密传输协议
   HTTPS协议是由SSL+HTTP协议构建的可进行加密传输、身份认证的网络协议 要比http协议安全

2. 没有被认证过的机构，签发的证书，用的时候，浏览器会提示不安全

2.2 实例

# 1. 不认证证书
import requests
respone = requests.get('https://www.12306.cn', verify=False) # 不验证证书,报警告,返回200
print(respone.status_code)


# 2. 手动携带证书访问
import requests
respone=requests.get('https://www.12306.cn',cert=('/path/server.crt','/path/key'))
print(respone.status_code)

2.3 使用代理

1. 频率限制，封账号，通过ip或用户id限制，做爬虫，就要避免这些
   封ip：代理
   封账号：注册很多小号

2. 代理是什么？
   正向代理：代理客户端
   反向代理：代理服务端，nginx是反向代理服务器

3. 发送http请求，使用代理发送
   import requests
   proxies = {
    'http': '192.168.10.102:9003',
}
respone=requests.get('https://www.baidu.com',proxies=proxies)

print(respone.text)

2.4 超时设置

# 超时设置
import requests
respone=requests.get('https://www.baidu23.com',timeout=3)
print(respone)

2.5 异常处理


# 异常处理
import requests
from requests.exceptions import * #可以查看requests.exceptions获取异常类型
try:
    r=requests.get('http://www.baidu.com',timeout=0.00001)
except ReadTimeout:
    print('===:')
except ConnectionError: #网络不通
    print('-----')
except Timeout:
    print('aaaaa')

except RequestException:
    print('Error')

2.6 上传文件

# 上传文件
import requests
files={'file':open('a.txt','rb')}
respone=requests.post('http://httpbin.org/post',files=files)
print(respone.text)

3. 代理池搭建

1. github开源的，代理池的代码，本地跑起来
   爬虫技术：爬取免费的代理网站，获取免费代理，验证过后，存到本地
   使用flask搭建一个web后端，访问某个接口就可以随机返回一个可用的代理地址
   https://github.com/jhao104/proxy_pool

2. 搭建步骤：
    1 git clone https://github.com/jhao104/proxy_pool.git
    2 创建虚拟环境mkvirtualenv -p python3.8 crawl
      安装依赖：pip install -r requirements.txt
    3 修改配置文件settings.py   ---》redis服务启动
        # 配置API服务
        HOST = "0.0.0.0"               # IP
        PORT = 5000                    # 监听端口
        # 配置数据库

        DB_CONN = 'redis://127.0.0.1:5010/0'
        # 配置 ProxyFetcher
        PROXY_FETCHER = [
            "freeProxy01",   
            "freeProxy02",
        ]
         4 启动爬虫，启动web服务
        # 启动调度程序
        python proxyPool.py schedule
        # 启动webApi服务
        python proxyPool.py server
        
    5 随机获取ip
    	127.0.0.1:5010/get

import requests

# http://127.0.0.1:5010/get/
# 获取一个随机ip
res = requests.get('http://127.0.0.1:5010/get/').json()
if res['https']:
    http = 'https'
else:
    http = 'http'
proxie = {
    http: res['proxy']
}
print(proxie)
res = requests.get('https://www.cnblogs.com/liuqingzheng/p/16005896.html', proxies=proxie)
print(res.status_code)

3.1 django后端获取客户端的ip

# 写一个返回用户ip地址的django程序
def ip_test(request):
    # 获取客户端ip
    ip=request.META.get('REMOTE_ADDR')
    return HttpResponse('您的ip是：%s'%ip)
#部署在云服务器

from django.contrib import admin
from django.urls import path
from app01 import views
urlpatterns = [
    path('admin/', admin.site.urls),
    path('ip/',views.ip_test)
]

#本地使用requests+代理访问，查看是否返回代理的ip地址
import requests

res = requests.get('http://127.0.0.1:5010/get/').json()
if res['https']:
    http = 'https'
else:
    http = 'http'
proxie = {
    http: http+'://'+res['proxy']
}
print(proxie)
# 服务端部署在本地，是访问不到的，内网穿透，或者部署在服务器上
# res = requests.get('http://192.168.1.143:8000/ip/', proxies=proxie)
# res = requests.get('https://46b3k95600.zicp.fun/ip/', proxies=proxie) # 不生效
res = requests.get('http://101.133.225.166/ip/', proxies=proxie)
print(res.text)
# 如果代理不可用，就不用代理了

4. 爬取某视频网站

import requests
import re
res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=1')

# 使用正则，解析出该页面中所有的视频地址
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text)
for video in video_list:
    video_url = 'https://www.pearvideo.com/' + video
    print(video_url)

import requests
import re

res = requests.get('https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=1&start=1')

# 使用正则，解析出该页面中所有的视频地址
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">', res.text)
# print(video_list)
for video in video_list:
    # video_url = 'https://www.pearvideo.com/' + video
    # print(video_url)
    # res = requests.get(video_url)
    # print(res.text)
    # break
    # 向https://www.pearvideo.com/videoStatus.jsp?contId=1646509&mrd=0.6761335369801458发送请求获取视频地址
    video_id = video.split('_')[-1]
    header = {
        'Referer': 'https://www.pearvideo.com/%s' % video
    }
    res = requests.get('https://www.pearvideo.com/videoStatus.jsp?contId=%s&mrd=0.6761335369801458' % video_id,
                       headers=header).json()
    real_mp4_url = res['videoInfo']['videos']['srcUrl']
    real_mp4_url = real_mp4_url.replace(real_mp4_url.rsplit('/', 1)[-1].split('-')[0], 'cont-%s' % video_id)
    print(real_mp4_url)

    res = requests.get(real_mp4_url)
    with open('./video/%s.mp4' % video_id, 'wb') as f:
        for line in res.iter_content():
            f.write(line)

5. 爬取新闻

# requests+BautifulSoup4(解析库：bs4，lxml...)
# https://www.autohome.com.cn/news/

import requests
# 解析库；bs4  pip3 install beautifulsoup4
from bs4 import BeautifulSoup

res = requests.get('https://www.autohome.com.cn/news/1/#liststart')
# print(res.text)  # 从返回的html中查找，bs是解析html，xml格式的
soup = BeautifulSoup(res.text, 'html.parser')
# 查找：类名等于article的ul标签
ul_list = soup.find_all(name='ul', class_='article')
print(len(ul_list))  # 4 个ul取出来了
for ul in ul_list:
    # 找到ul下所有的li标签
    li_list = ul.find_all(name='li')
    for li in li_list:
        h3 = li.find(name='h3')
        if h3:  # 获取h3标签的文本内容
            title = h3.text
            desc = li.find(name='p').text
            url = 'https:' + li.find(name='a').attrs.get('href')
            img = li.find(name='img').attrs.get('src')
            if not img.startswith('http'):
                img='https:'+img

        print('''
        文章标题：%s
        文章摘要：%s
        文章地址：%s
        文章图片：%s
        ''' % (title, desc, url, img))

        #把数据保存到mysql：创建库，创建表，pymysql   insert      conn.commit()

6. bs4 遍历文档树


from bs4 import BeautifulSoup

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title" id='id_p' name='lqz' xx='yy'>lqz is handsome <b>The Dormouse's story</b></p>

<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>

<p class="story">...</p>
"""
soup = BeautifulSoup(html_doc, 'lxml')
# 1 美化html:了解
# print(soup.prettify())

# 2 遍历文档树
'''
#遍历文档树：即直接通过标签名字选择，特点是选择速度快，但如果存在多个相同的标签则只返回第一个
#1、用法
#2、获取标签的名称
#3、获取标签的属性
#4、获取标签的内容
#5、嵌套选择
#6、子节点、子孙节点
#7、父节点、祖先节点
#8、兄弟节点
'''
# 1 基本用法，直接  .标签名字
# res=soup.title
# print(res)
# res=soup.a
# print(res)
# 可以嵌套使用
# res=soup.head.title
# print(res)

# 2 获取标签的名称
# 拿到的所有标签都是一个对象，Tag对象  bs4.element.Tag
# res=soup.head.title
# res=soup.body
# print(res.name)

# 3 获取标签的属性
# res=soup.p
# print(res.attrs)  # 属性字典


# 4 获取标签的内容
# res = soup.p
# print(res.text) # 把该标签子子孙孙内容拿出来拼到一起 字符串
# print(res.string) # None 必须该标签没有子标签，才能拿出文本内容
# print(list(res.strings) )# generator 生成器，把子子孙孙的文本内容放到生成器中

# 5 嵌套选择

# res=soup.html.body.a
# print(res.text)


# 6、子节点、子孙节点
# print(soup.p.contents) #p下所有子节点
# print(soup.p.children) #得到一个迭代器,包含p下所有子节点

# 7、父节点、祖先节点
# print(soup.a.parent) #获取a标签的父节点,直接父节点
# print(list(soup.a.parents)) #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...


# 8、兄弟节点
# print(soup.a.next_sibling)  # 下一个兄弟
# print(soup.a.previous_sibling)  # 上一个兄弟

print(list(soup.a.next_siblings)) #下面的兄弟们=>生成器对象
print('-----')
print(list(soup.a.previous_siblings)) #上面的兄弟们=>生成器对象

posted @ 2022-11-24 19:57 |相得益张| 阅读(658) 评论(0) 编辑收藏举报

刷新页面返回顶部

登录后才能查看或发表评论，立即登录或者逛逛博客园首页

相关博文：

· 【爬虫】爬虫基础

· 【爬虫】加代理，cookie，header，selenium去重，scrapy-redis实现分布式爬虫

· requests高级用法、代理池搭建、爬取案例

· 1 requests高级用法、2 代理池搭建、3 爬取某视频网站、4 爬取新闻

公告

昵称： |相得益张|
园龄： 2年8个月
粉丝： 10
关注： 10

+加关注

2025年3月

日

一

二

三

四

五

六

|相得益张|

人而不学其犹正墙面而立

【爬虫】 requests高级用法，代理池，爬取视频和新闻

1. 测试频率

2. requests高级用法

2.1 ssl认证

2.2 实例

2.3 使用代理

2.4 超时设置

2.5 异常处理

2.6 上传文件

3. 代理池搭建

3.1 django后端获取客户端的ip

4. 爬取某视频网站

5. 爬取新闻

6. bs4 遍历文档树

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

|相得益张|

人而不学 其犹正墙面而立

【爬虫】 requests高级用法，代理池，爬取视频和新闻

1. 测试频率

2. requests高级用法

2.1 ssl认证

2.2 实例

2.3 使用代理

2.4 超时设置

2.5 异常处理

2.6 上传文件

3. 代理池搭建

3.1 django后端获取客户端的ip

4. 爬取某视频网站

5. 爬取新闻

6. bs4 遍历文档树

公告

搜索

常用链接

随笔分类

随笔档案

阅读排行榜

评论排行榜

推荐排行榜

最新评论

人而不学其犹正墙面而立