爬虫:图片,视频

获取二进制数据content或者iter_content

用于下载图片,视频。

爬取图片:

import requests

header = {
   "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    "referer":"https://mmzztt.com/"
}

res = requests.get("https://p.iimzt.com/2022/04/21g14fpi.jpg",headers=header)
with open("a.jpg",'wb') as f:
    # f.write(res.content)
    # 每次拿多少字节,拿100写100
    for i in res.iter_content(100):
        f.write(i)

截屏2022-05-06 下午9.33.59

可以看到此时的请求头中携带user-agentreferer,所以我们的请求投中也需要携带。

import requests

header = {
   "user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
    "referer":"https://mmzztt.com/"
}

res = requests.get("https://p.iimzt.com/2022/04/21g14fpi.jpg",headers=header)
with open("a.jpg",'wb') as f:
    # f.write(res.content)
    # 每次拿多少字节,拿100写100
    for i in res.iter_content(100):
        f.write(i)

截屏2022-05-06 下午9.33.41

爬取视频:

举例使用梨视频:

截屏2022-05-07 下午12.07.48

截屏2022-05-07 下午12.09.12

访问此地址,url中有些参数可以删除,并且删除之后并不妨碍访问:

截屏2022-05-07 上午10.36.45

那么此时我们可以通过代码来模拟访问,并且通过正则与for循环形式拿到12条视屏链接。

import requests
import re

res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
print(res.text)

再开寻找真正的视屏地址:

截屏2022-05-07 上午11.01.25

截屏2022-05-07 上午11.32.46

那么此时我们就可以通过代码来获取到这12条视屏链接。

import requests
import re

res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
# print(res.text)
"""
标签<a href="video_1756640" class="vervideo-lilink actplay"中的href为视频链接地址
(.*?)分组,按照任意规则匹配,在res.text中匹配<a href="(.*?)" class="vervideo-lilink actplay
"""
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">',res.text)
"""
['<a href="video_1756640" class="vervideo-lilink actplay">', 
'<a href="video_1756571" class="vervideo-lilink actplay">', 
'<a href="video_1756570" class="vervideo-lilink actplay">', 
'<a href="video_1756573" class="vervideo-lilink actplay">', 
'<a href="video_1756572" class="vervideo-lilink actplay">', 
'<a href="video_1756575" class="vervideo-lilink actplay">', 
'<a href="video_1756574" class="vervideo-lilink actplay">', 
'<a href="video_1756577" class="vervideo-lilink actplay">', 
'<a href="video_1756576" class="vervideo-lilink actplay">', 
'<a href="video_1756579" class="vervideo-lilink actplay">', 
'<a href="video_1756578" class="vervideo-lilink actplay">', 
'<a href="video_1756581" class="vervideo-lilink actplay">']
"""
# print(video_list)
for video in video_list:
    """
    video_1756640
    video_1756571
    ...
    """
    print(video)
    video_jump = "https://www.pearvideo.com/" + video
    """
    https://www.pearvideo.com/video_1756640
    https://www.pearvideo.com/video_1756571
    ...
    """
    print(video_jump)

注意此时我们拿到是视频的链接,并不是视频的MP4地址,而是跳转地址,此时我们需要找到真正的视屏MP4地址,此时挑选过滤出的视频地址,可以看到加载此视频时是发送了一个ajax请求。

截屏2022-05-07 上午11.46.12

截屏2022-05-07 下午12.02.32

使用代码模拟发送get请求,发现得到的返回结果一致。

截屏2022-05-07 下午2.40.47

发现使用浏览器同样访问也无法打开视屏,查看请求头

截屏2022-05-07 下午2.42.22

代码中手动添加referer --> 上一次访问地址:

截屏2022-05-07 下午2.45.14

现在我们拿到返回的视屏MP4地址,使用浏览器访问,发现依旧无法访问,做了反扒措施。

截屏2022-05-07 下午2.47.26

此时我们可以看到代码中contId字样,contid就是视屏id。

截屏2022-05-07 下午2.59.43

将范文视频的ajax地址写活,此时拿到了12个MP4视频地址,但是,不能通过浏览器直接访问,做了反扒。

import requests
import re

res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
# print(res.text)
"""
标签<a href="video_1756640" class="vervideo-lilink actplay"中的href为视频链接地址
(.*?)分组,按照任意规则匹配,在res.text中匹配<a href="(.*?)" class="vervideo-lilink actplay
"""
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">',res.text)
"""
['<a href="video_1756640" class="vervideo-lilink actplay">',
'<a href="video_1756571" class="vervideo-lilink actplay">',
'<a href="video_1756570" class="vervideo-lilink actplay">',
'<a href="video_1756573" class="vervideo-lilink actplay">',
'<a href="video_1756572" class="vervideo-lilink actplay">',
'<a href="video_1756575" class="vervideo-lilink actplay">',
'<a href="video_1756574" class="vervideo-lilink actplay">',
'<a href="video_1756577" class="vervideo-lilink actplay">',
'<a href="video_1756576" class="vervideo-lilink actplay">',
'<a href="video_1756579" class="vervideo-lilink actplay">',
'<a href="video_1756578" class="vervideo-lilink actplay">',
'<a href="video_1756581" class="vervideo-lilink actplay">']
"""
# print(video_list)
for video_id in video_list:
    # video_1756640
    # print(video)
    video_jump = "https://www.pearvideo.com/" + video
    # https://www.pearvideo.com/video_1756640
    # print(video_jump)
    video_id = video.split('_')[-1]
    # print(video_id)
    header = {
        'Referer':video_jump
    }
    # https: // www.pearvideo.com / videoStatus.jsp?contId = 1756640 & mrd = 0.6021376528237194
    res2 = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={video_id}&mrd=0.6021376528237194',headers=header)
    """
    https://video.pearvideo.com/mp4/third/20220506/1651923079092-15498275-170720-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/1651923079178-13691186-165820-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/1651923079259-15454898-113859-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/1651923079338-15498275-091809-hd.mp4
    https://video.pearvideo.com/mp4/short/20220329/1651923079417-15851807-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079497-15851424-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079570-15851419-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079646-15851434-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079734-15851429-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079826-15851444-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079914-15851439-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923080012-15851454-hd.mp4
    """
    print(res2.json()['videoInfo']['videos']['srcUrl'])

我们再次是播放视频地址,播放视频,点击播放视频,找到视频播放url.

截屏2022-05-07 下午7.35.54

真实视频地址与我们的到视频地址对比,发现真实的视频地址是cont-我们得到到video_id

截屏2022-05-07 下午7.40.36

视频地址解析:

# 视频地址分析
fake = 'https://video.pearvideo.com/mp4/third/20220403/1651906724559-15902642-124731-hd.mp4'
real = "https://video.pearvideo.com/mp4/third/20220403/cont-1757346-15902642-124731-hd.mp4"
l = fake.split('/')[-1].split('-')[0]
# 1651906724559
print(l)
# 再讲分割出来的字符串修改为我们要的视频id
print(l.replace(l,"cont-%s" % video_id))

最终的代码,爬取视频并写入本地:

import requests
import re

res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
# print(res.text)
"""
标签<a href="video_1756640" class="vervideo-lilink actplay"中的href为视频链接地址
(.*?)分组,按照任意规则匹配,在res.text中匹配<a href="(.*?)" class="vervideo-lilink actplay
"""
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">',res.text)
"""
['<a href="video_1756640" class="vervideo-lilink actplay">',
'<a href="video_1756571" class="vervideo-lilink actplay">',
'<a href="video_1756570" class="vervideo-lilink actplay">',
'<a href="video_1756573" class="vervideo-lilink actplay">',
'<a href="video_1756572" class="vervideo-lilink actplay">',
'<a href="video_1756575" class="vervideo-lilink actplay">',
'<a href="video_1756574" class="vervideo-lilink actplay">',
'<a href="video_1756577" class="vervideo-lilink actplay">',
'<a href="video_1756576" class="vervideo-lilink actplay">',
'<a href="video_1756579" class="vervideo-lilink actplay">',
'<a href="video_1756578" class="vervideo-lilink actplay">',
'<a href="video_1756581" class="vervideo-lilink actplay">']
"""
# print(video_list)
for video in video_list:
    # video_1756640
    # print(video)
    video_jump = "https://www.pearvideo.com/" + video
    # https://www.pearvideo.com/video_1756640
    # print(video_jump)
    video_id = video.split('_')[-1]
    # print(video_id)
    header = {
        'Referer':video_jump
    }
    # https: // www.pearvideo.com / videoStatus.jsp?contId = 1756640 & mrd = 0.6021376528237194
    res2 = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={video_id}&mrd=0.6021376528237194',headers=header)
    """
    https://video.pearvideo.com/mp4/third/20220506/1651923079092-15498275-170720-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/1651923079178-13691186-165820-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/1651923079259-15454898-113859-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/1651923079338-15498275-091809-hd.mp4
    https://video.pearvideo.com/mp4/short/20220329/1651923079417-15851807-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079497-15851424-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079570-15851419-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079646-15851434-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079734-15851429-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079826-15851444-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923079914-15851439-hd.mp4
    https://video.pearvideo.com/mp4/short/20220328/1651923080012-15851454-hd.mp4
    """
    # print(res2.json()['videoInfo']['videos']['srcUrl'])
    no_video_url = res2.json()['videoInfo']['videos']['srcUrl']
    video_real_url = no_video_url.replace(no_video_url.split('/')[-1].split('-')[0],'cont-%s' % video_id)
    """
    https://video.pearvideo.com/mp4/third/20220506/cont-1761232-15498275-170720-hd.mp4
    https://video.pearvideo.com/mp4/third/20220505/cont-1761062-13691186-165820-hd.mp4
    ...
    """
    print(video_real_url)
    res3 = requests.get(video_real_url)
    with open('static/%s.mp4' % video_id,'wb') as f:
        # 循环以二进制的格式写入
        for i in res3.iter_content(1024):
            f.write(i)

截屏2022-05-07 下午8.26.10

posted @ 2022-05-07 20:30  谢俊杰  阅读(487)  评论(0编辑  收藏  举报