爬虫:图片,视频
获取二进制数据content或者iter_content
用于下载图片,视频。
爬取图片:
import requests
header = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
"referer":"https://mmzztt.com/"
}
res = requests.get("https://p.iimzt.com/2022/04/21g14fpi.jpg",headers=header)
with open("a.jpg",'wb') as f:
# f.write(res.content)
# 每次拿多少字节,拿100写100
for i in res.iter_content(100):
f.write(i)
可以看到此时的请求头中携带user-agent
和referer
,所以我们的请求投中也需要携带。
import requests
header = {
"user-agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/101.0.4951.54 Safari/537.36",
"referer":"https://mmzztt.com/"
}
res = requests.get("https://p.iimzt.com/2022/04/21g14fpi.jpg",headers=header)
with open("a.jpg",'wb') as f:
# f.write(res.content)
# 每次拿多少字节,拿100写100
for i in res.iter_content(100):
f.write(i)
爬取视频:
举例使用梨视频:
访问此地址,url中有些参数可以删除,并且删除之后并不妨碍访问:
那么此时我们可以通过代码来模拟访问,并且通过正则与for循环形式拿到12条视屏链接。
import requests
import re
res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
print(res.text)
再开寻找真正的视屏地址:
那么此时我们就可以通过代码来获取到这12条视屏链接。
import requests
import re
res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
# print(res.text)
"""
标签<a href="video_1756640" class="vervideo-lilink actplay"中的href为视频链接地址
(.*?)分组,按照任意规则匹配,在res.text中匹配<a href="(.*?)" class="vervideo-lilink actplay
"""
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">',res.text)
"""
['<a href="video_1756640" class="vervideo-lilink actplay">',
'<a href="video_1756571" class="vervideo-lilink actplay">',
'<a href="video_1756570" class="vervideo-lilink actplay">',
'<a href="video_1756573" class="vervideo-lilink actplay">',
'<a href="video_1756572" class="vervideo-lilink actplay">',
'<a href="video_1756575" class="vervideo-lilink actplay">',
'<a href="video_1756574" class="vervideo-lilink actplay">',
'<a href="video_1756577" class="vervideo-lilink actplay">',
'<a href="video_1756576" class="vervideo-lilink actplay">',
'<a href="video_1756579" class="vervideo-lilink actplay">',
'<a href="video_1756578" class="vervideo-lilink actplay">',
'<a href="video_1756581" class="vervideo-lilink actplay">']
"""
# print(video_list)
for video in video_list:
"""
video_1756640
video_1756571
...
"""
print(video)
video_jump = "https://www.pearvideo.com/" + video
"""
https://www.pearvideo.com/video_1756640
https://www.pearvideo.com/video_1756571
...
"""
print(video_jump)
注意此时我们拿到是视频的链接,并不是视频的MP4地址,而是跳转地址,此时我们需要找到真正的视屏MP4地址,此时挑选过滤出的视频地址,可以看到加载此视频时是发送了一个ajax请求。
使用代码模拟发送get请求,发现得到的返回结果一致。
发现使用浏览器同样访问也无法打开视屏,查看请求头
代码中手动添加referer --> 上一次访问地址:
现在我们拿到返回的视屏MP4地址,使用浏览器访问,发现依旧无法访问,做了反扒措施。
此时我们可以看到代码中contId字样,contid就是视屏id。
将范文视频的ajax地址写活,此时拿到了12个MP4视频地址,但是,不能通过浏览器直接访问,做了反扒。
import requests
import re
res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
# print(res.text)
"""
标签<a href="video_1756640" class="vervideo-lilink actplay"中的href为视频链接地址
(.*?)分组,按照任意规则匹配,在res.text中匹配<a href="(.*?)" class="vervideo-lilink actplay
"""
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">',res.text)
"""
['<a href="video_1756640" class="vervideo-lilink actplay">',
'<a href="video_1756571" class="vervideo-lilink actplay">',
'<a href="video_1756570" class="vervideo-lilink actplay">',
'<a href="video_1756573" class="vervideo-lilink actplay">',
'<a href="video_1756572" class="vervideo-lilink actplay">',
'<a href="video_1756575" class="vervideo-lilink actplay">',
'<a href="video_1756574" class="vervideo-lilink actplay">',
'<a href="video_1756577" class="vervideo-lilink actplay">',
'<a href="video_1756576" class="vervideo-lilink actplay">',
'<a href="video_1756579" class="vervideo-lilink actplay">',
'<a href="video_1756578" class="vervideo-lilink actplay">',
'<a href="video_1756581" class="vervideo-lilink actplay">']
"""
# print(video_list)
for video_id in video_list:
# video_1756640
# print(video)
video_jump = "https://www.pearvideo.com/" + video
# https://www.pearvideo.com/video_1756640
# print(video_jump)
video_id = video.split('_')[-1]
# print(video_id)
header = {
'Referer':video_jump
}
# https: // www.pearvideo.com / videoStatus.jsp?contId = 1756640 & mrd = 0.6021376528237194
res2 = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={video_id}&mrd=0.6021376528237194',headers=header)
"""
https://video.pearvideo.com/mp4/third/20220506/1651923079092-15498275-170720-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/1651923079178-13691186-165820-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/1651923079259-15454898-113859-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/1651923079338-15498275-091809-hd.mp4
https://video.pearvideo.com/mp4/short/20220329/1651923079417-15851807-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079497-15851424-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079570-15851419-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079646-15851434-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079734-15851429-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079826-15851444-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079914-15851439-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923080012-15851454-hd.mp4
"""
print(res2.json()['videoInfo']['videos']['srcUrl'])
我们再次是播放视频地址,播放视频,点击播放视频,找到视频播放url.
真实视频地址与我们的到视频地址对比,发现真实的视频地址是cont-我们得到到video_id
。
视频地址解析:
# 视频地址分析
fake = 'https://video.pearvideo.com/mp4/third/20220403/1651906724559-15902642-124731-hd.mp4'
real = "https://video.pearvideo.com/mp4/third/20220403/cont-1757346-15902642-124731-hd.mp4"
l = fake.split('/')[-1].split('-')[0]
# 1651906724559
print(l)
# 再讲分割出来的字符串修改为我们要的视频id
print(l.replace(l,"cont-%s" % video_id))
最终的代码,爬取视频并写入本地:
import requests
import re
res = requests.get("https://www.pearvideo.com/category_loading.jsp?reqType=5&categoryId=130&start=0")
# print(res.text)
"""
标签<a href="video_1756640" class="vervideo-lilink actplay"中的href为视频链接地址
(.*?)分组,按照任意规则匹配,在res.text中匹配<a href="(.*?)" class="vervideo-lilink actplay
"""
video_list = re.findall('<a href="(.*?)" class="vervideo-lilink actplay">',res.text)
"""
['<a href="video_1756640" class="vervideo-lilink actplay">',
'<a href="video_1756571" class="vervideo-lilink actplay">',
'<a href="video_1756570" class="vervideo-lilink actplay">',
'<a href="video_1756573" class="vervideo-lilink actplay">',
'<a href="video_1756572" class="vervideo-lilink actplay">',
'<a href="video_1756575" class="vervideo-lilink actplay">',
'<a href="video_1756574" class="vervideo-lilink actplay">',
'<a href="video_1756577" class="vervideo-lilink actplay">',
'<a href="video_1756576" class="vervideo-lilink actplay">',
'<a href="video_1756579" class="vervideo-lilink actplay">',
'<a href="video_1756578" class="vervideo-lilink actplay">',
'<a href="video_1756581" class="vervideo-lilink actplay">']
"""
# print(video_list)
for video in video_list:
# video_1756640
# print(video)
video_jump = "https://www.pearvideo.com/" + video
# https://www.pearvideo.com/video_1756640
# print(video_jump)
video_id = video.split('_')[-1]
# print(video_id)
header = {
'Referer':video_jump
}
# https: // www.pearvideo.com / videoStatus.jsp?contId = 1756640 & mrd = 0.6021376528237194
res2 = requests.get(f'https://www.pearvideo.com/videoStatus.jsp?contId={video_id}&mrd=0.6021376528237194',headers=header)
"""
https://video.pearvideo.com/mp4/third/20220506/1651923079092-15498275-170720-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/1651923079178-13691186-165820-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/1651923079259-15454898-113859-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/1651923079338-15498275-091809-hd.mp4
https://video.pearvideo.com/mp4/short/20220329/1651923079417-15851807-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079497-15851424-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079570-15851419-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079646-15851434-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079734-15851429-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079826-15851444-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923079914-15851439-hd.mp4
https://video.pearvideo.com/mp4/short/20220328/1651923080012-15851454-hd.mp4
"""
# print(res2.json()['videoInfo']['videos']['srcUrl'])
no_video_url = res2.json()['videoInfo']['videos']['srcUrl']
video_real_url = no_video_url.replace(no_video_url.split('/')[-1].split('-')[0],'cont-%s' % video_id)
"""
https://video.pearvideo.com/mp4/third/20220506/cont-1761232-15498275-170720-hd.mp4
https://video.pearvideo.com/mp4/third/20220505/cont-1761062-13691186-165820-hd.mp4
...
"""
print(video_real_url)
res3 = requests.get(video_real_url)
with open('static/%s.mp4' % video_id,'wb') as f:
# 循环以二进制的格式写入
for i in res3.iter_content(1024):
f.write(i)