爬虫知识点总结。

网络爬虫的基本工作流程例如以下：

1.选取种子URL；

2.将这些URL放入待抓取URL队列；

3.从待抓取URL队列中取出待抓取在URL。解析DNS，而且得到主机的ip，并将URL相应的网页下载下来，存储进已下载网页库中。

4.分析已抓取URL队列中的URL，分析当中的其它URL，而且将URL放入待抓取URL队列，从而进入下一个循环。

在抓取标签匹配的时候，有三种方法来抓取分别是 re , xpath , BeautifulSoup4

建议大家要精确的学好re正则匹配，因为有些网站，xpath，和 BeautifulSoup4 没有正则匹配效率高。

本人在爬取优酷网站的时候就得到了这个理解，正则匹配会精准的筛选出相应的数据。

如果大家是初学者，可以提供一个不被封ip的一个秘诀：exit(-1)。

次秘诀是爬虫里的断点，可以先设置断点只爬取一条数据，让网站不认为你是爬虫，等把所有逻辑都成功确认之后，在解开断点，爬到自己想要的数据。

那么就来展示一下爬虫爬视频的代码，代码如下：

# 导包

import requests

import re

from lxml import etree

import os

class PearVideo(object):

# 定义抓取方法

def get_content(self,url,type):

if type == 'index':

fil_name = 'test_pear.html'

else:

fil_name = 'inner_pear.html'

# 使用os模块来判断文件是否存在

if not os.path.exists(fil_name):

# 发送http请求

r = requests.get(url)

# 解码

html = r.content.decode('utf-8')

# 写文件

with open('./'+fil_name,'w',encoding='utf-8') as f:

f.write(html)

else:

with open('./'+fil_name,encoding='utf-8') as f:

contents = f.read()

return contents

# 定义数据匹配方法

def get_xpath(self,html):

# 转换格式

html = etree.HTML(html)

html_data_img = html.xpath("//div[@class='actcontbd']/a/@href")

# print(html_data_img)

# 处理内页网址

url_list = []

for item in html_data_img:

item = 'https://www.pearvideo.com/'+item

url_list.append(item)

# print(url_list)

# 爬取内页

url_page = url_list[8]

inner_html = self.get_content(url_page,'inner')

# 匹配真实视频地址

regex = re.compile('srcUrl="(.+?)"')

print(regex.findall(inner_html))

# 下载视频

r = requests.get(regex.findall(inner_html)[0])

with open("./test_pear.mp4","ab") as f:

f.write(r.content)

if __name__ == "__main__":

# 实例化一个对象

pearvideo = PearVideo()

html = pearvideo.get_content('https://www.pearvideo.com/','index')

# 匹配图片地址

pearvideo.get_xpath(html)

如有需要更进，请需大神指点。

posted @ 2019-03-05 20:38 男神鹏●詹姆斯阅读(496) 评论(0) 收藏举报

刷新页面返回顶部

男神鹏

爬虫 知识点 总结。

公告

爬虫知识点总结。