爬虫基础new
1.爬虫相关概述
爬虫概念:
通过编写程序模拟浏览器上网,然后让其去互联网上爬取/抓取数据的过程
模拟:浏览器就是一款纯天然的原始的爬虫工具
爬虫分类:
通用爬虫:爬取一整张页面中的数据. 抓取系统(爬虫程序)
聚焦爬虫:爬取页面中局部的数据.一定是建立在通用爬虫的基础之上
增量式爬虫:用来监测网站数据更新的情况.以便爬取到网站最新更新出来的数据
风险分析
合理的的使用
爬虫风险的体现:
爬虫干扰了被访问网站的正常运营;
爬虫抓取了受到法律保护的特定类型的数据或信息。
避免风险:
严格遵守网站设置的robots协议;
在规避反爬虫措施的同时,需要优化自己的代码,避免干扰被访问网站的正常运行;
在使用、传播抓取到的信息时,应审查所抓取的内容,如发现属于用户的个人信息、隐私或者他人的商业秘密的,应及时停止并删除。
反爬机制
反反爬策略
robots.txt协议:文本协议,在文本中指定了可爬和不可爬的数据说明.
常用的头信息
User-Agent:请求载体的身份标识
Connection:close
content-type
如何鉴定页面中是否有动态加载的数据?
局部搜索 全局搜索
对一个陌生网站进行爬取前的第一步做什么?
确定你要爬取的数据是否为动态加载的!!!
2.requests模块的基本使用
requests模块
概念:一个机遇网络请求的模块.作用就是用来模拟浏览器发起请求
编码流程:
指定url
进行请求的发送
获取响应数据(爬取到的数据)
持久化存储
import requests
url = 'https://www.sogou.com'
#返回值是一个响应对象
response = requests.get(url=url)
#text返回的是字符串形式的响应数据
data = (response.text)
with open('./sogou.html',"w",encoding='utf-8') as f:
f.write(data)
基于搜狗编写一个简易的网页采集器
解决乱码问题
解决UA检测问题
import requests
wd = input('输入key:')
url = 'https://www.sogou.com/web'
# 存储的就是动态的请求参数
params = {
'query': wd
}
#params参数表示的是对请求url参数的封装
#headers 解决反爬机制,实现UA伪装
headers = {
'User-Agent':"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params,headers=headers)
#手动修改响应数据的编码,解决中文乱码
response.encoding = 'utf-8'
data = (response.text)
filename = wd + '.html'
with open(filename, "w", encoding='utf-8') as f:
f.write(data)
print(wd, "下载成功")
1.爬取豆瓣电影的详细数据
分析
当滚轮滑动到底部的时候,发起ajax的请求,且请求到了一组电影数据
动态加载的数据:就是通过另一个额外的请求请求到的数据
ajax生成动态加载的数据
js生成动态加载的数据
import requests
limit = input("排行榜前多少的数据:::")
url = 'https://movie.douban.com/j/chart/top_list'
params = {
"type": "5",
"interval_id": "100:90",
"action": "",
"start": "0",
"limit": limit
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
#json返回的是序列化好的对象
data_list = (response.json())
with open('douban.txt', "w", encoding='utf-8') as f:
for i in data_list:
name = i['title']
score = i['score']
f.write(name+""+score+""+"\n")
print("成功")
2.爬取肯德基地理位置信息
import requests
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
params = {
"cname": "",
"pid": "",
"keyword": "青岛",
"pageIndex": "1",
"pageSize": "10"
}
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, params=params, headers=headers)
# json返回的是序列化好的对象
data_list = (response.json())
with open('kedeji.txt', "w", encoding='utf-8') as f:
for i in data_list["Table1"]:
name = i['storeName']
addres = i['addressDetail']
f.write(name + "," + addres + "\n")
print("成功")
3.腾讯课堂
import requests
res = requests.post(
url='https://ke.qq.com/cgi-proxy/course_list/search_course_list?bkn=&r=0.1427',
#request payload 这里填写json而不是data 一般From的时候使用data
json={"word":"python","page":"2","visitor_id":"9283127513403748","finger_id":"f62c588e17fd13645de79684bdcc3017","platform":3,"source":"search","count":24,"need_filter_contact_labels":1},
headers={
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
"Referer":"https://ke.qq.com/course/list/python?page=2"
}
)
print(res.json())
4. B站信息和cookie
获取b站的个人登录信息 这里的cokkie是登录完成之后的cookie
import requests
res = requests.get(
url='https://api.bilibili.com/x/member/web/account?web_location=333.33',
headers={
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36",
'Cookie':"buvid3=4D2C694C-3D16-25A9-AD4A-ACB2D65C90C918352infoc; b_nut=1705987918; _uuid=51064FE53-451E-6E74-E16F-6F8B92624F9718713infoc; buvid_fp=c00b689fc9db5c29f662cf2fb6bd01a6; buvid4=18A11919-E414-FBFC-1B8D-8BA88C1BCCB619628-024012305-nX8zov%2FsHs%2FB%2BKaOt7gP%2BA%3D%3D; CURRENT_FNVAL=4048; rpdid=|(J~|~~m)l|J0J'u~|lm)mmRJ; b_lsid=225B1026E_18D34F399DC; bsource=search_baidu; enable_web_push=DISABLE; header_theme_version=CLOSE; csrf_state=f9a3e7fd70aeb8dd0ee9ec51bb9a0e67; bili_ticket=eyJhbGciOiJIUzI1NiIsImtpZCI6InMwMyIsInR5cCI6IkpXVCJ9.eyJleHAiOjE3MDYyNDk2MTksImlhdCI6MTcwNTk5MDM1OSwicGx0IjotMX0.HZcDZSE0m-wZuevYHxD_f7SH-EkBX5ktM86m92q0SP4; bili_ticket_expires=1706249559; SESSDATA=9ba7b60b%2C1721542419%2C457da%2A11CjAb6kWlfz4IljS2QO8Sh_ohT9JXLK_vgAx0PTRmispeHAUur9NoA4gyvptuJ2pC-UcSVjhUajBkblptWHc2ZUNvN2NMMHRJeUdhR3F4SzRQU0tPRjUxMkJObWRzZGdTY3Z1UFk3bk1CRmhFdlo4am0yNnFndi02SzlfaWZ3V3NqaEtCLVhFNDZBIIEC; bili_jct=2312bb4ef19779446149d14e74fabb37; DedeUserID=388839130; DedeUserID__ckMd5=4cfa0a5f7f7d631f; sid=gfnbaeos; PVID=2; home_feed_column=5; browser_resolution=1440-791"
}
)
print(res.json())
#{'code': 0, 'message': '0', 'ttl': 1, 'data': {'mid': 388839130, 'uname': '追梦nan', 'userid': 'bili_43698205152', 'sign': '', 'birthday': '1970-01-01', 'sex': '男', 'nick_free': False, 'rank': '正式会员'}}
5. 响应体格式
res.conntent #原始响应体 视频 文件 图片
res.text #原始文本
res.json() #json
3.数据解析
解析:根据指定的规则对数据进行提取
作用:实现聚焦爬虫
聚焦爬虫的编码流程:
指定url
发起请求
获取响应数据
数据解析
持久化存储
数据解析的方式:
正则
bs4
xpath
pyquery(拓展)
数据解析的通用原理是什么?
数据解析需要作用在页面源码中(一组html标签组成的)
html的核心作用是什么?
展示数据
html是如何展示数据的呢?
html所要展示的数据一定是被放置在html标签之中,或者是在属性中
通用原理:
1.标签定位
2.取文本or取属性
1.正则解析
1.爬取糗事百科糗图数据
爬取单张
import requests
url = "https://pic.qiushibaike.com/system/pictures/12330/123306162/medium/GRF7AMF9GKDTIZL6.jpg"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
response = requests.get(url=url, headers=headers)
# content返回的是byte类型的数据
img_data = (response.content)
with open('./123.jpg', "wb") as f:
f.write(img_data)
print("成功")
爬取单页
<div class="thumb">
<a href="/article/123319109" target="_blank">
<img src="//pic.qiushibaike.com/system/pictures/12331/123319109/medium/MOX0YDFJX7CM1NWK.jpg" alt="糗事#123319109" class="illustration" width="100%" height="auto">
</a>
</div>
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
url = "https://www.qiushibaike.com/imgrank/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
img_text = requests.get(url, headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 对图片地址发请求获取图片数据
with open(img_path, "wb") as f:
f.write(response)
print("成功")
爬取多页
import re
import os
import requests
dir_name = "./img"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
for i in range(1,5):
url = f"https://www.qiushibaike.com/imgrank/page/{i}/"
print(f"正在爬取第{i}页的图片")
img_text = requests.get(url, headers=headers).text
ex = '<div class="thumb">.*?<img src="(.*?)" alt=.*?</div>'
img_list = re.findall(ex, img_text, re.S)
for src in img_list:
src = "https:" + src
img_name = src.split('/')[-1]
img_path = dir_name + "/" + img_name
response = requests.get(src, headers).content
# 对图片地址发请求获取图片数据
with open(img_path, "wb") as f:
f.write(response)
print("成功")
2.bs4解析
环境安装
pip install beautifulsoup4
bs4的解析原理
实例化一个BeautifulSoup的对象为soup,并且将即将被解析的页面源码数据加载到该对象中,
调用BeautifulSoup对象中的相关属性和方法进行标签定位和数据提取
根据标签名称,获取标签(只获取找到的第1个)
from bs4 import BeautifulSoup
html_string = """<div>
<h1 class="item">zz</h1>
<ul class="item">
<li>篮球</li>
<li>足球</li>
</ul>
<div id='x3'>
<span>5xclass.cn</span>
<a href="www.xxx.com" class='info'>pythonav.com</a>
</div>
</div>"""
soup = BeautifulSoup(html_string, features="html.parser")
tag = soup.find(name='a')
print(tag) # 标签对象
print(tag.name) # 标签名字 a
print(tag.text) # 标签文本 pythonav.com
print(tag.attrs) # 标签属性 {'href': 'www.xxx.com', 'class': ['info']}
根据属性获取标签(只获取找到的第1个)
from bs4 import BeautifulSoup
html_string = """<div>
<h1 class="item">zz</h1>
<ul class="item">
<li>篮球</li>
<li>足球</li>
</ul>
<div id='x3'>
<span>5xclass.cn</span>
<a href="www.xxx.com" class='info'>pythonav.com</a>
</div>
</div>"""
soup = BeautifulSoup(html_string, features="html.parser")
tag = soup.find(name='div', attrs={"id": "x3"}) #
print(tag)
嵌套读取,先找到某个标签,然后再去孩子标签中寻找
from bs4 import BeautifulSoup
html_string = """<div>
<h1 class="item">zz</h1>
<ul class="item">
<li>篮球</li>
<li>足球</li>
</ul>
<div id='x3'>
<span>5xclass.cn</span>
<a href="www.xxx.com" class='info'>pythonav.com</a>
<span class='xx1'>zz</span>
</div>
</div>"""
soup = BeautifulSoup(html_string, features="html.parser")
parent_tag = soup.find(name='div', attrs={"id": "x3"})
child_tag = parent_tag.find(name="span", attrs={"class": "xx1"})
print(child_tag)
读取所有标签(多个)
html_string = """<div>
<h1 class="item">zz</h1>
<ul class="item">
<li>篮球</li>
<li>足球</li>
</ul>
<div id='x3'>
<span>5xclass.cn</span>
<a href="www.xxx.com" class='info'>pythonav.com</a>
<span class='xx1'>zz</span>
</div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string, features="html.parser")
tag_list = soup.find_all(name="li") #递归所有
#tag_list = soup.find_all(recursive=Flase) 只找儿子
print(tag_list)
# 输出
# [<li>篮球</li>, <li>足球</li>]
html_string = """<div>
<h1 class="item">zz</h1>
<ul class="item">
<li>篮球</li>
<li>足球</li>
</ul>
<div id='x3'>
<span>5xclass.cn</span>
<a href="www.xxx.com" class='info'>pythonav.com</a>
<span class='xx1'>zz</span>
</div>
</div>"""
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_string, features="html.parser")
tag_list = soup.find_all(name="li") #找所有
for tag in tag_list:
print(tag.text)
# 输出
篮球
足球
1.爬取易车网数据
from bs4 import BeautifulSoup
import requests
url = 'https://car.yiche.com/'
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.106 Safari/537.36"
}
res = requests.get(url, headers=headers)
soup = BeautifulSoup(res.text,features="html.parser")
tag_list = soup.find_all(name="div",attrs={"class":"item-brand"})
# for tag in tag_list:
# print(tag.attrs["data-name"])
for tag in tag_list:
child = tag.find(name='div',attrs={'class':'brand-name'})
print(child.text)
2.爬取网易云数据
import requests
from bs4 import BeautifulSoup
res = requests.get(
url="https://music.163.com/discover/playlist/?cat=%E5%8D%8E%E8%AF%AD",
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
"Referer": "https://music.163.com/"
}
)
soup = BeautifulSoup(res.text, features="html.parser")
parent_tag = soup.find(name='ul', attrs={"id": "m-pl-container"})
for child in parent_tag.find_all(recursive=False):
title = child.find(name="a", attrs={"class": "tit f-thide s-fc0"}).text
image_url = child.find(name='img').attrs['src']
print(title, image_url)
# 每个封面下载下来
img_res = requests.get(url=image_url)
file_name = title.split()[0]
with open(f"{file_name}.jpg", mode='wb') as f:
f.write(img_res.content)
3.xpath解析
环境安装
pip install lxml
xpath的解析原理
实例化一个etree类型xpath的解析原理的对象,且将页面源码数据加载到该对象中
需要调用该对象的xpath方法结合着不同形式的xpath表达式进行标签定位和数据提取
etree对象的实例化
tree = etree.parse(fileNane)
tree = etree.HTML(page_text)
xpath方法返回的永远是一个列表
标签定位
tree.xpath("")
在xpath表达式中最最侧的/表示的含义是说,当前定位的标签必须从根节点开始进行定位
xpath表达式中最左侧的//表示可以从任意位置进行标签定位
xpath表达式中非最左侧的//表示的是多个层级的意思
xpath表达式中非最左侧的/表示的是一个层级的意思
属性定位://div[@class='ddd']
索引定位://div[@class='ddd']/li[3] #索引从1开始
索引定位://div[@class='ddd']//li[2] #索引从1开始
提取数据
取文本:
tree.xpath("//p[1]/text()"):取直系的文本内容
tree.xpath("//div[@class='ddd']/li[2]//text()"):取所有的文本内容
取属性:
tree.xpath('//a[@id="feng"]/@href')
1.爬取糗事百科
爬取作者,和文章。注意作者有匿名和实名之分
from lxml import etree
import requests
url = "https://www.qiushibaike.com/text/page/4/"
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
div_list = tree.xpath('//div[@class="col1 old-style-col1"]/div')
print(div_list)
for div in div_list:
#用户名分为匿名用户和注册用户
author = div.xpath('.//div[@class="author clearfix"]//h2/text() | .//div[@class="author clearfix"]/span[2]/h2/text()')[0]
content = div.xpath('.//div[@class="content"]/span//text()')
content = ''.join(content)
print(author, content)
2.爬取网站图片
from lxml import etree
import requests
import os
dir_name = "./img2"
if not os.path.exists(dir_name):
os.mkdir(dir_name)
headers = {
'User-Agent': "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36",
}
for i in range(1, 6):
if i == 1:
url = "http://pic.netbian.com/4kmeinv/"
else:
url = f"http://pic.netbian.com/4kmeinv/index_{i}.html"
page_text = requests.get(url, headers=headers).text
tree = etree.HTML(page_text)
li_list = tree.xpath('//*[@id="main"]/div[3]/ul/li')
for li in li_list:
img_src = "http://pic.netbian.com/" + li.xpath('./a/img/@src')[0]
img_name = li.xpath('./a/b/text()')[0]
#解决中文乱码
img_name = img_name.encode('iso-8859-1').decode('gbk')
response = requests.get(img_src).content
img_path = dir_name + "/" + f"{img_name}.jpg"
with open(img_path, "wb") as f:
f.write(response)
print(f"第{i}页成功")
热爱技术,享受生活,感谢推荐!
【推荐】编程新体验,更懂你的AI,立即体验豆包MarsCode编程助手
【推荐】凌霞软件回馈社区,博客园 & 1Panel & Halo 联合会员上线
【推荐】抖音旗下AI助手豆包,你的智能百科全书,全免费不限次数
【推荐】博客园社区专享云产品让利特惠,阿里云新客6.5折上折
【推荐】轻量又高性能的 SSH 工具 IShell:AI 加持,快人一步
· DeepSeek “源神”启动!「GitHub 热点速览」
· 微软正式发布.NET 10 Preview 1:开启下一代开发框架新篇章
· 我与微信审核的“相爱相杀”看个人小程序副业
· C# 集成 DeepSeek 模型实现 AI 私有化(本地部署与 API 调用教程)
· DeepSeek R1 简明指南:架构、训练、本地部署及硬件要求