爬虫——IP代理池与BeautifulSoup模块

IP代理池的概念及使用

1.有很多网站在防爬措施上面都加了封禁IP的措施
	一旦我的网站发现某一个IP在固定的时间内访问了很多次(一分钟访问了30次)，那么我会直接获取到该请求对应的主机IP地址,然后加入网站的黑名单
    刚请求来访问我的网站的时候我会先去黑名单中查看当前请求的ip在不在如果在直接拒绝
    如果不在才会进去下一个环节
    
针对上述ip封禁的情况，出现了IP代理池
	IP代理池里面有很多IP，你每次访问别人网站的时候
    随机从池子里面拿一个IP做伪装
    
具体使用
# 代理的地址获取有免费的也有收费的
import requests
proxies={
    'https':'123.163.117.55:9999',
    'https':'123.163.117.55:9999',
    'https':'123.163.117.55:9999',
}
respone=requests.get('https://www.12306.cn',
                     proxies=proxies)

print(respone.status_code)

Beautiful Soup模块

Beautiful Soup会帮你节省数小时甚至数天的工作时间

# 安装 Beautiful Soup
pip install beautifulsoup4  # 这个4千万不要少了

# 解析器
	有四种 常用的两种
    html.parse  内置的不需要下载
    lxml		需要下载
    	pip3 install lxml
 
# 导入
from bs4 import BeautifulSoup

基本使用

from bs4 import BeautifulSoup
html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
# 先将html页面内容传入BeautifulSoup 生成一个对象
soup = BeautifulSoup(html_doc,'lxml')  # 具有容错功能

res = soup.prettify()  # 处理好缩进，结构化显示  美化
print(res)

操作方法

html_doc = """
<html><head><title>The Dormouse's story</title></head>
<body>
<p id="my p" class="title jason" username="jason">123<b id="bbb" class="boldest">The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were
<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>;
and they lived at the bottom of a well.</p>
<p class="story">...</p>
"""

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_doc,'lxml')

print(soup.a)  # 查找a标签 只会拿第一个

print(soup.p.name)  # 获取标签名

print(soup.p.attrs)  # 用字典的形式给你列举出标签所有的属性
# {'id': 'my p', 'class': ['title'], 'username': 'jason'}

print(soup.p.text)  # 获取p标签内部所有的文本

# string用的很少
print(soup.p.string)  # 只有p下面有单独的文本的时候才能拿到

# 嵌套选择
soup.head.title.string  # 依次往内部查找
soup.body.a.string

# 子节点、子孙节点
soup.p.contents #p下所有子节点
soup.p.children #得到一个迭代器,包含p下所有子节点
for child in soup.p.children:
    print(child)
    
# 父节点、祖先节点
soup.a.parent #获取a标签的父节点
soup.a.parents #找到a标签所有的祖先节点，父亲的父亲，父亲的父亲的父亲...
for p in soup.a.parents:
    print(p)
    
# 兄弟节点
soup.a.next_siblings #下一个兄弟
for i in soup.a.next_siblings:
    print(i)
soup.a.previous_sibling #上一个兄弟
list(soup.a.next_siblings) #下面的兄弟们=>生成器对象
soup.a.previous_siblings #上面的兄弟们=>生成器对象

过滤器

from bs4 import BeautifulSoup

soup = BeautifulSoup(html_doc,'lxml')
# 五种过滤器: 字符串、正则表达式、列表、True、方法
# 1、字符串：即标签名   结果是一个列表 里面的元素才是真正的标签对象
print(soup.find_all('b'))  #[<b class="boldest" id="bbb">The Dormouse's story</b>]

# 2、正则表达式
import re   # 一定要注意拿到的结果到底是什么数据类型
print(soup.find_all(re.compile('^b'))) #找出b开头的标签，结果有body和b标签

# 3、列表：如果传入列表参数,Beautiful Soup会将与列表中任一元素匹配的内容返回.
# 下面代码找到文档中所有<a>标签和<b>标签:
print(soup.find_all(['a','b']))  # 找到文档中所有<a>标签和<b>标签

# 4、True：可以匹配任何值,下面代码查找到所有的tag,但是不会返回字符串节点
print(soup.find_all(True))  # True表示所有
for tag in soup.find_all(True):
    print(tag.name)

# 5、方法:如果没有合适过滤器,那么还可以定义一个方法,方法只接受一个元素参数,
# 如果这个方法返回 True 表示当前元素匹配并且被找到,如果不是则反回 False
def has_class_but_no_id(tag):
    return tag.has_attr('class') and not tag.has_attr('id')
print(soup.find_all(has_class_but_no_id))

总结

1.查找标签非常的简单
	find()
    find_all()
"""
括号内常用的参数
	name		根据标签的名字查找标签
	id			根据标签的id查找标签
	class_      根据标签的class查找
"""

2.查找标签内部的文本
	标签对象.text
    
3.查找标签属性对应的值
	a标签的href属性对应的值
    	a.get('href')
    img标签的src属性对应的值
		img.get('src')

中文文档

https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.html#find-parents-find-parent

posted @ 2020-09-19 02:50 最冷不过冬夜阅读(344) 评论(0) 收藏举报

刷新页面返回顶部

最冷不过冬夜

爬虫——IP代理池与BeautifulSoup模块

IP代理池的概念及使用

Beautiful Soup模块

基本使用

操作方法

过滤器

总结

中文文档

公告