Python_数据解析之bs4 - 努力爬行的小虫子

bs4进行数据解析：

-数据解析的原理：

1、标签定位

2、提取标签、标签属性中存储的数据值

-bs4数据解析的原理：

1、实例化一个BeautifulSoup对象，并且将页面源码数据加载到该对象

2、通过调用BeautifulSoup对象中相关的属性或方法进行标签定位和数据提取

-环境安装：

1、pip install bs4

2、pip install lxml

-如何实例化BeautifulSoup对象：

1、from bs4 import BeautifulSoup

2、对象的实例化：

1、将本地的html文档中的数据加载到该对象中

fp=open('./dog.html','r',encoding='utf-8')

soup=BeautifulSoup(fp,'lxml')

2、将互联网上获取的页面源码加载到该对象中

page_text=response.text

soup=BeautifulSoup(page_text,’lxml’)

3、提供的用于数据解析的方法和属性：

1、soup.tagName:返回的是文档中第一次出现的tagName对应的标签

2、soup.find():

-find(‘tagName’):等同于soup.div

-属性定位：

-soup.find(‘div’,class_=’song’)

3、soup.find_all(‘tagName’):返回符合要求的所有标签（列表）

4、select:

-select(‘某种选择器(id,class,标签选择器)’)，返回的是一个列表

-层级选择器：

-soup.select(‘.tang > ul > li >a’): > 表示的是一个层级

-soup.select(‘.tang > ul a’): 空格表示的多个层级

5、获取标签之间的文本数据：

-soup.a.text/string/get_text()

-text/get_text():可以获取某一个标签中所有的文本内容

-string:只可以获取该标签下面直系的文本内容

6、获取标签中属性值：

-soup.a[‘href’]

实例一：晋江文学城某个作者的一个作品的目录详情爬取

 1 from bs4 import BeautifulSoup
 2 import re
 3 if __name__=="__main__":
 4     #将本地的html文档中的数据加载到该对象中
 5     fp=open('./text.html','r',encoding='utf-8')
 6     soup=BeautifulSoup(fp,'lxml')
 7     #print(soup.a)#soup.tagName 返回的是第一次出现的<a></a>标签
 8     #print(soup.find('tr',itemprop='chapter'))
 9     #print(soup.find_all('tr',itemprop='chapter'))
10     #print(soup.select('#oneboolt > tbody > tr:nth-child(5)'))
11     #print(type(soup.select('table tr')[5]))
12 
13     chapterlist=soup.find_all('tr',attrs={'itemprop':'chapter','itemscope':'','itemtype':'http://schema.org/Chapter'})
14     list=[]
15     for tr in chapterlist:
16         text=tr.text
17         list.append(text)
18     reObj=re.compile("\\s*|\t|\r|\n") #去除换行符和空白字符
19     for text in list:
20         print(type(text))
21         txtlist=reObj.split(text)
22         for txt in txtlist:
23             if txt=='':
24                 txtlist.remove(txt)
25         print(txtlist)

实例二：三国演义书籍的爬取

1、使用select选择器

 1 #https://www.shicimingju.com/book/sanguoyanyi.html
 2 
 3 import requests
 4 from bs4 import BeautifulSoup
 5 if __name__=="__main__":
 6     headers={
 7         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
 8         'Cookie':'Hm_lvt_649f268280b553df1f778477ee743752=1613016932; key_kw=; key_cate=zuozhe; Hm_lpvt_649f268280b553df1f778477ee743752=1613016981'
 9     }
10     url='https://www.shicimingju.com/book/sanguoyanyi.html'
11     Response=requests.get(url=url,headers=headers)
12     Response.encoding='utf-8'
13     page_txt=Response.text
14 
15     #解析章节标题和详细页的url
16     soup=BeautifulSoup(page_txt,'lxml')
17     li_list=soup.select('#main_left > div > div.book-mulu > ul > li')
18 
19     fp=open('./三国演义.txt','w',encoding='utf-8')
20     for li in li_list:
21         title=li.a.string
22         title_url='https://www.shicimingju.com'+li.a['href']
23 
24         #获取title章节的html
25         detail_response=requests.get(url=title_url,headers=headers)
26         detail_response.encoding='utf-8'
27         detail_html=detail_response.text
28         #解析title章节的html
29 
30         title_soup=BeautifulSoup(detail_html,'lxml')
31 
32         content=title_soup.select('#main_left > div.card.bookmark-list > div')[0].text #select返回的是一个list
33         fp.write('\n'+title+content+'\n')
34         print(title+'   爬取成功！！！')
35 
36     fp.close()

2、使用find选择器和select选择器组合拳

 1 #https://www.shicimingju.com/book/sanguoyanyi.html
 2 
 3 import requests
 4 from bs4 import BeautifulSoup
 5 if __name__=="__main__":
 6     headers={
 7         'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36',
 8         'Cookie':'Hm_lvt_649f268280b553df1f778477ee743752=1613016932; key_kw=; key_cate=zuozhe; Hm_lpvt_649f268280b553df1f778477ee743752=1613016981'
 9     }
10     url='https://www.shicimingju.com/book/sanguoyanyi.html'
11     Response=requests.get(url=url,headers=headers)
12     Response.encoding='utf-8'
13     page_txt=Response.text
14 
15     #解析章节标题和详细页的url
16     soup=BeautifulSoup(page_txt,'lxml')
17     li_list=soup.select('#main_left > div > div.book-mulu > ul > li')
18 
19     fp=open('./三国演义.txt','w',encoding='utf-8')
20     for li in li_list:
21         title=li.a.string
22         title_url='https://www.shicimingju.com'+li.a['href']
23 
24         #获取title章节的html
25         detail_response=requests.get(url=title_url,headers=headers)
26         detail_response.encoding='utf-8'
27         detail_html=detail_response.text
28         #解析title章节的html
29 
30         title_soup=BeautifulSoup(detail_html,'lxml')
31 
32         detail_Tag=title_soup.find('div',attrs={'class':'chapter_content'}) #返回的是一个bs4.element.Tag类型
33         content=detail_Tag.text
34         fp.write('\n' + title + content + '\n')
35         print(title + '   爬取成功！！！')
36 
37     fp.close()

发表于 2021-02-11 13:52 努力爬行的小虫子阅读(485) 评论(0) 编辑收藏举报