bs4解析

'''bs4进行数据解析
数据解析的原理：
-1.标签定位
- 2.提取标签、标签属性中存储的数据值
bs4数据解析的原理：
- 1.实例化一个BeautifuLSoup对象，并且将页面源码数据加载到该对象中
- 2.通过调用BeautifuLSoup对象中相关的属性或者方法进行标签定位和数据
-环境安装：
    pip install bs4
    pip install lxml
-如何实例化 beautifulSoup对象：
    from bs4 import BeautifulSoup
    对象的实例化
    将本地的html文档中的数据加载到该对象
    fp = open('D:/网页html/index.html','r',encoding='utf-8')
    soup=BeautifulSoup(fp,'lxml')，通用解析器
    soup=BeautifulSoup(fp,'html,parser')#指定html解析器
    将互联网上获取的页面源码加载到该对象
     page_text=response.text
     soup=BeautifulSoup(page_text,'lxml')
-提供的用于数据解析的方法和属性
    soup.tagname返回的是html中第一次会出现的tahname标签
    soup.find('tagname')等同于soup.tagname
    属性定位
        soup.find('div',class_/id/attr='ddc')),string，class是python关键字，所以+_,也可以
        soup.find('div',attrs={'class':'con'})
    soup.find_all('div')返回符合要求的所有标签(列表)
-select
    -select('某种选择器(id,class,标签...选择器),返回一个列表')
    -soup.select('.con>ul>li>a')[0]:>表示的是一个层级
    -soup.select('.con>ul a')[0]：空格 表示的是多个层级
-获取标签之间的文本数据
    -soup.a.text/string/get_text()
    -text/get_text可以获取标签中可以获取某个标签中所有的文本内容
    -string:只可以获取该标签下面直系的文本内容
        -soup.find('div',class_='con').
-获取标签之间属性值：
    -soup.a['href']
    -soup.select('.con>ul a')[0]['href']
'''
from bs4 import BeautifulSoup
if __name__=="__main__":
    #将本地的html文档中的数据加载到该对象
    fp = open('D:/网页html/mine.html','r',encoding='utf-8')
    soup=BeautifulSoup(fp,'lxml')
    #print(soup.find('div'))
    #print(soup.find('div',class_='ddc'))
    #print(soup.find_all('div'))
    print(soup.select('.con>ul a')[0]['href'])

import requests

from bs4 import BeautifulSoup
#爬取三国演义小说所有章节标题和内容
if __name__=="__main__":
    #对首页页面进行爬取
    headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36 Edg/99.0.1150.39'}
    url='https://www.shicimingju.com/book/sanguoyanyi.html'
    page_text=requests.get(url=url,headers=headers).text.encode('ISO-8859-1')
    #需要在首页中解析章节的标题和详情页的url
    #1.实例化bs对象，需要将页面源码数据加载到该对象中
    soup = BeautifulSoup(page_text,'lxml')
    #解析章节标题和详情页的url
    li_list=soup.select('.book-mulu > ul > li')
    print(li_list)
    fp=open('./sanguo.txt','w',encoding='utf-8')
    for li in li_list:
        title=li.a.string
        detail_url='https://www.shicimingju.com'+li.a['href']
        #对详情页发起请求，解析出内容
        detail_text=requests.get(url=detail_url,headers=headers).text.encode('ISO-8859-1')
        #解析出详情页中相关内容
        soup1 = BeautifulSoup(detail_text, 'lxml')
        div_tag=soup1.find('div',class_='chapter_content')
        content=div_tag.text#解析到了章节内容
        fp.write(title+':'+content+'\n')
        print((title,'爬取成功'))-

posted @ 2022-03-14 20:33 wzc6 阅读(253) 评论(0) 收藏举报

刷新页面返回顶部

wzc6

bs4解析

公告