爬虫， requests, 获取某页面某标签的文本，属性

1 beautifulsoup4 的作用 ---解析html和xml，修改html和xml

找html里的标签及其内容,用re模块不好用, 所以用bs4
#html.parser内置，不需要安装第三方模块

好多软件的配置都用xml格式

通过结合pip3 install lxml解析器, 解析res.text内容,从而,获取内容中某标签的文本text,属性等(通过筛选根据标签名字/类名/id等 )

参考代码

import requests
# pip3 install beautifulsoup4  解析html和xml，修改html和xml
from bs4 import BeautifulSoup


res=requests.get('https://www.autohome.com.cn/news/1/#liststart')
# print(res.text)
# 第二个参数，使用什么解析器
#html.parser内置，不需要安装第三方模块
# soup=BeautifulSoup(要解析的内容,'解析器')
# soup=BeautifulSoup(res.text,'html.parser')
# pip3 install lxml
soup=BeautifulSoup(res.text,'lxml')


# 查找class为article-wrapper的div
# div=soup.find(class_='article-wrapper')
# div=soup.find(id='auto-channel-lazyload-article')
# print(div)
ul=soup.find(class_='article')
# print(ul)
# 继续找ul下的s所有li
li_list=ul.find_all(name='li')
# print(len(li_list))
for li in li_list:
    # 找每个li下的东西
    title=li.find(name='h3')
    if title:
        title=title.text
        # url=li.find('a')['href']
        url='https:'+li.find('a').attrs.get('href')
        desc=li.find('p').text
        img='https:'+li.find(name='img').get('src')
        print('''
        新闻标题：%s
        新闻地址：%s
        新闻摘要：%s
        新闻图片：%s
        
        '''%(title,url,desc,img))

posted @ 2022-09-21 16:21 tslam 阅读(296) 评论(0) 编辑收藏举报

刷新页面返回顶部

tslam

爬虫， requests, 获取某页面某标签的文本，属性

公告