BeautifulSoup入门

参考文献https://blog.csdn.net/weixin_30788239/article/details/95076026

问题：最近通过request请求。发现返回了很多Html代码。实际我们只需要取其中一段。soup来实现

import  requests
import json
from requests.auth import HTTPBasicAuth
from urllib import parse
from bs4 import BeautifulSoup  ### 引入

 result=''
        try :
            res = self.session.post(url, data=req_data, headers=self.header)  # 如果采用post 提交 这里无需操作。如果采用get提交，需要将参数放到url
            # print(res.text)
            ### 开始解析html
            soup = BeautifulSoup(res.text, 'html.parser')
            result = soup.find_all('textarea')
            print(result)

        except:
            print("erro")

        return result

二、soup 常用函数介绍

查找所有的h4标签，并且分别打印内容和带标签的内容

BeautifulSoup对象.find(标签，属性）  
find():提取满足条件的首个元素
find_all():提取满足条件的所有元素
属性使用class_=’’

案例一：
#coding=utf-8
import json
import requests
from bs4 import BeautifulSoup
url = 'http://www.itest.info/courses' # 定义被抓取页面的url
soup = BeautifulSoup(requests.get(url).text, 'html.parser')# 获取被抓取页面的html代码（注意这里是用 request框架获取的页面源码），并使用html.parser来实例化BeautifulSoup，属于固定套路
for course in soup.find_all('h4'):# 遍历页面上所有的h4标签
　　print course.text.encode('utf-8')# 打印出h4标签的text字符    如: 测试开发--试验班
　　print course  # 打印出h4的text字符加标签    如:<h4>测试开发--试验班</h4>

三、查找div 和spanl 内容

url = 'https://www.v2ex.com/'
soup = BeautifulSoup(requests.get(url).text, 'html.parser')
for span in soup.find_all('span', class_='item_hot_topic_title'):#查找span标签 且样式为class_='item_hot_topic_title'，注意是class_，不是class，因为class是python的关键字，所以后面要加个尾巴，防止冲突
　　print span.find('a').text.encode('utf-8')#获取里面的a标签展示,假如span标签里面有很多a标签，可以 for i in span.find_all('a', href='/t/415664')继续筛选
　　print span.find('a')['href'].encode('utf-8') #获取href属性，在bs4里，我们可以通过[attribute_name]的方式来获取元素的属性

posted @ 2020-07-30 09:42 马里亚纳仰望星空 Views(156) Comments(0) Edit 收藏举报

刷新页面返回顶部

马里亚纳仰望星空

BeautifulSoup入门

公告