Python BeautifulSoup定位取值

-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-*-

从网页中获取指定标签、属性值，取值方式：

　　1.通过标签名获取：tag.name tag对应的type是<class 'bs4.element.Tag'>

　　2.通过属性获取：tag.attrs

　　3.获取标签属性：tag.get('属性名') 或 tag['属性名']

获取标签内容：

　　1.tag.string 获取当前标签的内容，只有一个标签的时候，（是能处理一个标签，返回标签的text内容）

　　2.tag.get_text() 获取标签内所有的字符串

　　1. stripped_strings

　　 输出的字符串中可能包含了很多空格或空行,使用 .stripped_strings 可以去除多余空白内容
for string in soup.stripped_strings:
    print(repr(string))
    # u"The Dormouse's story"
    # u"The Dormouse's story"
    # u'Once upon a time there were three little sisters; and their names were'
    # u'Elsie'
    # u','
    # u'Lacie'
    # u'and'
    # u'Tillie'
    # u';\nand they lived at the bottom of a well.'
　　2. 标准输出页面：

　　　　soup.prettify()

BeautifulSoup 查找元素：

　　1.find_all(class_="class") 返回的是多个标签，格式为<class 'bs4.element.ResultSet'>

　　2.find(class_="class") 返回一个标签，格式是<class 'bs4.element.Tag'>

　　3.select_one()    返回一个标签，格式是<class 'bs4.element.Tag'>

　　4.select()    返回的是多个标签，格式为<class 'bs4.element.ResultSet'>

　　5.　soup = BeautifulSoup(backdata,'html.parser')　　#转换为BeautifulSoup形式属性

　　　　soup.find_all('标签名'，attrs{'属性名':'属性值'} ) #返回的是列表

　　　　limitk 控制 find_allf返回的数量

　　　　recursive=Flasef返回tag的直接子元素

　　　　soup.find_all(text=re.compile(' content '))     根据文本匹配，可模糊匹配

子节点处理方式：

　　1. contents

　　　　.contents 属性可以将tag的子节点以列表的方式输出

　　2. children

　　　　.children 生成器,可以对tag的子节点进行循环

　　3. descendants

　　　　contents和children 只是返回的是直接子节点，而descendants返回的是对多有的子孙节点进行循环

父节点处理方式：

　　1. parent

　　　　通过 .parent 属性来获取某个元素的父节点

　　2. find_parents（）

　　　　返回祖先节点

　　2. find_parent（）

　　　　返回父节点

兄弟节点处理方式：

　　1. next_siblings 下一个兄弟节点

　　2. previous_siblings 上一个兄弟节点

　　3. find_next_siblings（）下一个兄弟节点

　　4. find_next_sibling（）上一个兄弟节点

posted @ 2018-12-01 19:35 wlanan小栈阅读(5276) 评论(0) 收藏举报

刷新页面返回顶部

wlanan的小镇

There's always that one song that brings back old memories.513034620

Python BeautifulSoup定位取值

公告