爬虫日记之美味汤的各种属性值的运用

美味汤Beautifulsoup 实例

img

这个东西需要下载,打开cmd 输入指令pip install bs4 就可以下在这个库了。

它用来解析你爬取过来乱糟糟的html或者xml的代码,会自动帮你整理好。具体用法在上面。

BeautifulSoup里面的两个参数,第一个是爬取的html内容,第二个是用来解析的html解析器。

还有其他解析器

img

(来自右上角的视频截图,如有侵权,望告知,定整改。)

img

<html><head><title>This is a python demo page</title></head>
<body>
<p class="title"><b>The demo python introduces several python courses.</b></p>
<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>
</body></html>

原来爬动的是这么难看的数据,头皮发麻,用bs4之后做成了汤

<html>
 <head>
  <title>
   This is a python demo page
  </title>
 </head>
 <body>
  <p class="title">
   <b>
    The demo python introduces several python courses.
   </b>
  </p>
  <p class="course">
   Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
   <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">
    Basic Python
   </a>
   and
   <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">
    Advanced Python
   </a>
   .
  </p>
 </body>
</html>

变得这么好看,这是下面代码的soup全内容。

import requests
from bs4 import BeautifulSoup
try:
    r=requests.get('https://python123.io/ws/demo.html')
    demo=r.text
    soup = BeautifulSoup(demo,'html.parser')

    # 用来返回标题,也就是用左上角的那个东西 <title>This is a python demo page</title>
    print(soup.title)


    # 用来返回标签名为a的标签,默认返回第一个a标签  <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>
    print(soup.a)
    print(soup.a.string)


    # 返回标签a的标签名  a
    print(soup.a.name)


    # 返回a标签的父标签的标签名  p
    print(soup.a.parent.name)


    # 返回a标签的爷爷标签的标签名  body
    print(soup.a.parent.parent.name)

    tag=soup.a

    # 返回a标签的各种属性  {'href': 'http://www.icourse163.org/course/BIT-268001', 'class': ['py1'], 'id': 'link1'}
    print(tag.attrs)

    # 返回a标签的class属性的内容  ['py1']
    print(tag.attrs['class'])

    # 返回a标签的href属性的内容  http://www.icourse163.org/course/BIT-268001
    print(tag.attrs['href'])

    # 返回类型  <class 'dict'>
    print(type(tag.attrs))


    # 返回soup.a的数据类型 <class 'bs4.element.Tag'>
    print(type(tag))


    # 可以穿透内置标签,直接输出string部分的内容 Basic Python
    print(tag.string)



    tag2=soup.p.string


    #The demo python introduces several python courses.
    print(tag2.string)



    #<class 'bs4.element.NavigableString'>
    print(type(tag2.string))


    newsoup=BeautifulSoup("<b><!--This is a comment --></b><p>This is not a comment</p>","html.parser")

    #This is a comment
    print(newsoup.b.string)


    #<class 'bs4.element.Comment'>(注释类型)
    print(type(newsoup.b.string))

except:
    print('爬取失败')

注释都是我自己写的,好好看,还看不懂就重新来过吧,反正没有反正。上面用到较少,了解即可。

posted @ 2019-08-22 17:20  chanyuli  阅读(229)  评论(0编辑  收藏  举报