python爬虫beautifulsoup4系列1

前言

以博客园为例，爬取我的博客上首页的发布时间、标题、摘要，本篇先小试牛刀，先了解下它的强大之处，后面讲beautifulsoup4的详细功能。

一、安装

1.打开cmd用pip在线安装beautifulsoup4

>pip install beautifulsoup4

二、解析器

1.我们主要用第一个html.parser，这个是python的标准库，可以直接用。其它几个需要安装对应解析器，

下表列出了主要的解析器,以及它们的优缺点:

三、打印首页博客的时间

1.这里直接定位不好定位到，可以先定位它的父元素：class="dayTitle"

2.用requests里的get方法打开博客首页，r.content返回整个html内容，返回类型为string

3.查找所有的class属性为dayTitle的Tag类

4.获取当前Tag的标签为a的string值

四、打印摘要

1.获取标题方法跟上面一样，获取摘要的话，这里不太一样，这个父类<div class="c_b_p_desc">下多了一个子类a

2.先获取div这个Tag类，tag的 .contents 属性可以将tag的子节点以列表的方式输出

3.因为摘要可以看成是第一个子元素，取下标[0]就可以读出来

五、参考代码

# coding:utf-8
from bs4 import BeautifulSoup
import requests

r = requests.get("http://www.cnblogs.com/yoyoketang/")
# 请求首页后获取整个html界面
blog = r.content
# print blog
# 用html.parser解析html
soup = BeautifulSoup(blog, "html.parser")
# 获取所有的class属性为dayTitle，返回Tag类
times = soup.find_all(class_="dayTitle")
# for i in times:
#     print i.a.string # 获取a标签的文本

title = soup.find_all(class_="postTitle")
# for i in title:
#     print i.a.string

# 读取摘要内容
descs = soup.find_all(class_="postCon")
# for i in descs:
#     # tag的 .contents 属性可以将tag的子节点以列表的方式输出
#     c = i.div.contents[0] # 取第一个
#     print c

for i, j, k in zip(times,title,descs):
    print i.a.string
    print j.a.string
    print k.div.contents[0]
    print ""

对python接口自动化有兴趣的，可以加python接口自动化QQ群：226296743

也可以关注下我的个人公众号：

posted @ 2017-05-27 21:32 上海-悠悠阅读(4968) 评论(0) 编辑收藏举报

刷新页面返回顶部

上海-悠悠

基于Fastapi《Python 测试开发》课程，4月23开学
《python接口自动化+playwright》课程，5月26号开学
联系weixin/qq：283340479

python爬虫beautifulsoup4系列1

公告

上海-悠悠

基于Fastapi《Python 测试开发》课程，4月23开学 《python接口自动化+playwright》课程，5月26号开学 联系weixin/qq：283340479

python爬虫beautifulsoup4系列1

公告

基于Fastapi《Python 测试开发》课程，4月23开学
《python接口自动化+playwright》课程，5月26号开学
联系weixin/qq：283340479