python3爬虫之旅(一)

在学习python3基础语法的空闲时间，心里痒痒的，总想搞点什么大新闻，于是乎，打算再开一个“番外篇”。会穿插于python的另一个学习系列中，不定期更新。

本系列为单独的爬虫自学系列，请有选择的观看。

当然，这个系列也有实战的意思在里面。

入门

来，写第一个爬虫项目！百度搞起。

from urllib.request import urlopen
html = urlopen("http://www.baidu.com")
print(html.read())

ok了。解释一下，urllib是一个库，request是这个库里面的模块，import 表示只导入这个模块里面的urlopen函数。

beautifulsoup

beautifulsoup库不是内置库，需要安装，安装命令：

pip install beautifulsoup4

使用：

from urllib.request import urlopen
from bs4 import BeautifulSoup
html = urlopen("http://www.pythonscraping.com/pages/page1.html")
bsObj = BeautifulSoup(html.read())
print(bsObj.h1)
# 其实，下面的所有函数调用都可以产生同样的结果
# bsObj.html.body.h1
# bsObj.body.h1
# bsObj.html.h1

结果：

<h1>An Interesting Title</h1>

urlopen方法先拿到网页资源，然后用read方法读出来文档内容，赋值给bs，然后可以随意的读取里面的各标签的信息了。

异常处理：

from urllib.request import urlopen
from urllib.error import HTTPError
from bs4 import BeautifulSoup


def getTitle(url):
    try:
        html = urlopen(url)
    except HTTPError as e:   # http错误异常
        return None
    try:
        bsObj = BeautifulSoup(html.read(),'html.parser')
        title = bsObj.body.h1
    except AttributeError as e:  #节点不存在异常
        return None
    return title


title = getTitle("http://www.pythonscraping.com/pages/page1.html")
if title == None:
    print("Title could not be found")
else:
    print(title)

posted on 2020-03-26 15:56 little天阅读(197) 评论(0) 编辑收藏举报

刷新页面返回顶部

little天

python3爬虫之旅(一)

导航

公告