第1章初见网络爬虫

本章先介绍向网络服务器发送GET请求以获取具体网页，再从网页中读取HTML内容，最后做一些简单的信息提取，将我们要寻找的内容分离出来。

注：本节用到的html文件就是书中的，可以通过url访问到。

1.1 网络连接

1 from urllib.request import urlopen
2 html = urlopen('https://pythonscraping.com/pages/page1.html')
3 print (html.read())

urlopen()方法用来打开并读取一个从网络获取的远程对象。可以读取HTML文件、图像文件、或其它任何文件流。

1.2 BeautifulSoup简介

BeautifulSoup通过定位HTML标签来格式化和组织复杂的网络信息，用简单易用的python对象展现XML结构信息。

1.2.1 安装BeautifulSoup

需要预先安装beautifulsoup4。

使用：pip install beautifulsoup4 -i http://pypi.douban.com/simple/ --trusted-host pypi.douban.com

1.2.2 运行BeautifulSoup

BeautifulSoup库最常用的是BeautifulSoup对象。

 1 from urllib.request import urlopen
 2 from bs4 import BeautifulSoup
 3 
 4 html = urlopen('http://pythonscraping.com/pages/page1.html')
 5 
 6 bsObj = BeautifulSoup(html.read())
 7 
 8 #获取h1信息，等价于print (bsObj.html.body.h1)
 9 # print (bsObj.body.h1)
10 # print (bsObj.html.h1)
11 print (bsObj.h1)

将HTML内容传入BeautifulSoup对象，会转换成下面的结构：

1.2.3 可靠的网络连接

编写爬虫时，需要考虑可能出来的异常，例如：

html = urlopen('http://pythonscraping.com/pages/page1.html')

这行代码可能发生两种异常：（1）网页在服务器上不存在（或者获取页面的时候出现错误）；（2）服务器不存在。可使用try语句进行处理，或者当调用BeautifulSoup对象里的一个标签是，增加一个检查条件。

 1 from urllib.request import urlopen
 2 from urllib.error import HTTPError
 3 from bs4 import BeautifulSoup
 4 
 5 # 获取网页的标题
 6 def getTitle(url):
 7     try:
 8         html = urlopen(url)
 9     except HTTPError as e:
10         return None
11     try:
12         bsObj = BeautifulSoup(html.read())
13         title = bsObj.body.h1
14     except AttributeError as e:
15         return None
16     return title
17 
18 title = getTitle('http://pythonscraping.com/pages/page1.html')
19 
20 if title == None:
21     print ('Title counld not be found')
22 else:
23     print (title)

posted @ 2020-02-09 20:00 zhengcixi 阅读(240) 评论(0) 编辑收藏举报

刷新页面返回顶部

MrLayflolk

有志者，事竟成。

第1章初见网络爬虫

1.1 网络连接

1.2 BeautifulSoup简介

1.2.1 安装BeautifulSoup

1.2.2 运行BeautifulSoup

1.2.3 可靠的网络连接

公告

MrLayflolk

有志者，事竟成。

第1章 初见网络爬虫

1.1 网络连接

1.2 BeautifulSoup简介

1.2.1 安装BeautifulSoup

1.2.2 运行BeautifulSoup

1.2.3 可靠的网络连接

公告

第1章初见网络爬虫