Python爬虫之BeautifulSoup的用法

之前看静觅博客，关于BeautifulSoup的用法不太熟练，所以趁机在网上搜索相关的视频，其中一个讲的还是挺清楚的：python爬虫小白入门之BeautifulSoup库，有空做了一下笔记：

一、爬虫前的基本准备

1. 如何安装BeautifulSoup?

pip install beautifulsoup4 或 easy_install beautifulsoup4

注意：python2用BS4，但python3可以考虑用BS3，BS4对python3支持不够好

2. 如何查看BeautifulSoup已经安装？

打开IDE，用from bs4 import BeautifulSoup不报错或在cmd中输入pip list查看pip已安装的第三方库

3. 爬虫模块了解多少？

爬虫模块：urllib、urllib2、Requests、BeautifulSoup、Scrapy、lxml等

二、BeautifulSoup基础知识

1. 如何获取自定义html标签的内容？

1 #-*-coding:utf-8-*-
2 
3 from bs4 import BeautifulSoup
4 
5 html = '<title>女朋友</title>'
6 soup = BeautifulSoup(html, 'html.parser')    #创建一个beautifulsoup对象，html.parser为默认的网页解析器
7 print soup.title                             #获取title标签内容

运行结果：
<title>女朋友</title>

数据：<div>,<title>,<a>...标签

找到标签的内容：soup.div（标签的名字）

2. 如何获取本地html文件的内容？

1 #-*-coding:utf-8-*-
2  
3 from bs4 import BeautifulSoup
4  
5 soup = BeautifulSoup(open('C:\\Users\\Administrator\\Desktop\\a.html'), 'html.parser')    #在本地创建一个名叫a.html的html文件
6 print soup.prettify()                                                                     #打印soup对象的内容，格式化输出

结果：
<h1>
 今天是周五
</h1>
<p>
 你们都很棒
</p>

打开本地的html文件：open

打印本地文件的内容：soup.prettify()

3. html源代码相同的标签有很多，怎么获取到我想要的那一部分内容？

1 #-*-fulcoding:utf-8-*-
2 
3 from bs4 import BeautifulSoup
4 
5 html = '<div class="a">科里小姐姐</div><div class="b">若兰姐姐小溪姐姐</div>'
6 soup = BeautifulSoup(html, 'html.parser')
7 e = soup.find('div', class_ = "b")          #class是python关键字，所以用class过滤，必须加下划线_
8 print e.text                                #.text获取文本

结果：
若兰姐姐小溪姐姐

网页：名字，class，id

find(name,attrs,recursive,text,**wargs)：这些参数相当于过滤器一样进行筛选处理

name：基于标签的名字

attrs：基于标签的属性

recursive：基于是否使用递归查找

text：基于文本参数

**wargs：基于函数的查找

4. 区分点：find find_all

 1 #-*-fulcoding:utf-8-*-
 2 
 3 from bs4 import BeautifulSoup
 4 
 5 html = '<a href="www.baidu.com">百度</a><a href="www.sina.com.cn">新浪</a>'
 6 soup = BeautifulSoup(html, 'html.parser')
 7 
 8 #先用find
 9 a = soup.find('a')
10 print a.get('href')
11  
12 #再用find_all
13 b = soup.find_all('a')
14 for c in b:
15     print c.get('href')

结果：
find:
www.baidu.com

find_all:
www.baidu.com
www.sina.com.cn

可知：find_all()返回的是一个列表，可以遍历html文件中包含某一元素的所有字串，而find()只会找到第一个。

find_all()能够限制返回结果的数量，如soup.find_all('a', limit = 2)，当limit = 1时，find()与find_all()结果相同。

5. 如何对付反爬虫？

增加头部信息headers。urllib2.Request()有三个参数，即urllib2(url, data, headers)，如何我们爬取网页时得不到响应，有可能是网站建立了反爬虫机制，我们需要增加头部信息，模拟浏览器来登录，从而成功获取所需要的数据。

三、实战：爬取豆瓣妹子的图片

 1 #-*-coding: utf-8-*-
 2 
 3 from bs4 import BeautifulSoup    #从网页抓取数据
 4 import urllib2, urllib
 5 
 6 def crawl(url):                                          #网站反爬虫，模拟浏览器访问，加上headers头部信息
 7     headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'}
 8     request = urllib2.Request(url, headers = headers)    #用url创建一个request对象
 9     response = urllib2.urlopen(request, timeout = 20)
10     contents = response.read()                           #获取源码
11 
12     soup = BeautifulSoup(contents, 'html.parser')
13     my_girl = soup.find_all('img')
14     x = 0
15     for girl in my_girl:
16         link = girl.get('src')
17         print link
18         urllib.urlretrieve(link, 'E:\\image\\%s.jpg' %x) #urlretrieve是保存图片到本地
19         x += 1
20         
21 
22 url = 'https://www.dbmeinv.com/?pager_offset=1'
23 crawl(url)

posted @ 2017-08-27 18:36 cnhkzyy 阅读(448) 评论(0) 编辑收藏举报

会员力量，点亮园子希望

刷新页面返回顶部

cnhkzyy

认真写博客，努力加餐饭

Python爬虫之BeautifulSoup的用法

公告