python爬虫（3）五种方法通过黑板客第一关

在网上找到了一个练习爬虫的网站，挺有意思的，第一关网址： http://www.heibanke.com/lesson/crawler_ex00/

页面如下：

第一关的规则就是在网址后面输入数字，

然后打开下一个页面，之后重复如此，直到通关为止，

因此手动的输入有些繁琐，就需要用爬虫来完成

第一种方法

使用urllib和正则表达式

# coding:utf-8  
  
#注意事项：在linux平台上，前面两句注释是这样写的，尤其是第一句没有空格。  
  
#本程序是用于python爬虫练习，用于在黑板客上闯关所用。  
  
#程序分析：打开黑板客首页：http://www.heibanke.com/lesson/crawler_ex00/  
#发现第一关就是让你不停的更换域名，然后打开新的网页  
# 那思路如下：  
# 1.网页打开模块  
# 2.在打开的网页中通过bs4或者正则表达式获取网页中的数字串，然后组成新的网页地址再次打开，然后一直重复。  
  
import re  
import urllib  
import datetime  
  
begin_time=datetime.datetime.now()  
url = 'http://www.heibanke.com/lesson/crawler_ex00/'  
html = urllib.urlopen(url).read()  
index=re.findall(r'输入数字([0-9]{5})',html)  
while index:  
    url='http://www.heibanke.com/lesson/crawler_ex00/%s/' % index[0]  
    print url  
    html=urllib.urlopen(url) .read()   
    index=re.findall(r'数字是([0-9]{5})',html)  
  
html=urllib.urlopen(url).read()   
url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]  
print '最后通关的的网址是%s, 耗时%s' % (url,(datetime.datetime.now()-begin_time))  
print 'just for test,是吧！

最终结果如下：

http://www.heibanke.com/lesson/crawler_ex00/64899/  
http://www.heibanke.com/lesson/crawler_ex00/36702/  
http://www.heibanke.com/lesson/crawler_ex00/83105/  
http://www.heibanke.com/lesson/crawler_ex00/25338/  
http://www.heibanke.com/lesson/crawler_ex00/19016/  
http://www.heibanke.com/lesson/crawler_ex00/13579/  
http://www.heibanke.com/lesson/crawler_ex00/43396/  
http://www.heibanke.com/lesson/crawler_ex00/39642/  
http://www.heibanke.com/lesson/crawler_ex00/96911/  
http://www.heibanke.com/lesson/crawler_ex00/30965/  
http://www.heibanke.com/lesson/crawler_ex00/67917/  
http://www.heibanke.com/lesson/crawler_ex00/22213/  
http://www.heibanke.com/lesson/crawler_ex00/72586/  
http://www.heibanke.com/lesson/crawler_ex00/48151/  
http://www.heibanke.com/lesson/crawler_ex00/53639/  
http://www.heibanke.com/lesson/crawler_ex00/10963/  
http://www.heibanke.com/lesson/crawler_ex00/65392/  
http://www.heibanke.com/lesson/crawler_ex00/36133/  
http://www.heibanke.com/lesson/crawler_ex00/72324/  
http://www.heibanke.com/lesson/crawler_ex00/57633/  
http://www.heibanke.com/lesson/crawler_ex00/91251/  
http://www.heibanke.com/lesson/crawler_ex00/87016/  
http://www.heibanke.com/lesson/crawler_ex00/77055/  
http://www.heibanke.com/lesson/crawler_ex00/30366/  
http://www.heibanke.com/lesson/crawler_ex00/83679/  
http://www.heibanke.com/lesson/crawler_ex00/31388/  
http://www.heibanke.com/lesson/crawler_ex00/99446/  
http://www.heibanke.com/lesson/crawler_ex00/69428/  
http://www.heibanke.com/lesson/crawler_ex00/34798/  
http://www.heibanke.com/lesson/crawler_ex00/16780/  
http://www.heibanke.com/lesson/crawler_ex00/36499/  
http://www.heibanke.com/lesson/crawler_ex00/21070/  
http://www.heibanke.com/lesson/crawler_ex00/96749/  
http://www.heibanke.com/lesson/crawler_ex00/71822/  
http://www.heibanke.com/lesson/crawler_ex00/48739/  
http://www.heibanke.com/lesson/crawler_ex00/62816/  
http://www.heibanke.com/lesson/crawler_ex00/80182/  
http://www.heibanke.com/lesson/crawler_ex00/68171/  
http://www.heibanke.com/lesson/crawler_ex00/45458/  
http://www.heibanke.com/lesson/crawler_ex00/56056/  
http://www.heibanke.com/lesson/crawler_ex00/87450/  
http://www.heibanke.com/lesson/crawler_ex00/52695/  
http://www.heibanke.com/lesson/crawler_ex00/36675/  
http://www.heibanke.com/lesson/crawler_ex00/25997/  
http://www.heibanke.com/lesson/crawler_ex00/73222/  
http://www.heibanke.com/lesson/crawler_ex00/93891/  
http://www.heibanke.com/lesson/crawler_ex00/29052/  
http://www.heibanke.com/lesson/crawler_ex00/72996/  
http://www.heibanke.com/lesson/crawler_ex00/73999/  
http://www.heibanke.com/lesson/crawler_ex00/23814/
最后通关的的网址是http://www.heibanke.com/lesson/crawler_ex01/, 耗时0:00:49.396000  
just for test,是吧！

第二种方法

使用request 和 re 模块配合

#!/usr/bin/python
# coding:utf-8
#通过urllib 的方法获取网页内容，通过正则表达式获取所需的字符
import requests
import re
import datetime,sys
reload(sys)
sys.setdefaultencoding('utf-8')
begin_time=datetime.datetime.now()

url = r'http://www.heibanke.com/lesson/crawler_ex00/'
new_url = url
num_re = re.compile(r'<h3>[^\d<]*?(\d+)[^\d<]*?</h3')
while True:
	print '正在读取网址 ',new_url
	html = requests.get(new_url).text
	num = num_re.findall(html)
	if len(num) == 0:
		new_url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]
		break;
	else:
		new_url = url+num[0]
print '最后通关的的网址是%s, 耗时%s' % (new_url,(datetime.datetime.now()-begin_time))

最终耗时为：

最后通关的的网址是http://www.heibanke.com/lesson/crawler_ex01/, 耗时0:01:37.520779

这里还有一种正则匹配方式，可以借鉴一下

pattern = r'<h3>(.*)</h3>'  
result = re.findall(pattern, content)  
try:  
    num = int(''.join(map(lambda n: n if n.isdigit() else '', result[0])))  
except:  
    break

这里涉及到了几个函数：
join（）函数
map（）函数
以及lambda的使用

join（）函数

其实就是一个拼接函数，看下面的几个例子

>>>> st1=['hello','world','','','j','i','m']  
#以空字符串来进行分割，其实就是直接将list 里面的元素重新连接在了一起  
>>> ''.join(st1)  
'helloworldjim'  
  
#以 ‘.’ 小数点来进行连接， 这样，原本是空字符的元素也要占用相应的位置  
 >>> '.'.join(st1)  
'hello.world...j.i.m'  
#同样的道理，针对字符串也适用  
>>> st2='this is sendy'  
>>> ''.join(st2)  
'this is sendy'  
>>> ':'.join(st2)  
't:h:i:s: :i:s: :s:e:n:d:y'  
>>>

join()函数
语法：'sep'.join(seq)
参数说明
sep：分隔符。可以为空
seq：要连接的元素序列、字符串、元组、字典
上面的语法即：以sep作为分隔符，将seq所有的元素合并成一个新的字符串
返回值：返回一个以分隔符sep连接各个元素后生成的字符串

map()函数

传入的list的每一个元素进行映射，返回一个新的映射之后的list

def format_name(s):  
    s1=s[0:1].upper()+s[1:].lower();  
    return s1;  
  
print map(format_name, ['adam', 'LISA', 'barT'])  
输入：['adam', 'LISA', 'barT']  
输出：['Adam', 'Lisa', 'Bart']

map()是 Python 内置的高阶函数，它接收一个函数 f 和一个 list，并通过把函数 f 依次作用在 list 的每个元素上，得到一个新的 list 并返回。

lambda的使用

它的作用类似于def 语句，即用关键字 lambda来简写一个函数

>>>> aa=lambda : True if 4>6 else False  
>>> aa()  
False  
>>> aa = lambda sr1:sr1+1  
>>> aa(5)  
6

lambda存在意义就是对简单函数的简洁表示

第三种方法

通过urllib2 和re 库来实现

#!/usr/bin/python  
# coding:utf-8  
#通过urllib2 的方法打开网页，获取网页内容，网页里面的内容则通过正则表达式来匹配  
import re  
import urllib2  
import datetime  
    
begin_time=datetime.datetime.now()  
url = 'http://www.heibanke.com/lesson/crawler_ex00/'  
html = urllib2.urlopen(url).read()  
index=re.findall(r'输入数字([0-9]{5})',html)  
  
while index:  
    url='http://www.heibanke.com/lesson/crawler_ex00/%s/' % index[0]  
    print url  
    html=urllib2.urlopen(url) .read()   
    index=re.findall(r'数字是([0-9]{5})',html)  
  
html=urllib2.urlopen(url).read()   
url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]  
print '最后通关的的网址是%s, 耗时%s' % (url,(datetime.datetime.now()-begin_time))

最终耗时：

最后通关的的网址是http://www.heibanke.com/lesson/crawler_ex01/, 耗时0:00:42.172931

第四种方法

使用urllib2,re和BeautifulSoup库来实现

#!/usr/bin/python  
# coding:utf-8  
#这个方法使用 bs4 即beautiful 获取有用的信息，然后将获取到的数据通过正则表达式进行处理  
import re  
import urllib2  
import datetime  
from bs4 import BeautifulSoup  
  
  
begin_time=datetime.datetime.now()  
url = 'http://www.heibanke.com/lesson/crawler_ex00/'    
url2=url  
  
while True:  
    print '正在爬取',url2  
    html = urllib2.urlopen(url2).read()  
    soup = BeautifulSoup(html,'html.parser',from_encoding='utf8')  
    str1=soup.find_all('h3') #获取信息内容  
    str2= (''.join(str1[0])) #通过这种处理得到字符串  
    str3=re.findall(r'[\d]{5}',str2)#通过正则表达式得到数字  
    if len(str3) == 0:#对数字长度进行判断，可以在最后跳出循环  
        new_url='http://www.heibanke.com'+re.findall(r'<a href="(.*?)" class',html )[0]  
        break;  
    else:  
        url2=url+str3[0]#对url进行重组，可以获得下一个url  
  
print '最后通关的的网址是%s, 耗时%s' % (url,(datetime.datetime.now()-begin_time))

最终耗时：

最后通关的的网址是http://www.heibanke.com/lesson/crawler_ex00/, 耗时0:00:43.508280

第五种方法

使用 webdriver与re 正则表达式配合

#!/usr/bin/python  
# coding:utf-8  
#这个方法使用 webdriver获取页面内容 ，然后将获取到的数据通过正则表达式进行处理  
import re  
import datetime  
from selenium import webdriver  
import sys  
reload(sys)  
sys.setdefaultencoding('utf-8')  
  
begin_time=datetime.datetime.now()  
url = 'http://www.heibanke.com/lesson/crawler_ex00/'    
  
  
driver=webdriver.PhantomJS()  
driver.get(url)  
content= driver.find_element_by_tag_name('h3').text  
print content  
content=re.findall('([0-9]{5})',content)  
  
  
while True:  
    if len(content) == 0:#对数字长度进行判断，可以在最后跳出循环  
        content= driver.find_element_by_xpath('/html/body/div/div/div[2]/a')  
        url=content.get_attribute('href')  
        break;  
    else:  
        url='http://www.heibanke.com/lesson/crawler_ex00/%s' % content[0] #对url进行重组，可以获得下一个url  
        driver.get(url)  
        content= driver.find_element_by_tag_name('h3').text  
        print content  
        content=re.findall('([0-9]{5})',content)  
print '最后通关的的网址是%s, 耗时%s' % (url,(datetime.datetime.now()-begin_time))  
  
driver.quit()

耗时：

恭喜你,你找到了答案.继续你的爬虫之旅吧  
最后通关的的网址是http://www.heibanke.com/lesson/crawler_ex01/, 耗时0:02:07.484190

这里面有个小技巧，可以获取程序运行的时间：

datetime.datetime.now()

在程序开始和结束的时候都执行一下这一句，然后将结果相减就获得了程序运行的时间。

Tips:

安装BeautifulSoup的方法：

pip install bs4

安装selenium

pip install selenium

另外需要在网上下载 PhantomJS

posted @ 2017-03-03 18:35 枫奇丶宛南阅读(33) 评论(0) 编辑收藏举报

刷新页面返回顶部

枫奇丶宛南