python 学习之爬虫练习
通过学习python,写两个简单的爬虫,没用线程,本地抓取速度还不错,有些瑕疵就是抓的图片有些显示不出来,代码做个笔记记录下:
# -*- coding:utf-8 -*- import re import urllib.request import os url = "http://www.58pic.com/yuanchuang/0/day-" def getHtml(url): page = urllib.request.urlopen(url) html = page.read().decode('gbk') return html def getImg(html,num): reg = r'src="(.*?)" ' imgre = re.compile(reg) imglist = re.findall(imgre,html) x = 0 os.mkdir(r"G:\collect/%d" % num) filePath = r"G:\collect/%d/" % num for imgurl in imglist: f=open(filePath+str(x)+".jpg",'wb') req=urllib.request.urlopen(imgurl) buf=req.read() f.write(buf) x+=1 for i in range(1,10): getUrl = url+"%d.html" % i print(getUrl) html = getHtml(getUrl) #print(html) print(getImg(html,i))
最终的结果如下图:
根据上面的初步代码,优化后加强版的爬虫代码,对于链接的状态异常的抛出异常后在继续执行程序。代码如下:
# -*- coding:utf-8 -*- import re import urllib.request import os url = "http://www.58pic.com/psd/" def getHtml(url): page = urllib.request.urlopen(url) html = page.read().decode('gbk') return html def getImg(html,num): reg = r'src="(.+?\.jpg)" class="show-area-pic" id="show-area-pic" alt="(.*?)"' imgre = re.compile(reg) imglist = re.findall(imgre,html) print(imglist) filePath = r"F:\Py/collect/%d/" % num isCreate = os.path.exists(filePath) if isCreate == False : os.mkdir(r"F:\Py/collect/%d" % num) for img in imglist: title = img[1] f=open(filePath+title+".jpg",'wb') req=urllib.request.urlopen(img[0]) buf=req.read() f.write(buf) for i in range(22797263,22797666): getUrl = url+"%d.html" % i #status = urllib.request.urlopen(getUrl).code try: html = getHtml(getUrl) #print(html) getImg(html,i) except urllib.request.URLError as e: print(e.code) print(e.reason)
一个人静静坐在电脑面前写代码的感觉,那是什么感觉?那是武林高手闭关修炼的感觉。