python爬虫(2)爬取游民星空网的图片
python 入门级别爬虫
目的:爬取游民星空的图片
前两天在游民星空看到几张美图,然后就想把它保存下来,但是一个一个的右键保存,太不爽了,身为一个程序源,一个码农,怎么能这么低级的保存呢?
然后最近在学python,刚入门,然后就忍不住用python把图片都给抓下来了,哈哈,python就是这么顺手
话不多说,源码奉上:
# !/usr/bin/python
# -*- coding:UTF-8 -*-
import urllib
import re
#获取网页的函数
def getHtml(url):
page = urllib.urlopen(url)
html = page.read()
return html
#获取图片列表的函数,并对图片列表进行遍历,然后将图片存盘到本地
def getImg(html,count):
# reg = r'"http.+?\.jpg'
imgre = re.compile(r'src="(http.+?\.jpg)">')
imglist = re.findall(imgre,html)
x = 0
for imgurl in imglist:
urllib.urlretrieve(imgurl,'_picture_JM_%s_%s.jpg' % (x,count))
x+=1
#html = getHtml("http://tieba.baidu.com/p/2460150866")
#html = getHtml("http://tieba.baidu.com/p/4537424378")
if __name__ == "__main__":
count = 0
html = getHtml("http://www.gamersky.com/ent/201605/752759.shtml")
getImg(html,count)
count +=1
for i in range(2,10):
url = "http://www.gamersky.com/ent/201605/752759_%d.shtml" % (i)
print "开始抓取",url
html1 = getHtml(url)
count +=1
getImg(html1,count)
输出log如下:
开始抓取 http://www.gamersky.com/ent/201605/752759_2.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_3.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_4.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_5.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_6.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_7.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_8.shtml
开始抓取 http://www.gamersky.com/ent/201605/752759_9.shtml
最后爬到的图片如下:
声明:本文为博主原创,转载请注明出处----qiqiyingse_我是JM
源码下载:http://download.csdn.net/detail/qiqiyingse/9565053
Tips:想要抓取大图,就将正则表达式换一下吧,如下:
#imgre = re.compile(r'src="(http.+?\.jpg)">')
imgre = re.compile(r'href="http.+?\.shtml\?(http.+?\.jpg)">')
更新一版代码, 稍微修正一点点内容
修改了获取网页内容的方法,
修改了正则匹配的规则
修改了文件命名的方法
#!/usr/bin python #--*-- coding:utf-8 --*-- import os import urllib import time import urllib2 import HTMLParser import urllib import re #获取页面内容 def getHtml(url): print u'start crawl %s ...' % url headers = {'User-Agent':'Mozilla/5.0 (Windows NT 6.2; rv:16.0) Gecko/20100101 Firefox/16.0'} req = urllib2.Request(url=url,headers=headers) try: html = urllib2.urlopen(req).read().decode('utf-8') html=HTMLParser.HTMLParser().unescape(html)#处理网页内容, 可以将一些html类型的符号如" 转换回双引号 except urllib2.HTTPError,e: print u"连接失败,错误原因:%s " % e.code return None except urllib2.URLError,e: if hasattr(e,'reason'): print u"连接失败,错误原因:%s " % e.reason return None return html #获取图片列表的函数,并对图片列表进行遍历,然后将图片存盘到本地 def getImg(html,count): #reg = r'"http.+?\.jpg' #imgre = re.compile(r'src="(http.+?\.jpg)">') imgre = re.compile(r'href="http.+?\.shtml\?(http.+?\.jpg)">') imglist = re.findall(imgre,html) x = 0 for imgurl in imglist: urllib.urlretrieve(imgurl,'Picture_Jimy_%s_%s.jpg' % (count,x)) x+=1 if __name__ == "__main__": print ''' ***************************************** ** Welcome to python of Image ** ** Modify on 2017-05-09 ** ** @author: Jimy _Fengqi ** ***************************************** ''' count = 1 html = getHtml("http://www.gamersky.com/ent/201605/752759.shtml") getImg(html,count) for i in range(2,10): url = "http://www.gamersky.com/ent/201605/752759_%d.shtml" % (i) print "开始抓取",url html1 = getHtml(url) count +=1 getImg(html1,count)
重新稍微整理了一下代码,添加上python3版本的内容,具体内容在github上:https://github.com/JimyFengqi/JimyFengqi_spider/tree/master/01_youminxingkong