1. 利用urllib2.urlopen取得页面的内容
2. 利用正则表达式取得src标签为.jpg的URL
3. 保存图片
urllib2是python的一个获取url(Uniform Resource Locators,统一资源定址器)的模块。它用urlopen函数的形式提供了一个非常简洁的接口。这使得用各种各样的协议获取url成为可能。它同时 也提供了一个稍微复杂的接口来处理常见的状况-如基本的认证,cookies,代理,等等。这些都是由叫做opener和handler的对象来处理的。
高级功能:
1. 注意http://photography.nationalgeographic.com/ngs_pod_ext/searchPOD.jsp?month=08&day=26&year=2009&page=
最好能实现把某年的所有图片都一个一个保存下来(省掉多少Save的
Code
#!/usr/bin/env python
#coding=utf-8
import urllib2
import re
import urllib
urltemplate = 'http://photography.nationalgeographic.com/ngs_pod_ext/searchPOD.jsp?month=%d&day=%d&year=2009&page='
urlList = [urltemplate %(month, day) for month in range(1, 13) for day in range(1, 32)]
# define a regex to get the img src
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
p = re.compile('<img.+?>.+?</a>', re.I|re.S)
for url in urlList:
# get page html
page = urllib2.urlopen(url)
txt = page.read()
#page.close()
m = p.findall(txt)
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
for n in m:
p1=re.compile(imgre, re.I|re.S)
m1= p1.search(n)
if(m1!=None):
tmp=m1.group(2)
url="http://photography.nationalgeographic.com/" + tmp
n1=tmp.split("/")
urllib.urlretrieve(url,"D:\\My Work\\"+n1[-1]) 工作啊)
2. 把获得的图片保存在制定目录下
3. 实现GUI获取指定网址下图片(比如163相册,Tripntale相册)
1. 初始版本:只能得到图片的信息,并没有真正保存到本地.
Code
import urllib2
import re
# get page html
page = urllib2.urlopen("http://photography.nationalgeographic.com/ngs_pod_ext/searchPOD.jsp?month=06&day=10&year=2009&page=")
txt = page.read()
#txt2 = page.read()
page.close()
# define a regex to get the img src
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
# define a regex to get summary
summaryre = '<div class="summary">\s*<h1 class="podsummary">(?P<podsummary>[^<h>]*)</h1>\s*<p class="credit">(?P<credit>[^</>]*)</p>\s*<div class="description">(?P<desc>.*?)<div style="float:right'
# get img alt and source
m2 = re.search(imgre, txt)
if m2 is not None:
print "get picture alt is '%s', src is 'http://photography.nationalgeographic.com%s'" % \
(m2.group("alt"), m2.group("src"))
# get description
m3 = re.search(summaryre, txt, re.I|re.M|re.S)
if m3 is not None:
print "photo desc: summary is '%s', credit by '%s', desciption is '%s'" % (m3.group("podsummary"), m3.group("credit"), m3.group("desc"))
2. 优化后的版本,对页面所有img标签进行过滤,并查出每日一图的照片,保存到本地目录。
Code
import urllib2
import re
import urllib
# get page html
page = urllib2.urlopen("http://photography.nationalgeographic.com/ngs_pod_ext/searchPOD.jsp?month=08&day=26&year=2009&page=")
txt = page.read()
page.close()
# define a regex to get the img src
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
p = re.compile('<img.+?>.+?</a>', re.I|re.S)
m = p.findall(txt)
imgre = '<img alt="(?P<alt>[^"]*)" src="(?P<src>/staticfiles/NGS/Shared/StaticFiles/Photography/Images/POD/.+?-ga.jpg)">'
for n in m:
p1=re.compile(imgre, re.I|re.S)
m1= p1.search(n)
if(m1!=None):
tmp=m1.group(2)
url="http://photography.nationalgeographic.com/" + tmp
n1=tmp.split("/")
urllib.urlretrieve(url,"D:\\My Work\\"+n1[-1])
3. 在考虑如何保存到本地的时候, 遇到一个问题,在嵌套参数化month和day,其中month在(1,13)之间,day在(1, 32)之间,这个时候我一直不知道该怎么去操作,在chinaunix问了,也没人理我,估计是因为问题太简单。。。
后来在详细研究语法的时候看到print [(x, y) for x in range(3) for y in range(3)],我才恍然大悟,原来还可以这样用啊,于是就产生了一下的版本:(还没来得及调,待调试通过,我会更新,不过基本思路是一样的)