Python练习—Google’s Python Class
首先介绍下正则表达式:
1)python中提供了re模块来进行正则表达式支持,因此第一步 import re
2)几个常用的方法:
match = re.search(pat, str)
注意点:1.match是个对象,使用match.group()来输出匹配文本,若失败返回None
2.search从str的起始处开始处理,在第一个匹配处结束
3.所有的模式都必须匹配上,但并不是所有的字符串都要匹配一遍
4.首先找到匹配模式的最左边,然后尽可能的往右尝试
list = re.findall(pat, str) 搜索所有的匹配项,以列表形式返回
注意点:1.可以用f.read()把所有文本都丢给findall
2.使用()后,返回的是元组的列表
re.sub(pat,replacement,str) 搜索所有匹配项,并进行替换,匹配字符串可以包括\1,\2来引用group(1),group(2)的内容
3)基本模式
普通字符原样匹配,元字符会特殊处理. ^ $ * + ? { [ ] \ | ( )
.匹配除了\n外的任意字符
\w 匹配一个字符[a-zA-Z0-9_]
\W 匹配非上面的任意字符
\b 字符和非字符的边界
\s 匹配单个空格 [ \n\r\t\f]
\S 匹配非空格字符
\t, \n, \r 制表,换行,回车
\d 十进制数
^ 开始 $结束
\ 转义
[] 指明字符集,注意这时.就代表 [^]代表取反
() 分组抽取,组特性允许抽取部分匹配文本
重复:
+ 出现一次或多次
* 出现0次或多次
? 出现0次或一次,在正则表达式后面加?可以取消贪婪搜索
BUG Fixed:
WIN7+MINGW:
使用commands.getstatusoutput()函数,由于cmd加上了{,出现歧义,需要矫正
def getstatusoutput(cmd): """Return (status, output) of executing cmd in a shell.""" import sys mswindows = (sys.platform == "win32") import os if not mswindows: cmd = '{ ' + cmd + '; }' pipe = os.popen(cmd + ' 2>&1', 'r') text = pipe.read() sts = pipe.close() if sts is None: sts = 0 if text[-1:] == '\n': text = text[:-1] return sts, text
Google’s Class介绍了基本的内容,包括:字符串操作,列表操作,排序操作,字典和文件操作,正则表达式操作,一些辅助工具操作
提供的练习包括:字符串,列表使用;正则表达式,文件使用;辅助工具使用。并提供了参考代码。
特别是最后一个练习,根据文件提取图片地址,并下载,生成HTML文件的。稍微修改就可以用来订阅网站内容的功能,值得初学者练习使用。
这里贴个代码(新浪图片页面指定部分抓取):
1: #!/usr/bin/python
2: # -*- coding: utf-8 -*-
3: # Copyright 2010 Google Inc.
4: # Licensed under the Apache License, Version 2.0
5: # http://www.apache.org/licenses/LICENSE-2.0
6:
7: # Google's Python Class
8: # http://code.google.com/edu/languages/google-python-class/
9:
10: import os
11: import re
12: import sys
13: import urllib
14:
15:
16: """Logpuzzle exercise
17: Given an apache logfile, find the puzzle urls and download the images.
18:
19: Here's what a puzzle url looks like:
20: 10.254.254.28 - - [06/Aug/2007:00:13:48 -0700] "GET /~foo/puzzle-bar-aaab.jpg HTTP/1.0" 302 528 "-" "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.6) Gecko/20070725 Firefox/2.0.0.6"
21: """
22: def grep_file(url):
23: filename='test.html'
24: abspath=os.path.abspath(filename)
25: urllib.urlretrieve(url,abspath)
26:
27: def read_urls(filename):
28: """Returns a list of the puzzle urls from the given log file,
29: extracting the hostname from the filename itself.
30: Screens out duplicate urls and returns the urls sorted into
31: increasing order."""
32: # +++your code here+++
33: url=[]
34: piclist=[]
35: """firstpart=re.search(r'(.*)_(.*)',filename)
36: if firstpart:
37: first=firstpart.group(2)"""
38:
39: try:
40: f=open(filename,'rU')
41: """for line in f:
42: urline=re.search(r'GET\s(.*\.jpg)\sHTTP/1.0',line)
43: if urline:
44: urlpart=urline.group(1)
45: str='http://'+first+urlpart
46: if str not in url:
47: url.append(str)"""
48: url=re.findall('<!--写真 start-->([\w\W]*?)<!--写真 end-->',f.read().decode('gbk').encode('utf-8'))
49: f.close()
50: except IOError as (errno, strerror):
51: sys.stderr.write("I/O error({0}): {1}".format(errno, strerror))
52: """def MyFn(name):
53: base=os.path.basename(name)
54: set=re.findall(r'(.*?)[-.]',base)
55: if set:
56: #print set[0],set[1],set[2]
57: return set[2]
58: else:
59: return base
60: url=sorted(url,key=MyFn)
61: #url.sorted()"""
62: for i in url:
63: piclist=re.findall(r'<img src="(.*?)"',i)
64: return piclist
65:
66:
67: def download_images(img_urls, dest_dir):
68: """Given the urls already in the correct order, downloads
69: each image into the given directory.
70: Gives the images local filenames img0, img1, and so on.
71: Creates an index.html in the directory
72: with an img tag to show each local image file.
73: Creates the directory if necessary.
74: """
75: # +++your code here+++
76: abspath=os.path.abspath(dest_dir)
77: if not os.path.exists(abspath):
78: os.mkdir(abspath)
79:
80: count=0
81: for i in img_urls:
82:
83: fn=abspath+'\img'+str(count)
84: print 'Retrieving...'+fn
85: urllib.urlretrieve(i,fn)
86: count+=1
87:
88: #create html
89: toshow=''
90: htmlpath=os.path.join(abspath,'index.html')
91:
92: f=open(htmlpath,'w')
93: for i in range(count):
94: toshow+='<img src="img'+str(i)+'">'
95: f.write(toshow)
96: f.close
97:
98:
99: def main():
100: args = sys.argv[1:]
101:
102: if len(args)>0 and args[0] == '-h':
103: print 'usage: [--todir dir]'
104: sys.exit(1)
105:
106: todir = ''
107: if len(args)>0 and args[0] == '--todir':
108: todir = args[1]
109: del args[0:2]
110:
111: url='http://ent.sina.com.cn/photo/'
112: grep_file(url)
113:
114: #read_urls('test.html')
115: img_urls = read_urls('test.html')
116:
117: if todir:
118: download_images(img_urls, todir)
119: else:
120: print '\n'.join(img_urls)
121:
122: if __name__ == '__main__':
123: main()