《深入Python》学习笔记

《深入Python(Dive Into Python)》http://woodpecker.org.cn/diveintopython/toc/index.html

1～6章大部分内容都在《简明Python教程》中有介绍。

未介绍之处有：

私有函数，名称以两个下划线开头的函数都是似有函数

原始字符串：在字符串前面加上r，则此字符串里的\不需要写成\，比如'\b' 可以写成 r'\b'。正则表达式要用原始字符串，否则表达式会难以阅读。

第7章正则表达式

import re

\b 字符边界 '\bROAD\b' 表示包含单独的词WORD

$ 字符串末尾 '\bROAD$' 表示包含位于句末的词WORD

^ 字符串开始

示例，把字符串中的'ROAD'替换为'RD.'，

s = '100 BROAD ROAD APT. 3'; re.sub(r'bROAD\b', 'RD.', s)

结果：'100 BROAD RD. APT. 3'

字符后面的? 表示此字符出现0或1次。如 '^{M?M?M?$' 可匹配 ' ', 'M', 'MM', 'MMM'。re.search('}M?M?M?$', 'MMM')

字符后面的+ 表示此字符出现1次或多次。

{ }定义字符出现次数: '^{M?M?M?$'可以写成'}M{0,3}$'

| 表示或者如：'A|B'

示例：确认罗马数字 pattern = '^M{0,3}(CM|CD|D?C{0,3})(XC|XL|L?X{0,3})(IX|IV|V?I{0,3})$'

\d 任何单个数字

\D 任何非数字字符

Python默认的正则表达式都是紧凑型的，不容易阅读，可以写成下面这种松散型

pattern = ””“

<span style="color: #222222; font-family: 'Book Antiqua', Georgia, Palatino, Times, 'Times New Roman', serif; line-height: 23px; font-size: medium;"><span class="userinput"><span class="pystring" style="background-color: white; color: olive;">    ^                   # beginning of string
M{0,3}              # thousands - 0 to 3 M's
(CM|CD|D?C{0,3})    # hundreds - 900 (CM), 400 (CD), 0-300 (0 to 3 C's),
#            or 500-800 (D, followed by 0 to 3 C's)
(XC|XL|L?X{0,3})    # tens - 90 (XC), 40 (XL), 0-30 (0 to 3 X's),
#        or 50-80 (L, followed by 0 to 3 X's)
(IX|IV|V?I{0,3})    # ones - 9 (IX), 4 (IV), 0-3 (0 to 3 I's),
#        or 5-8 (V, followed by 0 to 3 I's)
$                   # end of string
"""</span></span></span>

使用松散型时，必须多加一个参数，如 re.search(pattern, 'MMMDCCCLXXXVIII', re.VERBOSE)

示例：解析电话号码

<span style="color: #222222; font-family: 'Book Antiqua', Georgia, Palatino, Times, 'Times New Roman', serif; line-height: 23px; font-size: medium;"><tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">phonePattern = re.compile(r<span class="pystring" style="background-color: white; color: olive;">'''
# don't match beginning of string, number can start anywhere
(\d{3})     # area code is 3 digits (e.g. '800')
\D*         # optional separator is any number of non-digits
(\d{3})     # trunk is 3 digits (e.g. '555')
\D*         # optional separator
(\d{4})     # rest of number is 4 digits (e.g. '1212')
\D*         # optional separator
(\d*)       # extension is optional and can be any number of digits
$           # end of string
'''</span>, re.VERBOSE)</span></span>

<span style="color: #222222; font-family: 'Book Antiqua', Georgia, Palatino, Times, 'Times New Roman', serif; line-height: 23px; font-size: medium;"><tt class="prompt">&gt;&gt;&gt; </tt><span class="userinput">phonePattern.search(<span class="pystring" style="background-color: white; color: olive;">'work 1-(800) 555.1212 #1234'</span>).groups()</span>        <a name="re.phone.7.1" id="re.phone.7.1"></a><img src="http://woodpecker.org.cn/diveintopython/images/callouts/1.png" alt="1" border="0" width="12" height="12" />
<span class="computeroutput" style="background-color: white; color: teal;">('800', '555', '1212', '1234')</span></span>

注意，(x)中的括号，表示一个记忆组(remembered group)。只有加上括号，才能用groups()获得它的值。

第8章 HTML处理（解析HTML文件，抓取数据）

通过 urllib 下载html内容，通过sgmllib(SGMLParser) 分析html文件

注意from module import和import module不同。

import module 保留模块的命名空间，要使用模块名访问内部函数或属性。

from module import 把模块中制定的函数和属性导入到自己的命名空间，可以直接使用，而不需要加上模块名。

from xml.dom import minidom xml是包，也就是目录，此目录中包含特殊文件__init__.py

使用dictionary格式化字符串 '%(key)s'

解析XML

使用Python标准库的ElementTree

xml.etree.ElementTree as etree

tree = etree.parse('aaa.xml')

root = tree.getroot()

root.tag

for child in root:

文字编码UNICODE

此文网络编程似乎比较老旧，放弃。

Python, 学习笔记

posted on 2012-04-26 23:38 VinceOniPhone 阅读(405) 评论(0) 编辑收藏举报

刷新页面返回顶部

Vince on iPhone

《深入Python》学习笔记

公告

导航