Python 抓取网页并提取信息(程序详解)

最近因项目需要用到python处理网页，因此学习相关知识。下面程序使用python抓取网页并提取信息，具体内容如下：

#------------------------------------------------------------------------------

import urllib2 # extensible library for opening URLs
import re  # regular expression module

#------------------------------------------------------------------------------
def main():
    userMainUrl = "http://www.songtaste.com/user/351979/"
    req = urllib2.Request(userMainUrl)  # request
    resp = urllib2.urlopen(req)         # response
    respHtml = resp.read()     # read html
    print "respHtml =", respHtml
    #<h1 class="hluser">crifan</h1>
    foundH1user = re.search(r'<h1\s+?class="h1user">(?P<h1user>.+?)</h1>', respHtml)
    print "foundHluser =", foundH1user
    if foundH1user:
        h1user = foundH1user.group("h1user")
        print "hluser=", h1user
    
###################################################################################
if __name__=='__main__':
    main()

本程序实现目的，从http://www.songtaste.com/user/351979/网页源码中找到

<h1 class="hluser">crifan</h1>

再从上面的格式中提取“crifan”。

从网络中读取网页，需要2个步骤：向网页服务器请求和服务器响应。下面对程序核心的部分进行解析，如下：

foundH1user = re.search(r'<h1\s+?class="h1user">(?P<h1user>.+?)</h1>', respHtml)

本语句使用正则表达式进行匹配字符串“<h1 class="hluser">crifan</h1>”。将<h1>与</h1>之间的内容归为一个group，group名为h1user。
注意 “h1user”中‘1’是数字‘1’，不是字母‘l’

程序中涉及到相关知识如下：

1、re.search

re.search(pattern, string, flags=0)

Scan through string looking for the first location where the regular expression pattern produces a match, and return a corresponding MatchObject instance. Return None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

class re.MatchObject

Match objects always have a boolean value of True.

Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

match = re.search(pattern, string)

if match:

process(match)

2、group([group1, ...])

Match objects support the following methods and attributes:

group([group1, ...])

Returns one or more subgroups of the match. If there is a single argument, the result is a single string; if there are multiple arguments, the result is a tuple with one item per argument. Without arguments, group1 defaults to zero (the whole match is returned). If a groupN argument is zero, the corresponding return value is the entire matching string; if it is in the inclusive range [1..99], it is the string matching the corresponding parenthesized group. If a group number is negative or larger than the number of groups defined in the pattern, an IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

3、(?P<name>...)

(?P<name>...)，用于对group命名，group名为name，从而可以通过group('name')，实现对此group进行访问。如程序中

foundH1user.group("h1user")

其中foundH1user为MatchObject instance，h1user为group名

与正常的括号类似，但是按group匹配的子串可通过象征性的group名name访问。group名必须是有效的Python标识符，每个组名在正则表达式中只能定义一次。具有symbolic group name的组也是一个有编号的组，就好像这个group没有被命名一样

Similar to regular parentheses, but the substring matched by the group is accessible via the symbolic group name name. Group names must be valid Python identifiers, and each group name must be defined only once within a regular expression. A symbolic group is also a numbered group, just as if the group were not named.

4、程序中使用的正则表达式符号

常用的元字符

\s 匹配任意的空白符

. 匹配除换行符以外的任意字符

常用的限定符

+ 重复一次或更多次

? 重复零次或一次

由正则表达式的符号含义可知，程序中 "\s+?" 完全可以用 "\s+" 或 ”\s?"替代

参考资料：

1、http://www.crifan.com/crawl_website_html_and_extract_info_using_python/

2、https://docs.python.org/2/library/re.html#re.MatchObject

3、http://deerchao.net/tutorials/regex/regex.htm