Click to Visit Homepage : zzyzz.top


Html / XHtml 解析 - Parsing Html and XHtml

  1 Html / XHtml 解析 - Parsing Html and XHtml
  2 
  3 HTMLParser 模块
  4     通过 HTMLParser 模块来解析 html 文件通常的做法是, 建立一个 HTMLParser 子类,
  5     然后子类中实现处理的标签(<.>)的方法, 其实现是通过 '重写' 父类(HTMLParser)的
  6     handle_starttag(), handle_data(), handle_endtag() 等方法.
  7 
  8     例子,
  9         解析 htmlsample.html 中 <head> 标签,
 10             <-- htmlsample.html -->  -> 文件内容,
 11                 '
 12                 <html>
 13                 <head><title>404 Not Found</title></head>
 14                 <body bgcolor="white">
 15                 <center><h1>404 Not Found</h1></center>
 16                 <hr><center>nginx/1.12.2</center>
 17                 </body>
 18                 </html>
 19                 '
 20         from html.parser import HTMLParser
 21         class ParsingHeadT(HTMLParser):
 22             def __init__(self):
 23                 self.headtag =''
 24                 self.parsesemaphore = False
 25                 HTMLParser.__init__(self)
 26 
 27             def handle_starttag(self, tag, attrs): # enable semaphore
 28                 if tag == 'head':
 29                     self.parsesemaphore = True
 30 
 31             def handle_data(self, data):          # tag process as requirement
 32                 if self.parsesemaphore:
 33                     self.headtag = data
 34 
 35             def handle_endtag(self, tag):
 36                 if tag == 'head':
 37                     self.parsesemaphore = False
 38 
 39             def getheadtag(self):
 40                 return self.headtag
 41 
 42         if __name__ == "__main__":
 43             with open('htmlsample.html') as FH:
 44                 pht = ParsingHeadT()
 45                 pht.feed(FH.read())    # HTMLParser will invoke the replaced methods
 46                                        # handle_starttag, handle_data and handle_endtag
 47                 print("Head Tag : %s" % pht.getheadtag())
 48 
 49         output,
 50            Head Tag : 404 Not Found
 51 
 52     上例是一个简单完成的 html 文本, 然而在实际生产中是有一些实现情况要考虑和处理的,
 53     比如 html 中的特殊字符 &copy (copyright 符号), &amp(& 逻辑与符号) 等,
 54         对于这种情况, 之前的做法是需要重写父类的 handle_entityref() 来处理,
 55             HTMLParser.handle_entityref(name)¶
 56                 This method is called to process a named character reference of the form
 57                 &name; (e.g. &gt;), where name is a general entity reference (e.g. 'gt').
 58                 This method is never called if convert_charrefs is True.
 59 
 60     字符转换 也是一种需要注意的情况, 比如 十进制 decimal 和 十六进制 hexadecimal 字符的转换.
 61         HTMLParser.handle_charref(name)
 62             This method is called to process decimal and hexadecimal numeric character
 63             references of the form &#NNN; and &#xNNN;. For example, the decimal equivalent
 64             for &gt; is &#62;, whereas the hexadecimal is &#x3E; in this case the method
 65             will receive '62' or 'x3E'. This method is never called if convert_charrefs is True.
 66 
 67     Note,
 68         幸运的是,以上情况在 python 3 已经能很好得帮我们处理了. 还是使用上例, 现在我们在 htmlsample.html
 69         <head> tag 中加入一些特殊字符来看看.
 70             <-- htmlsample.html -->
 71             <html>
 72             <head><title>&#62 &#x3E 404 &copy Not &gt Found & </title></head>
 73             <body bgcolor="white">
 74             <center><h1>404 Not Found</h1></center>
 75             <hr><center>nginx/1.12.2</center>
 76             </body>
 77             </html>
 78 
 79         上例 Output,
 80                 Head Tag : > > 404 © Not > Found &
 81                 从运行结果可以看出, 在 python 3 中上例能够很好的处理特殊字符的情况.
 82 
 83     然而, 在 html 的代码中存在一类 '非对称'的标签, 如 <p>, <li> 等, 当我们试图使用上面的例子
 84     去处理这类非对称标签的时候发现, 这类标签并不能被上例正确解析. 这时我们需要扩展上例的 code 使
 85     其能够正确解析这些'非对称'标签.
 86         先扩展一下儿 htmlsample.html, 以 <li> 标签为例,
 87         <-- htmlsample.html -->
 88         <html>
 89         <head><title>&#62 &#x3E 404 &copy Not &gt Found &</title>
 90         <body bgcolor="white">
 91         <center><h1>404 Not Found</h1></center>
 92         <hr><center>nginx/1.12.2</center>
 93         <ul>
 94             <li> First Reason
 95             <li> Second Reason
 96         </body>
 97         </html>
 98 
 99         htmlsample.html 文件是可以被浏览器渲染的, 然而 htmlsample.html 中 <head> 和 <ul> 标签
100         没有对应的结束 tag, <li> 为非对称的 tag. 现在来向之前的例子添加一些逻辑来处理这些问题.
101 
102         例,
103             from html.parser import HTMLParser
104             class Parser(HTMLParser):
105                 def __init__(self):
106                     self.taglevels = []     # track anchor
107                     self.tags =['head','ul','li']
108                     self.parsesemaphore = False
109                     self.data = ''
110                     HTMLParser.__init__(self)
111 
112                 def handle_starttag(self, tag, attrs): # enable semaphore
113                     if len(self.taglevels) and self.taglevels[-1] == tag:
114                         self.handle_endtag(tag)
115                     self.taglevels.append(tag)
116 
117                     if tag in self.tags:
118                         self.parsesemaphore = True
119 
120                 def handle_data(self, data):          # tag process as requirement
121                     if self.parsesemaphore:
122                         self.data += data
123 
124                 def handle_endtag(self, tag):
125                     self.parsesemaphore = False
126 
127                 def gettag(self):
128                     return self.data
129 
130             if __name__ == "__main__":
131                 with open('htmlsample.html') as FH:
132                     pht = Parser()
133                     pht.feed(FH.read())    # HTMLParser will invoke the replaced methods
134                                            # handle_starttag, handle_data and handle_endtag
135                     print("Head Tag : %s" % pht.gettag())
136 
137             Output,
138                  Head Tag : > > 404 © Not > Found &
139                  First Reason
140                  Second Reason
141 
142 Reference,
143     https://docs.python.org/3.6/library/html.parser.html?highlight=htmlparse#html.parser.HTMLParser.handle_entityref
144 
145 Appendix,
146     The example given by python Doc,
147         from html.parser import HTMLParser
148         from html.entities import name2codepoint
149 
150         class MyHTMLParser(HTMLParser):
151             def handle_starttag(self, tag, attrs):
152                 print("Start tag:", tag)
153                 for attr in attrs:
154                     print("     attr:", attr)
155 
156             def handle_endtag(self, tag):
157                 print("End tag  :", tag)
158 
159             def handle_data(self, data):
160                 print("Data     :", data)
161 
162             def handle_comment(self, data):
163                 print("Comment  :", data)
164 
165             def handle_entityref(self, name):
166                 c = chr(name2codepoint[name])
167                 print("Named ent:", c)
168 
169             def handle_charref(self, name):
170                 if name.startswith('x'):
171                     c = chr(int(name[1:], 16))
172                 else:
173                     c = chr(int(name))
174                 print("Num ent  :", c)
175 
176             def handle_decl(self, data):
177                 print("Decl     :", data)
178 
179         parser = MyHTMLParser()
180 
181     Output,
182         Parsing a doctype:
183 
184     # >>> parser.feed('<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" '
185     ...             '"http://www.w3.org/TR/html4/strict.dtd">')
186         Decl     : DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd"
187         Parsing an element with a few attributes and a title:
188 
189 
190     # >>> parser.feed('<img src="python-logo.png" alt="The Python logo">')
191         Start tag: img
192              attr: ('src', 'python-logo.png')
193              attr: ('alt', 'The Python logo')
194 
195     # >>> parser.feed('<h1>Python</h1>')
196         Start tag: h1
197         Data     : Python
198         End tag  : h1
199         The content of script and style elements is returned as is, without further parsing:
200 
201 
202     # >>> parser.feed('<style type="text/css">#python { color: green }</style>')
203         Start tag: style
204              attr: ('type', 'text/css')
205         Data     : #python { color: green }
206         End tag  : style
207 
208     # >>> parser.feed('<script type="text/javascript">'
209     ...             'alert("<strong>hello!</strong>");</script>')
210         Start tag: script
211              attr: ('type', 'text/javascript')
212         Data     : alert("<strong>hello!</strong>");
213         End tag  : script
214         Parsing comments:
215 
216     # >>> parser.feed('<!-- a comment -->'
217     ...             '<!--[if IE 9]>IE-specific content<![endif]-->')
218         Comment  :  a comment
219         Comment  : [if IE 9]>IE-specific content<![endif]
220         Parsing named and numeric character references and converting them to the correct
221         char (note: these 3 references are all equivalent to '>'):
222 
223     # >>> parser.feed('&gt;&#62;&#x3E;')
224         Named ent: >
225         Num ent  : >
226         Num ent  : >
227         Feeding incomplete chunks to feed() works, but handle_data() might be called more
228         than once (unless convert_charrefs is set to True):
229 
230     # >>> for chunk in ['<sp', 'an>buff', 'ered ', 'text</s', 'pan>']:
231     ...     parser.feed(chunk)
232         Start tag: span
233         Data     : buff
234         Data     : ered
235         Data     : text
236         End tag  : span
237         Parsing invalid HTML (e.g. unquoted attributes) also works:
238 
239     # >>> parser.feed('<p><a class=link href=#main>tag soup</p ></a>')
240         Start tag: p
241         Start tag: a
242              attr: ('class', 'link')
243              attr: ('href', '#main')
244         Data     : tag soup
245         End tag  : p
246         End tag  : a

 

posted @ 2017-12-14 12:07  zzYzz  阅读(562)  评论(0)    收藏  举报


Click to Visit Homepage : zzyzz.top