xml,json都有大量的库来解析,我们如何解析html呢?
TFHpple是一个小型的封装,可以用来解析html,它是对libxml的封装,语法是xpath。
今天我看到一个直接用libxml来解析html,参看:http://www.cocoanetics.com/2011/09/taming-html-parsing-with-libxml-1/#comment-3090 那张图画得一目了然,很值得收藏。这个文章中的源码不能遍历所有的html,我做了一点修改可以将html遍历打印出来
005 |
CFStringEncoding cfenc = CFStringConvertNSStringEncodingToEncoding(encoding); |
006 |
CFStringRef cfencstr = CFStringConvertEncodingToIANACharSetName(cfenc); |
007 |
const char *enc = CFStringGetCStringPtr(cfencstr, 0); |
009 |
htmlDocPtr _htmlDocument = htmlReadDoc([data bytes], |
010 |
[[baseURL absoluteString] UTF8String], |
012 |
XML_PARSE_NOERROR | XML_PARSE_NOWARNING); |
015 |
xmlFreeDoc(_htmlDocument); |
018 |
xmlNodePtr currentNode = (xmlNodePtr)_htmlDocument; |
024 |
if (currentNode->type == XML_ELEMENT_NODE) |
026 |
NSMutableArray *attrArray = [NSMutableArray array]; |
028 |
for (xmlAttrPtr attrNode = currentNode->properties; attrNode; attrNode = attrNode->next) |
030 |
xmlNodePtr contents = attrNode->children; |
032 |
[attrArray addObject:[NSString stringWithFormat:@ "%s='%s'" , attrNode->name, contents->content]]; |
035 |
NSString *attrString = [attrArray componentsJoinedByString:@ " " ]; |
037 |
if ([attrString length]) |
039 |
attrString = [@ " " stringByAppendingString:attrString]; |
042 |
NSLog(@ "<%s%@>" , currentNode->name, attrString); |
044 |
else if (currentNode->type == XML_TEXT_NODE) |
047 |
NSLog(@ "%@" , [NSString stringWithCString:( const char *)currentNode->content encoding:NSUTF8StringEncoding]); |
049 |
else if (currentNode->type == XML_COMMENT_NODE) |
051 |
NSLog(@ "/* %s */" , currentNode->name); |
055 |
if (currentNode && currentNode->children) |
057 |
currentNode = currentNode->children; |
059 |
else if (currentNode && currentNode->next) |
061 |
currentNode = currentNode->next; |
065 |
currentNode = currentNode->parent; |
068 |
if (currentNode && currentNode->type == XML_ELEMENT_NODE) |
070 |
NSLog(@ "</%s>" , currentNode->name); |
073 |
if (currentNode->next) |
075 |
currentNode = currentNode->next; |
081 |
currentNode = currentNode->parent; |
082 |
if (currentNode && currentNode->type == XML_ELEMENT_NODE) |
084 |
NSLog(@ "</%s>" , currentNode->name); |
085 |
if ( strcmp (( const char *)currentNode->name, "table" ) == 0) |
091 |
if (currentNode == nodes->nodeTab[0]) |
096 |
if (currentNode && currentNode->next) |
098 |
currentNode = currentNode->next; |
105 |
if (currentNode == nodes->nodeTab[0]) |
不过我还是喜欢用TFHpple,因为它很简单,也好用,但是它的功能不是很完完善。比如,不能获取children node,我就写了两个方法,一个是获取children node,一个是获取所有的contents. 还有node的属性content的key与node's content的key一样,都是@"nodeContent", 正确情况下属性的应是@"attributeContent",
所以我写了这个方法,同时修改node属性的content key.
01 |
NSDictionary *DictionaryForNode2(xmlNodePtr currentNode, NSMutableDictionary *parentResult) |
03 |
NSMutableDictionary *resultForNode = [NSMutableDictionary dictionary]; |
07 |
NSString *currentNodeContent = |
08 |
[NSString stringWithCString:( const char *)currentNode->name encoding:NSUTF8StringEncoding]; |
09 |
[resultForNode setObject:currentNodeContent forKey:@ "nodeName" ]; |
12 |
if (currentNode->content) |
14 |
NSString *currentNodeContent = [NSString stringWithCString:( const char *)currentNode->content encoding:NSUTF8StringEncoding]; |
16 |
if (currentNode->type == XML_TEXT_NODE) |
18 |
if (currentNode->parent->type == XML_ELEMENT_NODE) |
20 |
[parentResult setObject:currentNodeContent forKey:@ "nodeContent" ]; |
24 |
if (currentNode->parent->type == XML_ATTRIBUTE_NODE) |
29 |
stringByTrimmingCharactersInSet:[NSCharacterSet whitespaceAndNewlineCharacterSet]] |
30 |
forKey:@ "attributeContent" ]; |
39 |
xmlAttr *attribute = currentNode->properties; |
42 |
NSMutableArray *attributeArray = [NSMutableArray array]; |
45 |
NSMutableDictionary *attributeDictionary = [NSMutableDictionary dictionary]; |
46 |
NSString *attributeName = |
47 |
[NSString stringWithCString:( const char *)attribute->name encoding:NSUTF8StringEncoding]; |
50 |
[attributeDictionary setObject:attributeName forKey:@ "attributeName" ]; |
53 |
if (attribute->children) |
55 |
NSDictionary *childDictionary = DictionaryForNode2(attribute->children, attributeDictionary); |
58 |
[attributeDictionary setObject:childDictionary forKey:@ "attributeContent" ]; |
62 |
if ([attributeDictionary count] > 0) |
64 |
[attributeArray addObject:attributeDictionary]; |
66 |
attribute = attribute->next; |
69 |
if ([attributeArray count] > 0) |
71 |
[resultForNode setObject:attributeArray forKey:@ "nodeAttributeArray" ]; |
75 |
xmlNodePtr childNode = currentNode->children; |
78 |
NSMutableArray *childContentArray = [NSMutableArray array]; |
81 |
NSDictionary *childDictionary = DictionaryForNode2(childNode, resultForNode); |
84 |
[childContentArray addObject:childDictionary]; |
86 |
childNode = childNode->next; |
88 |
if ([childContentArray count] > 0) |
90 |
[resultForNode setObject:childContentArray forKey:@ "nodeChildArray" ]; |
TFHppleElement.m里加了两个key 常量
1 |
NSString * const TFHppleNodeAttributeContentKey = @ "attributeContent" ; |
2 |
NSString * const TFHppleNodeChildArrayKey = @ "nodeChildArray" ; |
并修改获取属性方法为:
1 |
- (NSDictionary *) attributes |
3 |
NSMutableDictionary * translatedAttributes = [NSMutableDictionary dictionary]; |
4 |
for (NSDictionary * attributeDict in [node objectForKey:TFHppleNodeAttributeArrayKey]) { |
5 |
[translatedAttributes setObject:[attributeDict objectForKey:TFHppleNodeAttributeContentKey] |
6 |
forKey:[attributeDict objectForKey:TFHppleNodeAttributeNameKey]]; |
8 |
return translatedAttributes; |
并添加获取children node 方法:
03 |
NSArray *childs = [node objectForKey: TFHppleNodeChildArrayKey]; |
13 |
- (NSArray *) children |
15 |
if ([self hasChildren]) |
16 |
return [node objectForKey: TFHppleNodeChildArrayKey]; |
最后我还加了一个获取所有content的主法:
1 |
- (NSString *)contentsAt:(NSString *)xPathOrCss; |
请看 源码。