在python开发[第九篇],我们已经在request模块中,讲解了如何根据url去获取网页内容。
如果返回的是内容的格式python的基本数据类型,可以json将返回的字符串转为python的基本数据类型。但是大多数情况下,我们通过http协议请求一个url后,返回的却是?xml格式。基于这种常见的报文格式,python对其进行提供了相应模块,如下:
一、xml模块
XML是实现不同语言或程序之间进行数据交换的协议,XML文件格式如下:
<data> <country name="Liechtenstein"> <rank updated="yes">2</rank> <year>2023</year> <gdppc>141100</gdppc> <neighbor direction="E" name="Austria" /> <neighbor direction="W" name="Switzerland" /> </country> <country name="Singapore"> <rank updated="yes">5</rank> <year>2026</year> <gdppc>59900</gdppc> <neighbor direction="N" name="Malaysia" /> </country> <country name="Panama"> <rank updated="yes">69</rank> <year>2026</year> <gdppc>13600</gdppc> <neighbor direction="W" name="Costa Rica" /> <neighbor direction="E" name="Colombia" /> </country> </data>
下面我们先试着请求一个地址,来查看其返回的结果:
#检查QQ是否在线的
import requests r = requests.get("http://www.webxml.com.cn//webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=1223995142") result = r.text print(result) """ 执行结果: <?xml version="1.0" encoding="utf-8"?> <string xmlns="http://WebXml.com.cn/">Y</string> """
由上面可以看出,请求后返回的是xml格式的。为了能便捷的处理xml内容,python提供了内置模块模块xml模块,在python的Lib目录下。
1.xml报文的解析
第一种方法:
from xml.etree import ElementTree as ET #在poython的Lib目录中有xml/etree目录。并在该目录下导入ElementTree.py文件,并重命名为ET # 打开文件,读取XML内容 str_xml = open('first.xml', 'r').read() # 将字符串解析成xml特殊对象,root代指xml文件的根节点 root = ET.XML(str_xml) #调用ElementTree文件中的XML()方法
print(root,type(root)) #执行结果:<Element 'data' at 0x00702BA0> <class 'xml.etree.ElementTree.Element'>
#可见,root是Element的类对象
第二种方法:
from xml.etree import ElementTree as ET # 利用ElementTree.parse将文件直接解析成xml对象 。不需要打开文件 tree = ET.parse("first.xml") print(tree,type(tree)) #执行结果:<xml.etree.ElementTree.ElementTree object at 0x0067FDB0> <class 'xml.etree.ElementTree.ElementTree'> # 获取xml文件的根节点 root = tree.getroot() #此时生成的tree是ElementTree的类对象 print(root,type(root)) #执行结果:<Element 'data' at 0x00502BA0> <class 'xml.etree.ElementTree.Element'>
#注意:1.在使用ElementTree类方法进行解析时,tree是ElementTree的对象。而root(根节点)却是Element的对象
2.不管采用什么方法解析xml。解析后,所有的节点(不论根节点,还是子节点)都是Element对象
3.ET.XML()是对字符串解析成xml。所以呢,如果是想解析xml文件,就需要先读取文件,获取到文件内容,然后再解析。
4.ET。parse()是只能解析文件,不能将字符串解析为XML
class Element: """An XML element. This class is the reference implementation of the Element interface. An element's length is its number of subelements. That means if you want to check if an element is truly empty, you should check BOTH its length AND its text attribute. The element tag, attribute names, and attribute values can be either bytes or strings. *tag* is the element name. *attrib* is an optional dictionary containing element attributes. *extra* are additional element attributes given as keyword arguments. Example form: <tag attrib>text<child/>...</tag>tail """ 当前节点的标签名 tag = None """The element's name.""" 当前节点的属性 attrib = None """Dictionary of the element's attributes.""" 当前节点的内容 text = None """ Text before first subelement. This is either a string or the value None. Note that if there is no text, this attribute may be either None or the empty string, depending on the parser. """ tail = None """ Text after this element's end tag, but before the next sibling element's start tag. This is either a string or the value None. Note that if there was no text, this attribute may be either None or an empty string, depending on the parser. """ def __init__(self, tag, attrib={}, **extra): if not isinstance(attrib, dict): raise TypeError("attrib must be dict, not %s" % ( attrib.__class__.__name__,)) attrib = attrib.copy() attrib.update(extra) self.tag = tag self.attrib = attrib self._children = [] def __repr__(self): return "<%s %r at %#x>" % (self.__class__.__name__, self.tag, id(self)) def makeelement(self, tag, attrib): 创建一个新节点 """Create a new element with the same type. *tag* is a string containing the element name. *attrib* is a dictionary containing the element attributes. Do not call this method, use the SubElement factory function instead. """ return self.__class__(tag, attrib) def copy(self): """Return copy of current element. This creates a shallow copy. Subelements will be shared with the original tree. """ elem = self.makeelement(self.tag, self.attrib) elem.text = self.text elem.tail = self.tail elem[:] = self return elem def __len__(self): return len(self._children) def __bool__(self): warnings.warn( "The behavior of this method will change in future versions. " "Use specific 'len(elem)' or 'elem is not None' test instead.", FutureWarning, stacklevel=2 ) return len(self._children) != 0 # emulate old behaviour, for now def __getitem__(self, index): return self._children[index] def __setitem__(self, index, element): # if isinstance(index, slice): # for elt in element: # assert iselement(elt) # else: # assert iselement(element) self._children[index] = element def __delitem__(self, index): del self._children[index] def append(self, subelement): 为当前节点追加一个子节点 """Add *subelement* to the end of this element. The new element will appear in document order after the last existing subelement (or directly after the text, if it's the first subelement), but before the end tag for this element. """ self._assert_is_element(subelement) self._children.append(subelement) def extend(self, elements): 为当前节点扩展 n 个子节点 """Append subelements from a sequence. *elements* is a sequence with zero or more elements. """ for element in elements: self._assert_is_element(element) self._children.extend(elements) def insert(self, index, subelement): 在当前节点的子节点中插入某个节点,即:为当前节点创建子节点,然后插入指定位置 """Insert *subelement* at position *index*.""" self._assert_is_element(subelement) self._children.insert(index, subelement) def _assert_is_element(self, e): # Need to refer to the actual Python implementation, not the # shadowing C implementation. if not isinstance(e, _Element_Py): raise TypeError('expected an Element, not %s' % type(e).__name__) def remove(self, subelement): 在当前节点在子节点中删除某个节点 """Remove matching subelement. Unlike the find methods, this method compares elements based on identity, NOT ON tag value or contents. To remove subelements by other means, the easiest way is to use a list comprehension to select what elements to keep, and then use slice assignment to update the parent element. ValueError is raised if a matching element could not be found. """ # assert iselement(element) self._children.remove(subelement) def getchildren(self): 获取所有的子节点(废弃) """(Deprecated) Return all subelements. Elements are returned in document order. """ warnings.warn( "This method will be removed in future versions. " "Use 'list(elem)' or iteration over elem instead.", DeprecationWarning, stacklevel=2 ) return self._children def find(self, path, namespaces=None): 获取第一个寻找到的子节点 """Find first matching element by tag name or path. *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Return the first matching element, or None if no element was found. """ return ElementPath.find(self, path, namespaces) def findtext(self, path, default=None, namespaces=None): 获取第一个寻找到的子节点的内容 """Find text for first matching element by tag name or path. *path* is a string having either an element tag or an XPath, *default* is the value to return if the element was not found, *namespaces* is an optional mapping from namespace prefix to full name. Return text content of first matching element, or default value if none was found. Note that if an element is found having no text content, the empty string is returned. """ return ElementPath.findtext(self, path, default, namespaces) def findall(self, path, namespaces=None): 获取所有的子节点 """Find all matching subelements by tag name or path. *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Returns list containing all matching elements in document order. """ return ElementPath.findall(self, path, namespaces) def iterfind(self, path, namespaces=None): 获取当前节点下的指定的子节点(只能是子节点,孙节点都不行),并创建一个迭代器(可以被for循环) """Find all matching subelements by tag name or path. *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Return an iterable yielding all matching elements in document order. """ return ElementPath.iterfind(self, path, namespaces) def clear(self): 清空节点 """Reset element. This function removes all subelements, clears all attributes, and sets the text and tail attributes to None. """ self.attrib.clear() self._children = [] self.text = self.tail = None def get(self, key, default=None): 获取当前节点的属性值 """Get element attribute. Equivalent to attrib.get, but some implementations may handle this a bit more efficiently. *key* is what attribute to look for, and *default* is what to return if the attribute was not found. Returns a string containing the attribute value, or the default if attribute was not found. """ return self.attrib.get(key, default) def set(self, key, value): 为当前节点设置属性值 """Set element attribute. Equivalent to attrib[key] = value, but some implementations may handle this a bit more efficiently. *key* is what attribute to set, and *value* is the attribute value to set it to. """ self.attrib[key] = value def keys(self): 获取当前节点的所有属性的 key """Get list of attribute names. Names are returned in an arbitrary order, just like an ordinary Python dict. Equivalent to attrib.keys() """ return self.attrib.keys() def items(self): 获取当前节点的所有属性值,每个属性都是一个键值对 """Get element attributes as a sequence. The attributes are returned in arbitrary order. Equivalent to attrib.items(). Return a list of (name, value) tuples. """ return self.attrib.items() def iter(self, tag=None): 在当前节点的子孙中根据节点名称寻找所有指定的节点,并返回一个迭代器(可以被for循环)。 """Create tree iterator. The iterator loops over the element and all subelements in document order, returning all elements with a matching tag. If the tree structure is modified during iteration, new or removed elements may or may not be included. To get a stable set, use the list() function on the iterator, and loop over the resulting list. *tag* is what tags to look for (default is to return all elements) Return an iterator containing all the matching elements. """ if tag == "*": tag = None if tag is None or self.tag == tag: yield self for e in self._children: yield from e.iter(tag) # compatibility def getiterator(self, tag=None): # Change for a DeprecationWarning in 1.4 warnings.warn( "This method will be removed in future versions. " "Use 'elem.iter()' or 'list(elem.iter())' instead.", PendingDeprecationWarning, stacklevel=2 ) return list(self.iter(tag)) def itertext(self): 在当前节点的子孙中根据节点名称寻找所有指定的节点的内容,并返回一个迭代器(可以被for循环)。 """Create text iterator. The iterator loops over the element and all subelements in document order, returning all inner text. """ tag = self.tag if not isinstance(tag, str) and tag is not None: return if self.text: yield self.text for e in self: yield from e.itertext() if e.tail: yield e.tail
class ElementTree: """An XML element hierarchy. This class also provides support for serialization to and from standard XML. *element* is an optional root element node, *file* is an optional file handle or file name of an XML file whose contents will be used to initialize the tree with. """ def __init__(self, element=None, file=None): # assert element is None or iselement(element) self._root = element # first node if file: self.parse(file) def getroot(self): """Return root element of this tree.""" return self._root def _setroot(self, element): """Replace root element of this tree. This will discard the current contents of the tree and replace it with the given element. Use with care! """ # assert iselement(element) self._root = element def parse(self, source, parser=None): """Load external XML document into element tree. *source* is a file name or file object, *parser* is an optional parser instance that defaults to XMLParser. ParseError is raised if the parser fails to parse the document. Returns the root element of the given source document. """ close_source = False if not hasattr(source, "read"): source = open(source, "rb") close_source = True try: if parser is None: # If no parser was specified, create a default XMLParser parser = XMLParser() if hasattr(parser, '_parse_whole'): # The default XMLParser, when it comes from an accelerator, # can define an internal _parse_whole API for efficiency. # It can be used to parse the whole source without feeding # it with chunks. self._root = parser._parse_whole(source) return self._root while True: data = source.read(65536) if not data: break parser.feed(data) self._root = parser.close() return self._root finally: if close_source: source.close() def iter(self, tag=None): """Create and return tree iterator for the root element. The iterator loops over all elements in this tree, in document order. *tag* is a string with the tag name to iterate over (default is to return all elements). """ # assert self._root is not None return self._root.iter(tag) # compatibility def getiterator(self, tag=None): # Change for a DeprecationWarning in 1.4 warnings.warn( "This method will be removed in future versions. " "Use 'tree.iter()' or 'list(tree.iter())' instead.", PendingDeprecationWarning, stacklevel=2 ) return list(self.iter(tag)) def find(self, path, namespaces=None): """Find first matching element by tag name or path. Same as getroot().find(path), which is Element.find() *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Return the first matching element, or None if no element was found. """ # assert self._root is not None if path[:1] == "/": path = "." + path warnings.warn( "This search is broken in 1.3 and earlier, and will be " "fixed in a future version. If you rely on the current " "behaviour, change it to %r" % path, FutureWarning, stacklevel=2 ) return self._root.find(path, namespaces) def findtext(self, path, default=None, namespaces=None): """Find first matching element by tag name or path. Same as getroot().findtext(path), which is Element.findtext() *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Return the first matching element, or None if no element was found. """ # assert self._root is not None if path[:1] == "/": path = "." + path warnings.warn( "This search is broken in 1.3 and earlier, and will be " "fixed in a future version. If you rely on the current " "behaviour, change it to %r" % path, FutureWarning, stacklevel=2 ) return self._root.findtext(path, default, namespaces) def findall(self, path, namespaces=None): """Find all matching subelements by tag name or path. Same as getroot().findall(path), which is Element.findall(). *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Return list containing all matching elements in document order. """ # assert self._root is not None if path[:1] == "/": path = "." + path warnings.warn( "This search is broken in 1.3 and earlier, and will be " "fixed in a future version. If you rely on the current " "behaviour, change it to %r" % path, FutureWarning, stacklevel=2 ) return self._root.findall(path, namespaces) def iterfind(self, path, namespaces=None): """Find all matching subelements by tag name or path. Same as getroot().iterfind(path), which is element.iterfind() *path* is a string having either an element tag or an XPath, *namespaces* is an optional mapping from namespace prefix to full name. Return an iterable yielding all matching elements in document order. """ # assert self._root is not None if path[:1] == "/": path = "." + path warnings.warn( "This search is broken in 1.3 and earlier, and will be " "fixed in a future version. If you rely on the current " "behaviour, change it to %r" % path, FutureWarning, stacklevel=2 ) return self._root.iterfind(path, namespaces) def write(self, file_or_filename, encoding=None, xml_declaration=None, default_namespace=None, method=None, *, short_empty_elements=True): """Write element tree to a file as XML. Arguments: *file_or_filename* -- file name or a file object opened for writing *encoding* -- the output encoding (default: US-ASCII) *xml_declaration* -- bool indicating if an XML declaration should be added to the output. If None, an XML declaration is added if encoding IS NOT either of: US-ASCII, UTF-8, or Unicode *default_namespace* -- sets the default XML namespace (for "xmlns") *method* -- either "xml" (default), "html, "text", or "c14n" *short_empty_elements* -- controls the formatting of elements that contain no content. If True (default) they are emitted as a single self-closed tag, otherwise they are emitted as a pair of start/end tags """ if not method: method = "xml" elif method not in _serialize: raise ValueError("unknown method %r" % method) if not encoding: if method == "c14n": encoding = "utf-8" else: encoding = "us-ascii" enc_lower = encoding.lower() with _get_writer(file_or_filename, enc_lower) as write: if method == "xml" and (xml_declaration or (xml_declaration is None and enc_lower not in ("utf-8", "us-ascii", "unicode"))): declared_encoding = encoding if enc_lower == "unicode": # Retrieve the default encoding for the xml declaration import locale declared_encoding = locale.getpreferredencoding() write("<?xml version='1.0' encoding='%s'?>\n" % ( declared_encoding,)) if method == "text": _serialize_text(write, self._root) else: qnames, namespaces = _namespaces(self._root, default_namespace) serialize = _serialize[method] serialize(write, self._root, qnames, namespaces, short_empty_elements=short_empty_elements) def write_c14n(self, file): # lxml.etree compatibility. use output method instead return self.write(file, method="c14n")
2、遍历同级别的所有节点
由于 每个节点 都具有和根节点同样的方法,并且在上一步骤中解析时均得到了root(xml文件的根节点),so 可以利用以上方法进行操作xml文件。
from xml.etree import ElementTree as ET ############ 解析方式一 ############ """ # 打开文件,读取XML内容 str_xml = open('first.xml', 'r').read() # 将字符串解析成xml特殊对象,root代指xml文件的根节点 root = ET.XML(str_xml) """ ############ 解析方式二 ############ # 直接解析xml文件 tree = ET.parse("first.xml") # 获取xml文件的根节点 root = tree.getroot() ### 操作 # 顶层标签 print(root.tag) # 遍历XML文档的第二层 for child in root: # 第二层节点的标签名称和标签属性 print(child.tag, child.attrib) # 遍历XML文档的第三层 for i in child: # 第二层节点的标签名称和内容 print(i.tag,i.text)
3、仅遍历XML中指定的节点
from xml.etree import ElementTree as ET ############ 解析方式一 ############ """ # 打开文件,读取XML内容 str_xml = open('first.xml', 'r').read() # 将字符串解析成xml特殊对象,root代指xml文件的根节点 root = ET.XML(str_xml) """ ############ 解析方式二 ############ # 直接解析xml文件 tree = ET.parse("first.xml") # 获取xml文件的根节点 root = tree.getroot() ### 操作 # 顶层标签 print(root.tag) #指定了遍历的节点:year 。那么就只会遍历标签为year的节点 for node in root.iter('year'): # 节点的标签名称和内容 print(node.tag, node.text)
4、修改节点内容:
修改后的xml一定要保存,因为修改的动作只是发生在了内存中,并没有将xml进行重写
使用解析方式一:
from xml.etree import ElementTree as ET ############ 解析方式一 ############ # 打开文件,读取XML内容 str_xml = open('first.xml', 'r').read() # 将字符串解析成xml特殊对象,root代指xml文件的根节点 root = ET.XML(str_xml) ############ 操作 ############ # 顶层标签 print(root.tag) # 循环所有的year节点 for node in root.iter('year'): # 将year节点中的内容自增一 new_year = int(node.text) + 1 node.text = str(new_year) # 设置属性 node.set('name', 'alex') node.set('age', '18') # 删除属性 del node.attrib['name'] # 遍历data下的所有country节点 for country in root.findall('country'): # 获取每一个country节点下rank节点的内容 rank = int(country.find('rank').text) if rank > 50: # 删除指定country节点 root.remove(country) ############ 保存文件 ############ tree = ET.ElementTree(root) #因为Element类方法中,没有保存文件这个功能,所以只能将类转为ElementTree来进行保存 tree.write("newnew.xml", encoding='utf-8')
使用解析方式二:
from xml.etree import ElementTree as ET ############ 解析方式二 ############ # 直接解析xml文件 tree = ET.parse("first.xml") # 获取xml文件的根节点 root = tree.getroot() ############ 操作 ############ # 顶层标签 print(root.tag) # 循环所有的year节点 for node in root.iter('year'): # 将year节点中的内容自增一 new_year = int(node.text) + 1 node.text = str(new_year) # 设置属性 node.set('name', 'alex') node.set('age', '18') # 删除属性 del node.attrib['name'] # 遍历data下的所有country节点 for country in root.findall('country'): # 获取每一个country节点下rank节点的内容 rank = int(country.find('rank').text) if rank > 50: # 删除指定country节点 root.remove(country) ############ 保存文件 ############ tree.write("newnew.xml", encoding='utf-8')
注意:这两者的区别就在于保存文件时不同 。
5.节点的创建
创建方式一,(推荐方式)
在python中一切皆对象,如"i=3"其本质就是i=int(3),所以i是int类的一个实例对象。同样在xml模块下,我们在本章的<1.xml报文的解析>中已经指出过,所有的节点都是Element类的对象。那么我们就可以直接son1=ET.Element('son', {'name': '儿1'})来创建节点。如下:
from xml.etree import ElementTree as ET # 创建根节点 root = ET.Element("famliy") # 创建节点大儿子 son1 = ET.Element('son', {'name': '儿1'}) #节点标签是:son ,标签属性是:'name': '儿1' # 创建小儿子 son2 = ET.Element('son', {"name": '儿2'}) # 在大儿子中创建两个孙子 grandson1 = ET.Element('grandson', {'name': '儿11'}) grandson2 = ET.Element('grandson', {'name': '儿12'})
#给son1添加子节点grandson1 son1.append(grandson1) #节点创建好,只是代表将节点在内存中生成了,并不代表就已经将所有节点连接在了一起,所以需要将子节点加入父节点中。 son1.append(grandson2) # 把儿子添加到根节点中 root.append(son1) root.append(son1) tree = ET.ElementTree(root)
############ 写文件方式1(不推荐)##############
tree.write('oooo.xml') #没有任何要求的写入,这种写入方式,默认是:1、不识别中文,若写入中文会出现乱码 2、没有内容的节点会采取如:<neighbor direction="E"/>(非常简洁的闭合)
"""执行结果:
<famliy><son name="儿1"><age name="儿11">孙子</age></son><son name="儿2" /></famliy>
"""
############ 写文件方式2(推荐)##############
tree.write('oooo.xml',encoding='utf-8') #能够识别中文,2,没有内容的节点会采取简单闭合
"""执行结果:
<famliy><son name="儿1"><age name="儿11">孙子</age></son><son name="儿2" /></famliy>
"""
############ 写文件方式3(不推荐)##############
tree.write('oooo.xml',encoding='utf-8', short_empty_elements=False) #能识别中文,2,没有内容的节点采取的闭合方式是:<rank updated="yes"></rank>
"""执行结果:
<famliy><son name="儿1"><age name="儿11">孙子</age></son><son name="儿2"></son></famliy>
"""
#注:采取这种闭合方式,没有任何的优点,反而增加代码量
############ 写文件方式4(推荐)##############
et.write("test.xml", encoding="utf-8", xml_declaration=True) #能识别中文,同时在xml根节点前加入标识:<?xml version='1.0' encoding='utf-8'?>。
"""执行结果:
<?xml version='1.0' encoding='utf-8'?>
<famliy><son name="儿1"><age name="儿11">孙子</age></son><son name="儿2" /></famliy>
"""
#注:根节点前的标识,只是一个注释的作用,代表使用的xml格式是什么版本,编码使用的是什么
创建方式二(强烈推荐):
直接调用SubElement类来创建对象,如:son1 = ET.SubElement(root, "son", attrib={'name': '儿1'}) 。在创建过程中就直接指明了,在根节点下,创建一个子节点。
from xml.etree import ElementTree as ET # 创建根节点 root = ET.Element("famliy") # 创建节点大儿子。 在root根节点下,创建一个子节点,节点标签为:son,属性为:'name': '儿1' son1 = ET.SubElement(root, "son", attrib={'name': '儿1'}) # 创建小儿子 son2 = ET.SubElement(root, "son", attrib={"name": "儿2"}) # 在大儿子中创建一个孙子 grandson1 = ET.SubElement(son1, "age", attrib={'name': '儿11'}) grandson1.text = '孙子' et = ET.ElementTree(root) #生成文档对象 et.write("test.xml", encoding="utf-8", xml_declaration=True, short_empty_elements=False)
创建方式三:
调用Element类方法中的makeelement方法来节点:
from xml.etree import ElementTree as ET # 创建根节点 root = ET.Element("famliy") # 创建大儿子 # son1 = ET.Element('son', {'name': '儿1'}) son1 = root.makeelement('son', {'name': '儿1'}) #这个是调用Element类方法makeelement()来创建节点 # 创建小儿子 # son2 = ET.Element('son', {"name": '儿2'}) son2 = root.makeelement('son', {"name": '儿2'}) # 在大儿子中创建两个孙子 # grandson1 = ET.Element('grandson', {'name': '儿11'}) grandson1 = son1.makeelement('grandson', {'name': '儿11'}) # grandson2 = ET.Element('grandson', {'name': '儿12'}) grandson2 = son1.makeelement('grandson', {'name': '儿12'}) son1.append(grandson1) son1.append(grandson2) # 把儿子添加到根节点中 root.append(son1) root.append(son1) tree = ET.ElementTree(root) tree.write('oooo.xml',encoding='utf-8', short_empty_elements=False)
6.xml格式化调整
注:格式化只是为了美观,没有实质意义,一般不建议格式化
由于原生保存的XML时默认无缩进,如果想要设置缩进的话, 需要修改保存方式:
from xml.etree import ElementTree as ET from xml.dom import minidom #用于来格式化xml def prettify(elem,path): """将节点转换成字符串,并添加缩进。 """ rough_string = ET.tostring(elem, 'utf-8') #将创建的xml转成字符串 reparsed = minidom.parseString(rough_string) #将字符串进行解析为xml raw_str= reparsed.toprettyxml(indent="\t") #给xml进行格式化 f = open(path, 'w', encoding='utf-8') f.write(raw_str) f.close() # 创建根节点 root = ET.Element("famliy") # 创建大儿子 son1 = root.makeelement('son', {'name': '儿1'}) # 创建小儿子 son2 = root.makeelement('son', {"name": '儿2'}) # 在大儿子中创建两个孙子 grandson1 = son1.makeelement('grandson', {'name': '儿11'}) grandson2 = son1.makeelement('grandson', {'name': '儿12'}) son1.append(grandson1) son1.append(grandson2) # 把儿子添加到根节点中 root.append(son1) root.append(son1) prettify(root,"test.xml") #通过传入根节点,来指明刚创建的xml
7、命名空间
命名空间只是为了避免两个xml文件出现相同的节点名。
命名冲突 在 XML 中,元素名称是由开发者定义的,当两个不同的文档使用相同的元素名时,就会发生命名冲突。 有一个 XML 文档信息: <table> <tr> <td>Apples</td> <td>Bananas</td> </tr> </table> 然而还有另一个XML文档信息: <table> <name>African Coffee Table</name> <width>80</width> <length>120</length> </table> 假如这两个 XML 文档被一起使用,由于两个文档都定义的 <table> 元素,就会发生命名冲突。 XML 解析器无法确定如何处理这类冲突。 使用前缀来避免命名冲突 对第一个xml文档进行更改为: <h:table> <h:tr> <h:td>Apples</h:td> <h:td>Bananas</h:td> </h:tr> </h:table> 对另一个XML 文档进行更改为: <f:table> <f:name>African Coffee Table</f:name> <f:width>80</f:width> <f:length>120</f:length> </f:table> 现在,命名冲突不存在了,这是由于两个文档都使用了不同的名称来命名它们的 <table> 元素 (<h:table> 和 <f:table>)。
那么在程序中怎么实现的呢?
from xml.etree import ElementTree as ET ET.register_namespace('com',"http://www.company.com") #定义以别名,"com"就代指"http://www.company.com" # build a tree structure root = ET.Element("{http://www.company.com}STUFF")
#在创建节点的时候,里面{http://www.company.com}就代表要使用别名,在输出到文件中时{http://www.company.com}就会被自动解析为com body = ET.SubElement(root, "{http://www.company.com}MORE_STUFF", attrib={"{http://www.company.com}hhh": "123"}) body.text = "STUFF EVERYWHERE!" # wrap it in an ElementTree instance, and save as XML tree = ET.ElementTree(root) tree.write("page.xml", xml_declaration=True, encoding='utf-8', method="xml")
二、requests和xml的结合应用:
检查QQ是否在线:
import requests r = requests.get("http://www.webxml.com.cn//webservices/qqOnlineWebService.asmx/qqCheckOnline?qqCode=1223995142") result = r.text # XML 模块 from xml.etree import ElementTree as ET #解析xml格式内容 #xml接受一个参数:字符串,格式化为特殊的对象 node = ET.XML(result)
#注意:此处只能使用ET.XML() ,不可使用ET.parse(),原因:result是通过requests()方法获取到的字符串,而要将字符串解析成xml只能是ET.XML()。ET.parse()是直接解析文件 #获取内容 if node.text == "Y": # text 是xml的值 print("在线") elif node.text == "N": print("离线") elif node.text == "V": print("隐身")