python 爬取 xml

What is XML?

XML stands for eXtensible Markup Language. It was designed to store and transport small to medium amounts of data and is widely used for sharing structured information.

Python enables you to parse and modify XML document. In order to parse XML document you need to have the entire XML document in memory. In this tutorial, we will see how we can use XML minidom class in Python to load and parse XML file.

In this tutorial, we will learn-

 

How to Parse XML using minidom

We have created a sample XML file that we are going to parse.

Step 1) Inside file, we can see first name, last name, home and the area of expertise (SQL, Python, Testing and Business)

Python XML Parser Tutorial: Create & Read XML with Examples

Step 2) Once we have parsed the document, we will print out the "node name" of the root of the document and the "firstchild tagname". Tagname and nodename are the standard properties of the XML file.

Python XML Parser Tutorial: Create & Read XML with Examples

  • Import the xml.dom.minidom module and declare file that has to be parsed (myxml.xml)
  • This file carries some basic information about employee like first name, last name, home, expertise, etc.
  • We use the parse function on the XML minidom to load and parse the XML file
  • We have variable doc and doc gets the result of the parse function
  • We want to print the nodename and child tagname from the file, so we declare it in print function
  • Run the code- It prints out the nodename (#document) from the XML file and the first child tagname (employee) from the XML file

Note:

Nodename and child tagname are the standard names or properties of an XML dom. In case if you are not familiar with these type of naming conventions.

Step 3) We can also call the list of XML tags from the XML document and printed out. Here we printed out the set of skills like SQL, Python,Testing and Business.

Python XML Parser Tutorial: Create & Read XML with Examples

  • Declare the variable expertise, from which we going to extract all the expertise name employee is having
  • Use the dom standard function called "getElementsByTagName"
  • This will get all the elements named skill
  • Declare loop over each one of the skill tags
  • Run the code- It will give list of four skills

How to Create XML Node

We can create a new attribute by using "createElement" function and then append this new attribute or tag to the existing XML tags. We added a new tag "BigData" in our XML file.

  1. You have to code to add the new attribute (BigData) to the existing XML tag
  2. Then you have to print out the XML tag with new attributes appended with existing XML tag

Python XML Parser Tutorial: Create & Read XML with Examples

  • To add a new XML and add it to the document, we use code "doc.create elements"
  • This code will create a new skill tag for our new attribute "Big-data"
  • Add this skill tag into the document first child (employee)
  • Run the code- the new tag "big data" will appear with the other list of expertise

XML Parser Example

Python 2 Example

import xml.dom.minidom

def main():
# use the parse() function to load and parse an XML file
   doc = xml.dom.minidom.parse("Myxml.xml");
  
# print out the document node and the name of the first child tag
   print doc.nodeName
   print doc.firstChild.tagName
  
# get a list of XML tags from the document and print each one
   expertise = doc.getElementsByTagName("expertise")
   print "%d expertise:" % expertise.length
   for skill in expertise:
     print skill.getAttribute("name")
    
# create a new XML tag and add it into the document
   newexpertise = doc.createElement("expertise")
   newexpertise.setAttribute("name", "BigData")
   doc.firstChild.appendChild(newexpertise)
   print " "

   expertise = doc.getElementsByTagName("expertise")
   print "%d expertise:" % expertise.length
   for skill in expertise:
     print skill.getAttribute("name")
    
if name == "__main__":
  main();

Python 3 Example

import xml.dom.minidom

def main():
    # use the parse() function to load and parse an XML file
    doc = xml.dom.minidom.parse("Myxml.xml");

    # print out the document node and the name of the first child tag
    print (doc.nodeName)
    print (doc.firstChild.tagName)
    # get a list of XML tags from the document and print each one
    expertise = doc.getElementsByTagName("expertise")
    print ("%d expertise:" % expertise.length)
    for skill in expertise:
        print (skill.getAttribute("name"))

    # create a new XML tag and add it into the document
    newexpertise = doc.createElement("expertise")
    newexpertise.setAttribute("name", "BigData")
    doc.firstChild.appendChild(newexpertise)
    print (" ")

    expertise = doc.getElementsByTagName("expertise")
    print ("%d expertise:" % expertise.length)
    for skill in expertise:
        print (skill.getAttribute("name"))

if __name__ == "__main__":
    main();

How to Parse XML using ElementTree

ElementTree is an API for manipulating XML. ElementTree is the easy way to process XML files.

We are using the following XML document as the sample data:

<data>
   <items>
      <item name="expertise1">SQL</item>
      <item name="expertise2">Python</item>
   </items>
</data>

Reading XML using ElementTree:

we must first import the xml.etree.ElementTree module.

import xml.etree.ElementTree as ET

Now let's fetch the root element:

root = tree.getroot()

Following is the complete code for reading above xml data

import xml.etree.ElementTree as ET
tree = ET.parse('items.xml')
root = tree.getroot()

# all items data
print('Expertise Data:')

for elem in root:
   for subelem in elem:
      print(subelem.text)

output:

Expertise Data:
SQL
Python

Summary:

Python enables you to parse the entire XML document at one go and not just one line at a time. In order to parse XML document you need to have the entire document in memory.

  • To parse XML document
    • Import xml.dom.minidom
    • Use the function "parse" to parse the document ( doc=xml.dom.minidom.parse (file name);
    • Call the list of XML tags from the XML document using code (=doc.getElementsByTagName( "name of xml tags")
  • To create and add new attribute in XML document
    • Use function "createElement"

posted @ 2020-02-10 20:04  This_is_bill  阅读(256)  评论(0编辑  收藏  举报