Click to share! ⬇️

Python XML Parsing

In this tutorial, we’ll see some examples of using Python to parse XML or Extensible Markup Language. XML is kind of like a more flexible version of HTML. It is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. There are a couple of different ways XML is parsed by computers. The first is known as the Simple API for XML, also known as SAX. The other way to parse XML is by using the DOM or Document Object Model. Back to SAX for a moment. SAX reads XML data one character at a time all the way to the end of the document. As the XML is read, the parser emits events that relate to the XML content. Using Python, we can handle these events as they happen.


SAX Events

When the parser encounters XML as we see below, it generates an event for when it is starting, and then as the parser reaches this closing angle bracket of the opening tag, it will send a start tag event with the tag’s name, and a collection of the attributes, and their values. When the parser reaches the opening angle bracket of the closing tag it will send an end tag event and when it reaches the closing bracket of the closing tag it will send an event for that as well.

sax XML API parse

As these events are generated, we can use Python to respond and operate on the data. When using SAX, the content of the XML can not be accessed in random order. Remember, SAX works by moving through the XML file character by character until it reaches the end of the document. You can not “rewind” or back up during this process. Additionally, SAX can not change the XML data during processing. For this reason, SAX is good when using XML as a config file.


SAX API

To use the SAX API in Python, we use the xml.sax module. So we will be importing that module to run some test code. Once imported, we’ll have access to an xml.sax.parse() function that can work with a file or a stream object. Another function we can use is the xml.sax.parseString() function that can be used if you already have the XML in a string variable. In addition to these functions is a base class named ContentHandler which can be used for custom content processing. The ContentHandler class has functions for handling the start and the end of the document, the start and end of tags, and handling text data. You can create your own class that overrides these functions to handle each type of content.

Python SAX XML Example

Below we have some sample XML data. It is stored in a file names xmldata.xml.

<?xml version="1.0" encoding="UTF-8"?>
<blogposts title="Blog Posts Collection" date="A date" author="Some dude">

   <post type="summary">
      <title>Parse XML With SAX</title>
   </post>

   <post type="detail">
      <title>Overview</title>
      <entry>
         Parsing XML is great
      </entry>
      <entry />
      <entry>
         Have fun with XML parsing
      </entry>
   </post>
</blogposts>

The XML data we are working on represents a fictional blogposts element. There is a blogposts root tag and it has some attributes on it and inside the blogposts, there are some posts and each post has some entries. The code extracts information from this XML as it is being parsed by the SAX parser. There are functions that will indicate that we’re starting to process the document and that we’re finishing up processing. To print out the name of the blogposts, the startElement function is used. There are also methods of endElement, characters, startDocument, and endDocument. To run the program, we place it inside of the Python main() function. A new instance of CustomContentHandler is assigned to the handler variable. Then we simply use xml.sax.parse() to read the data and print out some results.

import xml.sax


# define a Custom ContentHandler class that extends ContenHandler
class CustomContentHandler(xml.sax.ContentHandler):
    def __init__(self):
        self.postCount = 0
        self.entryCount = 0
        self.isInTitle = False

    # Handle startElement
    def startElement(self, tagName, attrs):
        if tagName == 'blogposts':
            print('Blogposts title: ' + attrs['title'])
        elif tagName == 'post':
            self.postCount += 1
        elif tagName == 'entry':
            self.entryCount += 1
        elif tagName == 'title':
            self.isInTitle = True

    # Handle endElement
    def endElement(self, tagName):
        if tagName == 'title':
            self.isInTitle = False

    # Handle text data
    def characters(self, chars):
        if self.isInTitle:
            print('Title: ' + chars)

    # Handle startDocument
    def startDocument(self):
        print('About to start!')

    # Handle endDocument
    def endDocument(self):
        print('Finishing up!')


def main():
    # create a new content handler for the SAX parser
    handler = CustomContentHandler()

    # call the parse method on an XML file
    xml.sax.parse('xmldata.xml', handler)

    # when we're done, print out some interesting results
    print(f'There were {handler.postCount} post elements')
    print(f'There were {handler.entryCount} entry elements')


if __name__ == '__main__':
    main()
About to start!
Blogposts title: Blog Posts Collection
Title: Parse XML With SAX
Title: Overview
Finishing up!
There were 2 post elements
There were 3 entry elements

Process finished with exit code 0

XML DOM API

Another way that XML content can be manipulated is by using the Document Object Model API or DOM. One of the big differences between the DOM API and SAX API is that the DOM allows you to access any part of the XML file at random. This is not possible with SAX as it reads one character at a time from beginning to end. With the DOM, you can also modify the XML file content. When using the DOM to parse XML code, the XML is read into memory in full and represented as a tree structure. You can then use various API’s to work on the resulting document tree. The Python Standard Library provides an implementation of the DOM API in the xml.dom.minidom module. It is intended to be a smaller implementation than the full DOM API. Below are some of the key points and methods to be aware of.

  • Access any part of the XML structure at random
  • Modify XML content
  • Represents XML as a hierarchial tree structure
  • xml.dom.minidom is a lighweight implementation
  • domtree = xml.com.minidom.parseString(str)
  • elem.getElementById(id)
  • elem.getElementsByTagName(tagname)
  • elem.getAttribute(attrName)
  • elem.setAttribute(attrName, val)
  • newElem = document.createElement(tagName)
  • newElem = document.createTextNode(strOfText)
  • elem.appendChild(newElem)

Here is an example of using xml.dom.minidom to operate on the same xmldata.xml file that we used in the SAX example. Notice this method provides a bit more flexibility and we can even add data to the file in memory. Many of us are quite familiar with the DOM since it is so common in Web Development, so working with XML in Python using the DOM is fairly easy to understand.

import xml.dom.minidom


def main():
    domtree = xml.dom.minidom.parse('xmldata.xml')
    rootnode = domtree.documentElement

    # display some information about the content
    print(f'The root element is {rootnode.nodeName}')
    print(f'The Title is: {rootnode.getAttribute("title")}')
    entries = domtree.getElementsByTagName('entry')
    print(f'There are {entries.length} entry tags')

    # create a new entry tag in memory
    newItem = domtree.createElement('entry')

    # add some text to the entry
    newItem.appendChild(domtree.createTextNode('Magic Entry!'))

    # now add the entry to the first post
    firstPost = domtree.getElementsByTagName('post')[0]
    firstPost.appendChild(newItem)

    # Now count the entry tags again
    entries = domtree.getElementsByTagName('entry')
    print('Now there are {0} entry tags'.format(entries.length))

    # Print out the domtree as xml
    print(domtree.toxml())


if __name__ == '__main__':
    main()
The root element is blogposts
The Title is: Blog Posts Collection
There are 3 entry tags
Now there are 4 entry tags

<?xml version="1.0" ?><blogposts title="Blog Posts Collection" date="A date" author="Some dude">

   <post type="summary">
      <title>Parse XML With SAX</title>
   <entry>Magic Entry!</entry></post>

   <post type="detail">
      <title>Overview</title>
      <entry>
         Parsing XML is great
      </entry>
      <entry/>
      <entry>
         Have fun with XML parsing
      </entry>
   </post>
</blogposts>

Process finished with exit code 0

XML ElementTree API

The DOM API is vast and offers cross-language and cross-platform API for working with XML data. The ElementTree API takes a different approach by focusing instead on being a simpler way of working with XML With the ElementTree API, elements are treated as if they were lists. This means that if you have an XML element that contains other elements, it is possible to iterate over those child elements using standard iteration like a for loop. The ElementTree API treats attributes like dictionaries. So if you have a reference to an element, then you can access its attrib property which is a dictionary of all the attribute names and values. ElementTree makes searching for content within XML straightforward. It offers functions that can use XPath Syntax to search the XML for specific data.

In the example below we use the ElementTree API to test these concepts out. Once again, we use the same XML data file we have been using for the entire tutorial. We can see how to build a document structure and find the root element of the tree. We can access an attribute, iterate over tags, count the number of elements, add new data, and so on.

from lxml import etree


def main():
    postCount = 0
    entryCount = 0

    # build a doc structure using the ElementTree API
    doc = etree.parse('xmldata.xml').getroot()
    print(doc.tag)

    # Access the value of an attribute
    print(doc.attrib['title'])

    # Iterate over tags
    for elem in doc.findall('post'):
        print(elem.tag)

    # Count the number of posts
    postCount = len(doc.findall('post'))
    entryCount = len(doc.findall('.//entry'))

    print(f'There are {postCount} post elements')
    print(f'There are {entryCount} entry elements')

    # Create a new post
    newPost = etree.SubElement(doc, 'post')
    newPost.text = 'This is a new post'

    # Count the number of posts
    postCount = len(doc.findall('post'))
    entryCount = len(doc.findall('.//entry'))

    print(f'There are now {postCount} post elements')
    print(f'There are now {entryCount} entry elements')


if __name__ == '__main__':
    main()
blogposts
Blog Posts Collection
post
post
There are 2 post elements
There are 3 entry elements
There are now 3 post elements
There are now 3 entry elements

Process finished with exit code 0

Learn More About Python XML Parsing

Python XML Parsing Summary

The problem of reading, writing, and manipulating XML data in Python is solved using any of the libraries mentioned in this tutorial. We had a look at the SAX API for XML, the DOM API for XML, and lastly the ElementTree API for XML. They each have their pros can cons, and some of the links above will offer more tips and tricks for working with XML in Python.

Click to share! ⬇️