In this tutorial, we’ll see some examples of using Python to parse XML or Extensible Markup Language. XML is kind of like a more flexible version of HTML. It is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable. There are a couple of different ways XML is parsed by computers. The first is known as the Simple API for XML, also known as SAX. The other way to parse XML is by using the DOM or Document Object Model. Back to SAX for a moment. SAX reads XML data one character at a time all the way to the end of the document. As the XML is read, the parser emits events that relate to the XML content. Using Python, we can handle these events as they happen.
SAX Events
When the parser encounters XML as we see below, it generates an event for when it is starting, and then as the parser reaches this closing angle bracket of the opening tag, it will send a start tag event with the tag’s name, and a collection of the attributes, and their values. When the parser reaches the opening angle bracket of the closing tag it will send an end tag event and when it reaches the closing bracket of the closing tag it will send an event for that as well.
As these events are generated, we can use Python to respond and operate on the data. When using SAX, the content of the XML can not be accessed in random order. Remember, SAX works by moving through the XML file character by character until it reaches the end of the document. You can not “rewind” or back up during this process. Additionally, SAX can not change the XML data during processing. For this reason, SAX is good when using XML as a config file.
SAX API
To use the SAX API in Python, we use the xml.sax module. So we will be importing that module to run some test code. Once imported, we’ll have access to an xml.sax.parse() function that can work with a file or a stream object. Another function we can use is the xml.sax.parseString() function that can be used if you already have the XML in a string variable. In addition to these functions is a base class named ContentHandler which can be used for custom content processing. The ContentHandler class has functions for handling the start and the end of the document, the start and end of tags, and handling text data. You can create your own class that overrides these functions to handle each type of content.
Python SAX XML Example
Below we have some sample XML data. It is stored in a file names xmldata.xml.
<?xml version="1.0" encoding="UTF-8"?> <blogposts title="Blog Posts Collection" date="A date" author="Some dude"> <post type="summary"> <title>Parse XML With SAX</title> </post> <post type="detail"> <title>Overview</title> <entry> Parsing XML is great </entry> <entry /> <entry> Have fun with XML parsing </entry> </post> </blogposts>
The XML data we are working on represents a fictional blogposts element. There is a blogposts root tag and it has some attributes on it and inside the blogposts, there are some posts and each post has some entries. The code extracts information from this XML as it is being parsed by the SAX parser. There are functions that will indicate that we’re starting to process the document and that we’re finishing up processing. To print out the name of the blogposts, the startElement function is used. There are also methods of endElement, characters, startDocument, and endDocument. To run the program, we place it inside of the Python main() function. A new instance of CustomContentHandler is assigned to the handler variable. Then we simply use xml.sax.parse() to read the data and print out some results.
import xml.sax
# define a Custom ContentHandler class that extends ContenHandler
class CustomContentHandler(xml.sax.ContentHandler):
def __init__(self):
self.postCount = 0
self.entryCount = 0
self.isInTitle = False
# Handle startElement
def startElement(self, tagName, attrs):
if tagName == 'blogposts':
print('Blogposts title: ' + attrs['title'])
elif tagName == 'post':
self.postCount += 1
elif tagName == 'entry':
self.entryCount += 1
elif tagName == 'title':
self.isInTitle = True
# Handle endElement
def endElement(self, tagName):
if tagName == 'title':
self.isInTitle = False
# Handle text data
def characters(self, chars):
if self.isInTitle:
print('Title: ' + chars)
# Handle startDocument
def startDocument(self):
print('About to start!')
# Handle endDocument
def endDocument(self):
print('Finishing up!')
def main():
# create a new content handler for the SAX parser
handler = CustomContentHandler()
# call the parse method on an XML file
xml.sax.parse('xmldata.xml', handler)
# when we're done, print out some interesting results
print(f'There were {handler.postCount} post elements')
print(f'There were {handler.entryCount} entry elements')
if __name__ == '__main__':
main()
About to start! Blogposts title: Blog Posts Collection Title: Parse XML With SAX Title: Overview Finishing up! There were 2 post elements There were 3 entry elements Process finished with exit code 0
XML DOM API
Another way that XML content can be manipulated is by using the Document Object Model API or DOM. One of the big differences between the DOM API and SAX API is that the DOM allows you to access any part of the XML file at random. This is not possible with SAX as it reads one character at a time from beginning to end. With the DOM, you can also modify the XML file content. When using the DOM to parse XML code, the XML is read into memory in full and represented as a tree structure. You can then use various API’s to work on the resulting document tree. The Python Standard Library provides an implementation of the DOM API in the xml.dom.minidom module. It is intended to be a smaller implementation than the full DOM API. Below are some of the key points and methods to be aware of.
- Access any part of the XML structure at random
- Modify XML content
- Represents XML as a hierarchial tree structure
- xml.dom.minidom is a lighweight implementation
- domtree = xml.com.minidom.parseString(str)
- elem.getElementById(id)
- elem.getElementsByTagName(tagname)
- elem.getAttribute(attrName)
- elem.setAttribute(attrName, val)
- newElem = document.createElement(tagName)
- newElem = document.createTextNode(strOfText)
- elem.appendChild(newElem)
Here is an example of using xml.dom.minidom to operate on the same xmldata.xml file that we used in the SAX example. Notice this method provides a bit more flexibility and we can even add data to the file in memory. Many of us are quite familiar with the DOM since it is so common in Web Development, so working with XML in Python using the DOM is fairly easy to understand.
import xml.dom.minidom
def main():
domtree = xml.dom.minidom.parse('xmldata.xml')
rootnode = domtree.documentElement
# display some information about the content
print(f'The root element is {rootnode.nodeName}')
print(f'The Title is: {rootnode.getAttribute("title")}')
entries = domtree.getElementsByTagName('entry')
print(f'There are {entries.length} entry tags')
# create a new entry tag in memory
newItem = domtree.createElement('entry')
# add some text to the entry
newItem.appendChild(domtree.createTextNode('Magic Entry!'))
# now add the entry to the first post
firstPost = domtree.getElementsByTagName('post')[0]
firstPost.appendChild(newItem)
# Now count the entry tags again
entries = domtree.getElementsByTagName('entry')
print('Now there are {0} entry tags'.format(entries.length))
# Print out the domtree as xml
print(domtree.toxml())
if __name__ == '__main__':
main()
The root element is blogposts The Title is: Blog Posts Collection There are 3 entry tags Now there are 4 entry tags <?xml version="1.0" ?><blogposts title="Blog Posts Collection" date="A date" author="Some dude"> <post type="summary"> <title>Parse XML With SAX</title> <entry>Magic Entry!</entry></post> <post type="detail"> <title>Overview</title> <entry> Parsing XML is great </entry> <entry/> <entry> Have fun with XML parsing </entry> </post> </blogposts> Process finished with exit code 0
XML ElementTree API
The DOM API is vast and offers cross-language and cross-platform API for working with XML data. The ElementTree API takes a different approach by focusing instead on being a simpler way of working with XML With the ElementTree API, elements are treated as if they were lists. This means that if you have an XML element that contains other elements, it is possible to iterate over those child elements using standard iteration like a for loop. The ElementTree API treats attributes like dictionaries. So if you have a reference to an element, then you can access its attrib property which is a dictionary of all the attribute names and values. ElementTree makes searching for content within XML straightforward. It offers functions that can use XPath Syntax to search the XML for specific data.
In the example below we use the ElementTree API to test these concepts out. Once again, we use the same XML data file we have been using for the entire tutorial. We can see how to build a document structure and find the root element of the tree. We can access an attribute, iterate over tags, count the number of elements, add new data, and so on.
from lxml import etree
def main():
postCount = 0
entryCount = 0
# build a doc structure using the ElementTree API
doc = etree.parse('xmldata.xml').getroot()
print(doc.tag)
# Access the value of an attribute
print(doc.attrib['title'])
# Iterate over tags
for elem in doc.findall('post'):
print(elem.tag)
# Count the number of posts
postCount = len(doc.findall('post'))
entryCount = len(doc.findall('.//entry'))
print(f'There are {postCount} post elements')
print(f'There are {entryCount} entry elements')
# Create a new post
newPost = etree.SubElement(doc, 'post')
newPost.text = 'This is a new post'
# Count the number of posts
postCount = len(doc.findall('post'))
entryCount = len(doc.findall('.//entry'))
print(f'There are now {postCount} post elements')
print(f'There are now {entryCount} entry elements')
if __name__ == '__main__':
main()
blogposts Blog Posts Collection post post There are 2 post elements There are 3 entry elements There are now 3 post elements There are now 3 entry elements Process finished with exit code 0
Learn More About Python XML Parsing
- Python Tutorial Python Xml (knowledgehut.com)
- Python Example Xml.sax.parse (programcreek.com)
- Partition Large Xml Files Into Subfiles In Python Using Sax (stackoverflow.com)
- Docs.python.org 3 Library Xml.sax (docs.python.org)
- Python Python_Xml_Processing (tutorialspoint.com)
- Sax Parsing With Python (knowthytools.com)
- Docs.python.org 3 Library Xml.dom.html (docs.python.org)
- Python Read Xml File Dom Example (mkyong.com)
- Reading And Writing Xml Files In Python (stackabuse.com)
- Read Xml File Exampleminidom Elementtree (python-tutorials.in)
- How I Used The Lxml Library To Parse Xml 20X Faster In Python (nickjanetakis.com)
- Python Lxml (journaldev.com)
- Pypi.org Project Lxml (pypi.org)
- Pythontips.com 2018 06 20 An Intro To Web Scraping With Lxml And Python (pythontips.com)
Python XML Parsing Summary
The problem of reading, writing, and manipulating XML data in Python is solved using any of the libraries mentioned in this tutorial. We had a look at the SAX API for XML, the DOM API for XML, and lastly the ElementTree API for XML. They each have their pros can cons, and some of the links above will offer more tips and tricks for working with XML in Python.