Python XML Parsing

FREE Online Courses: Elevate Skills, Zero Cost. Enroll Now!

XML is one of the languages which contains a set of rules for formatting the document in a human readable form. In this article, we will learn about processing XML files using different Python modules. Before starting to discuss XML parsing in Python, let us get introduced to XML.

What is XML Parsing in Python?

XML is a programming language that describes the rules to format the documents in digital mode and stands for an extensible markup language.

Everything in XML is enclosed within tangs and the main tag is called the root. All the other elements of the document come under the root. An example of an XML code is given below.

<?xml version="1.0"?>
<Class>
    <Student>
        <FirstName>Abc</FirstName>
        <MiddleName type="optional"></MiddleName>
        <LastName>TH</LastName>
        <RollNo>12</RollNo>
        <Section>A</Section>
    </Student>
    
     <Student>
      <FirstName>Prq</FirstName>
      <MiddleName type="optional">Rst</MiddleName>
      <LastName>Z</LastName>
      <RollNo>19</RollNo>
      <Section>B</Section>

  </Student>
  
  <Student>
      <FirstName>Xyz</FirstName>
      <MiddleName type="optional">Hi</MiddleName>
      <LastName>KE</LastName>
      <RollNo>11</RollNo>
      <Section>A</Section>

  </Student>
</Class>

This is an XML code storing the details of the class. We can see multiple tags and each tag gets closed when the information under is finished. We will be using this XML file to process using Python modules.

Python Modules for XML Parsing

We will be discussing the following three Python modules for XML parsing:

1. SAX:

It is a module that means simple API for XML

2. DOM:

It stands for document object module. It has all the features of XML and works by storing data in hierarchical form. This module is used mainly when dealing with large data as it is faster.

3. ElementTree:

This module helps us store the data in the XML document in the tree structure, showing in a hierarchical format. We can parse data, find elements and also modify the data. Let’s discuss each of these in the next few sections.

SAX

SAX is a standard module used for XML parsing. You can import the module as ‘ import xml.sax’.

After importing we need to create a handler that holds the tags and attributes of the document and also contains methods for parsing. It calls startDocument() and endDocument() methods at the starting and ending of parsing of the file.

The handler is called at the beginning and end of each element to get information about the element. If the element is in the namespace mode, then the functions startElementNS() and endElementNS() are called at the start and end of the element. Otherwise, it calls the functions startElement(tag, attributes) and endElement(tag), where tag and attributes belong to that element.

The module provides the following methods to parse:

1. make_parse(parse_list[]):

This function creates a SAX XML reader object, the first parser found by the system. The optional argument can be any iterable and if it is provided then modules listed in the parse_list will be used before using the default ones. Its syntax is

xml.sax.make_parser([parser_list])

2. parse(filename_or_stream, handler, error_handler=handler.ErrorHandler())

This function creates a parser to parse the XML document. The filename_or_stream can be a filename or an object. The handler and the error_handler should be SAX ones. If the error_handler is not specified the default SAXParseException handles errors. The handler does all the work of parsing and this function does not return any value to object.

Its syntax is:

xml.sax.parse(xmlfile,contenthandler[,errorhandler])

3. parseString(string, handler, error_handler=handler.ErrorHandler())

This method also creates a parser but this parser works on XML string passed as its argument. And its syntax is:

xml.sax.parseString(xmlstring,contenthandler[,errorhandler])

Let us now see the example of using this module.

Example of SAX parser:

import xml.sax

class ClassHandler( xml.sax.ContentHandler ):
    def __init__(self):
        self.CurrentData = ""
        self.FirstName = ""
        self.MiddleName = ""
        self.LastName = ""
        self.RollNo = ""
        self.Section = ""

   # Gets called at the start of element 
    def startElement(self, tag, attributes):
        self.CurrentData = tag
        if tag == "Class":
            print ("*****Class*****")

   # Gets called at the end of elements
    def endElement(self, tag):
        if self.CurrentData == "FirstName":
            print ("FirstName:", self.FirstName)
        elif self.CurrentData == "MiddleName":
            print ("MiddleName:", self.MiddleName)
        elif self.CurrentData == "LastName":
            print ("LastName:", self.LastName)
        elif self.CurrentData == "RollNo":
            print ("RollNo:", self.RollNo)
        elif self.CurrentData == "Section":
            print ("Section:", self.Section)
        self.CurrentData = ""

   # Gets called when a character is read
    def characters(self, content):
        if self.CurrentData == "FirstName":
             self.FirstName = content
        elif self.CurrentData == "MiddleName":
             self.MiddleName = content
        elif self.CurrentData == "LastName":
             self.LastName = content
        elif self.CurrentData == "RollNo":
             self.RollNo = content
        elif self.CurrentData == "Section":
             self.Section = content
if ( __name__ == "__main__"):
   
   # creates an XMLReader
    parser = xml.sax.make_parser()
   # turnsoff namepsaces
    parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # overrides the default Handler
    handler = ClassHandler()
    parser.setContentHandler( handler )
   
    parser.parse("C:\\Users\\w7\\Desktop\\class.xml")

Output:

*****Class*****
FirstName: Abc
MiddleName:
LastName: TH
RollNo: 12
Section: A
FirstName: Prq
MiddleName: Rst
LastName: Z
RollNo: 19
Section: B
FirstName: Xyz
MiddleName: Hi
LastName: KE
RollNo: 11
Section: A

DOM

Document Object Module (DOM) is a cross language API used for parsing, finding elements, and accessing values from the file. First, we convert the XML into DOM by parsing.

We can do parsing by using either of the following functions:

1. parse():

This method takes the XML file as the argument and then parses the file. It returns an object which can be used later to find and access elements.

Example of parse() in dom:

from xml.dom import minidom
parser = minidom.parse("C:\\Users\\w7\\Desktop\\class.xml")

We can also parse by first opening the file and then passing the variable holding the parameter as the argument to the parse().

Example of first opening file and then parsing using parse():

from xml.dom import minidom

xml_file=open("C:\\Users\\w7\\Desktop\\class.xml")
parser = minidom.parse(xml_file);

2. parseString()

This is the other method to parse the XML as a string. We can do parsing using the function in the following way.

Example of parseString():

parser = minidom.parseString('<name>PythonGeeks<empty/> Welcome!</name>')

Now using any of these parsers we can find the elements. When we try to print the parser as said above we get an object of DOM.

Example of printing the parser:

from xml.dom import minidom
parser = minidom.parse("C:\\Users\\w7\\Desktop\\class.xml")
print(parser)

Output:

<xml.dom.minidom.Document object at 0x0000000007B366A0>

Accessing Elements:

We can access the elements using the parser obtained and using the method getElementsByTagName().

Example of accessing elements using the parser and getElementsByTagName():

tag=parser.getElementsByTagName('FirstName')[0]
print(tag)

Output:

<DOM Element: FirstName at 0x7b3e790>

We now get a DOM element object. This refers to the first tag of the file (see the index 0).

To get the attributes and data we can use the tags in the following way.

Example of accessing attributes:

tag= parser.getElementsByTagName('MiddleName')
print(tag[1].attributes['type'].value)

Output:

optional

In this example, we first got the tag. The index given to the tag refers to the student’s number, that is, first, second, and so on. And we use the attribute() function giving the attribute name to get the corresponding value. We can use the getAttribute() function for this purpose by passing the attribute name as the input.

We can also check if a tag has a respective attribute or not by using the hasAttribute() function by passing the argument name as an argument.

Example of checking presence of attributes:

tag= parser.getElementsByTagName('Section')
print(tag[0].hasAttribute('type'))

tag2= parser.getElementsByTagName('MiddleName')
print(tag2[0].hasAttribute('type'))

Output:

False

True

Example of accessing data:

tag= parser.getElementsByTagName('FirstName')
print(tag[2].firstChild.data)

Output:

Xyz

In this example, we are first referring to the third child using the index 2. Then we are accessing the first name using the firstChild.

We can also get the roll numbers of all the students by iteration.

Example of iterating over the data:

tag= parser.getElementsByTagName('RollNo')
for name in tag:
    print(name.firstChild.data)

Output:

12
19
11

ElementTree in Python

This module stores the data of the file in a tree structure and allows conversion to and from XML. As usual, we first parse the XML file and then use it to access or modify the elements. This parser allows us also to convert back the modified into an XML.

We have the following two methods to parse:

1. parse():

This function parses the xml file and returns a parse object.

Example of parse() in ElementTree:

import xml.etree.ElementTree as ET
parser = ET.parse('C:\\Users\\w7\\Desktop\\class.xml')

Now to get the root of the document, here ‘Class’, we can use the getroot() method on the parser. This returns the object that holds the root data. This can be seen by printing the root as shown below.

Example of getting root:

root = parser.getroot()
print(root)

Output:

<Element ‘Class’ at 0x0000000007A33680>

2. fromString():

This is the second function that we can use to do parsing of the XML function. This takes the XML code as a string and parses it.

Example of parsing using fromString():

import xml.etree.ElementTree as ET
xml='''<?xml version="1.0" encoding="UTF-8"?>
<Movies>
<Movie1>
    <title subTitle="sub">Title</title>
    <ratings>7</ratings>
    <description>
   It is a movie.
   </description>
</Movie1>
</Movies>
'''
parser = ET.fromstring(xml)

Previously we got the root object. We can get the name and the attribute of the root using the ‘tag’ and ‘attrib’ objects respectively. For example, in the above XML string, we can get the root by the below code.

Example of getting root name:

print(parser.tag)

print(parser.attrib)

Output:

Movies

{}

Finding Elements in Python

Let us use the original class XML file to access the values. We can get the name of the first tab inside the ‘Class’ using the below code.

Example of getting name of the first tag inside the root:

import xml.etree.ElementTree as ET
parser = ET.parse('C:\\Users\\w7\\Desktop\\class.xml')
root = parser.getroot()
root[0].tag

Output:

‘Student’

Similarly we can get all tags and attributes under the root. For example,

Example of getting tags and attributes under the root:

root = parser.getroot()
for i in root[0]:
    print(i.tag,i.attrib)

Output:

FirstName {}
MiddleName {‘type’: ‘optional’}
LastName {}
RollNo {}
Section {}

We can also get the details of a student in the following way.

Example of getting details of a student:

root = parser.getroot()
for i in root[1]:
    print(i.text)

Output:

Prq
Rst
Z
19
B

We can also get particular information about all the students. Say we want to get the section of all the students. Then we can use the below code.

Example of getting section of all students:

for x in root.findall('Student'):
    sec =x.find('Section').text
    print(sec)

Output:

A
B
A

Adding values to XML

Let us say we want to give the last name to the first student as ‘RS’. Then we can do an iteration to get the ‘LastName’ tag of the first student and then change it. Then we can either modify the existing xml or create a new one using the write() function.

Example of adding to XML:

for lastname in root.iter('LastName'):
    name = 'RS'
    lastname.text = str(name)
    lastname.set('added_lastname', 'yes')
parser.write('Class1.xml')

We can change the existing XML by passing the current XML file name to the write() function. The Class1 XML file has the following data.

<Class>
    <Student>
        <FirstName>Abc</FirstName>
        <MiddleName type="optional" />
        <LastName added_lastname="yes">RS</LastName>
        <RollNo>12</RollNo>
        <Section>A</Section>
    </Student>
    
     <Student>
      <FirstName>Prq</FirstName>
      <MiddleName type="optional">Rst</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>19</RollNo>
      <Section>B</Section>

  </Student>
  
  <Student>
      <FirstName>Xyz</FirstName>
      <MiddleName type="optional">Hi</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>11</RollNo>
      <Section>A</Section>

  </Student>
</Class>

We can also add a subtab inside the Student tag. We can do this by using the SubElement() method.

Example of adding subtag to XML:

ET.SubElement(root[0], 'Status')
for i in root.iter('Status'):
    new_status = 'Passed'
    i.text = str(new_status)
parser.write('Class2.xml')

The Class2 XML file has the following content.

<Class>
    <Student>
        <FirstName>Abc</FirstName>
        <MiddleName type="optional" />
        <LastName added_lastname="yes">RS</LastName>
        <RollNo>12</RollNo>
        <Section>A</Section>
    <Status>Passed</Status></Student>
    
     <Student>
      <FirstName>Prq</FirstName>
      <MiddleName type="optional">Rst</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>19</RollNo>
      <Section>B</Section>

  </Student>
  
  <Student>
      <FirstName>Xyz</FirstName>
      <MiddleName type="optional">Hi</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>11</RollNo>
      <Section>A</Section>

  </Student>
</Class>

Deleting from XML

In addition to adding values, ElementTree also allows you to delete values from the XML file. And similar to the addition of value, we can save the XML file using the write() function.

To remove an attribute we can use the pop() method.

Example of deleting attribute from XML:

root[0][1].attrib.pop('type', None)
parser.write('Class3.xml')

Here we removed the attribute in the ‘’MiddleName’ (index 1) tag of the first student (index 0). The Class3 XML file has the following content.

<Class>
    <Student>
        <FirstName>Abc<Status>Passed</Status><Status>Passed</Status></FirstName>
        <MiddleName />
        <LastName added_lastname="yes">RS</LastName>
        <RollNo>12</RollNo>
        <Section>A</Section>
    <Status>Passed</Status><Status>Passed</Status></Student>
    
     <Student>
      <FirstName>Prq</FirstName>
      <MiddleName type="optional">Rst</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>19</RollNo>
      <Section>B</Section>

  </Student>
  
  <Student>
      <FirstName>Xyz</FirstName>
      <MiddleName type="optional">Hi</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>11</RollNo>
      <Section>A</Section>

  </Student>
</Class>

We can also remove a tag of a student using the remove function.

Example of deleting a tag from XML:

root[1].remove(root[1][4])
parser.write('Class4.xml')

Here we are removing the tag ‘Section’ (index 4) of the second student (index 1). The Class4 XML file has the following content.

<Class>
    <Student>
        <FirstName>Abc<Status>Passed</Status><Status>Passed</Status></FirstName>
        <MiddleName />
        <LastName added_lastname="yes">RS</LastName>
        <RollNo>12</RollNo>
        <Section>A</Section>
    <Status>Passed</Status><Status>Passed</Status></Student>
    
     <Student>
      <FirstName>Prq</FirstName>
      <MiddleName type="optional">Rst</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>19</RollNo>
      </Student>
  
  <Student>
      <FirstName>Xyz</FirstName>
      <MiddleName type="optional">Hi</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>11</RollNo>
      <Section>A</Section>

  </Student>
</Class>

We can also delete a student. This can be done by using the clear() function. For example,

Example of deleting a student from XML:

root[2].clear()
parser.write('Class5.xml')

Here we are removing the third student (index 2).The Class5 xml file has the following content.

<Class>
    <Student>
        <FirstName>Abc<Status>Passed</Status><Status>Passed</Status></FirstName>
        <MiddleName />
        <LastName added_lastname="yes">RS</LastName>
        <RollNo>12</RollNo>
        <Section>A</Section>
    <Status>Passed</Status><Status>Passed</Status></Student>
    
     <Student>
      <FirstName>Prq</FirstName>
      <MiddleName type="optional">Rst</MiddleName>
      <LastName added_lastname="yes">RS</LastName>
      <RollNo>19</RollNo>
      </Student>
  
  <Student /></Class>

Interview Questions on XML Processing in Python

Q1. Write a program to parse a string using the SAX module.
Ans. Below is the example of parsing a string using SAX:

import xml.sax

class ContextHandler( xml.sax.ContentHandler ):
    def __init__(self):
        self.CurrentData = ""
        self.description = ""

   # Call when an element starts
    def startElement(self, tag, attributes):
        self.CurrentData = tag
        if tag == "Website":
            print ("*****Website*****")

   # Call when an elements ends
    def endElement(self, tag):
        if self.CurrentData == "description":
            print ("Description:", self.description)
        self.CurrentData = ""

   # Call when a character is read
    def characters(self, content):
        if self.CurrentData == "description":
             self.description = content

xml_str='''<?xml version="1.0" encoding="UTF-8"?>
<Website>
    <PythonGeeks>
        <description>It is a place to build your Python skills!</description>
    </PythonGeeks>
</Website>
'''
parser = xml.sax.parseString(xml_str,ContextHandler())

Output:

*****Website*****
Description: It is a place to build your Python skills!

Q2. Write a program to iterate over particular data of all the elements using Data.
Ans. Let us consider the class xml to get the first names of all the students.

Example of iterating over data using DOM:

from xml.dom import minidom
parser = minidom.parse("C:\\Users\\w7\\Desktop\\class.xml")
tag= parser.getElementsByTagName('FirstName')
for name in tag:
    print(name.firstChild.data)

Output:

Abc
Prq
Xyz

Q3. Write a program to check if the attribute exists and then access it if it does. Use the DOM module.
Ans. Below is the example of getting attribute if exists using DOM:

parser = minidom.parse("C:\\Users\\w7\\Desktop\\class.xml")
tag= parser.getElementsByTagName('MiddleName')
if(tag[0].hasAttribute('type')):
    print(tag[0].getAttribute('type'))

Output:

optional

Q4. Write a program to get all the tags under the root and their attributes, if they exist using ElementTree.
Ans. Below is the example of getting tags and attributes using ElementTree:

import xml.etree.ElementTree as ET
parser = ET.parse('C:\\Users\\w7\\Desktop\\class.xml')
root = parser.getroot()
for i in root[0]:
    print("tag:",i.tag,end=" ")
    if(i.attrib):
        print("attrib:",i.attrib)
    print()

Output:

tag: FirstName
tag: MiddleName attrib: {‘type’: ‘optional’}
tag: LastName
tag: RollNo
tag: Section

Q5. Write a program to delete a tag from the xml file using ElementTree.
Ans. Let us write a program to delete the LastName tag from the 3rd student.

Example of getting root:

import xml.etree.ElementTree as ET
parser = ET.parse('C:\\Users\\w7\\Desktop\\class.xml')
root = parser.getroot()
root[2].remove(root[2][1])
parser.write('Class_new.xml')

Output:

<Class>
<Student>
<FirstName>Abc</FirstName>
<MiddleName type=”optional” />
<LastName>TH</LastName>
<RollNo>12</RollNo>
<Section>A</Section>
</Student>
<Student>
<FirstName>Prq</FirstName>
<MiddleName type=”optional”>Rst</MiddleName>
<LastName>Z</LastName>
<RollNo>19</RollNo>
<Section>B</Section>
</Student>
<Student>
<FirstName>Xyz</FirstName>
<LastName>KE</LastName>
<RollNo>11</RollNo>
<Section>A</Section>
</Student>
</Class>

Quiz on XML Processing in Python

Conclusion

In this article, we learned different modules to process the XML files in Python. These include SAX, DOM, and ElementTree. Finally, we practiced some coding questions.

Hope you understood the concepts discussed. Happy learning.

Did we exceed your expectations?
If Yes, share your valuable feedback on Google | Facebook


Leave a Reply

Your email address will not be published. Required fields are marked *