Archive

Posts Tagged ‘minidom’

Get XML element value in Python using minidom

29/07/2011 Leave a comment

Finally, a “development” post for my “developer” blog.

Recently, I’ve been working on some XML processing programs in Python.

The minidom module is great if you want your XML in a tree, and want tag names and attributes easily accessible, but, what happens if you want the text content inside a tag?

DOM, does not have a “tag value” concept. Instead, every bit of text in the XML, including the indentation is a “text node”, which is parsed as a separate tree element.

That means, that if you have something like this:


<name>John Smith</name>

You will get a tree with two levels: top level for “name” element, for which nodeValue will be None. This element will have a child node (second level of the tree) which will be of type TEXT_NODE an it’s values will be the text “John Smith”.

So far, so good, but, what if the value we want has some XML markup of its own?


<text>This text has <b>bold</b> and <i>italic</i> words.</text>

Now we have a complex tree on our hands with 3 levels and multiple branches.

It will look something like this:

<text>
   |______
          |-"This text has
          |-<b>
          |  |_________
          |            -"bold"
          |-"and"
          |-<i>
          |  |_________
          |            -"italic"
          --"words."

As you can see, this is a big mess, with the text split in to multiple parts on two separate tree levels.

There is no facility in minidom, to get the value of our <text> tag directly.

There is however, a way around it, that is simple but not obvious: you need to “flatten” the desired tag in to an XML string, then strip the tag it self from the string and you will have a clean value.

Here is the code:

def get_tag_value(node):
    """retrieves value of given XML node
    parameter:
    node - node object containing the tag element produced by minidom

    return:
    content of the tag element as string
    """

    xml_str = node.toxml() # flattens the element to string

    # cut off the base tag to get clean content:
    start = xml_str.find('>')
    if start == -1:
        return ''
    end = xml_str.rfind('<')
    if end < start:
        return ''

    return xml_str[start + 1:end]

Just pass the node you want the value of to the function and it will give you back the value as a string, including any internal markup.

I place this code in the public domain, which means you can use it anywhere any way you want with no strings attached.