minidom | Lev's developer blog

Get XML element value in Python using minidom

29/07/2011 levdev Leave a comment

Finally, a “development” post for my “developer” blog.

Recently, I’ve been working on some XML processing programs in Python.

The minidom module is great if you want your XML in a tree, and want tag names and attributes easily accessible, but, what happens if you want the text content inside a tag?

DOM, does not have a “tag value” concept. Instead, every bit of text in the XML, including the indentation is a “text node”, which is parsed as a separate tree element.

That means, that if you have something like this:


<name>John Smith</name>

You will get a tree with two levels: top level for “name” element, for which nodeValue will be None. This element will have a child node (second level of the tree) which will be of type TEXT_NODE an it’s values will be the text “John Smith”.

So far, so good, but, what if the value we want has some XML markup of its own?


<text>This text has <b>bold</b> and <i>italic</i> words.</text>

Now we have a complex tree on our hands with 3 levels and multiple branches.

It will look something like this:

<text>
   |______
          |-"This text has
          |-<b>
          |  |_________
          |            -"bold"
          |-"and"
          |-<i>
          |  |_________
          |            -"italic"
          --"words."

As you can see, this is a big mess, with the text split in to multiple parts on two separate tree levels.

There is no facility in minidom, to get the value of our <text> tag directly.

There is however, a way around it, that is simple but not obvious: you need to “flatten” the desired tag in to an XML string, then strip the tag it self from the string and you will have a clean value.

Here is the code:

def get_tag_value(node):
    """retrieves value of given XML node
    parameter:
    node - node object containing the tag element produced by minidom

    return:
    content of the tag element as string
    """

    xml_str = node.toxml() # flattens the element to string

    # cut off the base tag to get clean content:
    start = xml_str.find('>')
    if start == -1:
        return ''
    end = xml_str.rfind('<')
    if end < start:
        return ''

    return xml_str[start + 1:end]

Just pass the node you want the value of to the function and it will give you back the value as a string, including any internal markup.

I place this code in the public domain, which means you can use it anywhere any way you want with no strings attached.

Categories: Code, Tips and tricks Tags: Development, minidom, Python, XML

Lev's developer blog

Archive

Get XML element value in Python using minidom

Top Posts

Tag Cloud

Categories

Archives

Lev's developer blog

Archive

Get XML element value in Python using minidom

Share this:

Top Posts

Tag Cloud

Categories

Archives