Archive
Get XML element value in Python using minidom
Finally, a “development” post for my “developer” blog.
Recently, I’ve been working on some XML processing programs in Python.
The minidom module is great if you want your XML in a tree, and want tag names and attributes easily accessible, but, what happens if you want the text content inside a tag?
DOM, does not have a “tag value” concept. Instead, every bit of text in the XML, including the indentation is a “text node”, which is parsed as a separate tree element.
That means, that if you have something like this:
<name>John Smith</name>
You will get a tree with two levels: top level for “name” element, for which nodeValue will be None. This element will have a child node (second level of the tree) which will be of type TEXT_NODE an it’s values will be the text “John Smith”.
So far, so good, but, what if the value we want has some XML markup of its own?
<text>This text has <b>bold</b> and <i>italic</i> words.</text>
Now we have a complex tree on our hands with 3 levels and multiple branches.
It will look something like this:
<text> |______ |-"This text has |-<b> | |_________ | -"bold" |-"and" |-<i> | |_________ | -"italic" --"words."
As you can see, this is a big mess, with the text split in to multiple parts on two separate tree levels.
There is no facility in minidom, to get the value of our <text> tag directly.
There is however, a way around it, that is simple but not obvious: you need to “flatten” the desired tag in to an XML string, then strip the tag it self from the string and you will have a clean value.
Here is the code:
def get_tag_value(node): """retrieves value of given XML node parameter: node - node object containing the tag element produced by minidom return: content of the tag element as string """ xml_str = node.toxml() # flattens the element to string # cut off the base tag to get clean content: start = xml_str.find('>') if start == -1: return '' end = xml_str.rfind('<') if end < start: return '' return xml_str[start + 1:end]
Just pass the node you want the value of to the function and it will give you back the value as a string, including any internal markup.
I place this code in the public domain, which means you can use it anywhere any way you want with no strings attached.