Get XML element value in Python using minidom
Finally, a “development” post for my “developer” blog.
Recently, I’ve been working on some XML processing programs in Python.
The minidom module is great if you want your XML in a tree, and want tag names and attributes easily accessible, but, what happens if you want the text content inside a tag?
DOM, does not have a “tag value” concept. Instead, every bit of text in the XML, including the indentation is a “text node”, which is parsed as a separate tree element.
That means, that if you have something like this:
<name>John Smith</name>
You will get a tree with two levels: top level for “name” element, for which nodeValue will be None. This element will have a child node (second level of the tree) which will be of type TEXT_NODE an it’s values will be the text “John Smith”.
So far, so good, but, what if the value we want has some XML markup of its own?
<text>This text has <b>bold</b> and <i>italic</i> words.</text>
Now we have a complex tree on our hands with 3 levels and multiple branches.
It will look something like this:
<text>
|______
|-"This text has
|-<b>
| |_________
| -"bold"
|-"and"
|-<i>
| |_________
| -"italic"
--"words."
As you can see, this is a big mess, with the text split in to multiple parts on two separate tree levels.
There is no facility in minidom, to get the value of our <text> tag directly.
There is however, a way around it, that is simple but not obvious: you need to “flatten” the desired tag in to an XML string, then strip the tag it self from the string and you will have a clean value.
Here is the code:
def get_tag_value(node):
"""retrieves value of given XML node
parameter:
node - node object containing the tag element produced by minidom
return:
content of the tag element as string
"""
xml_str = node.toxml() # flattens the element to string
# cut off the base tag to get clean content:
start = xml_str.find('>')
if start == -1:
return ''
end = xml_str.rfind('<')
if end < start:
return ''
return xml_str[start + 1:end]
Just pass the node you want the value of to the function and it will give you back the value as a string, including any internal markup.
I place this code in the public domain, which means you can use it anywhere any way you want with no strings attached.
![[FSF Associate Member]](https://i0.wp.com/static.fsf.org/nosvn/associate/fsf-11332.png)
