Python和xml简介

python提供越来越多的技术来支持xml，本文旨在面向初学利用Python处理xml的读者，以教程的形式介绍一些基本的xml出来概念。前提是读者必须知道一些xml常用术语。

先决条件

本文所有的例子基于Python2.6.5，pyxml的最新版本为0.8.1，该教程中的例子都需要导入minidom模块，所以在py文件中需要加入以下类似代码：

1	`import` `xml.dom.minidom`

当然，你也可以从minidom模块中只导入你需要的类。你可以使用以下代码来查看该模块的内容：

1	`dir(xml.dom.minidom)`

创建XML 文件

首先，正如前面所说的，导入minidom模块：

1	`import` `xml.dom.minidom`

要创建XML文件，我们需要Document这个对象的实例：

1 2	`def` `get_a_document():` `doc` `=` `xml.dom.minidom.Document()`

当然这时候这个Document还没有任何内容，接下来我们将增加一些内容到文件中。

元素节点(Elements)

XML文件中有一个唯一的‘根元素节点’，其他子元素节点以及文本内容都是放在这个根元素的结构中。这里我们可以创建一个xml文件，用于描述一个公司的某个部分，该文件的根元素节点命名为“business”，名字空间(namespace)设置为：http://www.boddie.org.uk/paul/business。代码如下：

1	`business_element` `=` `doc.createElementNS("http://www.boddie.org.uk/paul/business",` `"business")`

此刻我们已经创建了元素节点，但是还没有加入到Document中，我们需要把它添加到文档中：

1	`doc.appendChild(business_element)`

最后在函数末尾返回我们创建的对象：

1	`return` `doc, business_element`

为了便利，我把上面的代码综合起来：

def get_a_document():

doc = xml.dom.minidom.Document()

business_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "business")

doc.appendChild(business_element)

return doc, business_element

执行完上面的函数，那么根元素节点已经被添加Document中，我们可以通过查询元素节点信息：

>>> doc, business_element = get_a_document()

>>> doc.childNodes

[<DOM Element: business at 0x20ad0d0>]

当然你可以查询这个节点列表里面的具体信息，

>>> doc.childNodes[0].namespaceURI

'http://www.boddie.org.uk/paul/business'

最后查看下我们给元素节点设置的名字空间：

>>> doc.childNodes[0].localName

'business'

business也就是我们刚才设置的根元素节点名字。有时候这个localName很重要，我们可以从中知道是公司的那个部门，同样我们也可以像刚才那样用一个函数来添加localName:

1 2	`def` `add_a_location(doc, business_element):` `location_element` `=` `doc.createElementNS("http://www.boddie.org.uk/paul/business",` `"location")`

添加元素节点作为子节点：

1	`business_element.appendChild(location_element)`

最后返回：

1	`return` `location_element`

有了以上这个函数，我们就可以向根元素节点添加新的元素节点：

>>> doc, business_element = get_a_document()

>>> location_element = add_a_location(doc, business_element)

>>> doc.childNodes

[<DOM Element: business at 0x20dc5f8>]

同样我们也可以查看这个元素列表中更为详细的信息：

>>> doc.childNodes[0].namespaceURI

'http://www.boddie.org.uk/paul/business'

>>> doc.childNodes[0].localName

'business'

文本

文本就是xml文件中的具体内容，通常被置于xml元素标签中。紧接着前面的例子，我们将添加元素节点”surrounding”作为location的子节点。作用就是用于描述location那个地方的周边环境。同样我们创建一个函数：

def add_surroundings(doc, location_element):

surroundings_element = doc.createElementNS("http://www.boddie.org.uk/paul/business", "surroundings")

location_element.appendChild(surroundings_element)

然后添加文本内容：

1	`description` `=` `doc.createTextNode("A quiet, scenic park with lots of wildlife.")`

把这个文本节点添加到surrounding中：

1	`surroundings_element.appendChild(description)`

返回对象：

1	`return` `surroundings_element`

我们可以从跟元素节点查询子节点信息：

>>> surroundings_element = add_surroundings(doc, location_element)

>>> doc.childNodes[0].childNodes[0].childNodes[0].childNodes[0]

当然可以查看整个文本值：

>>> doc.childNodes[0].childNodes[0].childNodes[0].childNodes[0].nodeValue

'A quiet, scenic park with lots of wildlife.'

同样我们可以添加更多文本值：

def add_more_surroundings(doc, surroundings_element):

description = doc.createTextNode(" It's usually sunny here, too.")

surroundings_element.appendChild(description)

来验证下结果：

>>> add_more_surroundings(doc, surroundings_element)

>>> surroundings_element.childNodes

[, , ]

有时候我们需要把这段文本组合在一起，该如何做呢？

1 2	`def` `fix_element(element):` `element.normalize()`

结果为：

>>> fix_element(surroundings_element)

>>> surroundings_element.childNodes[0].nodeValue

"A quiet, scenic park with lots of wildlife. It's usually sunny here, too. It's usually sunny here, too."

属性

xml 中的元素节点通常附带有属性。比如刚才的’location’节点还有一个属性叫做’building’,这个元素的属性名称叫做’name’.

1 2	`def add_building(doc,` `location_element):` `building_element` `=` `doc.createElementNS("http://www.boddie.org.uk/paul/business",` `"building")`

返回对象：

1 2	`location_element.appendChild(building_element)` `return` `building_element`

这里我们注意到’building’同样作为’location’的字节点出现在’surrounding’之后。我们可以用如下方法确认：

>>> building_element = add_building(doc, location_element)

>>> location_element.childNodes

[<DOM Element: surroundings at 136727844>, <DOM Element: building at 136286548>]

这样之后我们可以直接添加属性：

1 2	`def` `name_building(building_element):` `building_element.setAttributeNS("http://www.boddie.org.uk/paul/business",` `"business:name",` `"Ivory Tower")`

在名空间和元素节点以及文本值指定之后，我们还可以用以上方法添加其他属性：

>>> name_building(building_element)

>>> building_element.getAttributeNS("http://www.boddie.org.uk/paul/business", "name")

'Ivory Tower'

写XML文档

当你处理好以上xml内容，通常需要保存起来，所以一般是把内容写入文件。一个简单的方式是使用另外一个模块：

1 2	`import` `xml.dom.ext` `import` `xml.dom.minidom`

导入这两个模块，就有很多可用的函数和类，这里我们使用PrettyPrint函数输出标准的xml结构：

1 2	`def` `write_to_screen(doc):` `xml.dom.ext.PrettyPrint(doc)`

具体用法：

>>> from XML_intro.Writing import *

>>> write_to_screen(doc)

<?xml version='1.0' encoding='UTF-8'?>

<business xmlns='http://www.boddie.org.uk/paul/business' xmlns:business='http://www.boddie.org.uk/paul/business'>

  <location>

  <surroundings>A quiet, scenic park with lots of wildlife.</surroundings>

  <building business:name='Ivory Tower'/>

 </location>

</business>

以上只是打印在屏幕上，最后完成输出文件：

def write_to_file(doc, name="/tmp/doc.xml"):

file_object = open(name, "w")

xml.dom.ext.PrettyPrint(doc, file_object)

file_object.close()

或者简单的：

1 2	`def` `write_to_file_easier(doc, name="/tmp/doc.xml"):` `xml.dom.ext.PrettyPrint(doc,` `open(name,` `"w"))`

>>> write_to_file(doc)

读XML文件
接下来讲如何读取上面保存的xml文件：

import xml.dom.minidom

def get_a_document(name="/tmp/doc.xml"):

return xml.dom.minidom.parse(name)

如果已经存在存在xml可读对象：

1 2	`def` `get_a_document_from_file(file_object):` `return` `xml.dom.minidom.parse(file_object)`

更多资讯参考：http://www.boddie.org.uk/python/XML_intro.html