XML Processing with Python
December 6, 1999
As part of our XML'99 coverage, we are pleased to bring you this taster from the "Working with XML in Python" tutorial led by Sean McGrath.
Introduction
A century ago, when HTML and CGI ruled the waves, Perl dominated the Web programming scene. As the transition to XML on the Web gathers pace, competition for the hearts and minds of Web developers is heating up. One language attracting a lot of attention at the moment is Python.
In this article we will take a high level look at Python. We will use the time honored "Hello world" example program to illustrate the principle features of the language. We will then examine the XML processing capabilities of Python.
Python is free
Python is free. You will find downloadable source code plus pre-compiled executables on python.org. As you know, "free" is one of those words that is often heavily loaded on the Internet. Fear not. Python is free with a capital "F". You are free to do essentially anything you like with Python, including make commercial use of it or derivatives created from it.
Python is interpreted
Python is an interpreted language. Programs can execute directly from the plain text
files that house them. Typically Python files have a .py
extension. There is no
compilation phase as far as the programmer is concerned. Just edit and run!
Python is portable
Python is portable. It runs on basically every computing platform of note, from
mainframes to Palm Pilots and everything in between. Python uses a virtual machine
architecture, similar in concept to Java's virtual machine. The Python interpreter
"compiles" programs to virtual machine code on-the-fly. These compiled files (typically
having a .pyc
extension) are also portable. That is to say, if you wish to
keep your source files hidden from your end-users you can simply ship the compiled
.pyc
files.
Python is easy to understand
Python is very easy to understand. Here is a Python program that prints the string "Hello world":
print "Hello world"
I think you will agree that programming a "Hello world" application cannot get much
simpler than that! To execute this program, you put it in a text file, say
Hello.py
, and feed it to the Python interpreter like this:
python Hello.py
The output is, surprise, surprise:
Hello world
Note the complete lack of syntactic baggage in the Hello.py
program. There
are no mandatory keywords or semi-colons required to get this simple job done. This
spartan, no-nonsense approach to syntax is one of the hallmarks of Python and applies
equally well to large Python programs.
Python is interactive
By invoking the Python interpreter (typically by typing python
on a
UNIX/Linux system, or running the "IDLE" application on Windows), you will find yourself
in an environment where you can execute Python statements interactively. As an example,
here is the "Hello world" application again:
>>> print "Hello world"
This will output:
Hello world
Note that the ">>>" above is Python's command prompt. The interactive mode is an excellent environment for playing around with Python. It is also indispensable as a fully programmable calculator!
Python is WYSIWYG
Python is sometimes referred to as a WYSIWYG programming language. This is because the indentation of Python code controls how the code is executed. Python does not have begin/end keywords or braces for grouping code statements. It simply does not need them. Take a look at the following Python fragment:
if x > y: print x if y > z: print y print z else: print z
The indentation of the code is used to control how statements are grouped for execution
purposes. There can be no ambiguity as to which if
clause is associated with
the else
clause in the above code because both statements have same level of
indentation.
Functions in Python
We can turn the "Hello world" program into a Python function like this:
def Hello(): print "Hello world"
Note that statements within the body of a function are indented beneath the def
Hello()
line which introduces the function. The parenthesis are a place holder
for function parameters. Here is a function that prints its parameters x
and
y
as well as the string "Hello world":
def Hello(x,y): print "Hello world",x,y
Python modules
A Python program typically consists of a number of modules. Any Python source file
can serve as a module and be imported into another Python program. For example, assuming
the
Hello
function above is housed in the file Greeting.py
we can
import the function into a Python program and call it as follows:
# Import the Hello function from the Greeting module from Greeting import Hello # Call the Hello function Hello()
Programs as modules to larger programs
Python makes it easy to write programs that can be used both as stand-alone programs and as modules to other programs.
Here is a modified version of Greeting.py
which will print "Hello world" but
can also still be imported into other programs:
def Hello(): print "Hello world" if __name__ == "__main__": # Test Hello Function if running as # main program Hello()
Note the special __name__
variable above. This variable is automatically set
to "__main__"
when a program is being executed directly. If it is being
imported into another program, __name__
is set to the name of the module, which
in this case would be "Greeting".
Python is object-oriented
Python is a very object-oriented language. Here is an extended version of the "Hello
world" program, called Message.py
, that can print any message via
MessageHolder
objects:
#Create a class called MessageHolder class MessageHolder: # Constructor - called automatically # when an object of this class is created def __init__(self,msg): self.msg = msg # Function to return the stored message string def getMsg(self): return self.msg
Note how indentation is used to structure the source code. the getMsg
function is associated with objects of the MessageHolder
class because it is
indented beneath the class MessageHolder
. Functions associated with objects are
more generally known as methods.
Suppose now that I need a variation on the MessageHolder
class in which all
messages are returned in uppercase. I can do that by subclassing
MessageHolder
, specifying the class I wish to inherit from in parentheses after
the class name:
# Import existing MessageHolder class from Message.py from Message import MessageHolder # Create a sub-class of MessageHolder called MessageUpper class MessageUpper(MessageHolder): # Constructor def __init__(self,msg): # Call constructor of superclass Message.__init__(msg) # Over-ride getMsg with new # functionality def getMsg(self): return string.upper(self.msg)
Python is extensible
The Python language consists of a small core and a large collection of modules. Some of these modules are written in Python and some are written in C. As a user of Python modules, you cannot tell the difference. For example:
import xmlproc import pyexpat
The first statement imports Lars Marius Garshol's implementation of an XML parser that is written purely in Python. The second statement imports the Python wrapping of James Clark's expat XML parser which is written in C.
Python programs using these modules cannot tell what language they have been implemented in. As you would expect, programs based on expat are typically faster owing to the speed advantages of a pure C implementation of an XML parser.
It is remarkably easy to write a Python module in C. This facility is very useful for speed-critical parts of large Python systems. It is also easy to "wrap" existing C libraries as Python modules, as has been done with expat. Many technologies exposing a C API have been wrapped as Python modules, for example Oracle, the Win32 API, and the wxWindows GUI toolkit, to name a few.
XML programming support
The core Python distribution (currently at version 1.5.2) has a simple non-validating
XML
parser module called xmllib
. The vast bulk of Python's XML support is in the
form of an add-on module under active development by the SIG for XML Processing in Python (known as
XML-SIG). To illustrate Python's XML support, we will switch to an XML 1.0 version
of the
"Hello world" program processing the following file:
<?xml version = "1.0"?> <Greeting> Hello world </Greeting>
SAX
SAX is a simple API for XML, spearheaded by David Megginson and developed as a collaborative effort on the XML-dev mail list. The Python implementation was developed by Lars Marius Garshol.
A Python SAX application to count the words in Greeting.xml
looks like
this:
from xml.sax import saxexts, saxlib, saxutils import string # Create a class to handle document events class docHandler(saxlib.DocumentHandler): # Start of document handler def startDocument(self): # Initialize storage for character data self.Storage = "" # end of document handler def endDocument(self): # Print approximate number of words # by counting the number of elements in # the list of words returned by the # string.split function print len(string.split(self.Storage)) def characters(self,str,start,end): # Accumulate character data self.Storage = self.Storage + str[start:end] # Create a parser parser = saxexts.make_parser() # Provide the parser with a document handler parser.setDocumentHandler(docHandler()) # Parse the Greeting.xml file parser.parseFile(open("Greeting.xml"))
DOM
The DOM is a W3C initiative to standardize an API to XML (and HTML) documents. Python has two DOM implementations. The one in the XML-SIG modules is the work of Andrew Kuchling and Stéfane Fermigier. The other is called 4DOM and is the work of Fourthought, who have also created XSLT and XPath implementations in Python.
Here is a sample DOM application to count the words in Greeting.xml
:
from xml.dom import utils,core import string # Read an XML document into a DOM object reader = utils.FileReader('Greeting.xml') # Retrieve top level DOM document object doc = reader.document Storage = "" # Walk over the nodes for n in doc.documentElement.childNodes: if n.nodeType == core.TEXT_NODE: # Accumulate contents of text nodes Storage = Storage + n.nodeValue print len(string.split(Storage))
Native Python APIs
As well as industry standard APIs, there is a native Python XML processing library known as Pyxie.
Pyxie is an open source XML processing library for Python which will be made publicly available in January 2000. Pyxie tries to make the best of Python's features to simplify XML processing.
Here is the word counting application developed using Pyxie:
from pyxie import * # Load XML into tree structure t = File2xTree("Greeting.xml") Storage = "" # Iterate over list of data nodes for n in Data(t): Storage = Storage + t.Data print len(string.split(Storage))
In conclusion
We have looked at some of the main features of Python in a high level way. Also, we have glimpsed at some of the XML processing facilities available. For further information on programming with Python, I suggest you start with http://www.python.org.
Future articles on XML.com will taker a closer look at implementing XML applications in Python.