Architectural Design Patterns for XML Documents
March 26, 2003
Introduction
No one wants to reinvent the wheel. One way programmers try to reuse good ideas about object design is to look to catalogs of design patterns like, most famously, the Gang of Four's Design Patterns: Elements of Reusable Object-Oriented Software (Gamma et. al.). XML has been used enough now that some high-level patterns are starting to emerge. Some patterns revolve around the low-level details of good schema design, like those put together by Dare Obasanjo in "W3C XML Schema Design Patterns"; but when you have a blank sheet of paper in front of you and you're ready to start designing your new XML format, you want patterns to guide you at a higher level. This article attempts to document a few whole-document design patterns that have proven themselves in the field.
Dynamic Document
Abstract
This pattern contains XML untyped by DTD or schema, but follows accessors for underlying program objects. It allows for unlimited extension by multiple, uncoordinated parties at the cost of lack of type-checking; and is simple to implement, with supporting libraries abounding (e.g. Apache Commons for Java; .NET's XML marshalling for C#).
Problem
You need to develop a format quickly, or many different people are contributing on an ad-hoc basis at different times, and it's not possible to have a fixed document design.
Context
This pattern is more common for private formats or technical ones, such as configuration for a server or a marshaling format. It also is a good match for Extreme Programming projects because you can get it working quickly, refactoring later to use another mechanism if needed.
Forces
- You need a "quick and dirty" solution.
- You can't know beforehand what extensions will be required, but you know they will be many and created by people other than the original document format creator.
Solution
Don't design a format and drop validation. Have a technical solution -- that is, a marshaller -- drive the XML generation. As data structures in your program change, the generated XML changes. In both .NET and in Java the marshaller uses reflection and extra metadata (.NET CLR attributes or JavaBean BeanInfo classes) to find the read/write properties of a class. It moves recursively through the object graph, generating a tree of XML elements named after the accessor. For example, these two classes:
public Person { public String getName() { ... } public void setName(String name) { ... } public Address getAddress() { ... } public void setAddress(Address address) { ... } } public Address { public String getCity() { ... } public void setCity(String city) { ... } public String getState() { ... } public void setState(String state) { ... } }
might be marshalled as
<person> <name>Kyle Downey</name> <address> <city>Forest Hills</city> <state>Queens</state> </address> </person>
Discussion
Before sitting down to do a potentially complex document design, you should always ask yourself if a dynamic, data-driven format might be sufficient. Most XML-aware development platforms provide at least one library that will take an object and convert it into XML. You've done the object design, and in a couple lines of code, you've done your document design as well. If you're on a tight deadline, this is a potentially big time-saver for the development team.
But not so fast. Dynamic document most likely isn't an option for you if
- you're designing a long-lived business-critical exchange format and thus you don't want the format to change whenever you change your object design; or
- you don't trust the producers of the data to get it right, and cost of a mistake is high. For example, a document notifying you about inventory changes at a partner's warehouse and thus the lack of validation is risky.
Related Patterns
None. This is the "zero design pattern design." Once you start to involve other patterns, you're enforcing a human design rather than having a dynamic document.
Known Uses
- Ant build.xml
- Apache Tomcat server.xml
- JDK 1.4 JavaBean XML persistence
- .NET XML Marshalling
- SOAP default encoding
Composition
Abstract
Wherever possible, define the format using existing standards, referencing their elements by namespace rather than rolling your own. For example, add metadata to your metadata using RDF and the Dublin Core extensions rather than inventing your own <author> and <description> tags. Allows for independent evolution of markup by parties who know the business domain best.
Problem
You have an existing or planned document format that provides common types of data using its own, proprietary elements and types, and you're forced to maintain and understand that subset of data yourself, even though you're not a domain specialist.
Context
With all the standardization work out there, just about any business-oriented document problem presents an opportunity for defining some elements with Composition.
Forces
- There is an opportunity to reuse a <simpleType>, <complexType> or <element> from another XML schema.
- You can accept or even want to have the composed data type definition evolve independently of your own efforts.
- Patents or other legal encumbrances do not prevent you from reusing that schema.
Solution
XML namespaces make it very easy to import entire elements from one spec to another.
Let's
say you're designing a format for capturing use cases. You want to include attribution
information: who wrote it, when, etc.. You might want to consider using the Dublin Core RDF elements instead of defining your own
<author>
and other meta-information tags:
<uc:use-case xmlns:uc="http://example.com/my/usecase.xsd" id="3"> <uc:metadata> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:dc="http://dublincore.org/documents/2002/07/31/dcmes-xml/ dcmes-xml-xsd.xsd"> <rdf:Description> <dc:title>Irritate Customer</dc:title> <dc:creator>Kyle Downey</dc:creator> <dc:date>2002-03-08</dc:date> <dc:format>text/xml</dc:format> <dc:language>en</dc:language> <dc:contributor>Amber Archer Consulting Co., Inc.</dc:contributor> <dc:identifier>UC#3</dc:identifier> </rdf:Description> </rdf:RDF> </uc:metadata> ... </uc:use-case>
In your use case schema you would have (in part)
<schema xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <import namespace="http://www.w3.org/1999/02/22-rdf-syntax-ns#" schemaLocation="http://dublincore.org/documents/2002/07/31/ dcmes-xml/dcmes-rdf.xsd" /> <element name="metadata"> <sequence> <element ref="rdf:RDF"/> </sequence> </element> </schema>
Discussion
One of the strong arguments for Composition
-- aside from the well-documented
programmer's virtue of laziness -- is that you can lean on the more specialized knowledge
of
others. The people who put together Dublin Core put a lot of thought into how to best
represent document metadata. They have been doing it since 1994. Most likely, you've
been
thinking about how to put meta-information into your document since two paragraphs
ago.
There's no match. So your choice is either to get taken down by an angry librarian
who's
breaking noses and taking names or reuse the work. This design pattern recommends
the
latter.
As RDF and Dublin Core evolve, all you have to do is change the namespace and the
import
statement to point to a newer version of the schema, letting you take advantage of
all the
latest and greatest ways of representing metadata, widgets, documents, customers,
fixed
income instruments, or whatever it is you're reusing with very little effort. This
capacity
for concurrent evolution is, however, also the biggest gotcha in Composition
.
Unless the promoters of your standard have done the right thing and put version information
in the namespace and schema URI, there's a risk users in the field will suddenly start
getting backward-incompatible version 2.0 of the schema and get very angry. So keep
an eye
on versioning, and if necessary copy the schema to your own namespace and reuse from
there.
Even where you can't reuse a public XML schema, you can still look for common, reusable data clumps in your document formats. Let's put it this way: if you have five business processes involving customers and addresses, do you really need to define customer and address five times? Or even want to? Reuse through Composition can and should start inside your enterprise.
Related Patterns
None from this catalog.
Known Uses
- WSDL very nicely reuses XML schema by embedding a whole <schema> element in the WSDL document rather than defining its own mechanism for acceptable web service message types.
Self-Documenting Files
Abstract
Include as part of the document format elements that annotate the content.
Problem
Your human-readable format is so cryptic that it makes grown hackers cry: this fragment of Perl code rendered as XML that supposedly prints the entire Linux kernel when run:
<perlml> @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{ @p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&& close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print </perlml>
Note how it's much improved with just a little annotation:
<perlml> <annotation> You're not expected to understand this. </annotation> <code> @P=split//,".URRUU\c8R";@d=split//,"\nrekcah xinU / lreP rehtona tsuJ";sub p{ @p{"r$p","u$p"}=(P,P);pipe"r$p","u$p";++$p;($q*=2)+=$f=!fork;map{$P=$P[$f^ord ($p{$_})&6];$p{$_}=/ ^$P/ix?$P:close$_}keys%p}p;p;p;p;p;map{$p{$_}=~/^[P.]/&& close$_}%p;wait until$?;map{/^r/&&<$_>}%p;$_=$d[$q];sleep rand(2)if/\S/;print </code> </perlml>
Context
Documents that are meant to be viewed by people or at least post-processed to generate documentation for people. Internal data structure formats like on-the-wire marshaling generally don't need annotation.
Forces
- You're generating complex XML content that needs to be understood by people, or converted into some format for their viewing.
- Ihe information in the document itself is not enough to be comprehensible.
Solution
Add an element or elements to your XML schema to include documentation. Generally
you'll
want to somehow tie the documentation to each significant element, so you could consider
a
base type -- for example, documentableType
-- like this:
<complexType name="documentableType"> <sequence> <element name="annotation" type="string"/> </sequence> </complexType>
Discussion
XML comments are great, but if you find that they're becoming mandatory for users to decode your XML documents, maybe it's time to allow those annotations to be part of the XML itself. Probably the biggest win you get out of this (aside from standardizing where the comments go and how they're formatted using all the powerful features of XML Schema) is an ability to apply the rest of the XML toolkit to your documents. You could, for instance, write a "widgetdoc" XSLT stylesheet that takes your widget.xml files and converts them into an HTML document describing the widget, including all your extra annotations that might not mean much to your automatic widget-stamping machine that was reading the XML before, but will mean a lot to anyone debugging the machine's software.
Related Patterns
There's a nice combination of Composition
and Self-Documenting
Files
. There are two well-known formats for documentation in XML: DocBook and XHTML. DocBook is specialized for technical documentation, and there are powerful
stylesheets out there for converting it to HTML and PDF. XHTML is, obviously, very
good for
online presentation. So if you want to be able to generate professional-quality
documentation with links and images from your own XML format, you should definitely
consider
embedding XHTML or DocBook XML.
Known Uses
- XML Schema has annotations, and you can convert them to HTML using xs3p, a very snazzy schemadoc tool
- WSDL
Multipart Files
Abstract
Define an explicit mechanism for splitting content into multiple files: a primary document and satellite ones that represent faster changing components or sections of content shared with other primary documents.
Problem
Your documents have become large and unwieldy, and you want to share pieces of them.
Context
This pattern can apply to just about any format, but it seems to be more common in the technical arena.
Forces
- As documents grow in size and complexity, and as there are more documents that can overlap, this pattern becomes more appealing.
- Pushing against use, security and absolute versus relative URIs become issues for anyone processing the format: if it's too complicated for your taste, or if there are concerns about a cracker manipulating this facility to pull in content he or she should not have access to, you might want to disallow inclusions
Solution
Add to your schema an <import> or <include> element that takes an
href
attribute which can be any valid relative or absolute URI. Compliant
processors for your format will load and incorporate valid subdocuments in your format
from
the URI.
SOAP 1.1 with Attachments takes an interesting alternative approach to this problem, using Composition along the way. SOAP coopts the pre-existing MIME standard and allows SOAP messages to be mime/multipart, with the SOAP XML message as the initial part and others linked to it. This allows SOAP to behave something like the FTP protocol with separate "control" and "data" streams. You can send metadata about binary content and directives for what the recipient should do with it as part of the XML message and just attach the content directly to the message.
Discussion
From #include
to the humble href
in HTML, systems abound with
ways to pull together content from multiple locations. This makes documents more
maintainable and encourages basic reuse of common components, whether they're shared
stylesheet rules or whole XML schemas. While it may seem hard to find instances where
you
wouldn't want to allow sharing of document parts and file composition, as noted
above in forces there are potential complexity and security issues with allowing inclusions.
Related Patterns
You might want to make your Self-Documenting Format
refer to external
documents rather than embedding them, and you can use Composition by reusing the W3C
standards for file inclusion: XInclude and XML Base. But if you need to have different
meanings for including other files (as XSLT does with its <import> or <include>
elements) you might still have to roll your own.
Known Uses
- XSLT
- XML Schemas
- WSDL
- SOAP with Attachments
References and Acknowledgments
- XML Schemas
- XSL/XSLT
- SOAP 1.2
- SOAP 1.2 Attachments
- WSDL 1.2
- XHTML
- XML Pointer, XML Base and XLink
- Dublin Core Group
- Expressing Simple Dublin Core in RDF/XML
- Programming Perl, 2nd Edition (for source of the "three great virtues of a programmer")
- thanks to Raymond Blum for pointing out that Dynamic Document and XP go together well