Typeless Schemas and Services
September 2, 2003
Chris Sells has been running a boutique developer's conference for a few years. It's currently called the "Applied XML Developers Conference," with the sub-title "applied topics for XML and Web Services zealots." The devcon is clearly a case where, contrary to the spam we all get, size doesn't matter: Chris attracts an influential set of speakers and attendees, and what is discussed at these conferences is often a leading indicator of major trends in web services.
This month, I want to look at what Noah Mendelsohn, Tim Ewald, and Don Box have been saying about W3C XML Schema and web services. (This is my impressions of talks that were given in July and last October, so please don't hold them responsible for any errors or mistakes. My intent in naming the speakers is because I don't want to claim these ideas as my own.) Then next month, we'll look at how to use these ideas to drastically simplify WSDL.
Noah Mendelsohn was one of the editors of XML Schema Part 1: Structures, the specification for the XML Schema Language. He's also an editor for many of the SOAP 1.2 specifications. Last October he spoke about XML Schema, with a talk titled "what you might not know." He said that schemas are used for three things:
-
Contracts: agreeing on formats. Think of this as distributed type safety, because your C/C++ compiler can't do type checking across a process boundary.
-
Tools: Know what the data will be. Think of this as making code wizards possible, automatically building bindings between data and your local programming language.
-
Validation: getting what was expected. Think of this as run-time type-checking, the cousin to the first item because you can't just blindly trust the sender.
The difference between contracts and validation is important. The implementation of traditional RPC systems did not make this distinction, because RPC was all about preserving the function signature "across the wire." The contract specified what you were going to receive, and the validation decoded the network data and built the appropriate local datum.
Back then, you generally couldn't say "send me a number" in RPC, you had to explicitly say what kind of number it was: signed, unsigned, integer or floating point, and so on. No doubt part of the reason for this is because common programming languages where all strongly typed. But for the dynamically-typed languages (starting with AWK, and now including the major scripting languages such as Perl), any of the following XML fragments could be consumed as "a number":
<value>5</value> <value>5.0</value> <value>5.400</value>
So now, I think of Noah's contract as describing the "shape" of the XML: a purchase order starts with a billing address, which starts with an email address. Validation is looking at the shaped content to see if it meets my requirements: the number of items ordered must be an integer.
This distinction is only possible because XML provides a uniform data syntax. I think it's a wonderful and subtle strength -- there is an awful lot of power in nested angle brackets and text nodes -- and as developers we must fight short-sighted activities that might merge the two concepts.
In my experience, web services applications are often developed with schema validation tools. When the services are put into production, however, the cost of doing the validation is so great that it gets turned off. This results in the worst of all possible worlds: an application developed with a safety net is put into production -- sometimes made accessible over the internet -- with the net removed.
Instead, imagine an offline tool that compiled a schema into a shape description. This description could easily be used to augment any event-like parser (such as SAX or a pull model), for the nominal cost of checking a state stack and doing comparisons on the element name and namespace URI.
Tim Ewald works on designing the next generation of plumbing for the Microsoft Developer Network (MSDN). His talk at the July devcon was titled "Rebuilding MSDN with Angle Brackets." He explained that the software and databases behind MSDN are old and much is still oriented toward the production of CDs. They're moving to an XML-oriented foundation; while he didn't get into details, presumably one of the major motivations for this is to let them more easily re-purpose their content, such as their RSS feeds.
As a side note, the official format for articles submitted to MSDN is Microsoft Word using a specific template. As Tim and his group make progress, presumably this will get changed to be XML and a particular schema definition.
Tim talked about the flow of documents through the system. When a document is received, they want to verify that it meets some basic criteria, such as having a title, author, and content. As the document progresses through the system (i.e., has been subject to technical review, copy-editing, and so on), more details about the document type are exposed and enforced. For example, tutorials might be required to have code fragments, while interview pieces have a prologue, Q&A, and an epilogue.
Think of the pipeline as starting out very wide, and then splitting up into narrow branches. An initial schema would define only three the three elements I mentioned above, but as the article reached publication stage, a more complicated schema -- one that mandated alternate question and answer elements, for example -- would be used.
This is a perfect instance of location validation. It's not always necessary, and certainly not always possible, to fully validate every single datum of something. This is more typical of real-world use of XML. For example, when renting a car on a business trip, you usually don't care what color it is.
The W3C XML Schema any
element and its processContents
attribute
are used to let you define how strictly something should be validated -- i.e., how
much
detail you want to require. For details, see section 3.10.1 in the wildcard section of the XML Schema
specification.
Related to this type of "shape" validation is the notion of not using W3C XML Schema's type system and mechanism. For validation, treat the XML "as XML," and not as the serialization of some local object. This concept really opened my eyes. I used to think that defining types and then instances of those types was the way to do things. In my mind, the XML Signature specification was the apotheosis of this style. It completely and consistently defined all elements in this style:
<element name="Signature" type="ds:SignatureType"/> <complexType name="SignatureType"> ... </complexType>
Instead, following Tim's suggestion, use anonymous types, essentially in-lining the data definition:
<element name="Signature"> <complexType> ... </complexType> </element>
A security specification that wants to use an XML DSIG, then, will use the ref
attribute to indicate where the signature should appear:
<element name="AppData"> <complexType> ... <element ref="ds:Signature"/> ... </complexType> </element>
You can argue that this limits reuse, forcing anyone who wants to use a definition from another schema is forced to use that schema's name, and I don't disagree. But engineering is all about trade-offs, and I have come to believe that this meets the 80/20 rule. After all, XML is all about element names, and Namespaces in XML is the official mechanism for distributed naming. One of the great fissures in the XML community can be expressed as those who like the W3C XML Schema type system, and those who abhor it. Web services have, so far, been forced into the former camp, unnecessarily antagonizing the latter.
Don Box has been arguing against "RPC encoding" for some time, which is clearly a fall-out from the techniques I just described. No doubt his position within Microsoft gives him a great deal of influence on their approach to web services, and what large numbers of developers will see in the next couple of years.
On the outward-facing side, however, he has lately been giving (revised) versions of the same talk -- I caught it at the July devcon -- which his blog promises is the first chapter of his new book. To me, one of the most notable contributions is that he has come up with a new unifying acronym for this document-centric style of computing: Service Oriented Architecture and Protocol.