XML Schema Design Patterns: Is Complex Type Derivation Unnecessary?

October 29, 2003

W3C XML Schema (WXS) possesses a number of features that mimic object oriented concepts, including type derivation and polymorphism. However real world experience has shown that these features tend to complicate schemas, may have subtle interactions that lead tricky problems, and can often be replaced by other features of WXS. In this article I explore both derivation by restriction and derivation by extension of complex types showing the pros and cons of both techniques, as well as showing alternatives to achieving the same results.

Why Validate XML Documents?

The WXS recommendation is just one of many XML schema languages: DTD, RELAX NG, and XML Data-Reduced. An XML schema is used to describe the structure of an XML document by specifying the valid elements that can occur in a document, the order in which they can occur, as well as constraints on certain aspects of these elements. As usage of XML and XML schema languages has become more widespread, two primary usage scenarios have developed around XML document validation and XML schemas.

Describing and enforcing the contract between producers and consumers of XML documents: An XML schema ordinarily serves as a means for consumers and producers of XML to understand the structure of the document being consumed or produced. Schemas are a fairly terse and machine readable way to describe what constitutes a valid XML document according to a particular XML vocabulary. Thus a schema can be thought of as contract between the producer and consumer of an XML document. Typically the consumer ensures that the XML document being received from the producer conforms to the contract by validating the received document against the schema.

This description covers a wide array of XML usage scenarios from business entities exchanging XML documents to applications that utilize XML configuration files.
Creating the basis for processing and storing typed data represented as XML documents: As XML became popular as a way to represent rigidly structured, strongly typed data, such as the content of a relational database or programming language objects, the ability to to describe the datatypes within an XML document became important. This led to Microsoft's XML Data and XML Data-Reduced schema languages, which ultimately led to WXS. These schema languages are used to convert an input XML infoset into a type annotated infoset (TAI) where element and attribute information items are annotated with a type name.

WXS describes the creation of a type annotated infoset as a consequence of document validation against a schema. During validation against a WXS, an input XML infoset is converted into a post schema validation infoset (PSVI), which among other things contains type annotations. However practical experience has shown that one does not need to perform full document validation to create type annotated infosets; in general many applications that use XML schemas to create strongly typed XML such as XML<->object mapping technologies do not perform full document validation, since a number of WXS features do not map to concepts in the target domain.

In presenting the pros and cons of complex type derivation this article will focus on its effects on these uses of XML schema.

A Look at Derivation by Restriction of Complex Types

Restriction of complex types involves creating a derived complex type whose content model is a subset its base type's content model. This means that an instance of the derived type should also be a valid instance of the base complex type. Examples of acceptable restrictions to declarations in the content model include

Changing an optional attribute to being required
Changing the occurrence range of an element so it is a subset of the original occurrence range (e.g. from minOccurs="1" & maxOccurs="unbounded" to minOccurs="2" & maxOccurs="4"
Changing the nillability of an element from true to false
Changing the type of an element or attribute to a subtype (e.g. going from xs:integer in the base type to xs:positiveInteger in the derived type)
Changing an element or attribute to having a fixed value

Derivation by restriction is primarily useful in combination with abstract elements or types. One can create an abstract type that contains all the characteristics of a number of related content models, then restrict it to create each of the target content models. This approach is highlighted in a post to XML-DEV by Roger Costello , where a PublicationType is restricted to a MagazineType.

The following schema taken from one of my previous articles uses derivation by restriction to restrict a complex type which describes a subscriber to the XML-DEV mailing list to a type that describes me. Any element that conforms to the DareObasanjo type can also be validated as an instance of the XML-Deviant type.


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">



 <!-- base type -->

 <xs:complexType name="XML-Deviant">

  <xs:sequence>

   <xs:element name="numPosts" type="xs:integer" minOccurs="0" maxOccurs="1" /> 

   <xs:element name="signature" type="xs:string" nillable="true" />

   <xs:element name="email" type="xs:string"  minOccurs="0" maxOccurs="1" />

  </xs:sequence>

  <xs:attribute name="firstSubscribed" type="xs:date" use="optional" />

  <xs:attribute name="mailReader" type="xs:string"/>

 </xs:complexType>



 <!-- derived type --> 

  <xs:complexType name="DareObasanjo">

   <xs:complexContent>

   <xs:restriction base="XML-Deviant">

   <xs:sequence>

    <xs:element name="numPosts" type="xs:integer" minOccurs="1" /> 

    <xs:element name="signature" type="xs:string" nillable="false" />

    <xs:element name="email" type="xs:string"  maxOccurs="0" />

   </xs:sequence>

   <xs:attribute name="firstSubscribed" type="xs:date" use="required" />

   <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />

   </xs:restriction>

   </xs:complexContent>

  </xs:complexType> 



</xs:schema>

When a given complex type is to be derived by restriction from another complex type, its content model must be duplicated and refined.

The Problems with Derivation by Restriction of Complex Types

In a previous article in the XML Design Pattern series entitled "Avoiding Complexity" I pointed out why you should very carefully use restriction of complex types with the following admonition:

The rules for derivation by restriction of complex types are described in Section 3.4.6 and Section 3.9.6 of the WXS recommendation. Most bugs in implementations cluster around this feature, and it is quite common to see implementers express exasperation when discussing the various nuances of derivation by restriction in complex types. Further, this kind of derivation does not neatly map to concepts in either object oriented programming or relational database theory, which are the primary producers and consumers of XML data.

For the contract-validation class of users, derivation by restriction provides little if any benefits over defining content models without using derivation. The following schema is equivalent to the one in the previous section if all you're interested in is ensuring that an XML-Deviant or DareObasanjo element conforms to the specified content model.


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">



 <xs:complexType name="XML-Deviant">

  <xs:sequence>

   <xs:element name="numPosts" type="xs:integer" minOccurs="0" maxOccurs="1" /> 

   <xs:element name="signature" type="xs:string" nillable="true" />

   <xs:element name="email" type="xs:string"  minOccurs="0" maxOccurs="1" />

  </xs:sequence>

  <xs:attribute name="firstSubscribed" type="xs:date" use="optional" />

  <xs:attribute name="mailReader" type="xs:string"/>

 </xs:complexType>



  <xs:complexType name="DareObasanjo">

   <xs:sequence>

    <xs:element name="numPosts" type="xs:integer" minOccurs="1" /> 

    <xs:element name="signature" type="xs:string" nillable="false" />

    <xs:element name="email" type="xs:string"  maxOccurs="0" />

   </xs:sequence>

   <xs:attribute name="firstSubscribed" type="xs:date" use="required" />

   <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />

  </xs:complexType> 



</xs:schema>

It should be noted that this schema does not enforce the relationship between the XML-Deviant and DareObasanjo types. For cases where the subtype relationship must be maintained the alternative is not satisfactory.

For usage scenarios where a schema is used to create strongly typed XML, derivation by restriction is problematic. The ability to restrict optional elements and attributes does not exist in the relational model or in traditional concepts of type derivation from OOP languages. The example from the previous section where the email element is optional in the base type, but cannot appear in the derived type, is incompatible with the notion of derivation in an object oriented sense, while also being similarly hard to model using tables in a relational database. Similarly changing the nillability of a type through derivation is not a capability that maps to relation or OOP models. On the other hand, the example that doesn't use derivation by restriction can more straightforwardly be modeled as classes in an OOP language or as relational tables. This is important given that it reduces the impedance mismatch which occurs when attempting to map the contents of an XML document into a relational database or convert an XML document into an instance of an OOP class.

Although certain aspects of derivation by restriction do not map well, it's possible to enforce these constraints directly by, for example, always throwing an exception when attempting to access a property or field in a derived type that has been restricted away. However not only is such direct enforcement of WXS constraints unnatural to developers who traditionally use OOP languages, it is unlikely that such conventions would be uniform across all implementations of WXS mapping tools.

A Look at Derivation by Extension of Complex Types

Extension of complex types involves creating a derived complex type whose content model is a superset of its base type's content model. Complex type extension involves adding extra attributes or elements to the content model of a base type in the derived type. Elements added via extension are treated as if they were appended to the content model of the base type in sequence. This technique is useful for extracting the common aspects of a set of complex types and then reusing these commonalities via extending the base type definition.

The following schema uses derivation by extension to extend a complex type which describes a subscriber to the XML-DEV mailing list to a type that describes me. An instance of the DareObasanjo type is not necessarily a valid instance of the XML-Deviant type.


<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">



 <!-- base type -->

 <xs:complexType name="XML-Deviant">

  <xs:sequence>

   <xs:element name="numPosts" type="xs:integer" minOccurs="0" maxOccurs="1" /> 

   <xs:element name="email" type="xs:string"  />

  </xs:sequence>

  <xs:attribute name="firstSubscribed" type="xs:date" use="optional" />

  <xs:attribute name="lastPostDate" type="xs:date" use="optional" />

 </xs:complexType>



 <!-- derived type --> 

  <xs:complexType name="DareObasanjo">

   <xs:complexContent>

   <xs:extension base="XML-Deviant">

   <xs:sequence>

    <xs:element name="signature" type="xs:string"  />

   </xs:sequence>

   <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />

   </xs:extension>

   </xs:complexContent>

  </xs:complexType> 



</xs:schema>

The Problems with Derivation by Extension of Complex Types

For users who want to use an XML schema to validate that an XML document conforms to its contract, derivation by extension seems to be an excellent way to componentize and reuse aspects of a schema. Although this seems true at first glance, interactions with other features of WXS such as substitution groups and xsi:type make the usage of derivation by extension problematic. For instance consider the following element declaration:


  <xs:element name="xml-deviant" type="XML-Deviant" />

which declares an xml-deviant element whose type is the XML-Deviant complex type from the schema in the previous section. Both of the following XML elements are valid against the xml-deviant element declaration


  <xml-deviant firstSubscribed="1999-05-31" >

   <email>johndoe@example.com</email>

  </xml-deviant>



  <xml-deviant xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 

                  xsi:type="DareObasanjo" firstSubscribed="1999-05-31" 

		  mailReader="Microsoft Outlook">       

   <email>dareo@online.microsoft.com</email>

   <signature>XML is about data not objects, that is the zen of XML.</signature>

  </xml-deviant>

Although the element declaration explicitly states that the type of the xml-deviant element is the XML-Deviant complex type it is possible for an instance to override the declaration in the schema using the xsi:type attribute as long as the new type is a subtype of the original type. This means that, by default, even though an element is successfully validated, it does not necessarily conform to the content model the consumer believes it's being validated against. A similar problem is faced when the target element declaration is designated as the head of a substitution groups.

There are two ways to get around this potential problem with derivation by extension. The first involves blocking substitution or type derivation by placing the block or final attribute on the element declaration or the complex type declaration. Similarly the blockDefault or finalDefault attribute can be placed on the xs:schema element to specify which kind of substitutions or derivations are disallowed in the schema. The second option involves using named model groups (xs:group) and attribute groups to modularize ones schema as opposed to using derivation by extension. Below is the schema from the previous section rewritten using named model groups

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema">



 <xs:complexType name="XML-Deviant">

  <xs:group ref="XMLDeviantGrp" />

  <xs:attributeGroup ref="XMLDeviantAttrGrp" />

 </xs:complexType>



  <xs:complexType name="DareObasanjo">  

   <xs:sequence>

    <xs:group ref="XMLDeviantGrp" />

    <xs:element name="signature" type="xs:string"  />

   </xs:sequence>

   <xs:attributeGroup ref="XMLDeviantAttrGrp" />

   <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" />   

  </xs:complexType> 



  <xs:group name="XMLDeviantGrp">

   <xs:sequence> 

    <xs:element name="numPosts" type="xs:integer" minOccurs="0" maxOccurs="1" />  

    <xs:element name="email" type="xs:string"  minOccurs="0" maxOccurs="1" /> 

   </xs:sequence> 

  </xs:group>



  <xs:attributeGroup name="XMLDeviantAttrGrp">

   <xs:attribute name="firstSubscribed" type="xs:date" use="optional" />

   <xs:attribute name="lastPostDate" type="xs:date" use="optional" />

  </xs:attributeGroup>



</xs:schema>

For usage scenarios that revolve strongly typed XML derivation by extension poses a different but related set of problems. In situations where an XML schema is used as a basis to map between XML and the object oriented or relational models derivation by extension does not prove to problematic. However when processing such strongly typed XML with schema-aware programming languages such as XQuery or XSLT 2.0, certain problems arise. XQuery is a statically typed language meaning that it is expected to detect type related errors at compile type instead of at execution time. The following query is problematic given the previous examples:


   for $x in //xml-deviant 

    return $x/signature

On the one hand, the above expression should lead to a static error because the xml-deviant element is declared as having XML-Deviant as its type which does not have a signature element. On the other hand, since a subtype of XML-Deviant exists which has a signature element in the content model and hence could be the target of an xsi:type directive then this shouldn't be a static error. Both positions are valid and regardless of which one XQuery has chosen there will be people who expect the opposite. Developers with a background in XPath may expect it to work while developers who are familiar with statically typed languages would recognize it as being equivalent to the following and thus an error


      foreach(xmldeviant b in list) {

                yield b.signature; // static type error.

      }

To prevent this problem and others related to it is best to avoid using the derivation by extension if the XML document will be processed by an XML Schema aware processing language like XQuery.

Conclusion

Based on the current technological landscape the complex type derivation features of WXS may add more problems than they solve in the two most commmon schema use cases. For validation scenarios, derivation by restriction is of marginal value, while derivation by extension is a good way to create modularity as well as encourage reuse. Care must however be taken to consider the ramifications of the various type substitutability features of WXS (xsi:type and substitution groups) when using derivation by extension in scenarios revolving around document validation.

Currently processing and storage of strongly typed XML data is primarily the province of conventional OOP languages and relational databases respectively. This means that certain features of WXS such as derivation by restriction (and to a lesser extent derivation by extension) cause an impedance mismatch between the type system used to describe strongly typed XML and the mechanisms used for processing and storing said XML. Eventually when technologies like XQuery become widespread for processing typed XML and support for XML and W3C XML Schema is integrated into mainstream database products this impedance mismatch will not be important. Until then complex type derivation should be carefully evaluated before being used in situations where W3C XML Schema is primarily being used as a mechanism to create type annotated XML infosets.

Acknowledgments

I'd like to thank Don Box, Chris Lovett and Erik Meijer for their ideas and feedback while writing this article.