W3C XML Schema Design Patterns: Avoiding Complexity
November 20, 2002
Introduction
Over the course of the past year, during which I've worked closely with W3C XML Schema (WXS), I've observed many schema authors struggle with various aspects of the language. Given the size and relative complexity of the WXS recommendation (parts one and two ), it seems that many schema authors would be best served by understanding and utilizing an effective subset instead of attempting to comprehend all of its esoterica.
There have been a few public attempts to define an effective subset of W3C XML Schema for general usage, most notable have been W3C XML Schema Made Simple by Kohsuke Kawaguchi and the X12 Reference Model for XML Design by the Accredited Standards Committee (ASC) X12. However, both documents are extremely conservative and advise against useful features of WXS without adequately describing the cost of doing so.
This article is primarily a counterpoint to Kohsuke's and considers each of his original guidelines; the goal is to provide a set of solid guidelines about what you should do and shouldn't do when working with WXS.
The Guidelines
I've altered some of Kohsuke's original guidelines:
- Do use element declarations, attribute groups, model groups, and simple types.
- Do use XML namespaces as much as possible. Learn the correct way to use them.
- Do not try to be a master of XML Schema. It would take months.
- Do
notuse complex types and attribute declarations. - Do not use notations
- Do
notuse local declarations. - Do
notcarefully use substitution groups. - Do
notcarefully use a schema without thetargetNamespace
attribute (aka chameleon schema.)
I propose some additional guidelines as well:
- Do favor key/keyref/unique over ID/IDREF for identity constraints.
- Do not use default or fixed values especially for types of xs:QName.
- Do not use type or group redefinition.
- Do use restriction and extension of simple types.
- Do use extension of complex types.
- Do carefully use restriction of complex types.
- Do carefully use abstract types.
- Do use elementFormDefault set to qualified and attributeFormDefault set to unqualified.
- Do use wildcards to provide well defined points of extensibility.
The guidelines qualified with the word carefully are best avoided by novice users unless absolutely required by the problem being solved.
Why You Should Use Global And Local Element Declarations
An element declaration is used to specify the structure, type, occurrence, and value constraints for an element. The element declaration is the most important and common piece of a schema document.
Elements declarations that appear as children of the xs:schema element are global elements, which can be reused by referencing them in other parts of the schema or from other schema documents. They can also be members of substitution groups. Since the WXS recommendation doesn't provide a mechanism for specifying the root element of the document being validated, any global element can be used as the root element for a valid document.
Element declarations that appear within complex type or model group definitions, and that aren't references to a global element, are local elements. Unlike global elements, there can be many local element declarations with the same name and differing types in a schema as long as the local elements are not declared at the same level. Section 3.3 of the W3C XML Schema Primer gives the following example:
You can only declare one global element called "title", and that element is bound to a single type (e.g., xs:string or PersonTitle). However, you can locally declare one element called "title" that has a string type, and is a subelement of "book". Within the same schema (target namespace) you can declare a second element also called "title" that is an enumeration of the values "Mr Mrs Ms".
Global element declarations should be used for elements that will be reused from the target schema as well as from other schema documents, when the element and its associated type are comfortably bound together for widespread use. Local elements are to be favored when element declarations only make sense in the context of the declaring type and are unlikely to be reused.
By default, global elements have a namespace name equivalent to that of the target namespace of the schema, while local elements have no namespace name. So, by default, elements in an XML document which are meant to be validated against global element declarations should have a namespace name identical to that of the global element's schema target namespace. Those which are to be validated against local elements should have no namespace name. For example, consider this schema:
test.xsd <?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com" xmlns="http://www.example.com"> <!-- global element declaration validates <language> elements from http://www.example.com namespace --> <xs:element name="language" type="xs:string" /> <xs:element name="Root" type="sequenceOfLanguages" /> <xs:element name="Root2" type="sequenceOfLanguages2" /> <!-- complex type with local element declaration validates <language> elements without a namespace name --> <xs:complexType name="sequenceOfLanguages" > <xs:sequence> <xs:element name="language" type="xs:NMTOKEN" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> <!-- complex type with reference to global element declaration --> <xs:complexType name="sequenceOfLanguages2" > <xs:sequence> <xs:element ref="language" maxOccurs="10" /> </xs:sequence> </xs:complexType> </xs:schema> test.xml <?xml version="1.0"?> <ex:Root xmlns:ex="http://www.example.com"> <language>EN</language> </ex:Root> test2.xml <?xml version="1.0"?> <ex:Root2 xmlns:ex="http://www.example.com"> <ex:language>English</ex:language> <ex:language>Klingon</ex:language> </ex:Root2>
Why You Should Use Global And Local Attribute Declarations
An attribute declaration is used to specify the type, optionality, and defaulting information for an attribute.
Attribute declarations that appear as children of the xs:schema element are global attributes, which can be reused by referencing them in other parts of the schema or from other schema documents. Attribute declarations that appear within complex type definitions, and that do not reference global attributes, are local attributes.
Global attribute declarations should be used for types that will be reused from the target schema as well as from other schema documents. Local attributes should be used when attribute declarations only make sense in the context of the declaring type and are unlike to be reused. Since attributes are usually tightly coupled to their parent elements, local attribute declarations are typically favored by schema authors. But there are cases where global attributes which can apply to many elements from multiple namespaces are useful (for example, xsi:type and xsi:schemaLocation).
By default global attributes have a namespace name equivalent to that of the target namespace of the schema, while local attributes have no namespace name. Thus, attributes which are to be validated against global attribute declarations should have namespace name identical to that of the global attribute's schema target namespace. Those to be validated against local attributes should have no namespace name. For example,
test.xsd <?xml version="1.0" encoding="UTF-8" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com" xmlns="http://www.example.com"> <!-- global attribute declaration validates language attributes from http://www.example.com namespace --> <xs:attribute name="language" type="xs:string" /> <xs:element name="Root" type="sequenceOfNotes" /> <xs:element name="Root2" type="sequenceOfNotes2" /> <!-- complex type with local attribute declaration validates language attributes without a namespace name --> <xs:complexType name="sequenceOfNotes" > <xs:sequence> <xs:element name="Note" type="xs:string" /> </xs:sequence> <xs:attribute name="language" type="xs:NMTOKEN" /> </xs:complexType> <!-- complex type with reference to global attribute declaration --> <xs:complexType name="sequenceOfNotes2" > <xs:sequence> <xs:element name="Note" type="xs:string" /> </xs:sequence> <xs:attribute ref="language" /> </xs:complexType> </xs:schema> test.xml <?xml version="1.0"?> <ex:Root xmlns:ex="http://www.example.com" language="EN" > <Note>Nothing to see here</Note> </ex:Root> test2.xml <?xml version="1.0"?> <ex:Root2 xmlns:ex="http://www.example.com" ex:language="The English Language"> <Note>Nothing to see here</Note> </ex:Root2>
Why You Should Understand How XML Namespaces Affect WXS
Support for XML Namespaces is woven tightly into the WXS recommendation. Namespaces are used in a number of places:
- when referencing global elements, attributes, or types;
- in XPath expressions used for identity constraints;
- in determining what elements and attributes schema declarations can validate; and
- when importing and including other schema documents.
Thus, schema authors should be familiar with how namespaces work, including their affect on W3C XML Schema. I wrote two MSDN articles which address this issue: "XML Namespaces and How They Affect XPath and XSLT" provides a detailed overview of XML namespaces and "Working with Namespaces in XML Schema" explains the ramifications of namespaces in WXS.
Why You Should Always Set elementFormDefault to "qualified"
Elements or attributes with a namespace name are said to be "namespace qualified".
It's
possible to override whether local declarations validate namespace qualified elements
and
attributes or not. The xs:schema
element has the elementFormDefault
and attributeFormDefault
attributes, which specify whether local declarations in the schema should validate
namespace
qualified elements and attributes respectively. The valid values for either attribute
are
"qualified" and "unqualified". The default value of both attributes is "unqualified".
The form attribute on local element and attribute declarations can be used to override
the
values of the elementFormDefault
and attributeFormDefault
attributes specified on the xs:schema
element. This allows for fine-grained control over the way validation of elements
and
attributes in the instance document operates in relation to global or local declarations.
The following example, taken from the Kohsuke's article (the "Why You Should Avoid Local Declarations" section) shows exactly how these attributes can significantly affect the outcome of validation:
This schema
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://example.com"> <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="familyName" type="xs:string" /> <xs:element name="firstName" type="xs:string" /> <xs:sequence> <xs:complexType> <xs:element> <xs:schema>
validates the following document
<foo:person xmlns:foo="http://example.com"> <familyName> KAWAGUCHI <familyName> <firstName> Kohsuke <firstName> <foo:person>
which is unlikely what the schema author intended. And it's ugly, too. Altering the schema thus:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://example.com" elementFormDefault="qualified"> <xs:element name="person"> <xs:complexType> <xs:sequence> <xs:element name="familyName" type="xs:string" /> <xs:element name="firstName" type="xs:string" /> <xs:sequence> <xs:complexType> <xs:element> <xs:schema>
allows it to validate
<person xmlns="http://example.com"> <familyName> KAWAGUCHI <familyName> <firstName> Kohsuke <firstName> <person>
or
<foo:person xmlns:foo="http://example.com"> <foo:familyName> KAWAGUCHI <foo:familyName> <foo:firstName> Kohsuke <foo:firstName> <foo:person>
Leaving the value of the attributeFormDefault
attribute as "unqualified" makes
sense because most schema authors don't want to have to namespace qualify all attributes
explicitly by prefixing them.
Why You Should Use Attribute Groups
An attribute group definition is a way to create a named collection of attribute declarations and attribute wildcards. Attribute groups increase the modularity of schemas. You can declare a commonly used set of attributes in a single location and then reference them from other schemas.
When Kohsuke's article describes attribute groups as an alternative to global attribute declarations, it may give the incorrect impression that the two are mutually exclusive alternatives. A globally declared attribute is an individual, reusable attribute declaration. An attribute group is a modularly clustered set of attributes; the attribute declarations in an attribute group can either be local attribute declarations or references to global declarations. Kohsuke's article is not entirely accurate when it describes attribute groups as an alternative to global attribute declarations.
Why You Should Use Model Groups
A model group definition is a mechanism for creating named groups of elements using the all, choice, or sequence compositors. Model groups are useful for reusing groups of elements by avoiding type derivation. However, model groups are not a replacement for complex types; they cannot contain attribute declarations and they cannot be specified as the type of an element declaration. Additionally, derivation of model groups is much more limited than derivation of complex types.
Why You Should Use The Builtin Simple Types
A major benefit of WXS over DTDs in XML 1.0 is the existence of datatypes. The ability to specify that the values of elements or attributes are strings, dates, or numeric data enables schema authors to specify and validate the contents of XML data in an interoperable and platform independent manner. Given the number of built-in datatypes (44 by my count), it may be wise for schema authors to standardize on a subset of the built-in types to avoid information overload.
In most cases users can do without the subtypes of xs:string (e.g. xs:ENTITY or xs:language), the subtypes of xs:integer (e.g. xs:short or xs:unsignedByte), or the Gregorian date types (e.g. xs:gMonthDay or xs:gYearMonth). Eliminating these types reduces the amount of information to a more easily managed amount.
Why You Should Use Complex Types
A complex type definition is used to specify a content model consisting of elements and attributes. An element declaration can specify its content model by referring to a named or anonymous complex type. Named complex types can be referenced by name from the schema they are defined in or by external schema documents; anonymous complex types must be defined within the declaration for the element which uses the type. Additionally the content models of named complex types can be extended or restricted using WXS inheritance mechanisms.
Complex types are similar to model group definitions with two main differences. First, complex type definitions can include attributes in the content models they define. Second, it's possible to use type derivation with complex types, which isn't the case with named model groups. In Kohsuke's article he advocates using a combination of anonymous complex types, model group definitions, and attribute groups to specify the content model of an element instead of named complex types. He does so in an attempt to avoid dealing with what he sees as the complexity of named complex types. However, I'd counter that using three mechanisms instead of one to specify the content model of an element is actual more prone to confusion. Thus, in addition to the fact that named complex types allow for reuse of content models, they're also the most straightforward way of specifying the content model of an element.
Anonymous complex types should only be used if references to the type will not be needed outside the element declaration and there is no need for type derivation. It is important to note that it is not possible to derive a new type from an anonymous complex type. In general, schemas that make heavy use of anonymous types are likely to have problems with uniformity and consistency.
Why You Should Not Use Notation Declarations
Kohsuke's admonition to avoid notation declarations is spot on. They exist only to provide backward compatibility with DTDs, except they are not backward compatible with DTD notations. Pretend they do not exist. I certainly do.
Why You Should Use Substitution Groups Carefully
Substitution Groups provide a mechanism similar to subtype polymorphism in programming languages. One or more elements can be marked as being substitutable for a global element (also called the head element), which means that members of this substitution group are interchangeable with the head element in a content model. For example, for an Address substitution group with members USAddress and UKAddress, the generic element Address can be used in the content model, or it can be substituted by a USAddress or a UKAddress. The only requirement is that the members of the substitution group must be of the same type or be in the same type hierarchy as the head element.
The following is an example schema and the instance which it validates:
example.xsd: <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com" xmlns:ex="http://www.example.com" elementFormDefault="qualified"> <xs:element name="book" type="xs:string" /> <xs:element name="magazine" type="xs:string" substitutionGroup="ex:book" /> <xs:element name="library"> <xs:complexType> <xs:sequence> <xs:element ref="ex:book" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
example.xml: <library xmlns="http://www.example.com"> <magazine>MSDN Magazine</magazine> <book>Professional XML Databases</book> </library>
The content model of the library
element says that it can hold one or more
book
elements. Since magazine
elements are in the
book
substitution group, it's valid for magazine
elements to
appear in the instance XML where book
elements are expected.
Substitution groups make content models more flexible and allow extensibility in directions
the schema author may not have anticipated. This flexibility is a two-edged sword:
although
it allows greater extensibility, it makes processing documents based on such schemas
more
difficult. For instance, the code that processes the library
element must not
only handle its child book
elements but magazine
elements as well.
If the instance document specified additional schemas via the
xsi:schemaLocation attribute, the processing application could have to deal with even
more members of the book
substitution group as children of the
library
element.
Another complication is that members of a substitution group can be of a type derived
from
the substitution group's head. Writing code to properly handle any derived type generically
is difficult, especially since there are two opposite notions of derivation. The first,
restriction, restricts the range or values in the content model. The second, extension,
adds
elements or attributes to the content model. Certain attributes on element declarations
can
be used to give schema authors more control over element substitutions in instance
documents
and reduce the likelihood of unexpected substitutions in XML instance documents. The
block
attribute is used to specify whether elements whose types use a certain
derivation method can substitute for the element in an instance document, while the
final
attribute is used to specify whether elements whose types use a certain
derivation method can declare themselves to be part of the target element's substitution
group. The default values of the block
and final
attributes for
all element declarations in a schema can be specified via the blockDefault
and
finalDefault
attributes of the root xs:schema
element. By
default all substitutions are allowed without limitation.
Why You Should Favor key/keyref/unique Over ID/IDREF For Identity Constraints
DTDs provide a mechanism for specifying that an attribute's type is ID, i.e., its value will be unique within the document and matches the Name production in XML 1.0. IDs in XML 1.0 can also be referenced by attributes of type IDREF or IDREFS. For compatibility with DTDs, WXS has the xs:ID, xs:IDREF, and xs:IDREFS types.
WXS identity constraints are used for specifying unique values, keys, or references to keys using XPath expressions defined within the scope of an element declaration. Comparing feature for feature, the identity constraint mechanisms offer more than ID/IDREF. First, there is no limit on the values or types that can be used as part of an identity constraint. IDs can only be one of a specific range of values (e.g., 7 is not a valid ID). A more important benefit of the schema identity constraints is that a ID or IDREF has to be unique within the document, but WXS identity constraints don't. The symbol space for unique IDs is the entire document, but for unique keys it's the target scope of the XPath. This is particularly useful if uniqueness is needed in two overlapping value spaces with different scopes in the same XML document. For example, consider an XML document that contained room numbers and table numbers for a hotel. It is likely that some of the numbers overlap (i.e. there is a room 18 and a table 18), but they should be unique within either value space.
The WXS family of ID types are not exactly compatible with the DTD ID types. First,
the
xs:ID
, xs:IDREF
, and xs:IDREFS
types can be applied
to both elements and attributes in WXS, although they can only apply to attributes
in their
DTD equivalents. Second, there's no restriction on how many attributes of type
xs:ID
can appear on an element, although such a restriction exists for ID
attributes in the DTD equivalents.
Why You Should Use Chameleon Schemas Carefully
The target namespace of a schema document identifies the namespace name of the elements and attributes which can be validated against the schema. A schema without a target namespace can typically only validate elements and attributes without a namespace name. However, if a schema without a target namespace is included in a schema with a target namespace, the target namespaceless schema assumes the target namespaces of the including schema. This feature is typically called the Chameleon schema design pattern.
In Kohsuke's article he claims that the chameleon schema pattern does not work, which is incorrect. A full rebuttal of Kohsuke's claim was made by Michael Leditschke on XML-DEV, and it shows that the design pattern does work and is useful for creating a reusable module of type definitions and declarations.
There is a problem with combining chameleon schemas with identity constraints. Although QName references to types, definitions, and
declarations in the chameleon schema are coerced into the namespace of the including
schema,
the same is not done for XPath expressions used by xs:key
,
xs:keyref
, and xs:unique
identity constraints. Consider the
following schema:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="qualified"> <xs:element name="Root"> <xs:complexType> <xs:sequence> <xs:element name="person" type="PersonType" maxOccurs="unbounded" /> </xs:sequence> </xs:complexType> <xs:key name="PersonKey"> <xs:selector xpath="person"/> <xs:field xpath="@name"/> </xs:key> <xs:keyref name="BestFriendKey" refer="PersonKey"> <xs:selector xpath="person"/> <xs:field xpath="@best-friend"/> </xs:keyref> </xs:element> <xs:complexType name="PersonType"> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="best-friend" type="xs:string" /> <xs:attribute name="name" type="xs:string" /> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:schema>
If this schema is included in another schema with a target namespace, the XPath
expressions in both the key and keyref will fail. In this specific example, the
person
element is in no namespace in the chameleon schema, but once included
in another schema it picks up that target namespace. The XPath expressions which match
on a
person without a target namespace will not work without signifying that they no
longer work since processors are not obliged to ensure that path expressions in identity
constraint actually return results.
The point is that it is not advisable to use identity constraints in chameleon schemas.
Why You Should Not Use Default Or Fixed Values Especially For Types Of xs:QName.
The primary complaint against default and fixed values is that they cause new data to be inserted into the source XML after validation, thus changing the data. This means that an unvalidated document that has a schema with default values is incomplete. Tying the actual content of the XML document to the validation process is unwise since a schema may not always be available. It's also unwise to assume that consumers of the document will always perform validation.
The xs:QName type has additional validation problems caused by the fact that it has no canonical form. Consider this schema and XML instance:
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com" xmlns:ex="http://www.example.com" xmlns:ex2="ftp://ftp.example.com" elementFormDefault="qualified"> <xs:element name="Root"> <xs:complexType> <xs:sequence> <xs:element name="Node" type="xs:QName" default="ex2:FtpSite" /> </xs:sequence> </xs:complexType> </xs:element> </xs:schema>
<Root xmlns="http://www.example.com" xmlns:ex2="smtp://smtp.example.org" xmlns:foo="ftp://ftp.example.com"> <Node /> </Root>
What value should be inserted into the Node
element upon validation? Should it
be "ex2:FtpSite"? Even if the ex2 prefix is mapped to a different namespace in the
instance
document than in the schema? Maybe it should be "foo:FtpSite" because the prefix "foo"
is
mapped to the same namespace that "ex2" was mapped to in the schema. But then what
would
happen if no XML namespace declaration existed for the ftp://ftp.example.com
namespace? Would a namespace declaration have to be inserted? None of these questions
can be
answered in a satisfactory manner without violating some opinions as to what the correct
behavior should be. It is best to avoid using xs:QName
default values because
it's unlikely that different implementations agree on the relevant semantics.
Why You Should Use Restriction And Extension Of Simple Types
Restriction of a simple type involves constraining the facets of the type, thus reducing the permitted values of the type. Such restrictions involve specifying a maximum length for a string value, specifying a date range, or enumerating the list of permitted values. Types constrained in this manner are very commonly used by schema authors and account for most uses of type derivation in WXS. Such types can be used by both elements and attributes as their type definition.
Extension of simple types allows one to create a complex type (i.e. an element content model) with simple content that has attributes. A typical extension scenario is any situation where an element declaration has a simple type as its content and one or more attributes. Since such element content models occur commonly in XML documents, derivation by extension is another commonly used feature.
As with complex types, there are named and anonymous simple types. Named simple types can be referenced by name from the schema they are defined in or from external schema documents. Anonymous simple types must be defined within the declaration for the element or attribute which uses the type. And type derivation can only be performed on named types.
A common misconception is that anonymous types with the same structure are the same type. In other words, assuming that this schema fragment
<-- fragment A --> <xs:element name="quantity"> <xs:simpleType> <xs:restriction base="xs:positiveInteger"> <xs:maxExclusive value="100"/> </xs:restriction> </xs:simpleType> </xs:element> <xs:element name="size"> <xs:simpleType> <xs:restriction base="xs:positiveInteger"> <xs:maxExclusive value="100"/> </xs:restriction> </xs:simpleType> </xs:element>
<-- fragment B --> <xs:simpleType name="underHundred"> <xs:restriction base="xs:positiveInteger"> <xs:maxExclusive value="100"/> </xs:restriction> </xs:simpleType> <xs:element name="size" type="underHundred"/> <xs:element name="quantity" type="underHundred"/>
is incorrect with regard to whether both element declarations have the same type. Various aspects of WXS may require element declarations to have the same type (substitution groups, specifying key/keyref pairs, and type derivation). For instance, a keyref must be of the same type as a key. However, most features of WXS assume that the element declarations in fragment A have different types and those in fragment B to have the same type.
Why You Should Use Extension Of Complex Types
Extension of a complex type involves adding extra attributes or elements to the content model in the derived type. Elements added via extension are treated as if they were appended to the content model of the base type in sequence. This technique is useful for extracting the common aspects of a set of complex types and then reusing these commonalities via extending the base type definition. The following schema fragment showing how extension enables the reuse of common aspects of a mailing address is taken from the discussion on complex type extension and example in the WXS Primer.
<xs:complexType name="Address"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="street" type="xs:string"/> <xs:element name="city" type="xs:string"/> </xs:sequence> </xs:complexType> <xs:complexType name="USAddress"> <xs:complexContent> <xs:extension base="Address"> <xs:sequence> <xs:element name="state" type="USState"/> <xs:element name="zip" type="xs:positiveInteger"/> </xs:sequence> </xs:extension> </xs:complexContent> </xs:complexType> <xs:complexType name="UKAddress"> <xs:complexContent> <xs:extension base="Address"> <xs:sequence> <xs:element name="postcode" type="UKPostcode"/> </xs:sequence> <xs:attribute name="exportCode" type="xs:positiveInteger" fixed="1"/> </xs:extension> </xs:complexContent> </xs:complexType>
In this schema the Address
type defines the information common to addresses
in general; its derived types add information specific to addresses from the United
States
and United Kingdom, respectively. The ability to reuse and build upon content models
using
extension is a powerful and useful feature of WXS that promotes modularity and content
uniformity.
There is a caveat for processors that deal with types derived by extension. This
caveat
has to do with type-aware processors and the elements added to a content model by
extension.
In the future it is possible that type-aware languages like XQuery or XSLT 2.0 will be able to process XML elements and attributes polymorphically. For
instance, an application can decide to process all elements of type Address
or
that have Address
as their base type, choosing to process the information that
is common to all types. However a query such as
//*[. instance of Address]/city
could return unexpected results if dealing with a derived type that extended the content model in the following way
<xs:complexType name="BadAddress"> <xs:complexContent> <xs:extension base="Address"> <xs:sequence> <-- address format has two city entries, one for neighborhood and another for the actual city --> <xs:element name="city" type="xs:string"/> <xs:element name="state" type="xs:string"/> <xs:element name="country" type="xs:string"/> </xs:sequence> <xs:attribute name="exportCode" type="positiveInteger" fixed="1"/> </xs:extension> </xs:complexContent> </xs:complexType>
Although the example is contrived and the scenario seems unlikely, it demonstrates a real risk. A more detailed exposition on this potential problem has been provided by Paul Prescod on XML-DEV.
Why You Should Very Carefully Use Restriction Of Complex Types
Restriction of complex types involves creating a derived complex type whose content model is a subset of the base type.
The parts of the WXS spec which describe derivation by restriction in complex types (Section 3.4.6 and Section 3.9.6) are generally considered to be its most complex parts. Most bugs in implementations cluster around this feature, and it is quite common to see implementers express exasperation when discussing the various nuances of derivation by restriction in complex types. Further, this kind of derivation does not neatly map to concepts in either object oriented programming or relational database theory, which are the primary producers and consumers of XML data. This is the exact opposite of the situation with derivation by extension of complex types.
Another challenge in using derivation by restriction of complex types arises from
the way
in which restrictions are declared: when a given complex type is to be derived by
restriction from another complex type, its content model must be duplicated and refined.
Duplication of a definition replicates definitions, possibly down a long derivation
chain,
so any modification to an ancestor type must be manually propagated down the derivation
tree. Furthermore, such replication cannot cross namespace boundaries -- deriving
ns2:SlowCar
from ns1:Car
may not work if
ns2:SlowCar
's has a child element, ns2:MaxSpeed
, because it
cannot be correctly derived from ns1:Car
's child element
ns1:MaxSpeed
.
The following schema uses derivation by restriction to restrict a complex type, which
describes a subscriber to the XML-DEV mailing list, to a type that describes me. Any
element
that conforms to the DareObasanjo
type can also be validated as an instance of
the XML-Deviant
type.
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema> <!-- base type --> <xs:complexType name="XML-Deviant"> <xs:sequence> <xs:element name="numPosts" type="xs:integer" minOccurs="0" maxOccurs="1" /> <xs:element name="signature" type="xs:string" nillable="true" /> </xs:sequence> <xs:attribute name="firstSubscribed" type="xs:date" use="optional" /> <xs:attribute name="mailReader" type="xs:string"/> </xs:complexType> <!-- derived type --> <xs:complexType name="DareObasanjo"> <xs:complexContent> <xs:restriction base="XML-Deviant"> <xs:sequence> <xs:element name="numPosts" type="xs:integer" minOccurs="1" /> <xs:element name="signature" type="xs:string" nillable="false" /> </xs:sequence> <xs:attribute name="firstSubscribed" type="xs:date" use="required" /> <xs:attribute name="mailReader" type="xs:string" fixed="Microsoft Outlook" /> </xs:restriction> </xs:complexContent> </xs:complexType> </xs:schema>
Derivation by restriction of complex types is a multifaceted feature that is useful in situations where secondary types need to conform to a generic primary type, but also add their own constraints which go beyond those of the primary type. However, its extreme complexity requires that it be used only by those who have a firm grasp of WXS.
Why You Should Carefully Use Abstract Types
Borrowing a concept from OOP languages like C# and Java, both element declarations and complex type definitions can be made abstract. An abstract element declaration cannot be used to validate an element in an XML instance document and can only appear in content models via substitution. An abstract complex type definition similarly cannot be used to validate an element in an XML instance document; but it can be used as the the abstract parent of an element's derived type or in cases where the element's type is overridden in the instance using xsi:type.
Abstract complex types and element declarations are useful for creating generic base
types
which contain information common to a set of types (such as Shape
vs. Circle or
Square), yet the definition is not deemed "complete" unless further derivation (extension
or
restriction) has been applied. While this feature is not complicated to use, some
implications of its use are subtle and complex. Abstract types should be used with
care.
Do Use Wildcards to Provide Well Defined Points Of Extensibility
WXS provides the wildcards xs:any
and xs:anyAttribute
which can
be used to allow the occurrence of elements and attributes from specified namespaces
into a
content model. Wildcards allow schema authors to enable extensibility of the content
model
while maintaining a degree of control over the occurrence of elements and attributes.
A good
discussion of the benefits of using wildcards is available in an XML.com article,
"W3C XML Schema Design
Patterns: Dealing With Change".
Cautious schema authors, concerned with the problems posed by type derivation, may
choose
to block attempts at type derivation using the final
attribute on complex type
definitions and element declarations (similar to sealed
in C# and
final
in Java). They may then choose to allow extensibility at specific parts
of the content model by using wildcards. This gives schema authors more control over
the
content models they define and may reduce some of the problems with various aspects
of
complex type derivation (specifically derivation by extension).
It should be noted that wildcards sometimes cause problems with non-determinism that violate the Unique Particle Attribution rule if used improperly. The following schema causes such a problem.
<?xml version="1.0" encoding="utf-8" ?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" targetNamespace="http://www.example.com/fruit/" elementFormDefault="qualified"> <xs:complexType name="myKitchen"> <xs:choice maxOccurs="unbounded"> <xs:any processContents="skip" /> <xs:element name="apple" type="xs:string"/> <xs:element name="cherry" type="xs:string"/> </xs:choice> </xs:complexType> </xs:schema>
The content model of the myKitchen
type is such that it can contain one or
more apple
, cherry
, or any other element. However, during
validation, if an apple
element is seen, the compiler cannot tell whether it
should be validated against the wildcard or the apple
element declaration.
There are subtle but potentially profound ramifications to the selection of both
the
namespace attribute and the processContents
attribute. Overly restrictive
values can impede extensibility; overly loose values can open the schema up to abuse.
Controlling the supported namespaces for a wildcard can also be bewildering, especially
when
the set of allowable namespaces is subject to change.
Do Not Use Group or Type Redefinition
Redefinition is a feature of WXS that allows you to change the meaning of an included type or group definition. Using xs:redefine, schema authors can include type or group definitions from schema documents and alter these definitions in a pervasive manner. Redefinition is pervasive because it not only affects type or group definitions in the including schema but also those in the included schema as well. Thus all references to the original type or group in both schemas refer to the redefined type, while the original definition is overshadowed. This leads to the problems pointed out in "W3C XML Schema Design Patterns: Dealing With Change":
This causes a certain degree of fragility because redefined types can adversely interact with derived types and generate conflicts. A common conflict is when a derived type uses extension to add an element or attribute to a type's content model, and a redefinition also adds a similarly named element or attribute to the content model
A major problem with type redefinition is that unlike type derivation it cannot be
prevented by using the block
or final
attributes. Thus any schema
can have its types redefined in a pervasive manner, thus altering their semantics
completely. It is advisable to avoid this feature due to the potential conflicts it
can
cause.
Many schema authors attempt to use type redefinition to increase the value space of an enumeration but this does not work. The only way to increase the number of values accepted by an enumeration used as a base type is to create a union. However, those additional values are only available to applications of the resulting union type, not for the applications of the original base type. Also note that chained redefinitions (redefining a redefine) can be problematic, resulting in unexpected definition clashes.
Conclusion
The WXS recommendation is a complex specification because it attempts to solve complex problems. One can reduce its burdens by utilizing its simpler aspects. Schema authors should ensure that their schemas validate in multiple schema processors. Schemas are an important facilitator of interoperability. It's foolish to depend on the nuances of a specific implementation and inadvertently give up this interoperability.
Acknowledgments
I'd like to thank Priya Lakshminarayanan and Mark Feblowitz for their help with this article.