The W3C XML Schema Specification in Context
January 10, 2001
This article gives simple comparisons between the W3C XML Schemas and
- W3C XML instances
- W3C XML DTDs
- ISO SGML DTDs
- ISO SGML meta-DTDs
- Perl regular expressions
And some technologies that have arisen as a response to it:
- JIS RELAX
- Schematron
- DSD
It does not provide an exhaustive list of all W3C XML Schemas features. The information was prepared with the October Candidate Recommendation versions in mind.
W3C XML Schemas operates on the Information Set of a Document
W3C XML Schemas does not operate on marked-up instances per se, but on the information set of a document after it has been parsed, after any entity expansion and attribute value defaulting has occurred. Think of it as if it were a process looking at the W3C DOM API. The result of schema-validating a document is
- a set of outcomes giving, in particular, any violations of constraints -- there is currently no standard API for this; however the W3C XML Schemas specification gives a complete list of the constraint violations;
- an enhanced information set, the post-schema-validation information set, which can include various details about type and facets -- there is currently no standard API for this either; however, the W3C XML Schemas specification gives a complete list of the additional information.
XML Instance Markup | XML Schema | Comments |
---|---|---|
Element |
W3C XML Schemas can constrain which elements are allowed in a particular context. |
Actually, W3C XML Schemas constrain the types allowed in a particular complex type. W3C XML Schemas provides many ways in which this can done, and one of the most useful acts through specifying which elements an element can constrain. |
Attribute |
W3C XML Schemas can constrain which attributes are allowed in a particular context. |
|
xml:lang global attribute |
No specific support. |
A special datatype for language is available. However, there is no facility for global declarations. The schema for schemas gives an example of how to include the declarations for the xml: namespace. |
xml:space Global Attribute |
No specific support |
There is no facility for global declarations. The primitive datatype "string" has a facet "whitespace" that can be used to set various stripping and folding behaviors. The schema for schemasgives an example of how to include the declarations for the xml: namespace. |
Attribute Value |
W3C XML Schemas can constrain the attribute values allowed in a particular context. The context is based on the current type and its ancestor types. |
|
Data Content |
W3C XML Schemas can constrain the attribute values allowed in a particular context. The context is based on the current type and its ancestor types. |
|
CDATA Sections |
No specific support. |
This is a parser function |
Comments |
No support |
|
Processing Instructions |
No support |
|
Entity References |
No support |
|
Character Reference |
No support. |
This is a parser function |
XML Header |
No effect |
The standalone and encoding declarations are not part of the core information set of an entity, consequently there is no support for constraining them in W3C XML Schemas. |
Namespace Declarations |
No support |
W3C XML Schemas is highly aware of namespaces. It does not support altering namespace information of an instance. |
W3C XML Schemas and W3C XML Markup Declarations (DTDs)
W3C XML Markup Declarations (DTDs) are geared to provide simple datatyping on attributes sufficent to support graph-structures in the document only. W3C XML Schemas are intended to provide a systematic datatyping capability. W3C XML DTDs provide a basic macro facility, parameter entities, with which many good effects can be achieved. W3C XML Schemas reconstructs the most common of these in various high-level features.
W3C XML Markup Declarations | W3C XML Schema | Comments |
---|---|---|
DOCTYPE Declaration |
No equivalent header declaration. A W3C XML Schema has no conception of "document," it operates on elements and attributes which may be in namespaces. |
A W3C XML Schema cannot specify the top-level element in any schema. The attribute schemaLocation can be used on elements in instances to name the location of a retrievable schema for that element associated with that namespace. |
Internal and External Subset |
No equivalent header declaration. However, a schema can be placed within the document that uses it. Schemas can <include>, <import> and <redefine> other schemas, which may be external. There is no mechanism for overriding a declaration in the same schema; however, redefine can be used to restrict or extend declarations from other schemas. |
A schema for a single namespace can be composed of several distributed schema documents. Furthermore, a instance may require reference to multiple schemas as it uses elements from different namespaces. |
ELEMENT Declaration |
An <element> declaration creates a binding between a (namespaced) name and its attributes, content models and annotations. |
The big difference between W3C XML DTDs and W3C XML Schemas is the so-called tag/type distinction. This means that there is not a one-to-one correspondence between an element name and its type: for example, it is possible for one complex type to define local types of elements which have precedence over global declarations of an element with the same name. This kind of scoping mechanism is not available in DTDs. |
#PCDATA Declared Content Type |
Supported by defining a complex type with mixed="true" and no allowed elements. |
The simple type string can also be used but caution should be exercised as this may cause problems if ever you need to extend the type to allow subelements. |
ANY Declared Content Type |
Supported as <any> |
<any> has different wildcards to support a richer range of possibilities. Note that <anyAttribute> is also available to allow wildcards on the possible attributes. Whether the subelements found are validated or not, depends on whether the contents are assessed strictly, loosely, or skipped. |
EMPTY Declared Content Type |
Supported by declaring a complex type with mixed="false" and no allowed subelements. |
Note that W3C XML Schemas supports an explicit null as distinct from empty strings. An element (not a type) can be declared null , and a boolean attribute xsi:null on the instance specifies the null value. An implied attribute with no default can be taken as having a null value. There is no way provided to make this usage explicit. |
Content Model |
Supported as <complexType>. |
W3C XML Schemas keeps the W3C XML Markup Declarations requirement for unambiguous content models. Note that W3C XML Schemas maintains XML's model of mixed content, either allowing character data anywhere inside an element or nowhere. |
, (Sequence Connector) |
Supported. Sequence compositor is the <sequence> grouping element. |
|
| (Alternative Connector) |
Supported. Disjunction compositor is the <choice> grouping element. |
|
? (Optional) |
Supported, through maxOccurs and minOccurs attributes on elements, wildcards, and groups. |
|
+ (Required and Repeatable) |
Supported, through maxOccurs and minOccurs attributes on elements, wildcards, and groups. |
|
* (Optional and Repeatable) |
Supported, through maxOccurs and minOccurs attributes on elements, wildcards, and groups. |
|
( ) (Groups) |
Supported by the <group> grouping element |
|
ATTLIST Declaration |
<attribute> declarations can be grouped into <attributeGroup> declarations. |
|
Multiple ATTLIST declarations |
Not supported; however, not all attributes need to be defined in the same complex type declaration. |
All attribute declarations for a complex type are declared in one place. However, not all of them need to be defined there: they may be defined in an attribute group and declared by reference, or they may belong to the base type. |
CDATA Attribute Type |
Supported as a built-in simple type "CDATA" |
Lexical constraints can be specified using regular expressions in the pattern attribute. |
ID Attribute Type |
Supported as a built-in simple type. |
Lexical constraints on these names can be specified using regular expressions in the pattern attribute. W3C XML Schemas extends the capabilities of ID in the <unique> element, which allows scoping of uniqueness and multipart IDs based on W3C XPaths. |
IDREF IDREFS Attribute Types |
Supported as built-in simple types. |
Lexical constraints on these names can be specified using regular expressions in the pattern attribute. W3C XML Schemas extends the capabilities of IDREF with the <key> and <keyref> elements, which allow scoping of references and multipart keys based on W3C XPaths. |
NOTATION Attribute Type |
Supported as a built-in simple type. |
Lexical constraints on these names can be specified using regular expressions in the pattern attribute. |
NMTOKEN NMTOKENS Attribute Types |
Supported as built-in simple types. |
Lexical constraints on these names can be specified using regular expressions in the pattern attribute. |
ENTITY ENTITIES Attribute Types |
Supported as a built-in simple type. |
Lexical constraints on these names can be specified using regular expressions in the pattern attribute. This refers to entity references as a datatype, not to entity declarations. |
Enumerations |
Supported |
Available on elements as well as attributes. |
Attribute Defaults |
Supported through the attribute value, a string, and with the attribute use="default". |
Available on elements as well as attributes. The syntax is different: use an attribute default of type string. |
#FIXED Attributes |
Supported through the attribute value, a string, and with the attribute use="default". |
Available on elements as well as attributes. The syntax is different: use an attribute fixed of type string. |
#REQUIRED and #IMPLIED |
Supported through the attribute use, with values "prohibited," "optional," or "required". |
|
ENTITY Declaration |
Not supported |
Entities are declared in W3C XML markup declarations (DTDs). |
ENTITY % Parameter Entity Declaration |
Not supported. Functionality reconstructed with higher-level constructs. |
Parameter entities provide a low-level mechanism useful for many different purposes. W3C XML Schemas has tried to support first-class support for some of the most important: • the separation of <element> and
<complexType>; General entities can also be used to provide some of the other rarer uses of parameter entities. XML Schemas 1.0 does not attempt to systematically reconstruct every possible use of parameter entities. |
IGNORE/INCLUDE Marked Sections |
Not supported |
W3C XML Schemas CR does not provide any mechanism equivalent to IGNORE/INCLUDE marked sections. Consequently, if such functionality is required as part of the markup, W3C XML DTDs should be used. However, this functionality can be achieved in other ways: for example, by a schema management application. |
NOTATION Declaration |
Supported |
W3C XML Schemas does not introduce any mechanism for datatyping mixed content data. The underspecified NOTATION mechanism from W3C XML DTDs is supported. |
Comments in DTDs |
The <documentation> subelement of the <annotation> element provides this functionality. (Comments can still be used.) |
<documentation> elements are available to users of the Schema. Comments are not part of the core information set of a document and may not be available or in a useful form. |
PIs in DTDs |
The <appinfo> subelement of the <annotation> element provides this functionality. (PIs can still be used.) |
<appinfo> elements are available to users of the Schema. PIs require knowledge of their notation to parse correctly. Extensions to the XML Schema can be made using <appinfo>. An extension will not change the schema-validity of the document. |
Role of W3C XML Markup Declarations (DTDS) in the Immediate Future
W3C XML Markup Declarations (DTDS) are not superceded by W3C XML Schemas 1.0, and there is no general way to specify that DTD processing should not occur, nor any way to verify that it has not. So Markup Declarations will continue to be useful for non-schema related tasks in the near future, in particular as a simple and terse syntax for removing document constants to headers, which isn't really a task related to data or structural type specification:
- Entity declarations;
- Namespace declarations;
- Global variable defaults, particularly for xml:space;
- Attribute defaulting.
The terseness of DTDs and their widespread deployment in XML processors makes them a suitable notation for simple client-side validation; a W3C XML Schemas may be transformed into the closest approximating W3C XML DTD. However, given that an W3C XML Schema does not have a one-to-one correspondence between element name and content model, the closest approximating DTD may be less strict than what's needed or desired.
W3C XML Schema and ISO SGML Markup Declarations (SGML DTDs)
ISO SGML (IS 8879:1986 as amended 1997) provides additional features and capabilities to W3C XML 1.0. ISO SGML allows the specification of many different kinds of grammars: different levels of tag and delimiter omission, contextual delimiter recognition, and richer support for modeling a documents as an asynchronous tree of elements and tree of entities, each of which can have local links and other attributes. Consequently, an ISO SGML DTD is really a grammar specification rather than a data schema per se , though in practice such a regular grammar contains enough structural definition to make it useful for many kinds of data modeling.
ISO SGML Markup Declarations | W3C XML Schemas | Comments |
---|---|---|
CDATA Declared Content Type |
Not supported by XML |
This is a parser function. |
RCDATA Declared Content Type |
Not supported by XML. |
This is a parser function. |
ANY Declared Content Type |
Supported by the <any/> particle. The urType (the top-most type from which all other types are derived) -- called "anyType" -- allows any subelements and any attributes. |
Various wildcards allow ANY to be restricted to certain namespaces. Also, substitution groups allow a content model to name the position of a particle but to allow the name an complex type to be specified elsewhere. |
& Connector |
Conjunction. The <all> element allows this functionality at the top-level of an element only. |
In an <all> group, the elements have a maxOccurs of 1. |
ISO SGML Content Models |
No equivalent of allowing #PCDATA as a particular particle in content models. |
W3C XML Schemas keeps the ISO SGML requirement for unambiguous content models. The big difference between ISO SGML DTDs and W3C XML Schemas is the tag/type distinction. |
Global Inclusion Exceptions |
Not supported directly. A single-level inclusion can be made by using type refinement on the complex types elsewhere. |
However, the effect of a global inclusion can be achieved by deriving restricted types for each complex type possible underneath an element to any level. This may double the number of declarations for each inclusion. |
Global Exclusion Exceptions |
Not supported directly. A single-level exclusion can be made using type refinement. |
However, the effect of a global exclusion can be achieved by deriving restricted types for each complex type possible underneath an element to any level. This may double the number of declarations for each exception. |
NUTOKEN NUTOKENS Attribute Types |
Can be supported using regular expressions, subclassing simple type "token". NUTOKENS can be supported by deriving a list type. |
|
NAME, NAMES Attribute Types |
NAME is supported by the simple type "Name". NAMES can be supported by deriving a list type. |
To get a type closer to the ISO SGML Reference Concrete Syntax defaults, derive a type from NCName (which allows no colons) and restrict the type further to characters less than 0xFF using regular expressions and the pattern attribute. |
ENTITY Types (e.g., SDATA, CDATA) and Data Attributes (Attributes on Entities) |
Not supported |
|
#CONREF Attribute Keyword |
Not supported |
No support for keying occurrence from the value of an attribute. |
SUBDOC |
Not supported |
However, the key mechanism allows a scoping of IDs and IDREFs; and both the namespace mechanism and the tag/type mechanism allow element names to refer to different types in different contexts. In a sense, each new namespace encountered is a SUBDOC, as they all will have a separate schema. |
LINK Attribute Groups |
Not supported |
However, a similar effect can be gained using type refinement, so that different default and fixed attribute values are added in different contexts. |
Data Attributes on Elements |
Not supported |
|
W3C XML Schema and ISO SGML Extended Facilities (Meta-DTDs and Lexical Types)
A W3C XML Schema is a high-level specification of an architecture. W3C XML Schemas could be implemented as
- a transformation on the document to add xsi:type elements, based on the type derivation mechanism;
- a transformation on the schema to derive an effective schema, expressed according to the ISO HyTime Architectural Forms Definition Requirements;
-
architectural parse of the document using the effective schema as a meta-DTD and the xsi:type attribute as the element form.
It has not been proven yet that all W3C XML Schema constraints can be expressed using meta-DTDs and the other standard features of the ISO SGML Extended Facilities (given in the Annexes to the ISO HyTime standard). Consequently, an architectural validation system using meta-DTDs in ISO SGML markup declaration syntax may not completely validate every W3C XML Schema instance. In particular, the use of namespaces complicates understanding of the transformations required. Certainly it is not true that every schema definable using Architectural Forms has an equivalent W3C XML Schema: attribute renaming cannot be performed, for example. The tag/type distinction is the same as the element-form/architecture distinction: an abstract element type is a "base" (architectural) element.
W3C XML Schemas provides similar lexical capabilities to the ISO SGML Extended Facilities Lexical Definition Requirements, using a non-standard regular expression syntax.
W3C XML Schema and Perl Regular Expressions
Perl Regular Expressions | W3C XML Schema Regular Expressions | Comments |
---|---|---|
^ = beginning of string |
^ = character ^ only |
All regular expression matches start from the beginning of the string. For substring matching use .*substring.* |
$ = end of string |
$ = character $ only |
All regular expression matches end at the end of the string |
Zero-width assertions, look-ahead and look-behind, back references |
Not available |
|
Non-greedy + and * |
Not available |
|
\c |
An XML NAME character |
|
\i |
An XML initial NAME (i.e, SGML NAMESTRT) character |
|
\033 and \xAB |
XML Numeric Character Reference must be used |
|
\p{} |
\p{} |
The character classes allowed are the Unicode Consortium's character classes. |
Schema Languages influenced by W3C XML Schemas
JIS RELAX and Schematron are schema languages influenced in various degrees by W3C XML Schema. Both are created by W3C XML Schema WG members and may be seen as "minority reports" espousing alternate features or approaches to W3C XML Schemas. Both have adopted suggestions made during the course of the development of W3C XML Schemas that did not make the final cut. However, in public material the authors of both have stressed that the design differences largely flow from having a different answer to the question, "what problem should Schemas solve?" In particular, the view that a schema language should not make information set contributions is shared: a document should not require schema processing to have a complete information set.
DSD takes an opposite approach, paying much attention to various defaulting issues, including defaults which make information set contributions.
Interested readers may also enjoy the paper " Comparative Analysis of Six Schema Languages", Lee and Chu, ACM SIGMOD Record 29(3), September 2000. The comments on the various schema languages refer to earlier drafts, so individual comments may be out-of-date.
JIS RELAX
JIS RELAX is based on supporting document editing and schema modularization: it treats the schema as a grammar definition. This can be contrasted with XML Schemas, which treats the schema as a type definition system.
A subset called RELAX Core is fairly convertible to W3C XML Schemas and to DTDs. RELAX adopts W3C XML Schema's datatypes, annotation elements, and uses the same naming conventions as far as possible. It does not provide type derivation. Some features of W3C XML Schemas are not supported.
A second level, RELAX Namespace, adds some others particularly in the areas of modularization and schema combination.
Not every full RELAX Schema can be converted fully into a W3C XML Schema: modularization information will be lost and the selection of type based on using all the information present in a start-tag (an excellent design feature which DSD and Schematron share, following the lead in Dave Ragget's Assertion Grammars). However, useful conversion is certainly possible.
Refer: http://www.xml.gr.jp/relax/ especially the FAQ. JIS RELAX is mooted for release as an ISO Technical Report.
Schematron
My Schematron schema language is based on making assertions about the presence or absence of patterns in the document object tree. Paths and expressions use the version of W3C XPath paths and expressions available in W3C XSLT. Schematron schemas in particular allow the validation of co-occurrence constraints in a document where the presence or absence or value of some element or attribute in some document impacts the presence or absence or value of another element or attribute, possibly in another document.
Schematron relates to XML Schemas in two ways: first, a W3C XML Schema may be partially but usefully transformed into a Schematron schema, though this may be quite complex to achieve; second, a Schematron schema can be embedded into a Schematron schema in an <appinfo> element, providing an extension to express co-occurrence constraints. Schematron demonstrates that path expressions are a useful tool in the vocabulary of schema language designers, perhaps as useful as grammars, though with different modeling capabilities.
A Schematron implementation may take only a few hundred lines (given the availability of a W3C XSL library) while a W3C XML Schema implementation may take tens of thousands of lines.
Refer: http://www.ascc.net/xml/resource/schematron/schematron.html and for interview with me see http://www.xmlhack.com/read.php?item=121
Document Structure Descriptions
DSD is a grammar-based approach based on transplanting some of the useful mechanisms of CSS into the schema world. It does not handle namespaces yet.
DSD allows a kind of simple path expressions for getting context-dependent validation, inspired by W3C CSS selectors. In this it provides more than W3C XML Schema but much less than Schematron, which is not constrained to the ancestor tree. In W3C XML Schemas, a type can be selected based on the element type name and in the context of a particular type (e.g., by the element's name and the parent's type).
DSD also follows W3C CSS by allowing gradual declaration of allowed contents and attributes, as part of a comprehensive defaulting strategy.
Refer: http://www.brics.dk/DSD/ but note that some of the statements about XML Schemas are now out-of-date. W3C XML Schemas has edged unsystematically closer to DSD:
- W3C XML Schemas do allow the parent type to determine the type of a child element, not only the name of an element; an element type is like a non-terminal;
- W3C XML Schemas do allow the type of the target of a reference to be constrained, using the identity constraint mechanism which utilizes XPaths;
- W3C XML Schemas do allow the piecemeal declaration of types by the substitution group mechanism.