Managing Enumerations in W3C XML Schemas
February 5, 2003
Introduction
When working with data-oriented XML, there is often a requirement to handle "controlled vocabularies", otherwise known as enumerated values. Consider the following example of a bank account summary:
<accountSummary> <timestamp>2003-01-01T12:25:00</timestamp> <currency>USD</currency> <balance>2703.35</balance> <interest rounding="down">27.55</interest> </accountSummary>
There are two controlled vocabularies in this document. One is the currency, which
is an ISO-4217 3-letter currency code ("USD
" is US Dollar). The other is the
rounding direction for the interest, which can be "up
", "down
", or
"nearest
". The bank in this example prefers to round the interest down.
The problem in designing this schema is that the ISO 3-letter currency codes are externally controlled. They can change at any time. If you embed them in your schema, you need to reissue the schema every time ISO makes a change, which can be expensive. This is especially true in enterprise situations where any schema change, no matter how small, can require full retesting of any applications that use the schema. This needs to be avoided whenever possible.
In this article, we will discuss how controlled vocabularies can be managed when using W3C XML Schemas, since this is the dominant XML schema format for data-oriented XML. Note that the "vocabularies" we refer to are enumerated lists of element-attribute values. This differs from other contexts where "vocabularies" are sets of XML element names.
Step 1: Monolithic Schema
Before worrying about which controlled vocabularies are out of our control, the first thing to do is create a schema, using W3C XML Schema, for the account summaries. For the purposes of this article, we will use just a subset of the ISO 3-letter currency codes. A suitable schema is
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema" version = "1.0" elementFormDefault = "qualified"> <xsd:element name = "accountSummary"> <xsd:complexType> <xsd:sequence> <xsd:element ref = "timestamp"/> <xsd:element ref = "currency"/> <xsd:element ref = "balance"/> <xsd:element ref = "interest"/> </xsd:sequence> <xsd:attribute name = "version" use = "required"> <xsd:simpleType> <xsd:restriction base = "xsd:string"> <xsd:pattern value = "[1-9]+[0-9]*\.[0-9]+"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType> </xsd:element> <xsd:element name = "timestamp" type = "xsd:dateTime"/> <xsd:element name = "currency" type = "iso3currency"/> <xsd:element name = "balance" type = "xsd:decimal"/> <xsd:element name = "interest"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base = "xsd:decimal"> <xsd:attribute name = "rounding" use = "required" type = "roundingDirection"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> <xsd:simpleType name = "iso3currency"> <xsd:annotation> <xsd:documentation>ISO-4217 3-letter currency codes, as defined at http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90.xalter or available from http://www.xe.com/iso4217.htm Only a subset are defined here.</xsd:documentation> </xsd:annotation> <xsd:restriction base = "xsd:string"> <xsd:enumeration value = "AUD"/><!-- Australian Dollar --> <xsd:enumeration value = "BRL"/><!-- Brazilian Real --> <xsd:enumeration value = "CAD"/><!-- Canadian Dollar --> <xsd:enumeration value = "CNY"/><!-- Chinese Yen --> <xsd:enumeration value = "EUR"/><!-- Euro --> <xsd:enumeration value = "GBP"/><!-- British Pound --> <xsd:enumeration value = "INR"/><!-- Indian Rupee --> <xsd:enumeration value = "JPY"/><!-- Japanese Yen --> <xsd:enumeration value = "RUR"/><!-- Russian Rouble --> <xsd:enumeration value = "USD"/><!-- US Dollar --> <xsd:length value = "3"/> </xsd:restriction> </xsd:simpleType> <xsd:simpleType name = "roundingDirection"> <xsd:annotation> <xsd:documentation>Whether the interest is rounded up, down or to the nearest round value.</xsd:documentation> </xsd:annotation> <xsd:restriction base = "xsd:string"> <xsd:enumeration value = "up"/> <xsd:enumeration value = "down"/> <xsd:enumeration value = "nearest"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
Notice the two controlled vocabularies (enumerations), the simple types
iso3currency
and roundingDirection
. For
iso3currency
, the length of the enumeration strings is explicitly set to 3,
to help avoid stupid editing errors in future when the list of currencies needs to
be
updated.
Note also that the schema's optional version
attribute has been set to "1.0".
When working with data-oriented XML messages, it is usually necessary to support multiple
versions of the message schema concurrently, as the systems that use the message schema
will
probably not be able to upgrade to the latest version simultaneously. So, it is vital
to
identify the schema version that an XML message was validated against. In keeping
with this,
we will name our schemaq accountSummary-1.0.xsd
, so that future versions won't
overwrite the current version.
Further, a version
attribute has been added to the accountSummary
element, so that message instances clearly identify their schema version. It is assumed
that
the version numbers have the form M.N
where M
is the major version
number and N
is the minor version number. With this change, plus the schema,
the account summary now becomes
<accountSummary
version = "1.0"
xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
xsi:noNamespaceSchemaLocation = "accountSummary-1.0.xsd">
<timestamp>2003-01-01T12:25:00</timestamp>
<currency>USD</currency>
<balance>2703.35</balance>
<interest rounding = "down">27.55</interest>
</accountSummary>
Step 2: Isolate Volatile Controlled Vocabularies
When dealing with controlled vocabularies (enumerations) in schemas, it is a good idea to rate the volatility of each vocabulary. A volatile vocabulary is one which is expected to change independently of the normal release cycle of schema versions. A stable vocabulary is one which is expected to change (if at all) only as new schema versions are released. Volatile vocabularies are a problem if embedded in a schema because they impose extra releases on all dependent applications.
In our example of an account summary, the currency codes are a volatile vocabulary:
they
are externally controlled by ISO, and currencies can be added or removed by ISO at
any time.
On the other hand, the set of the rounding directions {"up", "down", "nearest"}
is unlikely to change, so it is a stable vocabulary. From the point of view of somebody
maintaining an application which deals with account summaries, adding a new rounding
direction would mean writing, testing, and deploying a new version of the application.
Political pressure would dictate that rounding values would only ever change as part
of the
planned release cycle of the schema. So it makes sense to leave the
roundingDirection
simple type embedded in the schema.
However, it is unlikely that an application would need to be recoded just to handle a change in the set of currency codes; if it did, that would bee a sign of an inflexible design. As the currency codes are externally controlled, they need to be isolated: we do that by creating a separate vocabulary schema for them. A vocabulary schema is one which contains a single simple type definition with enumerated values and nothing else. The vocabulary schema for the currencies is
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema" version = "1.0" elementFormDefault = "qualified"> <xsd:simpleType name = "iso3currency"> <xsd:annotation> <xsd:documentation>ISO-4217 3-letter currency codes, as defined at http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90.xalter or available from http://www.xe.com/iso4217.htm Only a subset are defined here.</xsd:documentation> </xsd:annotation> <xsd:restriction base = "xsd:string"> <xsd:enumeration value = "AUD"/><!-- Australian Dollar --> <xsd:enumeration value = "BRL"/><!-- Brazilian Real --> <xsd:enumeration value = "CAD"/><!-- Canadian Dollar --> <xsd:enumeration value = "CNY"/><!-- Chinese Yen --> <xsd:enumeration value = "EUR"/><!-- Euro --> <xsd:enumeration value = "GBP"/><!-- British Pound --> <xsd:enumeration value = "INR"/><!-- Indian Rupee --> <xsd:enumeration value = "JPY"/><!-- Japanese Yen --> <xsd:enumeration value = "RUR"/><!-- Russian Rouble --> <xsd:enumeration value = "USD"/><!-- US Dollar --> <xsd:length value = "3"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
and is named iso3currency-1.0.xsd
. As you see, the currency vocabulary now has
its own version numbers and, thus,its own release cycle. The vocabulary schema can
now be
included in the new version (1.1) of the main message schema:
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema" version = "1.1" elementFormDefault = "qualified"> <xsd:include schemaLocation = "iso3currency-1.0.xsd"/> <xsd:element name = "accountSummary"> <xsd:complexType> <xsd:sequence> <xsd:element ref = "timestamp"/> <xsd:element ref = "currency"/> <xsd:element ref = "balance"/> <xsd:element ref = "interest"/> </xsd:sequence> <xsd:attribute name = "version" use = "required"> <xsd:simpleType> <xsd:restriction base = "xsd:string"> <xsd:pattern value = "[1-9]+[0-9]*\.[0-9]+"/> </xsd:restriction> </xsd:simpleType> </xsd:attribute> </xsd:complexType> </xsd:element> <xsd:element name = "timestamp" type = "xsd:dateTime"/> <xsd:element name = "currency" type = "iso3currency"/> <xsd:element name = "balance" type = "xsd:decimal"/> <xsd:element name = "interest"> <xsd:complexType> <xsd:simpleContent> <xsd:extension base = "xsd:decimal"> <xsd:attribute name = "rounding" use = "required" type = "roundingDirection"/> </xsd:extension> </xsd:simpleContent> </xsd:complexType> </xsd:element> <xsd:simpleType name = "roundingDirection"> <xsd:annotation> <xsd:documentation>Whether the interest is rounded up, down or to the nearest round value.</xsd:documentation> </xsd:annotation> <xsd:restriction base = "xsd:string"> <xsd:enumeration value = "up"/> <xsd:enumeration value = "down"/> <xsd:enumeration value = "nearest"/> </xsd:restriction> </xsd:simpleType> </xsd:schema>
and this is accountSummary-1.1.xsd
according to our naming scheme. Note that
the currency codes no longer appear in the main schema.
Step 3: Decouple Controlled Vocabularies
The problem with accountSummary-1.1.xsd
is that it directly imports
iso3currency-1.0.xsd
. When a new version of the ISO currency vocabulary
schema is released, you still have to release a new version of the account
summary schema. What is needed is a mechanism to decouple the vocabulary schema versions
from the main schema versions. The simple solution is to use an unversioned "pass-through"
vocabulary schema:
<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
elementFormDefault = "qualified">
<xsd:include schemaLocation = "iso3currency-1.0.xsd"/>
</xsd:schema>
This unversioned vocabulary schema has no version
attribute and is named
iso3currency.xsd
. To complete the decoupling, a new version of the main
schema, accountSummary-1.2.xsd
, is released. The only change from version 1.1
is that the <xsd:include>
changes from
<xsd:include schemaLocation = "iso3currency-1.0.xsd"/>
to
<xsd:include schemaLocation = "iso3currency.xsd"/>
so that the unversioned currency vocabulary schema is included. The decoupling is
now
complete. If ISO changes the list of currency codes, a new currency schema is released
and
iso3currency.xsd
is updated so that it imports the new currency schema. The
main schema does not need to be changed, since it includes iso3currency.xsd
and
is agnostic to the version of the currency vocabulary schema.
Step 4: Protect Applications
Decoupling vocabulary schemas like this is not without issues. First, as new versions of the currency vocabulary schema are released, existing instance files will become invalid if they contain currency codes which ISO has deleted. In some situations that would be unacceptable, but it makes sense here. If an instance file refers to a currency code that no longer exists, then it has become semantically invalid; it is not unreasonable for it to become syntactically invalid too. The invalid syntax can then be used to detect such instances and route them for special processing, so that the code in the main application can focus on what to do with valid currency codes. Being able to remove error handling from the main application means the main application code remains smaller and easier to maintain.
Second, with the currency codes able to change at any time, there needs to be
synchronization between the currency codes in the currency vocabulary schema and the
currency codes known to the applications. There are two solutions to this. The first
is that
applications can use the vocabulary schema as the source of the currency codes. Treating
the
vocabulary schema as an XML file, a quick SAX parse is all you need to pull out the
<xsd:enumeration>
elements containing the allowed values. The second
solution is to keep the currency codes in a central relational database. Applications
can
access this table directly, while the vocabulary schema can be dynamically generated
from
the same table. Either method keeps the set of allowed values synchronized across
applications.
Third, using such vocabulary schemas is only workable if applications can rely on them changing in one of two ways only: either an enumerated value is added or one is deleted.
Vocabulary schemas must never change structurally. If a new simple type, complex type, or element definition was added to a vocabulary schema, it could change the results of validating an instance with the main schema and cause a major application failure. So vocabulary schemas need to be "validated" to ensure that they contain just a single simple type definition with enumerated values. This is exactly the situation Will Provost described in "Working with a Metaschema".
An obvious solution would be to write a schema for vocabulary schemas as the metaschema. In practice I don't do this. The existing "Schema for Schemas" is known not to be 100% correct in describing the W3C XML Schema syntax, and so schema editing tools use it as an indicative, rather than normative guide. This means that schema editors tend to ignore any attempt to impose a metaschema on a schema. For this reason, and because the vocabulary schema format is quite simple, I use the following Schematron schema:
<sch:schema xmlns = "http://www.w3.org/2001/XMLSchema" xmlns:sch = "http://www.ascc.net/xml/schematron" xmlns:xsd = "http://www.w3.org/2001/XMLSchema" xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation = "http://www.ascc.net/xml/schematron schematron-1.5.xsd"> <sch:title>Controlled vocabulary validation</sch:title> <!-- The input is assumed to be a valid W3C XML Schema. --> <!-- This just checks that it is also a valid --> <!-- vocabulary Schema. --> <sch:pattern name = "controlled-vocabulary-schema"> <sch:rule context = "schema"> <sch:assert test = "count(*) = count(simpleType[@name])" >The schema must contain only a single simple type definition.</sch:assert> <sch:assert test = "count(simpleType[@name]) = 1" >The schema must contain a single simpleType definition or a single include.</sch:assert> </sch:rule> <sch:rule context = "simpleType"> <sch:assert test = "@name" >The simpleType must have a name.</sch:assert> <sch:assert test = "count(restriction) = 1" >The simpleType must contain a single restriction.</sch:assert> <sch:assert test = "count(*) = count(annotation)+count(restriction)" >The simpleType may have an annotation as well as its restriction, but no other structure.</sch:assert> </sch:rule> <sch:rule context = "restriction"> <sch:assert test = "enumeration" >A restriction must contain enumerated values.</sch:assert> </sch:rule> <sch:rule context = "enumeration"> <sch:key name = "enumerationsByValue" path = "@value"/> <sch:assert test = "count(key('enumerationsByValue', @value)) = 1" >An enumerated value must be unique.</sch:assert> </sch:rule> </sch:pattern> </sch:schema>
Under Windows, you can run validate a vocabulary schema against this Schematron schema using the free validator from Topologi. For other platforms, see the list of tools in the Schematron Resource Directory. Chimezie Ogbuji introduced Schematron in "Validating XML with Schematron".
Schematron assertions are expressed using XPath expressions which must evaluate to
true
. If they evaluate to false
, a Schematron validation error
is generated. In our Schematron schema, note the following:
-
Look at the rule for the
schema
context. It contains the assertions that are applied to the<xsd:schema>
element in the vocabulary schema. The first assertion checks that the only thing in the schema is<xsd:simpleType>
definitions. The second assertion checks that there is only one<xsd:simpleType>
definition. -
The rule for the
simpleType
context asserts that the<xsd:simpleType>
must have aname
attribute, that the<xsd:simpleType>
may contain an<xsd:annotation>
and must contain an<xsd:restriction>
, but cannot contain any other elements. -
The rule for the
restriction
context asserts that the<xsd:restriction>
must contain one or more enumerated values. -
The rule for the
enumeration
context asserts that the enumeration values must be unique. This is checked using a Schematron key (equivalent to an XSLT key). The expressionkey('enumerationsByValue', @value)
returns a list of the<xsd:enumeration>
elements with the same value as the element being validated. If the values are unique, there will always be just one<xsd:enumeration>
element in the list, the one being validated.
Conclusion
WXS schemas can be made more manageable by separating volatile controlled vocabularies (enumerations) into their own vocabulary schemas. In this article, we have seen how to identify volatile controlled vocabularies, how to separate them from the main schema, how to decouple the versions, and how to validate vocabulary schemas. There is no absolute rule for when a controlled vocabulary should have its own schema. Use the guidelines here, but always use your own judgment and your knowledge of your problem domain.
Resources
- The example files from this article are available as a ZIP archive (9K).