Managing Enumerations in W3C XML Schemas

February 5, 2003

Introduction

When working with data-oriented XML, there is often a requirement to handle "controlled vocabularies", otherwise known as enumerated values. Consider the following example of a bank account summary:

<accountSummary>
  <timestamp>2003-01-01T12:25:00</timestamp>
  <currency>USD</currency>
  <balance>2703.35</balance>
  <interest rounding="down">27.55</interest>
</accountSummary>

There are two controlled vocabularies in this document. One is the currency, which is an ISO-4217 3-letter currency code ("USD" is US Dollar). The other is the rounding direction for the interest, which can be "up", "down", or "nearest". The bank in this example prefers to round the interest down.

The problem in designing this schema is that the ISO 3-letter currency codes are externally controlled. They can change at any time. If you embed them in your schema, you need to reissue the schema every time ISO makes a change, which can be expensive. This is especially true in enterprise situations where any schema change, no matter how small, can require full retesting of any applications that use the schema. This needs to be avoided whenever possible.

In this article, we will discuss how controlled vocabularies can be managed when using W3C XML Schemas, since this is the dominant XML schema format for data-oriented XML. Note that the "vocabularies" we refer to are enumerated lists of element-attribute values. This differs from other contexts where "vocabularies" are sets of XML element names.

Step 1: Monolithic Schema

Before worrying about which controlled vocabularies are out of our control, the first thing to do is create a schema, using W3C XML Schema, for the account summaries. For the purposes of this article, we will use just a subset of the ISO 3-letter currency codes. A suitable schema is

<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
  version = "1.0"
  elementFormDefault = "qualified">

  <xsd:element name = "accountSummary">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element ref = "timestamp"/>
        <xsd:element ref = "currency"/>
        <xsd:element ref = "balance"/>
        <xsd:element ref = "interest"/>
      </xsd:sequence>
      <xsd:attribute name = "version" use = "required">
        <xsd:simpleType>
          <xsd:restriction base = "xsd:string">
            <xsd:pattern value = "[1-9]+[0-9]*\.[0-9]+"/>
          </xsd:restriction>
        </xsd:simpleType>
      </xsd:attribute>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name = "timestamp" type = "xsd:dateTime"/>

  <xsd:element name = "currency" type = "iso3currency"/>

  <xsd:element name = "balance" type = "xsd:decimal"/>

  <xsd:element name = "interest">
    <xsd:complexType>
      <xsd:simpleContent>
        <xsd:extension base = "xsd:decimal">
          <xsd:attribute name = "rounding" use = "required"
                         type = "roundingDirection"/>
        </xsd:extension>
      </xsd:simpleContent>
    </xsd:complexType>
  </xsd:element>

  <xsd:simpleType name = "iso3currency">
    <xsd:annotation>
      <xsd:documentation>ISO-4217 3-letter currency codes,
as defined at
http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90.xalter
or available from
http://www.xe.com/iso4217.htm
Only a subset are defined here.</xsd:documentation>
    </xsd:annotation>
    <xsd:restriction base = "xsd:string">
      <xsd:enumeration value = "AUD"/><!-- Australian Dollar -->
      <xsd:enumeration value = "BRL"/><!-- Brazilian Real -->
      <xsd:enumeration value = "CAD"/><!-- Canadian Dollar -->
      <xsd:enumeration value = "CNY"/><!-- Chinese Yen -->
      <xsd:enumeration value = "EUR"/><!-- Euro -->
      <xsd:enumeration value = "GBP"/><!-- British Pound -->
      <xsd:enumeration value = "INR"/><!-- Indian Rupee -->
      <xsd:enumeration value = "JPY"/><!-- Japanese Yen -->
      <xsd:enumeration value = "RUR"/><!-- Russian Rouble -->
      <xsd:enumeration value = "USD"/><!-- US Dollar -->
      <xsd:length value = "3"/>
    </xsd:restriction>
  </xsd:simpleType>

  <xsd:simpleType name = "roundingDirection">
    <xsd:annotation>
      <xsd:documentation>Whether the interest is
rounded up, down or to the
nearest round value.</xsd:documentation>
    </xsd:annotation>
    <xsd:restriction base = "xsd:string">
      <xsd:enumeration value = "up"/>
      <xsd:enumeration value = "down"/>
      <xsd:enumeration value = "nearest"/>
    </xsd:restriction>
  </xsd:simpleType>

</xsd:schema>

Notice the two controlled vocabularies (enumerations), the simple types iso3currency and roundingDirection. For iso3currency, the length of the enumeration strings is explicitly set to 3, to help avoid stupid editing errors in future when the list of currencies needs to be updated.

Note also that the schema's optional version attribute has been set to "1.0". When working with data-oriented XML messages, it is usually necessary to support multiple versions of the message schema concurrently, as the systems that use the message schema will probably not be able to upgrade to the latest version simultaneously. So, it is vital to identify the schema version that an XML message was validated against. In keeping with this, we will name our schemaq accountSummary-1.0.xsd, so that future versions won't overwrite the current version.

Further, a version attribute has been added to the accountSummary element, so that message instances clearly identify their schema version. It is assumed that the version numbers have the form M.N where M is the major version number and N is the minor version number. With this change, plus the schema, the account summary now becomes

<accountSummary
  version = "1.0"
  xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
  xsi:noNamespaceSchemaLocation = "accountSummary-1.0.xsd">
  <timestamp>2003-01-01T12:25:00</timestamp>
  <currency>USD</currency>
  <balance>2703.35</balance>
  <interest rounding = "down">27.55</interest>
</accountSummary>

Step 2: Isolate Volatile Controlled Vocabularies

When dealing with controlled vocabularies (enumerations) in schemas, it is a good idea to rate the volatility of each vocabulary. A volatile vocabulary is one which is expected to change independently of the normal release cycle of schema versions. A stable vocabulary is one which is expected to change (if at all) only as new schema versions are released. Volatile vocabularies are a problem if embedded in a schema because they impose extra releases on all dependent applications.

In our example of an account summary, the currency codes are a volatile vocabulary: they are externally controlled by ISO, and currencies can be added or removed by ISO at any time. On the other hand, the set of the rounding directions {"up", "down", "nearest"} is unlikely to change, so it is a stable vocabulary. From the point of view of somebody maintaining an application which deals with account summaries, adding a new rounding direction would mean writing, testing, and deploying a new version of the application. Political pressure would dictate that rounding values would only ever change as part of the planned release cycle of the schema. So it makes sense to leave the roundingDirection simple type embedded in the schema.

However, it is unlikely that an application would need to be recoded just to handle a change in the set of currency codes; if it did, that would bee a sign of an inflexible design. As the currency codes are externally controlled, they need to be isolated: we do that by creating a separate vocabulary schema for them. A vocabulary schema is one which contains a single simple type definition with enumerated values and nothing else. The vocabulary schema for the currencies is

<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
  version = "1.0"
  elementFormDefault = "qualified">

  <xsd:simpleType name = "iso3currency">
    <xsd:annotation>
      <xsd:documentation>ISO-4217 3-letter currency codes,
as defined at
http://www.bsi-global.com/Technical+Information/Publications/_Publications/tig90.xalter
or available from
http://www.xe.com/iso4217.htm
Only a subset are defined here.</xsd:documentation>
    </xsd:annotation>
    <xsd:restriction base = "xsd:string">
      <xsd:enumeration value = "AUD"/><!-- Australian Dollar -->
      <xsd:enumeration value = "BRL"/><!-- Brazilian Real -->
      <xsd:enumeration value = "CAD"/><!-- Canadian Dollar -->
      <xsd:enumeration value = "CNY"/><!-- Chinese Yen -->
      <xsd:enumeration value = "EUR"/><!-- Euro -->
      <xsd:enumeration value = "GBP"/><!-- British Pound -->
      <xsd:enumeration value = "INR"/><!-- Indian Rupee -->
      <xsd:enumeration value = "JPY"/><!-- Japanese Yen -->
      <xsd:enumeration value = "RUR"/><!-- Russian Rouble -->
      <xsd:enumeration value = "USD"/><!-- US Dollar -->
      <xsd:length value = "3"/>
    </xsd:restriction>
  </xsd:simpleType>
</xsd:schema>

and is named iso3currency-1.0.xsd. As you see, the currency vocabulary now has its own version numbers and, thus,its own release cycle. The vocabulary schema can now be included in the new version (1.1) of the main message schema:

<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
  version = "1.1"
  elementFormDefault = "qualified">

  <xsd:include schemaLocation = "iso3currency-1.0.xsd"/>

  <xsd:element name = "accountSummary">
    <xsd:complexType>
      <xsd:sequence>
        <xsd:element ref = "timestamp"/>
        <xsd:element ref = "currency"/>
        <xsd:element ref = "balance"/>
        <xsd:element ref = "interest"/>
      </xsd:sequence>
      <xsd:attribute name = "version" use = "required">
        <xsd:simpleType>
          <xsd:restriction base = "xsd:string">
            <xsd:pattern value = "[1-9]+[0-9]*\.[0-9]+"/>
          </xsd:restriction>
        </xsd:simpleType>
      </xsd:attribute>
    </xsd:complexType>
  </xsd:element>

  <xsd:element name = "timestamp" type = "xsd:dateTime"/>
  <xsd:element name = "currency" type = "iso3currency"/>
  <xsd:element name = "balance" type = "xsd:decimal"/>

  <xsd:element name = "interest">
    <xsd:complexType>
      <xsd:simpleContent>
        <xsd:extension base = "xsd:decimal">
          <xsd:attribute name = "rounding" use = "required" type = "roundingDirection"/>
        </xsd:extension>
      </xsd:simpleContent>
    </xsd:complexType>
  </xsd:element>

  <xsd:simpleType name = "roundingDirection">
    <xsd:annotation>
      <xsd:documentation>Whether the interest is
rounded up, down or to the
nearest round value.</xsd:documentation>
    </xsd:annotation>
    <xsd:restriction base = "xsd:string">
      <xsd:enumeration value = "up"/>
      <xsd:enumeration value = "down"/>
      <xsd:enumeration value = "nearest"/>
    </xsd:restriction>
  </xsd:simpleType>

</xsd:schema>

and this is accountSummary-1.1.xsd according to our naming scheme. Note that the currency codes no longer appear in the main schema.

Step 3: Decouple Controlled Vocabularies

The problem with accountSummary-1.1.xsd is that it directly imports iso3currency-1.0.xsd. When a new version of the ISO currency vocabulary schema is released, you still have to release a new version of the account summary schema. What is needed is a mechanism to decouple the vocabulary schema versions from the main schema versions. The simple solution is to use an unversioned "pass-through" vocabulary schema:

<xsd:schema xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
  elementFormDefault = "qualified">
  <xsd:include schemaLocation = "iso3currency-1.0.xsd"/>
</xsd:schema>

This unversioned vocabulary schema has no version attribute and is named iso3currency.xsd. To complete the decoupling, a new version of the main schema, accountSummary-1.2.xsd, is released. The only change from version 1.1 is that the <xsd:include> changes from

<xsd:include schemaLocation = "iso3currency-1.0.xsd"/>

<xsd:include schemaLocation = "iso3currency.xsd"/>

so that the unversioned currency vocabulary schema is included. The decoupling is now complete. If ISO changes the list of currency codes, a new currency schema is released and iso3currency.xsd is updated so that it imports the new currency schema. The main schema does not need to be changed, since it includes iso3currency.xsd and is agnostic to the version of the currency vocabulary schema.

Step 4: Protect Applications

Decoupling vocabulary schemas like this is not without issues. First, as new versions of the currency vocabulary schema are released, existing instance files will become invalid if they contain currency codes which ISO has deleted. In some situations that would be unacceptable, but it makes sense here. If an instance file refers to a currency code that no longer exists, then it has become semantically invalid; it is not unreasonable for it to become syntactically invalid too. The invalid syntax can then be used to detect such instances and route them for special processing, so that the code in the main application can focus on what to do with valid currency codes. Being able to remove error handling from the main application means the main application code remains smaller and easier to maintain.

Second, with the currency codes able to change at any time, there needs to be synchronization between the currency codes in the currency vocabulary schema and the currency codes known to the applications. There are two solutions to this. The first is that applications can use the vocabulary schema as the source of the currency codes. Treating the vocabulary schema as an XML file, a quick SAX parse is all you need to pull out the <xsd:enumeration> elements containing the allowed values. The second solution is to keep the currency codes in a central relational database. Applications can access this table directly, while the vocabulary schema can be dynamically generated from the same table. Either method keeps the set of allowed values synchronized across applications.

Third, using such vocabulary schemas is only workable if applications can rely on them changing in one of two ways only: either an enumerated value is added or one is deleted.

Vocabulary schemas must never change structurally. If a new simple type, complex type, or element definition was added to a vocabulary schema, it could change the results of validating an instance with the main schema and cause a major application failure. So vocabulary schemas need to be "validated" to ensure that they contain just a single simple type definition with enumerated values. This is exactly the situation Will Provost described in "Working with a Metaschema".

An obvious solution would be to write a schema for vocabulary schemas as the metaschema. In practice I don't do this. The existing "Schema for Schemas" is known not to be 100% correct in describing the W3C XML Schema syntax, and so schema editing tools use it as an indicative, rather than normative guide. This means that schema editors tend to ignore any attempt to impose a metaschema on a schema. For this reason, and because the vocabulary schema format is quite simple, I use the following Schematron schema:

<sch:schema
  xmlns = "http://www.w3.org/2001/XMLSchema"
  xmlns:sch = "http://www.ascc.net/xml/schematron"
  xmlns:xsd = "http://www.w3.org/2001/XMLSchema"
  xmlns:xsi = "http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation =
    "http://www.ascc.net/xml/schematron schematron-1.5.xsd">

  <sch:title>Controlled vocabulary validation</sch:title>

  <!-- The input is assumed to be a valid W3C XML Schema. -->
  <!-- This just checks that it is also a valid           -->
  <!-- vocabulary Schema.                                 -->

  <sch:pattern name = "controlled-vocabulary-schema">
    <sch:rule context = "schema">
      <sch:assert test = "count(*) = count(simpleType[@name])"
      >The schema must contain only a
       single simple type definition.</sch:assert>
      <sch:assert test = "count(simpleType[@name]) = 1"
      >The schema must contain a single simpleType
       definition or a single include.</sch:assert>
    </sch:rule>

    <sch:rule context = "simpleType">
      <sch:assert test = "@name"
      >The simpleType must have a name.</sch:assert>
      <sch:assert test = "count(restriction) = 1"
      >The simpleType must contain a
       single restriction.</sch:assert>
      <sch:assert test = "count(*) = count(annotation)+count(restriction)"
      >The simpleType may have an annotation as well as its
       restriction, but no other structure.</sch:assert>
    </sch:rule>

    <sch:rule context = "restriction">
      <sch:assert test = "enumeration"
      >A restriction must contain enumerated values.</sch:assert>
    </sch:rule>

    <sch:rule context = "enumeration">
      <sch:key name = "enumerationsByValue" path = "@value"/>
      <sch:assert test = "count(key('enumerationsByValue', @value)) = 1"
      >An enumerated value must be unique.</sch:assert>
    </sch:rule>
  </sch:pattern>
</sch:schema>

Under Windows, you can run validate a vocabulary schema against this Schematron schema using the free validator from Topologi. For other platforms, see the list of tools in the Schematron Resource Directory. Chimezie Ogbuji introduced Schematron in "Validating XML with Schematron".

Schematron assertions are expressed using XPath expressions which must evaluate to true. If they evaluate to false, a Schematron validation error is generated. In our Schematron schema, note the following:

Look at the rule for the schema context. It contains the assertions that are applied to the <xsd:schema> element in the vocabulary schema. The first assertion checks that the only thing in the schema is <xsd:simpleType> definitions. The second assertion checks that there is only one <xsd:simpleType> definition.
The rule for the simpleType context asserts that the <xsd:simpleType> must have a name attribute, that the <xsd:simpleType> may contain an <xsd:annotation> and must contain an <xsd:restriction>, but cannot contain any other elements.
The rule for the restriction context asserts that the <xsd:restriction> must contain one or more enumerated values.
The rule for the enumeration context asserts that the enumeration values must be unique. This is checked using a Schematron key (equivalent to an XSLT key). The expression key('enumerationsByValue', @value) returns a list of the <xsd:enumeration> elements with the same value as the element being validated. If the values are unique, there will always be just one <xsd:enumeration> element in the list, the one being validated.

Conclusion

WXS schemas can be made more manageable by separating volatile controlled vocabularies (enumerations) into their own vocabulary schemas. In this article, we have seen how to identify volatile controlled vocabularies, how to separate them from the main schema, how to decouple the versions, and how to validate vocabulary schemas. There is no absolute rule for when a controlled vocabulary should have its own schema. Use the guidelines here, but always use your own judgment and your knowledge of your problem domain.

Resources

The example files from this article are available as a ZIP archive (9K).