A Smoother Change to Version 2.0

April 11, 2007

"Ch-ch-ch-changes" sang David Bowie, "Just gonna have to be a different man. Time may change me but I can't trace time." It's a great idea for a song, but when moving to the next version of an XML-based exchange, we would like a transition with less stutter. However, this is not easy: on one hand, we want old version processors to accept new messages, the way older browsers can display newer HTML by ignoring unknown tags. On the other hand, those "mustIgnoreUnknown" semantics are often unwanted. In financial transactions, medical care, justice, etc., certain parts must always be understood -- we'd rather have a physician grab the phone and call us than have him ignore those parts about lethal allergies that his client software did not understand. SOAP provides a mechanism for SOAP headers: a "mustUnderstand" flag, which indicates which parts may not be ignored. Unfortunately, this mechanism is rather inflexible.

This article will outline a design pattern that makes a version transition much easier, and that is both more powerful and simpler than SOAP-style "mustUnderstand" semantics. In a nutshell, the next version of an XML vocabulary can be backward- or forward-compatible with the previous. We know the language version we use when we create a message, and all previous language versions that are forward-compatible with this message. So we put the list of those versions in the message. We cannot know which message versions the receiver supports, but if it supports only earlier versions, it can decide from the list whether it will process our message or not. If the receiver supports later versions, it can decide whether the message is backward-compatible with the receiver's version. In all cases, the receiver can make the best decision possible. Let's study this "Capability Compatibility Design Pattern" in more detail.

On Compatibility

Figure 1. (In)compatible language

First, some background. Read David Orchard's "A Theory of Compatible Versions" or the W3C TAG's Versioning Finding for more. With traditional software -- say, a word processor -- things are simple: if your V2 word processor can read documents created with your V1 word processor, the V2 processor is backward-compatible with the V1 processor. If the V1 processor can read V2 documents, it is forward-compatible with V2. Of course no one will buy a V2 word processor, write documents, and then buy a V1 word processor and try to access the V2 documents, so this is uncommon in traditional software applications. It does happen when we email documents and the receiver has an older word processor than we do. It also happens a lot on the Web, where older web browsers may encounter markup that wasn't known when the browser was built. HTML tackles this problem by requiring software to ignore all unknown tags, and just display the content (if possible).

For two language versions -- L1 and L2 -- for message exchange, we can summarize:

If L2 applications can read all L1 documents, L1 and L2 are backward-compatible
If L1 applications accept all L2 documents, L1 and L2 are forward-compatible

Figure 2. Forward and backward compatibility

IgnoreUnknown and MustUnderstand Semantics

If two language versions, L1 and L2, are forward-compatible, we do not expect L1 to process all L2 syntax. Like HTML, we just expect the earlier application to accept documents in later versions of the language, and show what can be shown. This is what we call "IgnoreUnknown" semantics, and this is where "MustUnderstand" comes in. Some information simply may not be ignored. This is frequently the case with information related to security. SOAP provides a mechanism for SOAP headers to achieve this:

<my:security-header soap:mustUnderstand = "1">

If the mustUnderstand attribute is set to "1", an application may only process the message if it understands the semantics of this header. MustUnderstand overrides IgnoreUnknown.

IgnoreUnknown works well for browsers, but sometimes understanding is simply mandatory. Again, this is true for nearly everything related to security, and much of reliable messaging and transactioning as well. It is also often true in environments such as health care or finance: if you do not understand the information I sent you, I'd rather have you reject the message and call me than ignore dosage in the medical prescription I sent, or the maximum on the stock order I submitted. Some things need to be understood. SOAP mustUnderstand semantics are not very flexible, however: mustUnderstand works only for SOAP headers. It could be extended to cover elements in SOAP:Body as well, but this potentially adds an attribute to every element in the tree -- yuck! It also only works on the level of an entire element. There must be a better way.

The Capability Compatibility Design Pattern

One of the principles that follows from the discussion of compatibility is that a sender knows which language version was used to create a message, and the capabilities of that language and of earlier versions. So the sender can put this information in the message itself. Of course the language version L4 that was used to produce a message is suitable for understanding it, so any receiver that understands L4 may process it. The sender can also know whether this particular message uses any new items introduced in version L4. Maybe it uses only items already in the previous language version, L3. So the sender could indicate in the message that any L3 receiver may process it. Ditto for L2 and L1.

If the message does contain new L4 items, and those items can be safely ignored, the sender can also list L3 as sufficient for processing. If the message contains items from L4 that must be understood, the sender will list only the L4 capability as sufficient for processing the message. The receiver knows which version was used to build the receiving software, its capabilities, and the capabilities of earlier versions. So if the receiver is built using language version L5, it will know whether it can process L4 messages (it usually will -- but sometimes language changes will not be backward-compatible). If it can, L5 receivers will simply know they can safely process L4 messages. So if we put the version information into the message itself, the receiver can calculate whether it may process the message or not -- in the latter case, the receiver can return an error message.

Figure 3. L3 and L4 compatibility

This "Capability Compatibility Design Pattern" extends well beyond elements. Of course any attribute in a particular language version can be handled in exactly the same way. More than this, the Pattern easily handles element content as well. If we have an L4 code list with values "Standard" and "Handle with Care," and then L5 introduces a code "Unknown," it can be ignored, and L4 receivers may process it. If L5 contains a new code "Hazardous," this may not be ignored -- only L5 receivers may use such a message for subsequent transport of associated goods. In fact, the Capability Compatibility Design Pattern can handle any type of change in the language. And instead of requiring mustUnderstand attributes sprinkled throughout the entire document, a single list with a couple of language versions required for processing is sufficient.

Let's do a walkthrough of the Capability Compatibility Design Pattern. In each example a new type of version change is shown, as well as the way it is handled by the Capability Compatibility Design Pattern.

The Medication Example

We'll start with a language used by physician to send medication prescriptions to apothecaries. Here is version 1:

<?xml version="1.0" encoding="UTF-8"?>

    <message

        version="1">

    <require>

    <version>1</version>

        </require><prescription>

        <medication>aspirin</medication><amount>24</amount>

    </prescription>

</message>

We'll ignore all details such as patient IDs, namespaces, etc., and focus on the medication and the versioning information. In a <require> element we list the versions that may accept our message -- just version 1 for version 1 of the language. Normally, using a URI to identify the version would be the thing to do, but for brevity, I've used just integers in the examples. L1 processors will also need the capability to ignore unknown tags. I've supplied a XSLT script that transforms an Lx document to L1 by removing all unknown elements in the <prescription> element. There are other mechanisms -- using NVDL, authoring XML Schemas with wildcards, doing this in Java or C on the server -- but this will do here. The important thing is that any language that uses the Capability Compatibility Design Pattern must have a mechanism for ignoring unknown content. The processing model for the language is:

Check the required versions.
If not available, return an error.
Strip unknown content with stylesheet.
Validate against schema.
Dispatch for further processing.

Here's a zip file with all sample XML, all "ignore unknown" stylesheets, and schemas for the examples in the article.

Adding an IgnoreUnknown Element

For version 2 of the language, we'll add an <packaging> element. This is the advised packaging of the medicine. Understanding it is not required; apothecaries are specialized enough to select the best packaging, and the element contains merely an advisement, not a prescription:

# Language version 2

# added administration, mustUnderstand = false

element message {

    attribute version { xsd:integer },

    element require {

        element version { xsd:integer }+

    },

    element prescription {

        element medication { xsd:string },

        element amount { xsd:integer },

        element packaging { xsd:string }?

    }

}

The change in language 2 -- L2 for short -- is backward-compatible: since the <packaging> element is optional, any L1 document (such as the one above) will be valid in L2. Because the language is backward-compatible, L2 receivers have the capability to understand L1 and L2 messages. L2 also gets its own stylesheet for ignoring unknown tags; this one will also retain the new packaging element.

L2 capabilities: read L1 L2 write L2

An L2 instance may contain the new element:

<?xml version="1.0" encoding="UTF-8"?>

<message version="2">

    <require>

        <version>1</version>

        <version>2</version>

    </require>

    <prescription>

        <medication>aspirin</medication>

        <amount>24</amount>

        <packaging>box</packaging>

    </prescription>

</message>

The L2 instance lists the receivers that may process this message: version 1 and version 2 receivers. If an L1 receiver gets this message, it will conclude that it is safe to process this message. It will then do its "ignore unknown" magic and remove unknown elements, which will yield message version 1 above. We thus have the desired forward-compatibility.

Applying MustUnderstand Semantics

Next we'll go for an element that must be understood. We'll expand the language and enable the physician to instruct the apothecary to send the medication by mail to the patient's home address. We'll introduce an <element delivery { "mail" | "standard" }?. The receiver must understand this element, otherwise the medication would never be sent to the patient. However, the element is optional, so understanding is obviously only necessary when the element is present. We now get two flavors of instances:

<?xml version="1.0" encoding="UTF-8"?>

<message version="3">

    <require>

        <version>3</version>

    </require>

    <prescription>

        <medication>aspirin</medication>

        <amount>24</amount>

        <delivery>mail</delivery>

    </prescription>

</message>

The first flavor does have the <delivery> element. The only receivers that may process it are version 3 receivers. Version 1 and 2 receivers will recognize that they are not allowed to process it, and must return an error. It amounts to the same as a SOAP-style "mustUnderstand" flag on the <delivery> element, but without the need for such flags on every element that must be understood.

The second flavor does not have the delivery element:

<?xml version="1.0" encoding="UTF-8"?>

<message version="3">

    <require>

        <version>1</version>

        <version>2</version>

        <version>3</version>

    </require>

    <prescription>

        <medication>aspirin</medication>

        <amount>24</amount>

    </prescription>

</message>

It basically is the same message 1 again. Receivers that support either L1, L2, or L3 may process it. This highlights a principle that every writer application should adhere to: maintain a list of versions that may consume the produced instance. For L3, the default list is L1, L2, L3. But whenever a <delivery> element is inserted, the list should be restricted to the L3-level receivers minimum.

Removing Obsolete Features

Sometimes backward-incompatible changes are made. A common case is when an ill-conceived part is replaced by a better alternative; the original is marked as "obsolete" for some versions, then removed. Such a removal is not backward-compatible. Let's take a detailed look:

<?xml version="1.0" encoding="UTF-8"?>

<message version="1">

    <require>

        <version>4</version>

    </require>

    <prescription>

        <medication>aspirin</medication>

        <quantity unit="pcs">24</quantity>

    </prescription>

</message>

Now of course the idea of an <amount> element was ill-conceived. Not all medication comes in countable pieces. Sometimes prescriptions are in milliliters or milligrams. So we decided to remove amount and introduce <quantity>, with a unit attribute. We won't remove the obsolete <amount> element after several versions -- we'll do it right away. Receivers now must support L4: if older processors try to process the message, amount would lack and quantity would be stripped, making the prescription incomplete.

Furthermore, version 4 of the language could refuse to accept documents with the <amount> element:

L4 capabilities: read L4 write L4

In this case, messages from older senders would be rejected with an "Obsolete version, please upgrade" error. This seems a bit harsh for the <amount> example, but if a security leak were discovered in the older versions, such a policy would be advisable for sensitive messages. And even for simple features, after enough time it makes sense to require all parties in a professional environment to support a specific minimum level of a language specification.

Attributes and Code Lists

The Capability Compatibility Design Pattern supports not only new and removed elements, but attributes as well. Of course it depends a bit on the "ignore unknown" implementation, but supposing we remove not only unknown elements, but unknown attributes in known elements as well. The mechanics for attributes are no different than those sketched above. What's more, we can require support for some version of the language based on the code value in an enumeration. Above, we introduced the <delivery> tag:

element delivery { "mail" | "standard" }?

We can change it to:

element delivery { "mail" | "standard" | "personal" | "any" }?

Now "any" could mean it's up to the apothecary to decide how to deliver the medication. The value can safely be ignored by older processors.

<?xml version="1.0" encoding="UTF-8"?>

<message version="5">

    <require>

        <version>4</version>

        <version>5</version>

    </require>

    <prescription>

        <medication>aspirin</medication>

        <quantity unit="pcs">24</quantity>

        <delivery>any</delivery>

    </prescription>

</message>

The stylesheet for version 4 will remove the <delivery> element with the unknown value. But the value "personal" means the physician insists that the medication may only be given to the patient in person, not to anybody else. This value may not be ignored, so version 5 processors should require a minimum of version 5 whenever they insert the personal value:

<?xml version="1.0" encoding="UTF-8"?>

<message version="5">

    <require>

        <version>5</version>

    </require>

    <prescription>

        <medication>aspirin</medication>

        <quantity unit="pcs">24</quantity>

        <delivery>personal</delivery>

    </prescription>

</message>

Namespaces and multiple languages in a single document make things more complicated than can be shown here, but the same principles apply.

Conclusions

The Capability Compatibility Design Pattern is a very flexible and powerful way to control changes in versions of languages for exchanges over the Internet. It goes beyond SOAP-style mustUnderstand headers and easily supports IgnoreUnknown and mustUnderstand semantics for elements, attributes, and enumerations. It does this all by adhering to two simple principles:

List all versions, including older ones, that you know may process your message inside the message you make.
Know all versions you support, and check whether they fit requirements in incoming messages.