Impact of XML Schema Versioning on System Design

(Strategies for Facilitating System Evolution)

by Roger L. Costello and Melissa Utzinger

Introduction

Creating a new version of an XML Schema may have effects that ripple through many parts of a system. Managing these effects can be expensive. So it is worthwhile to examine ways to mitigate the costly ripple effects of new versions of a Schema.

Frequently, Schema versioning is considered in isolation from the rest of the system. However, as noted, Schema changes may impact other parts of a system, so we recommend that Schema versioning be part of an integrated system evolution plan. Schema versioning is one of the drivers for system evolution.

As a strategy for facilitating system evolution we focus on these three parts of a system - Schemas, instance documents, and applications. To treat these three parts in a holistic fashion we make the following recommendations:

Schema Design Recommendations:

- Use the same namespace for all Schema versions.

- Give each new Schema version a different filename or a different URL location or both.

- Don't use anonymous types. Instead, use named types.

- If you change a type when you create a new version of a Schema then give the type a different name.

- Change the name of an element's type only if its immediate content has changed.

- Use a version attribute on the root element. If an instance document is a compound document - that is, an assembly of XML fragments - then place a version attribute on the root of each fragment.

Instance Document Design Recommendations:

- Use the schemaLocation attribute to identify the target Schema (i.e., don't have the Schema validator use out-of-band information to identify the target Schema)

Application Design Recommendations:

- Applications should use the tag names to locate data in instance documents. (Applications should be designed to anticipate that the order of tags may change)

- Define a system-wide protocol (e.g., fault reporting mechanism) to be used when an application is unable to process an instance document it receives from another application.

The rationale for each of these recommendations is explained over the course of this paper. But first we begin by defining the nature of the systems being targeted.

The System

We assume that the system under consideration possesses these characteristics:

The system is comprised of multiple independent applications that collaborate by exchanging XML documents (henceforth called "instance documents").
All instance documents conform to a common XML Schema. (The applications are part of a community which uses the same XML Schema)
All applications are independent and are not required to have prior knowledge, understanding, or agreements with one another.
The XML Schema is periodically updated, i.e., periodically a new version is created.
The XML Schemas (both old and new versions) are accessible by the applications.
Applications are not required to upgrade in lockstep. For example, application A may have been upgraded to send and receive instance documents which conform to the latest version of an XML Schema. Meanwhile, the application that it exchanges data with may still be coded to the last version.
The semantics of an element does not change with new versions of the Schema. Thus, a <location> element always means "position", although in the version 1 Schema its contents may be <lat> and <lon> whereas in the version 2 Schema its contents may be <x> and <y>. Thus, the representation of <location> may change between versions, but its semantics remains constant.
Instance documents use the schemaLocation attribute to reference the XML Schema(s).

Given the above system characterization we now state the problem.

Problem Statement

How can a system be designed to minimize breaking things with each new version of the XML Schema? Specifically, when designing these three things what strategies can be employed to minimize breakage :

XML Schemas
Instance documents
Applications

Categories of Schema Changes that Impact Instance Documents and Applications

There are six categories of changes to an XML Schema that can impact instance documents and applications:

Namespace: the new version (of the XML Schema) could have a different namespace (i.e., targetNamespace).
Location: the new version could physically reside at a new location (URL).
Change: the new version could have changed the contents of an element.
Shuffle: the new version could have reorganized the data in some way, such as by changing the order of elements.
Remove: the new version could have removed an element or attribute that was previously in the old version.
Add: the new version could have added an element or attribute that was not in the old version.

Note: There are many other kinds of changes that could occur in an XML Schema than those listed above. However, they are changes internal to the schema and have no manifestation in instance documents.

Below we discuss how to mitigate the impact of each of these changes.

1. Namespace-Aware Applications

Most XML applications are "namespace aware". That is, the application is designed to process elements belonging to a specific namespace.

For example, an XML Stylesheet Language Transformations (XSLT) Processor is an application which understands the XSLT namespace:

      http://www.w3.org/1999/XSL/Transform

Concretely, this means that an XSLT Processor (application) knows how to process elements such as <template>, <for-each>, <if>, etc., provided the elements are associated with the XSLT namespace.

Changing namespaces results in breaking namespace-aware applications. This brings us to our first recommendation:

Recommendation 1: To avoid breaking namespace-aware applications with each new version of an XML Schema use the same namespace for all versions.

2. Place the New Version of a Schema at a New Location to Avoid Breaking Old Instance Documents

Suppose that a new version of a XML Schema is created (using the same namespace, as described above). And the new version simply overwrites the old version. That is, the new version has the same filename and the same URL location as the old schema. Depending on the kinds of changes made, this may result in breaking all instance documents that were written to conform to the old Schema.

Recommendation 2: To prevent breaking old instance documents give the new Schema version a different filename or a different URL location or both.

3. Dealing with Change to an Element's Content Model

A common occurrence when creating a new version of an XML Schema is to change an element's content. (The technical expression is: "change an element's content model")

For example, in a version 1 Schema the <location> element may have been declared to be comprised of <lat> and <lon> whereas in version 2 its contents may be <x> and <y>.

Suppose that an application receives an instance document which conforms to the latest version of the Schema. And let's suppose that the application is still coded to the previous version of the Schema. The application parses through the instance document and arrives, say, at the <location> element. How will the application recognize that location's content model has changed?

It would be useful if the application could consult the parser: "What's the type (content model) of <location>?" If the type is not one that it expects then the application must decide how to proceed.

There are many possible courses of action that an application may take when it encounters an element with an unfamiliar content model. For example it may (1) simply skip the <location> element, or it may (2) attempt to dynamically understand the new content model by consulting an ontology. Which action is taken depends on the application and is beyond the scope of this paper.

How can we facilitate an application in recognizing an element's type? That is, how do we enable applications to determine the type of each element it encounters? Answer: the XML Schema must be designed to provide explicit type information.

Recommendation 3: To facilitate an application in recognizing that an element's content has changed, don't use anonymous types. Instead, use named types.

Example. Do not design your Schema like this:

<element name="location">
    <complexType>
        <sequence>
            <element name="x" type="decimal"/>
            <element name="y" type="decimal"/>
        </sequence>
    </complexType>
</element>

What is location's type? Answer: it is anonymous. This Schema is not designed to facilitate an application in obtaining type information.

Instead, design your Schema like this:

<complexType name="locationType-x_y_version">
    <sequence>
        <element name="x" type="decimal"/>
        <element name="y" type="decimal"/>
    </sequence>
</complexType>

<element name="location" type="locationType-x_y_version"/>

What is location's type? Answer: the named type, locationType-x_y_version. Thus, this Schema is designed to facilitate an application in obtaining type information.

Suppose that an Schema designer follows Recommendation 3 and always uses named types. This will enable applications to query a parser for type information, e.g., "What's the type of <location>?" The parser will reply with: "the type is locationType-x_y_version". If this is a type that the application did not expect (i.e., was not coded to understand) then it will take appropriate steps (as described above).

As a parser validates an instance document against a Schema, it collects from the Schema information about each element in the instance document (such as the datatype of each element). This collection of information is called the Post Schema Validation Infoset (PSVI).

Let's continue with the <location> example. Above we saw the motivation for using named types - it enables an application to easily discover an element's content model. Of course, if a new version of a Schema is created and <location>'s type is changed but the new type is given the same name as the old type, then it defeats the whole purpose of type information. This leads us to the next recommendation:

Recommendation 4: If you change a type when you create a new version of a Schema then give the type a different name.

Example. Suppose that in the version 1 Schema <location> has this as its contents: <lat> and <lon>. The Schema declares this named type:

<complexType name="locationType">
    <sequence>
        <element name="lat" type="decimal"/>
        <element name="lon" type="decimal"/>
    </sequence>
</complexType>

<element name="location" type="locationType"/>

Now suppose that in the version 2 Schema the contents of <location> is changed to <x> and <y>. It is important to give a new name to location's type:

<complexType name="locationType-x_y_version">
    <sequence>
        <element name="x" type="decimal"/>
        <element name="y" type="decimal"/>
    </sequence>
</complexType>

<element name="location" type="locationType-x_y_version"/>

Thus, if a version 1 application receives a version 2 instance document then, when it parses down to the <location> element, it will be able to easily recognize that <location>'s content model has changed (it has changed from locationType to locationType-x_y_version).

3.b Localize Type Changes

Suppose that the <location> element is nested within an <aircraft> element, e.g.,

<complexType name="aircraftType">
    <sequence>
        <element name="location" type="locationType-x_y_version"/>
    </sequence>
</complexType>

<element name="aircraft" type="aircraftType"/>

Technically, the contents of <aircraft> has changed since the contents of <location> has changed. Should the type name for <aircraft> be changed? Answer: no. The reason is that we want to minimize changes. That is, we want an application to see as many familiar elements and types as possible. The aircraftType is a familiar type. It still has as its contents a <location> element. We want to preserve this familiarity.

Recommendation 5: Change the name of an element's type only if its immediate content has changed.

3.c Use a Version attribute

Applications will find it useful to have an indication of whether it can expect changes as it processes an instance document. This can be accomplished using a version attribute on the root element.

Note that this is what XSLT does. As the XSLT technology has migrated to a new version, instance documents (i.e., XSLT documents) indicate which version is being used with a version attribute on the root element.

Recommendation 6: Use a version attribute on the root element. If an instance document is a compound document - that is, an assembly of XML fragments - then place a version attribute on the root of each fragment.

4. Effect of Shuffling Elements

A new version of a Schema may make a change as simple as reordering the contents of an element. For example, in a version 1 Schema the order may be A, B, C, e.g.,

<complexType name="...">
    <sequence>
        <element name="A" type="..."/>
        <element name="B" type="..."/>
        <element name="C" type="..."/>
    </sequence>
</complexType>

In the version 2 Schema the order may be changed to B, C, A, e.g.,

<complexType name="...">
    <sequence>
        <element name="B" type="..."/>
        <element name="C" type="..."/>
        <element name="A" type="..."/>
    </sequence>
</complexType>

If an application is coded to expect a certain ordering of the data then the new version of the Schema will break the application. To avoid this an application should never depend on specific ordering of data. It should locate the data using the tags.

Recommendation 7: Applications should use the tag names to locate data in instance documents. Applications should be designed to anticipate that the order of tags may change.

Thus, a Schema's <sequence> particle should be treated only as notional.

5. Effect of Removing an Element or Attribute

Creating a new version of an XML Schema may result in removing an element or attribute. Consider an application that has not been upgraded to the new version, and receives an instance document that conforms to the new version. The application must decide whether the lack of the element or attribute is catastrophic or whether it can live without the information. The action taken is application-specific (and is outside the scope of this paper).

6. Effect of Adding an Element or Attribute

Creating a new version of an XML Schema may result in adding an element or attribute. Consider an application that has not been upgraded to the new version, and receives an instance document that conforms to the new version. The application must decide what to do with the additional information. Again, what action is taken is application-specific.

What to do when an Application Breaks

The above recommendations will help mitigate breakage due to Schema changes. However, they do not guarantee that applications will not break. An old application may receive a new instance that is missing crucial information, or the content model of a crucial element may have changed to a type that cannot be dynamically understood.

To anticipate such occurrences it will be beneficial to institute a system protocol that specifies what actions should be taken by applications when breakage occurs. One possible protocol is for an application to respond to the sender with a fault message.

Recommendation 8: Define a system-wide protocol (e.g., fault reporting mechanism) to be used when an application is unable to process an instance document it receives from another application.

Summary