You are here: Home > What's New > Single XML Vocabulary ...

A Single XML Vocabulary, Customized for each Subcommunity

Multiple Subcommunities within a Community

Frequently we find a community creating a single XML vocabulary, but within the community are subcommunities that have different perspectives on what data is relevant and needed. Below is a discussion on various approaches to deal with this situation.

Problem Statement

How do you create a single XML vocabulary, and validate that XML vocabulary, for a community which has subcommunities that have overlapping but different data needs?

Example

Consider the book community. It is comprised of:

They have overlapping, but different data needs.

For example, the data needed by a book seller is:

The book distributor has many of the same data needs, but also some differences:

And the book printer has overlapping but different needs:

How does the book community deal with such differing needs?

Approach #1 - Make Everything Optional

One approach is to define a schema where everything is optional, e.g.

book.xsd

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.books.org"
        elementFormDefault="qualified">

    <element name="Book">
        <complexType>
            <sequence>
                <element name="Title" minOccurs="0" type="string"/>
                <element name="Author" minOccurs="0" type="string"/>
                <element name="Date" minOccurs="0" type="gYear"/>
                <element name="ISBN" minOccurs="0" type="string"/>
                <element name="Publisher" minOccurs="0" type="string"/>
                <element name="Size" minOccurs="0" type="string"/>
                <element name="Weight" minOccurs="0" type="string"/>
                <element name="MailingCost" minOccurs="0" type="string"/>
                <element name="NumPages" minOccurs="0" type="nonNegativeInteger"/>
                </sequence>
        </complexType>
    </element>

</schema>

Then, each sub-group in the book community uses just the elements they need, ignoring the others. Thus, the book seller creates XML instance documents comprised of Title, Author, Date, ISBN, and Publisher, e.g.

book-seller.xml

<?xml version="1.0" encoding="UTF-8"?>
<Book xmlns="http://www.books.org">
    <Title>The Wisdom of Crowds</Title>
    <Author>James Surowiecki</Author>
    <Date>2005</Date>
    <ISBN>0-385-72170-6</ISBN>
    <Publisher>Anchor Books</Publisher>
</Book>

The book distributor creates XML instance documents comprised of Title, Author, Size, Weight, and MailingCost, e.g.

book-distributor.xml

<?xml version="1.0" encoding="UTF-8"?>
<Book xmlns="http://www.books.org">
    <Title>The Wisdom of Crowds</Title>
    <Author>James Surowiecki</Author>
    <Size>5" x 8"</Size>
    <Weight>15oz</Weight>
    <MailingCost>$3.90</MailingCost>
</Book>

And the book printer creates XML instance documents comprised of Size and NumPages, e.g.

book-printer.xml

<?xml version="1.0" encoding="UTF-8"?>
<Book xmlns="http://www.books.org">
    <Size>5" x 8"</Size>
    <NumPages>301</NumPages>
</Book>

Advantage

This approach has the advantage that the schema is very simple. And, each subcommunity can pick-and-choose whatever elements they want, and can easily change their choices.

Disadvantage

The disadvantage of this approach is that validation is weak. For example, a book seller may accidentally forget to provide the Publisher in his instance document:

<?xml version="1.0" encoding="UTF-8"?>
<Book xmlns="http://www.books.org">
    <Title>The Wisdom of Crowds</Title>
    <Author>James Surowiecki</Author>
    <Date>2005</Date>
    <ISBN>0-385-72170-6</ISBN>
</Book>

Validation would not catch this error.

Approach #2 - Create Subschemas

A second approach is to create a master book schema, and then each subcommunity selectively filters, retaining only the desired set of elements. Here is the master book schema:

book-community.xml

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.books.org"
        xmlns:bk="http://www.books.org"
        elementFormDefault="qualified">

    <complexType name="BookType">
        <sequence>
            <element name="Title" minOccurs="0" type="string"/>
            <element name="Author" minOccurs="0" type="string"/>
            <element name="Date" minOccurs="0" type="gYear"/>
            <element name="ISBN" minOccurs="0" type="string"/>
            <element name="Publisher" minOccurs="0" type="string"/>
            <element name="Size" minOccurs="0" type="string"/>
            <element name="Weight" minOccurs="0" type="string"/>
            <element name="MailingCost" minOccurs="0" type="string"/>
            <element name="NumPages" minOccurs="0" type="nonNegativeInteger"/>
        </sequence>
    </complexType>

    <element name="Book" type="bk:BookType" />

</schema>

The book seller subcommunity creates a schema that filters Date, ISBN, Publisher, and NumPages:

book-seller.xsd

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.books.org"
        xmlns:bk="http://www.books.org"
        elementFormDefault="qualified">

    <redefine schemaLocation="book-community.xsd" >
        <complexType name="BookType">
            <complexContent>
                <restriction base="bk:BookType">
                    <sequence>
                        <element name="Title" type="string"/>
                        <element name="Author" type="string"/>
                        <element name="Date" type="gYear"/>
                        <element name="ISBN" type="string"/>
                        <element name="Publisher" type="string"/>
                    </sequence>
                </restriction>
            </complexContent>
        </complexType>
    </redefine>

</schema>

The book distributor subcommunity creates another schema, which does the appropriate filtering:

book-distributor.xsd

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.books.org"
        xmlns:bk="http://www.books.org"
        elementFormDefault="qualified">

    <redefine schemaLocation="book-community.xsd" >
        <complexType name="BookType">
            <complexContent>
                <restriction base="bk:BookType">
                    <sequence>
                        <element name="Title" type="string"/>
                        <element name="Author" type="string"/>
                        <element name="Size" type="string"/>
                        <element name="Weight" type="string"/>
                        <element name="MailingCost" type="string"/>
                    </sequence>
                </restriction>
            </complexContent>
        </complexType>
    </redefine>

</schema>

And the book printer subcommunity creates yet another schema, which does the appropriate filtering:

book-printer.xsd

<?xml version="1.0"?>
<schema xmlns="http://www.w3.org/2001/XMLSchema"
        targetNamespace="http://www.books.org"
        xmlns:bk="http://www.books.org"
        elementFormDefault="qualified">

    <redefine schemaLocation="book-community.xsd" >
        <complexType name="BookType">
            <complexContent>
                <restriction base="bk:BookType">
                    <sequence>
                        <element name="Size" type="string"/>
                        <element name="NumPages" type="nonNegativeInteger"/>
                    </sequence>
                </restriction>
            </complexContent>
        </complexType>
    </redefine>

</schema>

Advantage

There is strong validation.

Disadvantage

Instead of one schema, there are now four schemas to manage. The subcommunity-specific schemas are a bit complex, using the relatively obscure <redefine> element, which is being deprecated in the next version of XML Schema.

Approach #3 - Layer Business Rules on Top of the Grammar

The third approach is to use the simple schema from approach #1, and then add a business-rules layer on top to constrain it in a way that is appropriate for a subcommunity. The business-rules can be expressed using Schematron.

Lower layer is grammar (element declarations) and upper layer is the subcommunities business rules (Schematron)

Here is the Schematron schema that implements the constraints needed by the book seller subcommunity:

book-seller.sch

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

   <sch:ns uri="http://www.books.org"
           prefix="bk" />

   <sch:pattern name="Book Sellers">

      <sch:p>The book data required for a seller is 
             title, author, date, ISBN, and publisher.</sch:p> 

      <sch:rule context="bk:Book">

         <sch:assert test="count(bk:Title) = 1 and
                           count(bk:Author) = 1 and
                           count(bk:Date) = 1 and
                           count(bk:ISBN) = 1 and
                           count(bk:Publisher) = 1 and
                           count(*[not(self::bk:Title or 
                                       self::bk:Author or 
                                       self::bk:Date or 
                                       self::bk:ISBN or 
                                       self::bk:Publisher)]) = 0">
             The book data required for a seller is 
             title, author, date, ISBN, and publisher.
         </sch:assert>

      </sch:rule>

   </sch:pattern>

</sch:schema>

Here is the Schematron schema that implements the constraints needed by the book distributor subcommunity:

book-distributor.sch

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

   <sch:ns uri="http://www.books.org"
           prefix="bk" />

   <sch:pattern name="Book Distributors">

      <sch:p>The book data required for a distributor is 
             title, author, size, weight, and mailing cost.</sch:p> 

      <sch:rule context="bk:Book">

         <sch:assert test="count(bk:Title) = 1 and
                           count(bk:Author) = 1 and
                           count(bk:Size) = 1 and
                           count(bk:Weight) = 1 and
                           count(bk:MailingCost) = 1 and
                           count(*[not(self::bk:Title or 
                                       self::bk:Author or 
                                       self::bk:Size or 
                                       self::bk:Weight or 
                                       self::bk:MailingCost)]) = 0">
             The book data required for a distributor is 
             title, author, size, weight, and mailing cost.
         </sch:assert>

      </sch:rule>

   </sch:pattern>

</sch:schema>

And here is the Schematron schema that implements the constraints needed by the book printer subcommunity:

book-printer.sch

<?xml version="1.0"?>
<sch:schema xmlns:sch="http://www.ascc.net/xml/schematron">

   <sch:ns uri="http://www.books.org"
           prefix="bk" />

   <sch:pattern name="Book Printers">

      <sch:p>The book data required for a printer is 
             the size and number of pages.</sch:p> 

      <sch:rule context="bk:Book">

         <sch:assert test="count(bk:Size) = 1 and
                           count(bk:NumPages) = 1 and
                           count(*[not(self::bk:Size or self::bk:NumPages)]) = 0">
             The book data required for a printer is 
             the size and number of pages.
         </sch:assert>

      </sch:rule>

   </sch:pattern>

</sch:schema>

Concurrent Validation

To tie the two layers together, what is needed is to validate a subcommunity's XML instance document against the grammar-based schema plus the appropriate Schematron schema. This "concurrent validation" can be accomplished using an NVDL meta-schema.

Here is the NVDL document to perform concurrent validation of the book schema and the book seller's Schematron schema:

book-seller.nvdl

<?xml version="1.0"?>
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0">

   <namespace ns="http://www.books.org">
     <validate schema="book.xsd" />
     <validate schema="book-seller.sch" />
   </namespace>

</rules>

Here is the NVDL document to perform concurrent validation of the book schema and the book distributor's Schematron schema:

book-distributor.nvdl

<?xml version="1.0"?>
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0">

   <namespace ns="http://www.books.org">
     <validate schema="book.xsd" />
     <validate schema="book-distributor.sch" />
   </namespace>

</rules>

Here is the NVDL document to perform concurrent validation of the book schema and the book printer's Schematron schema:

book-printer.nvdl

<?xml version="1.0"?>
<rules xmlns="http://purl.oclc.org/dsdl/nvdl/ns/structure/1.0">

   <namespace ns="http://www.books.org">
     <validate schema="book.xsd" />
     <validate schema="book-printer.sch" />
   </namespace>

</rules>

Advantage

The layered approach maintains a clean separation of concerns — it keeps the task of defining the XML vocabulary separate from the task of constraining the XML vocabulary based on a subcommunity's particular business data needs. Thus, the tasks of defining the XML vocabulary, specifying each subcommunity's business rules, and validating the instance documents can be divided up and assigned to the appropriate people/department. The XML Schema is simple. The Schematron schemas are simple. And the NVDL meta-schema is simple.

Disadvantage

There are now seven files to maintain.

Approach #4 - use Active Schema Language (ASL)

Philippe Poulard describes the ASL approach to solving the problem.

Approach #5 - Retain Subcommunity-Specific XML Vocabulary and Converge at a Higher Level

Getting agreement on a common XML vocabulary among a group of subcommunities that have different needs and terminology is difficult. Doing so sometimes results in a vocabulary that is a least-common-denominator, where the richness of an individual subcommunity's vocabulary is lost. And the cost of creating and maintaining a global vocabulary may exceed its benefit. Sometimes it's best to retain each subcommunity's XML vocabulary.

In any case, prior to attempting to create a community-wide XML vocabulary, a "usage analysis" should be undertaken. "How much direct exchange is there between subcommunities?" "If one subcommunity doesn't play by the rules, how will that impact the rest of the community?"

Rather than converging on a community-wide XML vocabulary, it may be more beneficial to defer convergence to a higher level, at the transformation level or query level. Thus each subcommunity continues to use its own vocabulary. When an exchange is made with another subcommunity, the global community-agreed-to transformation is utilized, or the global community-agreed-to query mechanism is utilized.

Summary

The above discussion illustrates how a community can create a single XML vocabulary that can be appropriately customized to the needs of differing subcommunities.

Four approaches were identified. The first approach created a smorgasbord schema — all the elements are optional and each subcommunity picks the elements that are relevant. The second approach used a hierarchy of schemas; each subcommunity creates a schema that filters the base schema. The third approach used a layering approach — a simple grammar-based schema defined the XML vocabulary, Schematron rules were defined to constrain the XML vocabulary in a way appropriate to each subcommunity, and NVDL was used to tie together the grammar-based schema with the Schematron schema. The fourth approach is to use a powerful schema language called ASL.

All four approaches have their advantages and disadvantages. As always, understand the approaches, and choose the approach that is right for your situation.

Finally, a fifth approach was identified. This approach asserts that attempting to create and maintain a community-wide XML vocabulary is folly; it is better to allow each subcommunity to retain its own terminology. This approach suggests that convergence may better occur at a higher level than at the XML vocabulary; covergence should be at the transformation or query level.

Acknowledgements

Thanks to the following people for their input to this document:

Last Updated: July 20, 2008