Schema Repositories: What's at Stake?
January 26, 2000
Table of Contents |
Part One |
No race has been at once as enigmatic and as heated as the race to be Your Source for XML Schemas and DTDs. Since the publication of XML 1.0, the schema-writing public has been graciously invited to deposit its intellectual property on other folks' web sites. (Here and throughout the article I use "schema" in the general sense, including XML Schemas, DTDs, etc.) When the invitations came from out of the blue, there was no incentive to respond; but with the world's largest software company sending out invitations you can't refuse, and the compulsion to offer an alternative now galvanizing support for the Organization for the Advancement of Structured Information Standards (OASIS), it's time to figure out what's at stake.
Here's a theory: It's not the siting and cataloging of schemas that is important; it's the potential relationship to the content of the schemas that is the ultimate prize. If the sole intent was to be a source of schemas and useful information about schemas, then either contending organization might have taken seriously the question of what the schema-seeker needs right now and in the near term. But to date, neither repository seems grounded in real-world, current requirements. The registry and repository sites launched by Microsoft and OASIS (BizTalk and XML.org, respectively) will come into their own only when they can dish out schemas that are part of a comprehensive and cohesive framework. In fact, both sites are being developed in conjunction with such frameworks.
Before talking about what the schema-seeking public really needs, who's calling the shots, and where all this is leading, let's see what the BizTalk and XML.org repository efforts offer and how they compare.
BizTalk.org repository |
XML.org repository |
|
Control |
Microsoft Corp. |
Sponsor organizations: IBM, Oracle, SAP, Sun Microsystems, CommerceOne, DataChannel, Documentum, SoftQuad |
Advisory |
Membership not published, but partial list of 29 organizations, including OAG, the DoD, CommerceOne, and RosettaNet, supplied on request.* |
A project of OASIS, so it is possible to say that OASIS, with 150 members, is an "advisory" to XML.org. |
Policy on Receipt of Material |
Use XDR with BizTalk wrapper tags; statement on site says will support W3C schemas when done. |
Use standard schema language (W3C or ISO DTDs until W3C schema language done.) |
Services |
Discovery and hosting (search for and retrieve actual schemas and supporting documents from repository) |
Current: catalog of links to sites developing schemas Future: discovery and hosting; referral of queries to alternate repositories |
Schemas listed |
Hosts 212-250 schemas searchable in 11 industry categories from more than 50 organizations. |
Links to over 100 schema-producing organizations listed in 45 categories. |
Supporting material |
Sample document instance and documentation for each schema; documentation according to template |
Future: DTD/schema and supporting files |
Interface |
Keyword search or select industry and organization |
Browse list organized by industry and organization |
Descriptive documents |
BizTalk Framework 1.0 Independent Document Specification (applies to business schema language more than repository). Linked from BizTalk.org. |
OASIS R&R Technical Committee
History
|
* The list is not posted on BizTalk's site, but is "public." In response to a query, Chris Kurt supplied this list: "American Petroleum Institute, Ariba, Baan, Boeing, Clarus, CommerceOne, Compaq, Concur, Dell, DISA, EXE, Extricity, Ford, GEIS, Harbinger, i2, Intelysis, JDEdwards, US DOD, Merrill Lynch, Neon, Open Applications Group, Pivotal, Reuters, RosettaNet, SAP, Siebel, UPS, webMethods, and others."
The BizTalk Repository
It's hard to tell exactly how many schemas are on the BizTalk server as there is no browse interface (so selection is either by keyword or by industry and organization). A search on "unclassified" renders 212 hits, and press releases claim over 250 on the site. If you want a purchase order schema, you can search on those keywords and get 19 hits of various types. If you want a manufacturing purchase order schema, you can look at organizations listed as schema providers under "manufacturing," but you can't search by keyword within that category. As the number of listed schemas proliferates, this interface will need some work.
Presentations on BizTalk—there were three at XML '99—canonically reiterate its three components:
- the repository
- the BizTalk schemas
- the BizTalk Server, which is a commercial Microsoft product
The BizTalk Framework document deals almost exclusively with the BizTalk schema tags, and the rambling, undated, "BizTalk Philosophy" does not define repository requirements. An overview of the framework says:
The BizTalk Framework Web site will be an interactive place where industry groups and developers can publish their schemas. The Web site will allow public and private publication based on the decision of the publishing organization. Once a BizTalk Framework schema is accepted and published, the repository will provide versioning and specialization support for BizTalk Framework schema adoption and alteration. The repository will support dynamic detection of schemas, processes and visualization maps connected to any given version of a BizTalk Framework schema.
A press release on December 15, 1999, gives some indication of how the repository is to be viewed:
One hundred and fifty organizations are now registered as schema publishers on www.biztalk.org. "We're far and above in the lead," says Dan Rogers, Program Manager of www.biztalk.org. "The difference between our library and others is the richness and correctness of the content." No other schema library or so-called "repository" validates the technical correctness of schemas.
The PR is consistent with presentations made in Philadelphia, which indicated that Microsoft sees this as something of a horse race based on numbers of schemas and quality of supporting documents and services. The release states, "Another important feature to look for in a schema library is run-time hosting.... Hosting allows an application that is using a schema to access the schema over the Internet at any time." I did not find any further indication of exactly what this means or how it is implemented on the current site. Advisory committee members are already privy to the draft 2.0 spec for the Framework.
XML.org
The XML.org repository is the work product of OASIS, officially an "initiative" of the consortium, which has grown from a few dozen to over 150 members, with eight sponsors putting up half a million dollars total to jumpstart the repository. (Four partner sponsors have paid a $100,000 entry tab and four affiliate sponsors have paid $25,000 to get the site up and running.) The project will adopt the specifications developed by the OASIS Registry and Repository Technical Committee, chaired by Terry Allen of CommerceOne. The actual site is under the direction of Craig Chevrier, recently hired as XML.org managing editor.
The current site catalogs and links to schema-writing organizations, from the American Institute of CPAs to the Workflow Management Coalition. This catalog is the precursor to the actual registry. By April, according to Chevrier, they hope to be opening their doors to deposit of actual schemas, as BizTalk.org does now.
The XML.org catalog has a browse interface, which, in the absence of a robust taxonomy or classification scheme, gives a better overview than the limited search and classification system of BizTalk, but again, won't scale up to thousands of entries. Chevrier says they have not yet decided on an interface, but the goal is to allow querying by keyword, application type, and industry. Browsing will be maintained as long as it stays manageable, but may become less and less viable as the volume increases. The site itself is undergoing a major revision that should be up by early February.
The documents describing the OASIS Registry posted by Allen's technical committee apply generally to XML schema repositories. According to Rogers, the BizTalk site will use the OASIS specification when it is complete, the idea being that interoperability between repositories will allow a query to be passed to an alternate source.
"We're tracking the progress of that work, and will make any change to our software that we feel is appropriate once the specification reaches a mature state and other schema libraries start implementing it and need to interoperate. We're working on defining automation interfaces for this purpose as well."
Since the Microsoft specifications are not public, the Registry TC documents apply generally to both sites.
Are Repositories Useful?
Here are the use cases projected for the repositories (explicitly for XML.org and implicitly for BizTalk.org) and summaries of why, on closer examination, I think the area of application may be significantly narrower in the near term:
-
Obtain schema (and other required supporting files, such as stylesheet) automatically on receipt of a document referencing an unknown schema.
Counter: If you don't know the information model of the schema, or if it has changed, retrieving the schema won't automate interoperability. If you do know the information model underlying the schema, and you are using it on a real-time, transactional basis, you will likely download it once and maintain it locally, rather than hitting the repository server every time you need to parse an instance. If you aren't convinced by this argument, search for "purchase order" on BizTalk.org. The 19 hits (as of 1/20/00) include general and specialized documents, complementary and contradictory approaches, and pieces of larger schema frameworks. If you know ahead of time which one you want, finding it here might be convenient. If you don't know, this selection would represent the beginning of your research, not its conclusion.
-
Upload schema and supporting files, thus taking burden of being a schema server off of the creator. Files may be available for archival access (slow retrieval) or utility access, where server would require high speed and possibly high bandwidth. Posting material can also solicit useful feedback.
Counter: The arguments against utility usage seem the same as above: if you use it frequently, you will fetch your own copy once. If an update is made, you will fetch the new schema once. This process can be automated as long as the revision does not affect the relationship to your local information model and how instances are processed locally. But how will the updated processing system know that a change in datatype means a commensurate change in local processing? It seems difficult to automate this level of discretion without a set of ground rules on the range of changes possible within an "update."
-
Register without deposit to gain visibility, but maintain local control from original site or alternate repository.
Counter: None. This is the library or catalog function of the repository, consonant with the archival search and browse functions. This seems quite reasonable and immediately useful.
-
Browse or search for schema for new editing application. End user may not even be aware of use of XML or invocation of remote schema. (Example, I'm listing my house with a real estate broker, but don't have the right schema. My editing application hits a repository, finds and downloads the correct schema, and customizes itself for my data input.)
Counter: At XML '99, all three vendors showing XML document editing tools promoted easily customizable, schema-specific applications (Arbortext's Adept Lite, SoftQuad's XMetaL, Excosoft's Documentor). Adapting any of these to a schema today is neither an automated nor an end-user process. The vendors have done much to lower the bar, but the task still requires integration and programming. Fully automating the process will require a major chunk of work in the implementation and execution of editing tools and interfaces. I'd really love to see this, and I applaud the writers of the OASIS Use Scenarios for looking ahead—it's a refreshing change from looking backward at the word processing paradigm—but I don't think this is a use case applicable to repositories in 2000 or 2001.
In summary then, the repository use cases that are compelling, at least for the near-term, are the discover-what's-out-there, look-at-it, and evaluate-it yellow pages scenarios.
What We Really Need in a Repository
There are essentially three levels of utility a repository could provide:
- A yellow-pages-like listing of anyone willing to pay the price of admission and conform to minimal constraints
- A reference-librarian or encyclopedia-like resource that informs and guides users to the information they really need
- A dynamic, real-time source for schema location during transactional processing
Both XML.org and BizTalk could grow into a yellow pages for schemas. However, to guide users to the right schema—that is, to be a reference library rather than a phone book—the sites will need to put some more muscle and moxie into the project and produce more than just a flat list of everything that comes their way.
Currently, BizTalk touts its validation service, but if this is anything above a validating XML parser, it is not obvious. BizTalk also ranks schemas by what they call "use counts." This number is the number of individuals registered with BizTalk who ask to be notified if a change is made to a schema. Perhaps it indicates something, but moving from "tell me if Foo.xdr gets tweaked" to "Foo.xdr is mission-critical to my business" is an unwarranted leap of faith. XML.org proposes a similar metric: tracking the number of downloads. But this won't work either. Here's why:
Let's say I post Lioras.RadiologyExam.DTD on the XML.org site, and a dozen integrators—from the Mayo Clinic to King Faisal Hospital—download the thing to see what the heck I've done. The indicators would be "high usage." Meanwhile, the American College of Radiology has ACR.RadiologyExam.DTD, which is listed for reference on both sites. But everyone in the field has been tracking the development of the document, is a member of ACR, and gets their copy directly from the ACR site. Result: "low usage."
In short, measuring status or usage hasn't gotten more than a lick and a promise from either site. Users need to know what is really standards conformant. They need to find out what is used by whom; what experience others have had working with the schema; and its relationship to other schemas. If not a critical edition, at least we need an Amazon.com-like source of user feedback and a NY Times best seller list version of popularity.
Dan Rogers of BizTalk indicated that Microsoft had no plans to qualify or make judgements on schemas, and that usage indicators would become more representative as traffic to the site and use of schemas rose. Laura Walker, Executive Director of OASIS, on the other hand, suggested that "long term, OASIS will play more of a role in arbitrating the standards and offering opinions on the validity and viability of the standards." She believes that it is too early to add this layer of valuation, that the repository should get started on a "democratic" basis while experience is compiled on the various schemas. According to Walker, "More needs to be done in the process of downloading and testing, then using and applying the schemas. 12-18 months from now, this will change."
For one-stop schema shopping, a repository will not only need to guide a user to the right model, it will need to provide an unambiguous information model documenting its semantics. If I'm going to map my local database to information via that schema, I need to know its information model and its relationship to other models and schemas.
The OASIS Registry TC design principles call for "providing DTDs and schemas, and an interface to their metadata, before proceeding to other matters." BizTalk-hosted schemas have rudimentary documentation on site. Neither BizTalk nor OASIS will necessarily set the context required for "semantic interoperability"—the sine qua non of the exchange world.
Semantic interoperability means that when I send you my XML instance, you not only can parse it against a known schema, but you know what the components mean and can relate them to your local information model. To pull a schema off the shelf or down from a repository site and put it to work, the schema has to be a known quantity, part of a known framework of interoperable schemas or one with an unambiguous derivation from a known information model.
While the current sites are clearly intended to rise above the level of a yellow pages, neither has addressed the requirements for qualification or documentation of their wares.
So, what are they aiming at?
Table of Contents |
Part One |
The Business of Business Schemas
With a proven framework and set of interoperable vertical schemas, exchange communities could use a repository in real time. Both repository initiatives (BizTalk and XML.org) are associated with efforts to create frameworks for interoperable business schemas. The BizTalk repository is actually ancillary to the BizTalk Framework Independent Document Specification. OASIS, parent to XML.org, is co-sponsor of ebXML together with UN/CEFACT. Let's take a look at what is proposed:
BizTalk Framework |
ebXML |
|
Control |
Microsoft Corp. |
Sponsor organizations: OASIS, UN/CEFACT |
Advisory |
Membership not published, but partial list of 29 organizations, including OAG, the DoD, CommerceOne, and RosettaNet, supplied on request. |
Invited and confirmed participants listed on site include over a hundred individuals and organizations, including several major standards organizations as well as vendors and users. |
Status |
Version 1.0 issued 11/30/99; Version 2.0 available to advisory committee. |
First meeting held November 1999; task forces formed; next meeting January 31, 2000; 18 month time-frame projected. |
Sources |
Microsoft specification |
Submissions accepted from CEN, UN/CEFACT, CommerceOne, and other organizations |
Objective |
to build "...a set of guidelines for how to publish schemas in XML and how to use XML messages to easily integrate software programs together in order to build rich new solutions." "...to leverage what you have today—your existing data models, solutions, and application infrastructure—and adapt it for electronic commerce through the use of XML." (www.BizTalk.org) |
"to research and identify the technical basis upon which the global implementation of XML can be standardized." A project "... for the exchange of electronic business data in application-to-application, application-to-person and person-to-application environments." (www.ebXML.org) |
The BizTalk Framework describes the schema for BizTalk Messages. The messages start with a transport-specific envelope, which encloses a BizTalk Document. The BizTalk Document header has delivery and manifest information and the body is the actual business document payload. The 1.0 specification addresses logical and physical addressing and point-to-point request/reply exchanges. Subsequent releases will address distribution lists, handlers, anonymous messaging, and publish/subscribe scenarios. The specification complies with the requirements of the BizTalk Server, which can route and manage the messages.
ebXML is a joint project of the United Nations body for Trade Facilitation and Electronic Business (UN/CEFACT) and OASIS, created to "develop a technical framework that will enable XML to be utilized in a consistent manner for the exchange of all electronic business data." Over 150 people participated in the first meeting, including the key players in international EDI.
UN/CEFACT and OASIS characterize ebXML as an 18-month initiative—an ambitious timeframe, even in web time. But a working draft is achievable in that time if they get rapid consensus to incorporate existing work. The semantic layer will come from EDI and EDIFACT, and the XML framework from what is essentially the third generation of CommerceOne's Common Business Language, part of the CommerceNet eCo Framework.
The eCo Framework, which effectively sets the scope of ebXML, is a combination of layered protocols and services that rule the exchange of business information within communities and markets. At XML '99, Dr. Robert J. Glushko, Director of Advanced Technology at Commerce One, Inc, said that the eCo Framework doesn't compete with the other business dialects like OBI and RosettaNet or even BizTalk. Instead, "It creates a world in which they can co-exist and makes it easier to compare and contrast them." It wraps them in a conceptual marketplace.
In this view of the world, the BizTalk specifications could provide an alternate component of messaging semantics, a uniform wrapper for the business documents containing actual business information. In this sense, it would be comparable to the current eCo Framework semantic recommendation (which is derived from legacy EDI systems).
Robert Worden took on the assumptions behind the BizTalk repository/schema scenario in his article, XML E-Business Standards: Promises and Pitfalls, published here on XML.com, when he wrote:
"None of the XML repositories will solve the N-squared translation problem for all businesses, unless it can establish a common model of all business information, agreed between all parties, which can then act as an interlingua for all XML translations. The chances of such a massive information model being developed consistently and completely, agreed across all countries and industry sectors, and then maintained effectively, are remote."
Regarding the "supra-standards," such as the eCo Framework, he says that they may manage the complexity of certain dialects, but it is too early to tell which will work and which will flop. Without a universal schema or schema management system anywhere on the horizon, he concludes that organizations should buckle down to the tough but necessary work of building their own gold standard schema since they are going to need it anyway.
My own experience in building interoperable schemas for health care leads me to be somewhat more optimistic on the prospects for some industry-wide or protocol-specific agreements within a reasonable timeframe (say, within 2001) in domains where the EDI legacy has created a basis for shared semantics. I endorse Worden's advice to develop one's own model, and would like to extend it to say that modeling should be done in concert with an industry model or framework, where one exists. Entities can build their own model while they contribute to and extend the common models such as the Health Level 7 (HL7) Reference Information Model (RIM).
At the end of the day, however, you can't magically build a common infrastructure by fiat, regardless of market position or funding or degree of openness. HL7 has been working on the RIM for three years, and it builds on top of a decade of messaging; most verticals are not so far along. So, if unified frameworks are not imminent, where does that leave the repository business?
A Unified Theory of Repositories and Schemas
BizTalk is "ahead" of XML.org in the sense that it has actual schemas, but it still functions as a limited catalog-type resource. The wider, shallower net of links cast by XML.org's catalog is as useful as BizTalk if you want to know who is doing what and where. While Microsoft can claim an early lead in the repository race, it hardly appears commanding, and significant doubts remain whether anyone is even watching the race.
Users polled at XML'99 either said that they were indifferent or would post on both sites. The search and retrieval aspects of both sites and the basic services offered will have to go through major revisions before either site becomes more than a curiosity for this group. Andrew Hinchley, a consultant working on health care standards architecture for CEN, ISO, and the NHS felt that the proposed schema frameworks, while urgently needed, are either woefully immature or susceptible to the consensual mire of the open standards process. The public draft of the BizTalk schemas require Dun & Bradstreet identifiers (for example), hardly feasible for (say) an English public hospital. At the same time, even eighteen months seems too long to wait for ebXML.
On the other side of the fence, some vendors are adding "now available on BizTalk.org" to their PR notices, as if that were a significant mark of industry acceptance. Another vendor claims standards-compliance on the basis of BizTalk tags. On a calmer note, at least one standards-writing organization (which has been courted warmly by both groups) will post to both sites with the caveat that it can be done using an industry-standard schema language. In other words, it will send DTDs today, and later W3C schemas to XML.org, but won't translate its specifications to Microsoft's XDR and won't post on BizTalk.org until it accepts them in industry-standard markup.
To be useful as a reference catalog, repositories need qualitative analysis. To be useful as a real-time source of interoperable specifications, they need to serve components of a known framework. Until one or both of these criteria are met, repositories can provide a yellow pages directory of schema development, but not much more.
So if the real payback for repositories is either out-of-line with the current repository punchlist or way over the horizon, why have eight companies put up a total of half a million dollars to build one that will stand opposite Microsoft's BizTalk? Why are they focusing on the pot of gold at the end of the rainbow and ignoring the potholes in front of us?
To understand Microsoft's commitment, look at the third leg on the BizTalk stool: first is the repository, second is the schema framework, and third is the BizTalk commercial server. You can't blame Microsoft for using their influence to promote cross-industry support for their product format—anyone would. But when the booster is the big company in Redmond, promotion takes on a different character. (See sidebar: "The Microsoft Effect.")
We cannot forget that the background to the repository competition is the multi-track race to be the authoritative provider of schemas in every domain, from medicine to matchmaking. Schemas and schema frameworks represent the information model on which business is based. Selection and distribution of them is too critical a task to leave to anything less than a vetted source.