Generic Data Models and Schemas
May 3, 2000
Several weeks ago, I wrote in XML.com about the inadvisability of using a "one-size-fits-all" DOM approach to internal data models in applications. Jeff Lowery wrote me back pointing out that in his industry, publishing, the generic nature of the DOM was a huge asset (it is, after all, called the Document Object Model).
Jeff's e-mail also considered the potential usefulness of XML Schemas in providing necessary validation on an internal DOM data model. I asked Jeff if he would share his experiences and thoughts in an article for XML.com. This article provides an interesting insight into the interplay between developing XML tools and standards and the real world requirements of developers using XML. -- E.D.
There are some unique aspects to writing software for the publishing industry that motivate the use of generic data structures (such as XML) as an application's internal data model. For instance, there are many tagged data formats (PDF, PJTF, PPD, etc.) that we must support. By having tag names for every data element, we can easily map one tagged data structure to another. We do this by building hash maps in Java that map our tags to classes that transform one particular type of data item to a target data format. As we traverse our tree-based data model, each tag encountered is looked up in the hash map, the associated class instance is retrieved, and one of its methods is called with the instance data as its parameter.
Because publish and print is a manufacturing process (making books, magazines, catalogs, newspapers, and packaging), there are often many steps involved in producing a finished product. At each step along the way, data must not only be retrieved, but often new data must be added. We, as software developers, may not always know the nature of the data being added, yet we must be careful to preserve this data should it be read back into our data model. A generic data model can handle this task quite easily as long as the data is well-formatted. For XML-based workflows, this means that any new data must be parsable by an XML parser.
Another advantage of generic data models is that undo/redo operations become trivial as there are generally only three operations that change it: add, modify, and delete. To undo an add operation, you do a delete; to undo a modify operation, you merely modify again with the old value. In our application, before a command object is executed, it registers itself as a listener to our data model. As the command is executed, the resultant changes in the data model are recorded, including old/new data values, if any. Should the command be undone, then the opposites of these operations are done in reverse order. This is very similar in function, if not in methodology, to database transaction rollbacks.
Schemas
Still, a generic data model in itself is not sufficient for an application's internal data model. A generic data model says nothing about the nature of the data within it. In some cases that can be a good thing, such as when you must incorporate third-party data as mentioned above. Unfortunately, for an application to be able to manipulate the data within a data model, it has to know, at the minimum, the location, structure, and type of the data members. This is where a schema comes in handy: it's the contract the data model has with the application. The data model must conform to the schema to be understandable to the application. What about that third party data? Well, certain nodes of the data model can be designated as "open," meaning that not only can the node accept the data that's described in its corresponding schema definition, but it can accept other well-formed data as well. XML-Schema, for instance, uses the <any> tag to allow for this case.
Tradeoffs
What disadvantages might we run into? Performance isn't an issue because we don't do ad-hoc queries or large-scale batch manipulation of our data. Except for reading, there aren't any massive single updates of the data model. Size isn't a factor because our data model isn't very large to begin with. But all these could change, forcing us to re-evaluate our decision to use a generic data model.
The one disadvantage we do have right now is that much data integrity is enforced at runtime, a situation that is less than optimal. The solution to this would be to take our schema and compile it into simple data structures in the language that we use (Java), thus moving constraint checking to compile time. As it turns out, these classes are being written by hand now in order to simplify the interaction between data model clients (views) and the physical data structure. When writing these "wrapper" classes, one becomes aware that today's programming languages seem to lack the specificity for defining data structures as schema languages such as SQL and XML-Schema are able to do. Why is this?
Could We Use Schema Definitions in Programming Languages?
It seems that the ability to specify low-level data constraints in a programming language the same way one would in a schema language would save programmers a lot of tedious work. I have my own personal wish list:
- The separation of data members from their class methods and association of the two through a mechanism similar to inheritance.
- The ability to have more than one set of class methods support a schema.
- The implicit generation of get/set, add/remove methods for data members, with the ability to override them in any classes supporting the schema.
- Allow data member access only through the above methods, ensuring code integrity should those methods be overridden.
- The ability to declare, and implicitly enforce through an exception mechanism, value constraints based on range or enumeration.
- The ability to declare, and implicitly enforce through an exception mechanism, maximum cardinality constraints.
- An implicitly generated validate() method that can be used to check minimal cardinality constraints.
- The ability to further constrain a schema's members via a refining schema (similar to the mechanism described in XML-Schema).
Having a simple but powerful declarative mechanism for describing data member constraints in a programming language (such as a schema definition) would save millions of developers a lot of mindless hand coding of low-level constraints through class methods. Higher level, business rule constraints will still have to be written by hand, but often these business rules change over time, or from one department to another, so declaring these high level rules would not be as useful. Lastly, having schema definitions in a programming language should facilitate the transformation of one data format into another.