Mapping DTDs to Databases
May 9, 2001
1. Overview
A common question in the XML community is how to map XML to databases. This article discusses two mappings: a table-based mapping and an object-relational (object-based) mapping. Both mappings model the data in XML documents rather than the documents themselves. This makes the mappings a good choice for data-centric documents and a poor choice for document-centric documents. The table-based mapping can't handle mixed content at all, and the object-relational mapping of mixed content is extremely inefficient.
Both mappings are commonly used as the basis for software that transfers data between XML documents and databases, especially relational databases. An important characteristic in this respect is that they are bidirectional. That is, they can be used to transfer data both from XML documents to the database and from the database to XML documents. One consequence is that they are likely to be used as canonical mappings on top of which XML query languages can be built over non-XML databases. The canonical mappings will define virtual XML documents that can be queried with something like XQuery.
In addition to being used to transfer data between XML documents and databases, the first part of the object-relational mapping is used in "data binding", the marshalling and unmarshalling of data between XML documents and objects.
2. Table-based Mapping
There is an obvious mapping between the following XML document and table:
<A> <B> <C>ccc</C> Table A <D>ddd</D> ------- <E>eee</E> C D E </B> --- --- --- <B> <=> ... ... ... <C>fff</C> ccc ddd eee <D>ggg</D> fff ggg hhh <E>hhh</E> ... ... ... </B> </A>
It's called a table-based mapping. It views the document as a single table or a set of tables. The structure of the document must be either
<Table> <Row> <Column_1>...</Column_1> ... <Column_n>...</Column_n> </Row> ... <Row> <Column_1>...</Column_1> ... <Column_n>...</Column_n> </Row> </Table>
or
<Tables> <Table_1> <Row> <Column_1>...</Column_1> ... <Column_n>...</Column_n> </Row> ... </Table_1> ... <Table_n> <Row> <Column_1>...</Column_1> ... <Column_m>...</Column_m> </Row> ... </Table_n> </Tables>
with the caveat that column data can be represented either as PCDATA-only elements (shown) or attributes.
The obvious advantage of this mapping is its simplicity. Because it matches the structure of tables and result sets in a relational database, it is easy to write code based on this mapping; code which is fast, scales well, and is quite useful for certain applications, such as transferring data between databases one table at a time.
The mapping has several disadvantages; primarily, it only works with a very small subset of XML documents. In addition, it does not preserve physical structure (such as character and entity references, CDATA sections, character encodings, or the standalone declaration) or document information (such as the document type or DTD), comments, or processing instructions.
The table-based mapping is commmonly used by middleware to transfer data between XML documents and relational databases. It is also used in some Web application servers to return result set data as XML.
3. Object-Relational Mapping
Because table-based mappings only work with a limited subset of XML documents, some middleware tools, most XML-enabled relational databases, and most XML-enabled object servers use a more sophisticated mapping, called an object-relational mapping. This models the XML document as a tree of objects that are specific to the data in the document, then maps these objects to the database.
(The name "object-relational" is actually a misnomer -- a better name would be object-based mapping. This is because the objects can be mapped to non-relational databases, such as object-oriented or hierarchical databases, or simply left alone, as is done in data binding systems. However, because object-relational is a well-known term and this mapping is most commonly used with relational databases, that term will be used here. In addition, all examples use relational tables.)
To understand the object-relational mapping, it is easiest to first look at some simple examples. To start, notice that there is an obvious mapping between the following XML document, object, and row in a table:
XML Objects Tables ============= ============ =============== Table A <A> object A { ------- <B>bbb</B> B = "bbb" B C D <C>ccc</C> <=> C = "ccc" <=> --- --- --- <D>ddd</D> D = "ddd" ... ... ... </A> } bbb ccc ddd ... ... ...
Similarly, there is an obvious mapping between the following element type definition, class, and table schema:
DTD Classes Table Schema ====================== ============ ========================== class A { CREATE TABLE A String B; B VARCHAR(10) NOT NULL, <!ELEMENT A (B, C, D)> <=> String C; <=> C VARCHAR(10) NOT NULL, String D; D VARCHAR(10) NOT NULL } )
As a more complex example, consider the following XML document:
<SalesOrder> <Number>1234</Number> <Customer>Gallagher Industries</Customer> <Date>29.10.00</Date> <Item Number="1"> <Part>A-10</Part> <Quantity>12</Quantity> <Price>10.95</Price> </Item> <Item Number="2"> <Part>B-43</Part> <Quantity>600</Quantity> <Price>3.99</Price> </Item> </SalesOrder>
which maps to the following objects:
object SalesOrder { number = 1234; customer = "Gallagher Industries"; date = 29.10.00; items = {ptrs to Item objects}; } / \ / \ / \ object Item { object Item { number = 1; number = 2; part = "A-10"; part = "B-43"; quantity = 12; quantity = 600; price = 10.95; price = 3.95; } }
and then to rows in the following tables:
SaleOrders ---------- Number Customer Date ------ -------------------- -------- 1234 Gallagher Industries 29.10.00 ... ... ... ... ... ... Items ----- SONumber Item Part Quantity Price -------- ---- ---- -------- ----- 1234 1 A-10 12 10.95 1234 2 B-43 600 3.99 ... ... ... ... ...
All of these are examples of object-relational mappings from DTDs to relational databases.
3.1. The Basic Mapping
An object-relational mapping is performed in two steps. First, an XML schema (a DTD in this case) is mapped to an object schema, then the object schema is mapped to the database schema. The two mappings can optionally be combined for a direct DTD-to-database mapping, as is done in most software today.
In considering this mapping, it is important to understand that the objects involved are specific to each DTD and not objects from the DOM. In particular, these objects model the data in the XML document, while the DOM models the structure of the XML document. For example, the following shows the object trees for data-specific objects and the DOM that would be created for the XML document in the example above:
SalesOrder Document / \ | Item Item Element_______ ______/ / \ \______ \_________ / / \ \ \ Element Element Element Element Element | | | / | \ \ etc. Text Text Text / | \ \_______ / | \ \ Element Element Element Attr | | | | Text Text Text Text
This distinction is particularly important when you consider how the data is stored in the database. To store the data-specific objects, you will need SalesOrders and Items tables; to store the DOM objects, you will need Document, Element, Text, and Attr tables. The most important difference is that non-XML applications can use the data-specific tables but can't use the DOM-specific tables.
3.1.1. Mapping DTDs to Object Schemas
The mapping starts by recognizing that element types are data types. Element types that have PCDATA-only content are called simple element types; the term is borrowed from W3C XML Schema. These hold a single data value and are equivalent to scalar data types in an object-oriented programming language. (Note that the word "scalar" is used here to mean "consisting of a single data value". In some languages, "scalar" data types -- in this sense of the word -- are represented by objects. The most notable example of this is the String data type in Java.) Attribute types are also simple types.
Element types that have element or mixed content, or that have attributes, are called complex data types; the term is again borrowed from XML Schema. These hold a structured value and are equivalent to classes in an object-oriented programming language or structs in C. Note that an element type that has empty content and attributes is still "complex". The reason for this is that the attributes also provide structure and are roughly equivalent to child PCDATA-only elements.
The object-relational mapping first maps simple types to scalar data types. For example, the element type Title might be mapped to a String, and the element type Price might be mapped to a float. It then maps complex types to classes, with each element type in the content model of the complex type mapped to a property of that class. The data type of each property is the data type to which the referenced element type is mapped. For example, a reference to a Title element would be mapped to a String property, and a reference to a Price element would be mapped to a float property. References to complex element types are mapped to pointers/references to an object of the class to which the complex element type is mapped.
The last part of the mapping maps attributes to properties, with the data type of the property determined from the data type of the attribute. Note that attributes are equivalent to references to element types in a content model. This is because, like references in a content model, they are local to a given element type. The only conceptual difference is that the attribute type is defined locally, rather than at a global (DTD-wide) level, as is the case with element types.
For example, in the following the simple element types B, D, E, and the attribute F are all mapped to Strings and the complex element types A and C are mapped to classes A and C. The content models and attributes of A and C are mapped to properties of classes A and C. The references to B, D, and E in the content models of A and C are mapped to String properties (because the types are mapped to Strings) and the attribute F is also mapped to a String property. The reference to C in the content model of A is mapped to a property with the type pointer/reference to an object of class C because element type C is mapped to class C.
DTD Classes ========================= ============= <!ELEMENT A (B, C)> class A { <!ELEMENT B (#PCDATA)> String b; <!ATTLIST A ==> C c; F CDATA #REQUIRED> String f; } <!ELEMENT C (D, E)> class C { <!ELEMENT D (#PCDATA)> ==> String d; <!ELEMENT E (#PCDATA)> String e; }
One important point to reiterate here is that mapping references to element types in a content model is different than mapping the element types themselves. Element types map to data types, while references to element types map to properties in structured data types (classes). The difference is clear when you consider an element type that is referenced in two different content models. In this case, each reference must be mapped separately, with the data types of the resulting properties determined by the data types to which the element types themselves (not the references) are mapped.
For example, consider the Title and Section element types in the following. Each of these element types is referenced in the content models of both Chapter and Appendix. Each reference is mapped separately for each parent element type. The data type of the properties to which the references to Title are mapped is String, since Title contains only PCDATA and is mapped to a String. The data type of the properties to which the references to Section are mapped is a pointer/reference to a Section object, since the Section element type is complex and is mapped to a Section class.
DTD Classes ========================================= ============= <!ELEMENT Chapter (Title, Section+)> class Chapter { <!ELEMENT Appendix (Title, Section+)> ==> String title; <!ELEMENT Title (#PCDATA)> Section[] sections; <!ELEMENT Section (#PCDATA | p | b | i)*> } class Appendix { String title; Section[] sections; }
The other important point to reiterate is that simple element types and attributes can be mapped to data types other than String. For example, an element type named Quantity might be mapped to an integer. When mapping from a DTD, this requires human intervention, since there is no way to predict the target data type from a PCDATA-only element type. However, when mapping from an XML Schema, the target type is known since XML Schemas have data types.
3.1.2. Mapping Object Schemas to Database Schemas
In the second part of the object-relational mapping, classes are mapped to tables (known as class tables), scalar properties are mapped to columns, and pointer/reference properties are mapped to primary key/foreign key relationships. For example:
Classes Tables ============ ================= class A { Table A: String b; Column b C c; ==> Column c_fk String f; Column f } class C { Table C: String d; ==> Column d String e; Column e } Column c_pk
Note that the tables are joined by a primary key (C.c_pk) and a foreign key (A.c_fk). Because the relationship between the parent and child elements is one-to-one, the primary key can be in either table. If the relationship is one-to-many, the primary key must be on the "one" side of the relationship, regardless of whether this is the parent or child. For example, if a SalesOrder element contains multiple Item elements, then the primary key must be in the SalesOrder table (the parent). But if each Item element contains a Part element, the primary key must be in the Part table (the child), since one part can occur in many items.
A primary key column can be created as part of the mapping, as is the case for the column c_pk, or an existing column or columns can be used as the primary key. For example, if a SalesOrder element type has a Number child, this might be mapped to the primary key column.
If a primary key column is created as part of the mapping, its value must be generated by the database or the data transfer software. While this is generally considered better database design than using data columns as primary keys, it has a disadvantage when used with XML, and this is that the generated key is meaningless outside the source database. Thus, when data with a generated key is transferred to an XML document, it will either contain a meaningless primary key (if the primary key value is transferred) or no primary key at all (if the key is not transferred). In the latter case, it may be impossible to re-identify the source of the data, which is a problem if the data is modified and returned to the database as an XML document.
3.1.3. Miscellany
Before we continue with the more complicated parts of the mapping, we need to mention two things. First, names can be changed during the mapping. For example, the DTD, object schema, and relational schema can all use different names. For example, the following DTD uses different names than the following class:
DTD Classes =============================== ===================== <!ELEMENT Part (Number, Price)> class PartClass { <!ELEMENT Number (#PCDATA)> ==> String numberProp; <!ELEMENT Price (#PCDATA)> float priceProp; }
which uses different names than the following table:
Classes Tables ===================== ================== class PartClass { Table PRT String numberProp; ==> Column PRTNUM float priceProp; Column PRTPRICE }
Second, the objects involved in the mapping are conceptual. That is, there is no need to actually instantiate them when transferring data between an XML document and a relational database. (This is not to say that the objects can't be instantiated. Whether it is useful to instantiate the objects depends on the actual application.)
3.2. Mapping Complex Content Models
The content models we have seen so far are relatively simple. What happens with more complex content models, such as the following?
<!ELEMENT A (B?, (C | ((D | E | F | G)*, (H | I)+, J?)))>
In this section, we will consider the various parts of the content model. Mapping the above example will be left as an exercise to the reader. (I've always wanted to say that.)
3.2.1. Mapping Sequences
As has already been seen, each element type referenced in a sequence is mapped to a property, which is then mapped either to a column or to a primary key, foreign key relationship. For example:
DTD Classes Tables ====================== ============ ============== <!ELEMENT A (B, C)> class A { Table A <!ELEMENT B (#PCDATA)> ==> String b; ==> Column b C c; Column c_fk } <!ELEMENT C (D, E)> class C { Table C <!ELEMENT D (#PCDATA)> ==> String d; ==> Column d <!ELEMENT E (#PCDATA)> String e; Column e } Column c_pk
3.2.2. Mapping Choices
As with sequences, each element type referenced in a choice is also mapped to a property, then either to a column or a primary key, foreign key relationship. The only difference from the way sequences are mapped is that the properties and columns can be null. For example, suppose we changed the sequence in the content model of A to a choice. The mapping from DTD to object schema would then be
DTD Classes ====================== ======================== <!ELEMENT A (B | C)> class A { <!ELEMENT B (#PCDATA)> ==> String b; // Nullable C c; // Nullable } <!ELEMENT C (D, E)> class C { <!ELEMENT D (#PCDATA)> ==> String d; <!ELEMENT E (#PCDATA)> String e; }
and the mapping from object schema to database schema would be
Classes Tables ======================== =========================== class A { Table A ( String b; // Nullable ==> Column b // Nullable C c; // Nullable Column c_fk // Nullable } class C { Table C ( String d; ==> Column d // Not nullable String e; Column e // Not nullable } Column c_pk // Not nullable
To see why this is true, consider the following XML document, which conforms to the DTD above. Because the choice in the content model of A requires that either B or C (but not both) be present as a child, one of the two corresponding properties (and columns) will always be null.
XML Objects Tables ============= ============ =========== Table A --------- <A> object a { b c_fk <B>bbb</B> ==> b = "bbb" ==> --- ---- </A> c = null bbb NULL } ... ...
Note that if the primary key used to join the tables was in table A, the corresponding foreign key column in C would not be nullable. If A did have a C child, the column would have to have a value to join it to the correct row in table A. If A did not have a C child, there would simply be no row in table C.
3.2.3. Mapping Repeated Children
Children that can occur multiple times in their parent, which are known as repeated children, are mapped to multi-valued properties and then either to multiple columns in a table or to a separate table, known as a property table.
If a content model contains repeated references to an element type, the references are mapped to a single property, which is an array of known size. This can be mapped either to multiple columns in a table or to a property table. For example, the following shows how to map a repeated reference to multiple columns in a table:
DTD Classes Tables ========================= ============== =========== <!ELEMENT A (B, B, B, C)> class A { Table A <!ELEMENT B (#PCDATA)> ==> String[] b; ==> Column b1 <!ELEMENT C (#PCDATA)> String c; Column b2 } Column b3 Column c
If the + or * operator is applied to a reference, the reference is again mapped to a single property, which this time is an array of unknown size. Since the number of values can be arbitrarily large, the property must be mapped to a property table, which will contain one row for each value. The property table is linked to the class table by a primary key, foreign key relationship, where the primary key is in the class table. For example,
DTD Classes Tables ====================== ============== ============== <!ELEMENT A (B+, C)> class A { Table A <!ELEMENT B (#PCDATA)> ==> String[] b; ==> Column a_pk <!ELEMENT C (#PCDATA)> String c; Column c } Table B Column a_fk Column b
3.2.4. Mapping Optional Children
Children that are optional in their parent are mapped to nullable properties, then to nullable columns. This has already seen for children that appear in a choice group, as in the following mapping from DTD to object schema:
DTD Classes ========================= ======================== <!ELEMENT A (B | C | D)> class A { <!ELEMENT B (#PCDATA)> ==> String b; // Nullable <!ELEMENT C (#PCDATA)> String c; // Nullable D d; // Nullable } <!ELEMENT D (E, F)> class D { <!ELEMENT E (#PCDATA)> ==> String e; <!ELEMENT F (#PCDATA)> String f; }
and object schema to database schema:
Classes Tables ======================== ======================== class A { Table A String b; // Nullable ==> Column b // Nullable String c; // Nullable Column c // Nullable D d; // Nullable Column d_fk // Nullable } class D { Table D String e; ==> Column e // Not nullable String f; Column f // Not nullable } Column d_pk // Not nullable
It also applies when the ? or * operator is applied to a reference, as in the following mapping from DTD to object schema:
DTD Classes ====================== ========================== <!ELEMENT A (B?, C*)> class A { <!ELEMENT B (#PCDATA)> ==> String b; // Nullable <!ELEMENT C (#PCDATA)> String c[]; // Nullable }
and object schema to database schema:
Classes Tables ========================== =============================== class A { Table A String b; // Nullable ==> Column b // Nullable String c[]; // Nullable Column a_pk // Not nullable } Table C Column a_fk // Not nullable Column c // Not nullable
Note that the column used to store C (in property table C) is not nullable. This is because, if no C children are present in A, there are simply no rows in table C.
3.2.5. Mapping Subgroups
References in subgroups are mapped to properties of the parent class, then to columns in the class table, as in the following mapping from DTD to object schema:
DTD Classes ========================= ============================ <!ELEMENT A (B, (C | D))> class A { <!ELEMENT B (#PCDATA)> ==> String b; // Not nullable <!ELEMENT C (#PCDATA)> String c; // Nullable <!ELEMENT D (E, F)> D d; // Nullable } <!ELEMENT D (E, F)> class D { <!ELEMENT E (#PCDATA)> ==> String e; <!ELEMENT F (#PCDATA)> String f; }
and object schema to database schema:
Classes Tables ============================ ========================== class A { Table A String b; // Not nullable ==> Column b // Not nullable String c; // Nullable Column c // Nullable D d; // Nullable Column d_fk // Nullable } class D { Table D ( String e; ==> Column e String f; Column f } Column d_pk
You might be wondering how this is possible. What happened to the structure imposed by the subgroup? In fact, this structure appears only in the content model, not in instance documents. For example, both of the following documents conform to the above content model:
<A> <B>bbbbbb</B> <C>cccccc</C> </A> <A> <B>bbbbbb</B> <D> <E>eee</E> <F>fff</F> </D> </A>
There's no way to determine the presence of the subgroup from these documents. Structurally, C and D are indistinguishable from B; they are just children of A. Thus, they can be mapped like children of A that are not in a subgroup.
One consequence of mapping references in subgroups directly to properties of the parent class is that repeatability and optionality can be indirect. For example, in the following content model C, D, and E are both optional and repeatable. They are repeatable because the + operator applies to them indirectly. C is optional because it is directly in a choice group, but D and E are optional because they are indirectly in a choice group.
DTD Classes =============================== ============================= <!ELEMENT A (B, (C | (D, E))+)> class A { <!ELEMENT B (#PCDATA)> String b; <!ELEMENT C (#PCDATA)> ==> String[] c; // May be null <!ELEMENT D (E, F)> D[] d; // May be null <!ELEMENT E (#PCDATA)> String[] e; // May be null }
3.3. Mapping Mixed Content
Mixed content is just a choice group to which the * operator applies indirectly, except that it can contain PCDATA mixed between child elements. Thus, the element type references in mixed content can be mapped first to properties that are nullable arrays of unknown size, then to property tables.
To see how to map mixed content, consider the following XML document:
<A> This text <c>cc</c> makes <b>bbbb</b> no sense <c>cccc</c> except as <b>bb</b> an example. </A>
and then notice that it is essentially the same as the following document, in which PCDATA has been wrapped in <pcdata> elements:
<A> <pcdata>This text </pcdata><c>cc</c><pcdata> makes </pcdata><b>bbbb</b><pcdata> no sense </pcdata><c>cccc</c><pcdata> except as </pcdata><b>bb</b><pcdata> an example.</pcdata> </A>
From this, it is easy to see that PCDATA can be treated like any other child element. Thus, PCDATA in mixed content is mapped to a nullable array of unknown size, then to a property table. The following shows how to map mixed content from a DTD to an object schema:
DTD Classes =============================== =================== class A { <!ELEMENT A (#PCDATA | B | C)*> String[] pcdata; <!ELEMENT B (#PCDATA)> ==> String[] b; <!ELEMENT C (#PCDATA)> String[] c; }
and an object schema to a database schema:
Classes Tables =================== =================================== Table PCDATA ------Column a_fk class A { / Column pcdata String[] pcdata; Table A / Table B String[] b; ==> Column a_pk--------Column a_fk String[] c; \ Column b } \ Table C \-----Column a_fk Column c
To see what is actually stored in the database, consider the document shown at the start of this section, which is mapped to the following object, then to rows in the following tables. (We assume that the system generates a primary key of value 1 for the row in the table for A. This is used to link the row in table A to the rows in the other tables.)
Objects Tables ============================ =============================== Table PCDATA a_fk pcdata ---- ----------- 1 This text 1 makes object a { 1 no sense pcdata = {"This text ", 1 except as " makes ", Table A 1 an example. " no sense ", a_pk " except as", ==> ---- Table B " an example."} 1 a_fk b b = {"bbbb", "bb"} ---- ---- c = {"cc", "cccc"} 1 bbbb } 1 bb Table C a_fk c ---- ---- 1 cc 1 cccc
One of the things that should be readily obvious from this example is that the object-relational mapping is not very efficient at storing mixed content. Because of this, it is more commonly used in data-centric applications, which tend to have little mixed content.
There are two ways to solve this problem. The first is to use a mapping other than the object-relational mapping. For example, if the document is modeled using the DOM or a similar structure, and this is mapped to the database with an object-relational mapping, there are far fewer tables in the database -- Document, Element, Attr, Text, etc. -- although a similar number of joins are required to retrieve a document. The second strategy is to not break documents into their smallest possible components but instead to break them into larger pieces, such as chapters or sections. This strategy can be used with the object-relational mapping; for more information, see section 3.6.1, "Mapping Complex Element Types to Scalar Types".
3.4. Mapping Order
This section discusses how the object-relational mapping handles order.
3.4.1. Sibling Order, Hierarchical Order, and Document Order
Sibling means "brother or sister". Thus, sibling elements or PCDATA are elements or PCDATA that have the same parent. In other words, they appear in the same content model. For example, if the document from the previous section is represented as a tree, it is readily apparent which elements are siblings: those elements at the second level of the hierarchy, which all have A as their parent.
A ___________________________|______________________ | | | | | | | | | This text C makes B no sense C except as B an example | | | | cc bbbb cccc bb
Note that the elements at the third level of the hierarchy are not siblings because they don't share the same parent. This also points out the difference between sibling order, which is the order in which children occur in their parent, and hierarchical order, which is the level at which children appear in a tree representing the document. Different still is document order, which is the order in which elements and text appear in an XML document. For example:
Sibling order (order not shown where there is only one sibling):
A ___________________________|______________________ | | | | | | | | | This text C makes B no sense C except as B an example 1 2 3 4 5 6 7 8 9 | | | | cc bbbb cccc bb
Hierarchical order:
1 A ___________________________|______________________ | | | | | | | | | 2 This text C makes B no sense C except as B an example | | | | 3 cc bbbb cccc bb
Document order:
A 1 ___________________________|______________________ | | | | | | | | | This text C makes B no sense C except as B an example 2 3 5 6 8 9 11 12 14 | | | | cc bbbb cccc bb 4 7 10 13
According to the XML specification, sibling order is significant. In practice, this depends on the application. For example, in a data-centric application, where an XML document is used to populate an object or a table, sibling order usually does not matter because object-oriented languages have no concept of order among their properties. Similarly, relational databases have no concept of order among their columns. Thus, the sibling order is not significant in either of the following documents:
<Part> <Number>123</Number> <Desc>Turkey wrench</Desc> <Price>10.95</Price> </Part> <Part> <Price>10.95</Price> <Desc>Turkey wrench</Desc> <Number>123</Number> </Part>
both of which can be mapped to the following object and row in a table:
Objects Tables ========================= =================================== Table Parts object part { ------------------------------- number = 123 ==> Number Desc Price desc = "Turkey wrench" ------ ------------- ----- price = 10.95 123 Turkey wrench 10.95
(A major exception to this is when a data-centric document must match a specific DTD. This occurs when an application must validate documents, such as when they come from an unknown or untrusted source. Although "all groups" in XML Schemas help in this situation by allowing a set of children to appear in any order, they do not support repeated children.)
On the other hand, in document-centric applications, in which documents are generally designed for human consumption, sibling order is very important. For example, I am likely to like the first review and not the second:
<Review> <p>Ronald Bourret is an <b>excellent writer</b>. Only an <b>idiot</b> wouldn't read his work.</p> </Review> <Review> <p>Ronald Bourret is an <b>idiot</b>. Only an <b>excellent writer</b> wouldn't read his work.</p> </Review>
The object-relational mapping can preserve sibling order, as will be seen below, although in practice few products support this. It inherently preserves hierarchical order by mapping references to simple element types to columns in a table and by mapping references to complex element types to primary key, foreign key relationships. It preserves document order when both hierarchical and sibling order are preserved.
3.4.2. Mapping Sibling Order
Because object-oriented languages have no concept of order among their properties, and relational databases have no concept of order among their columns, it is necessary to store sibling order values separately from data values. One way to do this is to introduce separate properties and columns in which to store order values. Another way to do this is to store the order values in the mapping itself.
3.4.2.1. Order Properties and Columns
Order properties and order columns are used to store order values. They are separate from data properties and data columns. One property or column is needed for each referenced element type or PCDATA for which order is deemed important. For example, consider the above mixed content example. The following maps the sibling order in a DTD to order properties:
DTD Classes =============================== ======================== class A { String[] pcdata; int[] pcdataOrder; <!ELEMENT A (#PCDATA | B | C)*> String[] b; <!ELEMENT B (#PCDATA)> ==> int[] bOrder; <!ELEMENT C (#PCDATA)> String[] c; int[] cOrder; }
and then to order columns:
Classes Tables ======================== ======================================== Table PCDATA class A { -----Column a_fk String[] pcdata; / Column pcdata int[] pcdataOrder; / Column pcdataOrder String[] b; Table A / Table B int[] bOrder; ==> Column a_pk--------Column a_fk String[] c; \ Column b int[] cOrder; \ Column bOrder } \ Table C \----Column a_fk Column c Column cOrder
Notice that the order properties are stored in tables parallel to the properties that they order.
The following example shows order properties being used to preserve sibling order in the "makes-no-sense" example. One important thing to notice here is that all the order properties share the same order space. An order value that appears in one order property won't appear in another order property.
Classes Tables ================================= ===================================== Table PCDATA a_fk pcdata pcdataOrder ---- ----------- ----------- 1 This text 1 object a { 1 makes 3 pcdata = {"This text ", 1 no sense 5 " makes ", 1 except as 7 " no sense ", Table A 1 an example. 9 " except as", a_pk " an example."} ==> ---- Table B pcdataOrder = {1, 3, 5, 7, 9} 1 a_fk b bOrder b = {"bbbb", "bb"} ---- ---- ------ bOrder = {4, 8} 1 bbbb 4 c = {"cc", "cccc"} 1 bb 8 cOrder = {2, 6} } Table C a_fk c cOrder ---- ---- ------ 1 cc 2 1 cccc 6
Although order properties are most commonly used to maintain order in mixed content, they can be used with element content as well. For example, consider the following element type definition. Because B can appear an arbitrary number of times in A, it is stored in a separate property table. Without order properties, there would be no way to determine how to order the B children. (Note that row order cannot be used here, as relational databases are not guaranteed to return rows in any particular order.)
<!ELEMENT A (B*, C)>
3.4.2.2. Storing Order in the Mapping
In many cases, sibling order is important only because of validation; the application itself does not care about sibling order except to be able to validate a document. This is especially true of element content in data-centric documents. In such cases, it may be sufficient to store order information in the mapping itself.
For example, given the following content model, the mapping could store the information that the children of A are ordered B, then C, then D:
<!ELEMENT A (B, C, D)>
In practice, there are limitations to storing order information in a mapping. For example, consider the following content model:
<!ELEMENT A (B?, C, B)>
Constructing a document that matches this content model requires software to decide first how much data is available for constructing B elements. If there is only enough data to construct one B element, it won't construct the first B element, since the second B element is required.
It is unlikely that most software will go to the trouble of doing this. Instead, a reasonable limitation is to support only those content models that group all siblings of the same element type together. This is sufficient for many data-centric content models and can be implemented by storing the position of each element in the content model in the mapping.
For example, order of siblings in the following content models can be mapped this way. Note that in the third content model, Author and Editor can both be assigned the same order value or different values; if they are assigned different values, all elements of one type one will occur before any elements of the other type.
<!ELEMENT Part (Number, Description, Price)> <!ELEMENT Order (Number, CustNum, Date, Item*)> <!ELEMENT Book (Title, (Author | Editor)+, Price, Review*)>
When order information is stored only in the mapping, round-tripping of documents is not possible whenever the content model contains more than one element of the same type. For example, consider the following content model:
<!ELEMENT A (B+, C)>
Although the mapping can tell the software that all B elements must occur before the C element, it cannot specify the order of the B elements. Thus, if data is transferred from a document containing this content model to the database and back again, there is no guarantee that the B elements will occur in the same order as in the original document. Fortunately, this is not often a problem for data-centric documents.
3.5. Mapping Attributes
As was seen earlier, attributes are mapped to scalar properties. This section will discuss the details of that mapping, along with other issues.
3.5.1. Mapping Single- and Multi-Valued Attributes
There are two different classes of attributes: single-valued (CDATA, ID, IDREF, NMTOKEN, ENTITY, NOTATION, and enumerated) and multi-valued (IDREFS, NMTOKENS, and ENTITIES). As might be expected, these map to single-valued properties (and then to columns) and multi-valued properties (and then to property tables). For example:
DTD Classes Tables ============================ ============ =========== <!ELEMENT A (B, C)> class A { Table A <!ATTLIST A String b; Column B D CDATA #REQUIRED> ==> String c; ==> Column C <!ELEMENT B (#PCDATA)> String d; Column D <!ELEMENT C (#PCDATA)> }
and:
DTD Classes Tables ======================== ============== ============== <!ELEMENT A (B, C)> class A { Table A <!ATTLIST A String b; Column a_pk D IDREFS #IMPLIED> ==> String c; ==> Column b <!ELEMENT B (#PCDATA)> String[] d; Column c <!ELEMENT C (#PCDATA)> } Table D Column a_fk Column d
The order in which attributes occur is not significant according to the XML Information Set. For example, the following two XML documents are considered identical. Because of this, there is no need to use order properties to maintain the order in which attributes occur, although it would certainly be possible to do so.
<A B="bbb" C="ccc" D="ddd"/> <A C="ccc" B="bbb" D="ddd"/>
On the other hand, the order in which values occur in multi-valued attributes is considered significant. As was the case with sibling elements and PCDATA, order properties can be used to maintain information about the order in which values in a multi-valued attribute occur. However, there is one important difference between the order properties used for sibling elements and PCDATA and those used for multi-valued attributes: each order property used for a multi-valued attribute has its own order space. This can be seen in the following example:
XML Objects =============== ========================= object a { <A B="dd ee ff" b = {"dd", "ee", "ff"} C="gg hh"/> ==> bOrder = {1, 2, 3} c = {"gg", "hh"} cOrder = {1, 2} }
Alert readers will note that order properties are not strictly necessary at the object level; array order could be used instead. However, they are needed in relational databases, since there is no concept of row order there.
3.5.2. Mapping ID/IDREF(S) Attributes
ID attributes are used to uniquely identify elements in an XML document. IDREF and IDREFS attributes are used to associate one element with another by referring to the latter element's ID. This is usually done when it is not possible to form this association by nesting one element inside another. For example, consider the following directed graph:
A / \ B C \ / D
This can be represented in an XML document as:
<A> <B ref_d="1"> ... </B> <C ref_d="1"> ... </C> <D id="1"> ... </D> </A>
ID/IDREF(S) attributes map to primary key, foreign key relationships. For example, the above document could be stored in the database with the following tables and columns:
Table A Column a_pk ... / \ / \ Table B Table C Column a_fk Column a_fk Column ref_d Column ref_d ... ... \ / \ / Table D Column a_fk Column id ...
One thing that data transfer software needs to be careful of when storing ID/IDREF(S) attributes in a database is that IDs are only guaranteed to be unique inside a given XML document. Thus, if the data from more than one document is stored in the same table, there is no guarantee that the IDs will be unique. The solution to this problem is to somehow "decorate" the IDs. This could be done by mapping the attributes to two columns, one of which contains a value that is unique to each document and the other of which contains the ID, or by decorating the ID itself, such as prefixing it with a unique value.
A similar problem exists when transferring data from the database to a XML document. If the retrieved data originates from more than one document, the data transfer software needs to ensure that the ID values are unique. This might involve changing one or more values, along with the values of any IDREF(S) attributes that reference them.
Currently, most products do not support ID/IDREF attributes as distinct from other attributes.
3.5.3. Mapping Notation Attributes
Notations are used in XML to alert the application of how an element or unparsed entity is to be treated. For example, the following "xhtml" notation might tell the application that the element contains XHTML and should be displayed by a browser:
<Desc type="xhtml"> <html> <head><title>Turkey wrench</title></title></head> <body> <p>A very <b>big</b> turkey wrench.</p> </body> </html> </Desc>
Notation attributes and their values are generally of no interest to the object-relational mapping; they're treated as simply another attribute.
The only exception to this occurs when the notation indicates the data type of the contained text. For example, the notation "base64" might tell the application that an element contains binary data encoded as Base64 (a MIME encoding that maps binary data to a subset of US-ASCII). In most cases, this information will only be of interest to software that generates mappings. It could use the information to map the element type to a binary-valued property and then to a BLOB (Binary Large OBject). In such cases, the mapping itself does not use this information. The mapping from element to BLOB is independent of the fact that a notation contains data type information.
The only exception to this is when the data transfer software is sophisticated enough to switch mappings at run time based on the notation value. In this case, each possible notation is mapped to a data type, which is then used to convert the data.
3.5.4. Mapping ENTITY/ENTITIES Attributes
ENTITY and ENTITIES attributes are used to associate unparsed, external data (such as a binary file) with an XML document. These are mapped the same as any other attribute, except that, when the data is transferred, the entity may be substituted for the attribute value (when transferring data from XML to the database) or a new entity may be created and its identifier stored as the attribute value (when transferring data from the database to XML). Because unparsed entity values can be generated dynamically, it is a good idea for the mapping to specify whether the value or the entity URI is to be stored in the database.
Because unparsed entities always have associated notations, it is possible to use those notations when determining the data type of the entity (either at map time or run time).
3.6. Alternate Mappings
In the previous sections, we have described how to map DTDs to databases. In fact, this description has not been complete, as there are a number of other ways we could have done the mapping. In this section, we will discuss two of the most significant alternatives.
3.6.1. Mapping Complex Element Types to Scalar Types
Although complex element types are normally mapped to classes and then to tables, it is possible to map them to scalar types. In other words, references to complex element types can be mapped to scalar properties. The value of such properties is generally the element content, serialized as XML. This is useful when the element's value makes sense only as a whole and shouldn't be broken into smaller parts.
For example, consider an XML document that gives information about a part. If one of the child elements is a part description in XHTML, it probably makes no sense to fragment it further. As we have already seen, this will result in the data being scattered across numerous tables; one for italic words, one for bold words, one for words used in hyperlinks, etc. Therefore, it is better to store this description in a single column:
DTD Classes Tables =========================== =============== ============== <!ELEMENT Part (Num, Desc)> class Part { Table Part <!ELEMENT Number (#PCDATA)> ==> String num; ==> Column num <!-- Use Inline entity String desc; Column desc from XHTML --> } <!ELEMENT Desc (%Inline;)>
For example, the following description is stored as follows:
<Part> <Number>127</Number> Table Part <Desc> Num Desc A very <b>big</b> => --- ---------------- turkey wrench. 127 A very <b>big</> </Desc> turkey wrench. </Part>
Note that storing data as XML does cause problems for data transfer software. In particular, the software cannot distinguish between markup and data. For example, how does the application determine if the <b> in the following text is a <b> element or text?
An example of the <b> element is <b>this element</b>.
One solution to this is to store actual elements in the database using tags and character data using entity references:
An example of the <b> element is <b>this element</b>.
The problem with this is that non-XML applications can't then search the database as they might expect.
3.6.2. Mapping Scalar Properties to Property Tables
Although single, scalar-valued properties are usually mapped to columns, they can be mapped to property tables. This is useful, for example, for storing BLOBs or infrequently used properties in a separate table from the main table. For example:
Classes Tables =============== ================== class Part { Table Parts String num; ==> Column num String desc; Table Descriptions } Column num Column desc
3.6. Conclusions
The object-relational mapping works for all XML documents, maps well to objects, and allows non-XML applications to use the data in the database. Because of this, it is a good choice for storing data-centric documents and (not surprisingly) is used as the underlying model in some middleware, most XML-enabled relational databases, and most XML-enabled object servers.
It should be noted that all of these products implement slightly different versions of the object-relational mapping and none implements all possibilities in the mapping. Of the more common parts of the mapping, all map complex element types to classes and references to element types to properties, as well as using primary key, foreign key pairs to join tables. However, some map columns only to PCDATA-only elements, others map columns only to attributes, and still others allow both. Similarly, most of these products do not support sibling order or mixed content, and many do not allow users to change names during the mapping.
The object-relational mapping is not a good choice for ordinary documents. First, it is inefficient when used with mixed content. Second, like the table-based mapping, it does not preserve physical structure, comments, or processing instructions.
4. Generating Schema
We now consider how to generate relational database schema from a DTD according to the object-relational mapping and vice versa. Since there are a number of possible paths through the object-relational mapping, the algorithms here simply choose the most commonly used branch each time there is a choice. For example, single references to PCDATA-only elements can be mapped to a column or a separate property table. Since the most common choice is to map them to a column, the following algorithm generates a column from such references.
For object-oriented databases, the generation process is similar.
4.1. Generating Relational Database Schema from DTDs
Relational schemas are generated by reading through the DTD and processing each element type:
- Complex element types generate class tables with primary key columns.
- Simple element types are ignored except when processing content models.
To process a content model:
- Single references to simple element types generate columns; if the reference is optional (? operator), the column is nullable.
- Repeated references to simple element types generate property tables with foreign keys.
- References to complex element types generate foreign keys in remote class tables.
- PCDATA in mixed content generates a property table with a foreign key.
- Optionally generate order columns for all referenced element types and PCDATA.
To process attributes:
- Single-valued attributes generate columns; if the attribute is optional, the column is nullable.
- Multi-valued attributes generate property tables with foreign keys.
- If an attribute has a default, it is used as the column default.
The following example shows how this process works. Consider the following DTD:
DTD Tables ================================================= ================= <!ELEMENT Order (OrderNum, Date, CustNum, Item*)> <!ELEMENT OrderNum (#PCDATA)> <!ELEMENT Date (#PCDATA)> <!ELEMENT CustNum (#PCDATA)> <!ELEMENT Item (ItemNum, Quantity, Part)> <!ELEMENT ItemNum (#PCDATA)> <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Part (PartNum, Price)> <!ELEMENT PartNum (#PCDATA)> <!ELEMENT Price (#PCDATA)>
In the first step, we generate tables for complex element types and primary keys for these tables:
DTD Tables ================================================= ================= <!ELEMENT Order (OrderNum, Date, CustNum, Item*)> ==> Table Order <!ELEMENT OrderNum (#PCDATA)> Column OrderPK <!ELEMENT Date (#PCDATA)> <!ELEMENT CustNum (#PCDATA)> <!ELEMENT Item (ItemNum, Quantity, Part)> ==> Table Item <!ELEMENT ItemNum (#PCDATA)> Column ItemPK <!ELEMENT Quantity (#PCDATA)> <!ELEMENT Part (PartNum, Price)> ==> Table Part <!ELEMENT PartNum (#PCDATA)> Column PartPK <!ELEMENT Price (#PCDATA)>
In the second step, we generate columns for references to simple element types:
DTD Tables ================================================= ================= <!ELEMENT Order (OrderNum, Date, CustNum, Item*)> ==> Table Order <!ELEMENT OrderNum (#PCDATA)> Column OrderPK <!ELEMENT Date (#PCDATA)> Column OrderNum <!ELEMENT CustNum (#PCDATA)> Column Date Column CustNum <!ELEMENT Item (ItemNum, Quantity, Part)> ==> Table Item <!ELEMENT ItemNum (#PCDATA)> Column ItemPK <!ELEMENT Quantity (#PCDATA)> Column ItemNum Column Quantity <!ELEMENT Part (PartNum, Price)> ==> Table Part <!ELEMENT PartNum (#PCDATA)> Column PartPK <!ELEMENT Price (#PCDATA)> Column PartNum Column Price
In the final step, we generate foreign keys for references to complex element types:
DTD Tables ================================================= ================= <!ELEMENT Order (OrderNum, Date, CustNum, Item*)> ==> Table Order <!ELEMENT OrderNum (#PCDATA)> Column OrderPK <!ELEMENT Date (#PCDATA)> Column OrderNum <!ELEMENT CustNum (#PCDATA)> Column Date Column CustNum <!ELEMENT Item (ItemNum, Quantity, Part)> ==> Table Item <!ELEMENT ItemNum (#PCDATA)> Column ItemPK <!ELEMENT Quantity (#PCDATA)> Column ItemNum Column Quantity Column OrderFK <!ELEMENT Part (PartNum, Price)> ==> Table Part <!ELEMENT PartNum (#PCDATA)> Column PartPK <!ELEMENT Price (#PCDATA)> Column PartNum Column Price Column PartFK
A generated schema isn't going to be the same as a human would have written. In addition to naming problems (for example, a person might have called the tables Orders, Items, and Parts), the generation algorithm was unable to determine that the relationship between Items and Parts was many-to-one. The algorithm was also unable to recognize that OrderNum and PartNum could be used as primary keys, and it could not determine data types and column lengths, although XML Schemas will solve the latter problem. Although no naming collisions occurred or illegal names were generated, both are possible.
4.2. Generating DTDs from Database Schema
DTDs are generated by starting from a single "root" table or set of root tables and processing each:
- Each root table generates an element type with element content in the form of a single sequence.
- Each data (non-key) column in the table generates an element type with PCDATA-only content and a reference in the sequence; nullable columns generate optional references.
Primary and foreign keys are generated as follows:
- The remote table is processed in the same manner as a root table.
- A reference to the element type for the remote table is added to the sequence.
- If the key is the primary key, the reference is optional and repeated (* operator). This is because there is no guarantee that a row will exist in the foreign table, nor is there a guarantee that only one row will exist.
- If the key is the primary key, PCDATA-only element types are optionally generated for each column in the key. If these are generated, references to these element types are added to the sequence. This is useful only if primary keys contain data.
- If the key is a foreign key and is nullable, the reference is optional (? operator).
During this process, the key used to reach the table (if any) is not processed. This saves the algorithm from generating element types that duplicate those created in the parent table.
The following example shows how this process works. Consider the following database schema:
Table Orders Column OrderNum Column Date Column CustNum Table Items Column OrderNum Column ItemNum Column Quantity Column PartNum Table Parts Column PartNum Column Price
In our first step, we generate an element type for the root table (Orders):
Tables DTD ================== =================================================== Table Orders ==> <!ELEMENT Orders ()> Column OrderNum Column Date Column CustNum Table Items Column OrderNum Column ItemNum Column Quantity Column PartNum Table Parts Column PartNum Column Price
Next, we generate PCDATA-only elements for the data columns (Date and CustNum) and add references to these elements to the content model of the Orders element:
Tables DTD ================== =================================================== Table Orders ==> <!ELEMENT Orders (Date, CustNum)> Column OrderNum Column Date <!ELEMENT Date (#PCDATA)> Column CustNum <!ELEMENT CustNum (#PCDATA)> Table Items Column OrderNum Column ItemNum Column Quantity Column PartNum Table Parts Column PartNum Column Price
Now we generate a PCDATA-only element for the primary key (OrderNum) and add a reference to it to the content model:
Tables DTD ================== =================================================== Table Orders ==> <!ELEMENT Orders (Date, CustNum, OrderNum)> Column OrderNum <!ELEMENT OrderNum (#PCDATA)> Column Date <!ELEMENT Date (#PCDATA)> Column CustNum <!ELEMENT CustNum (#PCDATA)> Table Items Column OrderNum Column ItemNum Column Quantity Column PartNum Table Parts Column PartNum Column Price
And then add an element for the table (Items) to which the primary key is exported, as well as a reference to it in the content model:
Tables DTD ================== =================================================== Table Orders <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)> Column OrderNum <!ELEMENT OrderNum (#PCDATA)> Column Date <!ELEMENT Date (#PCDATA)> Column CustNum <!ELEMENT CustNum (#PCDATA)> Table Items ==> <!ELEMENT Items()> Column OrderNum Column ItemNum Column Quantity Column PartNum Table Parts Column PartNum Column Price
We process the data and primary key columns in the remote (Items) table in the same way:
Tables DTD ================== =================================================== Table Orders <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)> Column OrderNum <!ELEMENT OrderNum (#PCDATA)> Column Date <!ELEMENT Date (#PCDATA)> Column CustNum <!ELEMENT CustNum (#PCDATA)> Table Items ==> <!ELEMENT Items(ItemNum, Quantity)> Column OrderNum Column ItemNum <!ELEMENT ItemNum (#PCDATA)> Column Quantity <!ELEMENT Quantity (#PCDATA)> Column PartNum Table Parts Column PartNum Column Price
And then add an element for the table (Parts) to which the foreign key corresponds:
Tables DTD ================== =================================================== Table Orders <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)> Column OrderNum <!ELEMENT OrderNum (#PCDATA)> Column Date <!ELEMENT Date (#PCDATA)> Column CustNum <!ELEMENT CustNum (#PCDATA)> Table Items <!ELEMENT Items(ItemNum, Quantity, Parts)> Column OrderNum Column ItemNum <!ELEMENT ItemNum (#PCDATA)> Column Quantity <!ELEMENT Quantity (#PCDATA)> Column PartNum Table Parts ==> <!ELEMENT Parts()> Column PartNum Column Price
Finally, we process the foreign key table (Parts):
Tables DTD ================== =================================================== Table Orders <!ELEMENT Orders (Date, CustNum, OrderNum, Items*)> Column OrderNum <!ELEMENT OrderNum (#PCDATA)> Column Date <!ELEMENT Date (#PCDATA)> Column CustNum <!ELEMENT CustNum (#PCDATA)> Table Items <!ELEMENT Items (ItemNum, Quantity, Parts)> Column OrderNum Column ItemNum <!ELEMENT ItemNum (#PCDATA)> Column Quantity <!ELEMENT Quantity (#PCDATA)> Column PartNum Table Parts ==> <!ELEMENT Parts(PartNum, Price)> Column PartNum <!ELEMENT PartNum (#PCDATA)> Column Price <!ELEMENT Price (#PCDATA)>
As was the case in the previous section, the generated DTD is not what a human would have created. Although the only problems here are with names, the algorithm cannot recognize order columns or property tables.
5. Mapping XML Schemas to Databases
Most XML schema languages can be mapped to databases with an object-relational mapping. The exact mappings depend on the language. DDML, DCD, and XML Data Reduced schemas can be mapped in a manner almost identical to DTDs. The mappings for W3C Schemas, Relax, TREX, and SOX appear to be somewhat more complex. It is not clear to me that Schematron can be mapped.
In the case of W3C Schemas, a complete mapping to object schemas and then to database schemas is available. Briefly, this maps complex types to classes (with complex type extension mapped to inheritance) and maps simple types to scalar data types (although many facets are lost). "All" groups are treated like unordered sequences and substitution groups are treated like choices. Finally, most identity constraints are mapped to keys. For complete details, see http://www.rpbourret.com/xml/SchemaMap.htm.
6. Related Topics
For a wider discussion of XML and databases, see:
XML and Databases (http://www.rpbourret.com/xml/XMLAndDatabases.htm)
For a reasonably updated list of XML database products, see:
XML Database Products (http://www.rpbourret.com/xml/XMLDatabaseProds.htm)
If you have comments or questions about this paper, you can reach me at:
rpbourret@rpbourret.com