Building Metadata Applications with RDF
February 12, 2003
The real test of any technology's value is what kinds of tasks are easier with it than without it. If I hear about some new technology, I'm not going to learn it and use it unless it saves me some trouble. Well, being a bit of a geek, I might play with it a bit, but I'm going to lose interest if I don't eventually see tangible proof that it either makes new things possible or old things easier.
I've played with RDF for a while and found some parts interesting and other parts a mess. During this time, I continued to wonder what tasks would be easier with RDF than without it. I came across some answers (for example, various xml-dev postings or some documentation on the Mozilla project), but they usually addressed the issue in fairly abstract terms.
The first time I tried the RDFLib Python libraries, the lightbulb finally flashed on. RDFLib lets you generate, store, and query RDF triples without requiring you to ever deal directly with the dreaded RDF/XML syntax. And you can do all this with a minimal knowledge of Python.
Storing and Using Triples with RDFLib
Daniel Krech developed RDFLib to parse and serialize RDF/XML and to store RDF triples. To install it, unzip the distribution file, run the following command, and you'll be ready to run scripts like those shown in this article:
python setup.py install
The examples here are built around RDFLib 1.2.3. Suppose that you need to aggregate electronic publications to republish their content, and you need to track the incoming documents' metadata. Some of the metadata is embedded in the documents being republished, but some of the documents aren't even XML, so metadata about them must be stored externally.
The makeTriples.py script in Listing 1 demonstrates how easy this is using RDFLib. In addition to the setup declarations, it performs four basic tasks:
- creates an empty RDFLib TripleStore storage object in memory
- stores metadata about three documents in that storage object
- outputs all the information about one of the documents
- saves all the stored metadata in an RDF/XML document
Listing 1: makeTriples.py
#! /usr/bin/python # makeTriples.py: demonstrate the creation of an RDFLib TripleStore from rdflib.TripleStore import TripleStore from rdflib.Literal import Literal from rdflib.Namespace import Namespace # Declare namespaces to use. ns_sn = Namespace("http://www.snee.com/ns/misc#") ns_sd = Namespace("http://www.snee.com/docs/") ns_dc = Namespace("http://purl.org/dc/elements/1.1/") ns_pr = Namespace("http://prismstandard.org/1.0#") # Create storage object for triples. store = TripleStore() # Add triples to store. store.add((ns_sd["d1001"], ns_dc["title"], Literal("Sample Acrobat document"))) store.add((ns_sd["d1001"], ns_dc["format"], Literal("PDF"))) store.add((ns_sd["d1001"], ns_dc["creator"], Literal("Billy Shears"))) store.add((ns_sd["d1001"], ns_pr["publicationTime"], Literal("2002-12-19"))) store.add((ns_sd["d1002"], ns_dc["title"], Literal("Sample RTF document"))) store.add((ns_sd["d1002"], ns_dc["format"], Literal("RTF"))) store.add((ns_sd["d1002"], ns_dc["creator"], Literal("Nanker Phelge"))) store.add((ns_sd["d1002"], ns_pr["publicationTime"], Literal("2002-12-15"))) store.add((ns_sd["d1003"], ns_dc["title"], Literal("Sample LaTeX document"))) store.add((ns_sd["d1003"], ns_dc["format"], Literal("LaTeX"))) store.add((ns_sd["d1003"], ns_dc["creator"], Literal("Richard Mutt"))) store.add((ns_sd["d1003"], ns_pr["publicationTime"], Literal("2002-12-16"))) store.add((ns_sd["d1003"], ns_sn["quality"], Literal("pretty good"))) # Output information about one document. docID = "d1003" print "Information about document " + docID + ":" for docInfo in store.predicate_objects(ns_sd[docID]): print docInfo #Store saved information in serialized XML. store.save("articlesIncoming.rdf")
The declaration of the store object is all you need to create a usable TripleStore. To add triples to a TripleStore object like store, you specify the subject-predicate-object triple as a parameter (specifically, a Python tuple, which is in bold above) to the TripleStore class's add methodx. There's no need to worry about RDF/XML syntax.
The simplicity of creating a container for RDF triples is why some people refer to this as a "freeform database"; you can add attribute name-value pairs for objects without first defining a schema for the attributes you plan to store. If you tracked the metadata about incoming documents using a relational database, you might declare a table with fields for each document's ID, title, format, creator, and publication time. What if your system added records for thousands of documents, and you then wanted to add a new field for "quality"? You'd have to go through all the testing and rollout steps associated with schema maintenance. In makeTriples.py, adding the quality value for document d1003 is as easy as adding any other metadata values.
The for loop near the end of the script shows how easy it is to output all the data about one of the subjects in this freeform database without knowing the classes of information stored for that subject. It prints all the attribute-value pairs for document d1003 with no need to specify an attribute names. The following shows the output:
Information about document d1003: ('http://purl.org/dc/elements/1.1/title', 'Sample LaTeX document') ('http://purl.org/dc/elements/1.1/format', 'LaTeX') ('http://prismstandard.org/1.0#publicationTime', '2002-12-16') ('http://purl.org/dc/elements/1.1/creator', 'Richard Mutt') ('http://www.snee.com/ns/misc#quality', 'pretty good')
Of course, you don't have to add values to your RDF store from strings hardcoded into your script. They might come from data entry screens, a bar code reader, or some other source of data. They can also come from RDF/XML files, as we'll see in the next script.
Reading Other Applications' RDF
In addition to doing the required setup work, the following readDoc.py script does two things. First, it reads in the RDF triples stored in four different documents: the one saved by the makeTriples.py script that we saw earlier and three content documents that have their metadata embedded. Then it outputs the dc:creator and dc:title values for all documents whose prism:publicationTime comes after December 16, 2002.
Listing 2: readDocs.py
#! /usr/bin/python # readDocs.py: read 4 RDF documents, output data meeting test condition from rdflib.TripleStore import TripleStore from rdflib.Namespace import Namespace ns_dc=Namespace("http://purl.org/dc/elements/1.1/") ns_pr=Namespace("http://prismstandard.org/1.0#") store = TripleStore() store.load("articlesIncoming.rdf") # Saved by makeTriples.py. store.load("doc1.xml") # Three documents store.load("doc2.xml") # with store.load("doc3.xml") # embedded RDF. cutOffDate = "2002-12-16" # For all triples that have prism:publicationTime as their predicate, for s,o in store.subject_objects(ns_pr["publicationTime"]): # if the triples' object is greater than the cutoff date, if o > cutOffDate: # print the date, author name, and title. print o, for object in store.objects(s, ns_dc["creator"]): print object + ": ", for object in store.objects(s, ns_dc["title"]): print object
The following shows the output:
"2002-12-18" Nanker Phelge: Sample HTML Document "2002-12-19" Billy Shears: Sample Acrobat document "2002-12-17" Nanker Phelge: Sample DocBook document
At first glance, the code in readDocs.py doesn't seem very interesting. It reads some input and then outputs the input data that meets a certain condition. However, if we look in Listings 3, 4, and 5 at the different ways that the doc1.xml, doc2.xml, and doc3.xml files store their RDF triples, then the short, simple amount of code required by RDFLib to read and use those triples is much more impressive. In addition to conforming to three completely different DTDs, these three XML documents use three different RDF/XML conventions to store their triples:
- RDF values in doc1.xml are child elements of RDF's Description element
- RDF values in doc2.xml are child elements of DocBook's articleInfo element
- RDF values in doc3.xml are attributes, not child elements, of RDF's Description element.
Listing 3: doc1.xml
<html> <head> <title>Sample HTML Document</title> <meta> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://prismstandard.org/1.0#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://http:/www.snee.com/samples/doc1.xml"> <dc:description>Short HTML document with PRISM RDF embedded. </dc:description> <dc:title>Sample HTML Document</dc:title> <dc:creator>Nanker Phelge</dc:creator> <dc:format>html</dc:format> <prism:publicationTime>2002-12-18</prism:publicationTime> </rdf:Description> </rdf:RDF> </meta> </head> <body> <p>This is a test HTML document with some PRISM RDF embedded in the <tt>meta</tt> element.</p> </body> </html>
Listing 4 : doc2.xml
<article xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://prismstandard.org/1.0#" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns="http://www.docbook.org#"> <rdf:RDF> <articleinfo rdf:about="http://http:/www.snee.com/samples/doc2.xml"> <dc:format>xml</dc:format> <prism:publicationTime>2002-12-17</prism:publicationTime> <dc:title>Sample DocBook document</dc:title> <dc:creator>Nanker Phelge</dc:creator> </articleinfo> </rdf:RDF> <title>Sample DocBook Document</title> <para>This is a test DocBook document with some PRISM RDF embedded in the <literal>articleinfo</literal> element.</para> </article>
Listing 5 : doc3.xml
<nitf> <head> <docdata> <!-- docdata content model doesn't really allow the following but it's a logical place for them. --> <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:prism="http://prismstandard.org/1.0#" xmlns:dc="http://purl.org/dc/elements/1.1/"> <rdf:Description rdf:about="http://http:/www.snee.com/samples/doc3.xml" dc:description="Short NITF document with PRISM RDF embedded." dc:title="Sample NITF Document" dc:creator="Bob DuCharme" dc:format="xml" prism:publicationTime="2002-12-14" /> </rdf:RDF> </docdata> </head> <body> <body.head> <hedline> <hl1>Sample NITF Article</hl1> </hedline> </body.head> <body.content> <p>This is a test NITF document with some PRISM RDF embedded in the docdata element.</p> </body.content> </body> </nitf>
As the author of the readDocs.py script in Listing 2, I really don't care about these differences. Again, I let RDFLib's TripleStore class handle them for me.
RDF's ability to assign attribute name-value pairs to anything that can be identified with a URI always appealed to me, but the mechanics of generating, reading, and using the subject-predicate-object triples seemed like too much trouble. Now that I've found a tool that makes it easy, I have a new perspective on RDF. And RDFLib isn't the only tool that lets you do this. A variety of tools are available for processing RDF using different languages: The Repat C library, Hewlett-Packard's Jena and David Megginson's RDF Filter for Java, and 4Suite, in addition to RDFLib, for Python. And many people don't know about the exposed RDF support built right into Mozilla!
Conversations about the value of RDF often veer off on two distracting detours: debates about the architecture and syntax of RDF/XML and debates about the potential value of the Semantic Web. If you ignore the latter and let tools like RDFLib shield you from the less appealing details of the former, RDF's value becomes much more readily apparent, and its increasing success in the metadata community starts to make more sense. I look forward to more work with it.