Introduction to XFML
January 22, 2003
XFML is a simple XML format for exchanging metadata in the form of faceted hierarchies, sometimes called taxonomies. Its basic building blocks are topics, also called categories. XFML won't solve all your metadata needs. It's focused on interchanging faceted classification and indexing data. XFML addresses the following problems with basic hierarchical classification:
- Creating and maintaining a good topic hierarchy is a lot of work, ask any librarian.
- Indexing (categorizing) large amounts of content consistently is even harder. See Cory Doctorow's "Metacrap".
- Creating a centralized hierarchy to organize a large amount of information doesn't scale. (If you think Yahoo's hierarchy scales, ask yourself why you keep turning to Google.)
XFML provides a simple format to share classification and indexing data. It also provides two ways to build connections between topics, information that lets you write clever tools to automate the sharing of indexing efforts. It's based on the principles of faceted classification, addressing many of the scaling issues with simple hierarchies.
What is Faceted Classification?
Facets sound scary and librarian-like, but they are really just a common sense approach to classifying things. Instead of building one huge tree of topics, a faceted classification uses multiple smaller trees (each tree is called a facet) that can then be combined by the user to find things more easily.
Say you're building a travel site about the USA. You could build a hierarchy to browse it that looks something like this:
- USA
- New York
- Bars
- Blues music
- Latin music
- Restaurants
- Blues music
- Latin music
- Bars
- L.A.
- Bars
- Blues music
- Latin music
- Restaurants
- Blues music
- Latin music
- Bars
- New York
If you're going to New York and want to find a blues bar, browsing this hierarchy will work just fine for you. That's because it's organized by city first, type of place second, and type of music third, which is exactly what you happen to need. But if you're about to visit the USA and want to decide which city to go to based on its blues bars, our classification breaks down. You first want to select your type of music, not your city. Unless there's a good search, you will have to browse every single city looking for blues bars, which is neither elegant nor user friendly.
Combining different types of information (city, type of music, type of place) in one big hierarchy can never address all possible information needs. Faceted classification addresses this problem by providing separate facets that can be combined in the user interface. For example:
City (City is a facet)
- New York (New York is a topic within the facet City)
- L.A.
Type of place
- Bars
- Restaurants
Type of music
- Blues
- Latin
By combining these facets, a user could view all bars in New York, all places that have Latin music throughout the country, or any other combination. Things have suddenly become a lot more interesting. If you want to know what an interface for this can look like, check out Facetmap, a tool that automatically generates four ways of browsing the same faceted classification. You can even upload XFML files to it.
How XFML Works
The XFML core spec gives an introduction, defines the concepts, and specifies the XML format. The spec is stable and frozen, which means you can safely build applications that use it.
An empty XFML Core document looks like this:
<?xml version="1.0" ?>
<xfml version="1.0"
url="http://domain.com/xfml/map1.xml" language="en-us">
</xfml>
It's a valid XML document and conforms to the XFML Core DTD. The url attribute is required; it's the URL where the original XFML document can be found. To be nice we add a comment pointing to the XFML Core spec:
<?xml version="1.0" ?>
<xfml version="1.0"
url="http://domain.com/xfml/map1.xml" language="en-us">
<!-- This document
conforms to XFML Core. See http://purl.oclc.org/NET/xfml/core/ -->
</xfml>
Facets and Topics
The building blocks of a faceted hierarchy in XFML are facets and topics. A facet is the top node of each tree. The nodes in the tree are called topics. XFML can define multiple hierarchies, and each hierarchy is a facet. Our hierarchy expressed in XFML looks like this:
<facet id="city">City</facet>
<facet id="place">Type of
place</facet>
<facet id="music">Type of music</facet>
<topic id="ny" facetid="city"><name>New York</name></topic>
<topic id="la" facetid="city"><name>Los
Angeles</name></topic>
<topic id="bar"
facetid="place"><name>bar</name></topic>
<topic
id="restaurant" facetid="place"><name>restaurant</name></topic>
<topic id="blues" facetid="music"><name>blues</name></topic>
<topic id="latin"
facetid="music"><name>latin</name></topic>
The reason why topics have a child element called <name> and facets don't is that topics can have other child elements. We'll get to those later. Facet and topic id's are defined in the DTD as id's and therefore cannot contain spaces or start with a number. The facetid attribute for topics is required.
You can add unlimited topic hierarchies within a facet, using the parentTopicid attribute:
<topic id="ny" facetid="city"><name>New
York</name></topic>
<topic id="brooklyn" facetid="city"
parentTopicid="ny"><name>Brooklyn</name></topic>
<topic
id="brooklyn_heights" facetid="city" parentTopicid="brooklyn"><name>Brooklyn
Heights</name></topic>
So when do you make a hierarchy of topics become a facet? The spec says, when describing the facet concept, that "[f]acets are mutually exclusive containers that contain hierarchies of topics. Mutually exclusive means that a certain topic can only possibly belong to one facet". The mutual exclusivity requirement is semantic: it can't be (realistically) enforced by software. It means that you should separate out a new facet when you are describing topics that can be usefully combined. Type of music and city are mutually exclusive facets because a topic in type of music (Latin) can never be a topic in city (New York). Note that the mutual exclusivity requirement does not mean that pages (see next section) can only have occurrences in one facet.
Pages
Once you have some facets and topics defined, you will want to classify or index some web pages and add them to your XFML document so your indexing efforts can be shared. You can only classify things that have a URI. Each URI (we call them pages but you can use other filetypes as well) can be classified under multiple topics. The homepage of the B.B. King Blues Club and Grill in New York can be classified under NY, bar and blues topics. We say these topics occur on the page and we call them topic occurrences:
<page url="http://bbkingblues.com/">
<title>B. B. Kind blues club and
grill</title>
<description>Conveniently located in the heart of Times
Square near Penn Station and Port Authority, The B.B. King Blues Club and Grill offers
music fans a unique experience. Owned by the Bensusan Family, proprietors of the world
renowned Blue Note Jazz Club, the club features world-class musical talent and consists
of
two distinct spaces: the Showcase Room and Lucille’s Grill.</description>
<occurrence topicid="bar" />
<occurrence topicid="blues" />
<occurrence topicid="ny" />
</page>
The mapInfo Element
MapInfo is an optional element containing administrative metadata about the map. Usage is simple, check the spec. For our example, mapInfo could look something like this:
<mapInfo>
<managingEditor>
<name>Joe
Blogs</name>
<email>feedback@joeblogs.com</email>
<url>http://joeblogs.com/</url>
</managingEditor>
<license>
<name>GNU Free Documentation License</name>
<url>http://www.gnu.org/licenses/fdl.html</url>
</license>
</mapInfo>
The mapInfo element can also contain child elements describing additional editors, a technical contact, the owner of the map, and the software used to generate the map.
Distributed Metadata
What we have so far (facets, topics, pages, and occurrences) lets us build a file that provides some interesting metadata for others to reuse. Typically you will write some code that regularly downloads an updated XFML file from web sites with similar topics to yours, then takes all the topic occurrences that are relevant to your topics and copies those occurrences to your XFML document. That's how you can automate the reuse of indexing efforts.
There is a problem though. If site A wants to reuse the indexing work of site B, they have to use exactly the same topics. That's not how the world works. Site A might have topics "blues" and "latin", and site B might have topics "blues & jazz" and "Latino". They probably mean the same thing, and B might want to reuse the indexing of A, but how can your code know which topic occurrences to reuse?
XFML provides two answers. You can create direct connections between two topics in different maps, indicating that for example the topic "latin" in map A is equal to the topic "Latino" in map B. You can also create implicit connections by pointing a topic to a web page that describes that topic, for example a page with the dictionary definition for Latino. The software can then infer that any topics it finds that point to that same page are really the same topic, no matter what the topic is called.
These two approaches mean that you can create a web of loosely distributed metadata, which is how XFML attempts to address the problems with centralized hierarchies.
Connecting Topics
The first approach to reusing indexing efforts is to connect individual topics between maps. The connect element is a child of the topic element; its content is the concatenation of three strings: the URL of another map, the "#" character, and the id of a topic in that map:
<topic id="latin" facetid="music">
<name>latin</name>
<connect>http://domainb.com/mapb.xml#latino</connect>
</topic>
A topic can contain multiple connect elements.
Published Subject Indicators
The second approach to reusing indexing efforts is to point a topic to a resource on the web that describes it; in other words, to point to a published subject indicator represented by the psi element.
<topic id="latin" facetid="music">
<name>latin</name>
<psi>http://dictionary.reference.com/search?q=latino</psi>
</topic>
A topic can have multiple psi elements. It can even have multiple connect and psi elements: the more psi or connect elements it has, the higher the value of your XFML document. Also note that, once you have established a connection with a topic in another map (through <connect> or a common <psi>), your software can safely copy all of the <psi>'s and <connect>'s from that topic to your topic. Two topics in the same map are not allowed to have the same <psi> or <connect> elements. Some network effects can cause contradictions when automatically copying <connect> or <psi> elements, but those can be resolved by presenting a choice to the administrator when that happens.
Using XFML
Don't try to fit all your internal metadata into the XFML format. It's an export format like RSS, and your database will surely have more fields than XFML can handle. That's okay. If you want a format that can handle (almost) all your metadata, check out Topicmaps or RDF. When programming XFML support into your system, check the processing instructions in the spec. They are just recommendations, however; you may come up with better ways of doing things.
Exporting XFML is easy; often you can just add a template to your content management system and leave it at that. A (somewhat rough) example template for Moveable Type took about half an hour to hack together. Most content management systems don't support faceted classification internally, so you are limited in the richness of metadata you can export. However, you can automatically generate data for facets like date of publication, length of entry, number of comments, and so on; or, if you have categories that don't change often, hardcode the facets and just generate occurrences.
When you make XFML feeds available on your site, indicate them with an XFML button and add a link element in your HTML as described here for auto discovery purposes.
Expect some experimentation when importing XFML and automating indexing work: you'll be traveling in unknown territory. Taxomita is currently the only tool under development that does advanced importing of XFML. However, importing is the cutting edge. This is where you take advantage of the real strength of XFML, namely, distributed metadata. Importing will allow you to use the information in the <connect> and <psi> elements to automatically expand your metadata without resorting to a central list of metadata. We expect exciting things to happen in this area in 2003.
The XFML.org website has a page with tools that support the standard. Livetopics (a plug in for Radio Userland) and Drupal (a content management system) export XFML. Facetmap lets you import and browse XFML files, and Taxomita is an upcoming authoring tool built around XFML. Templates and code libraries are being developed for a variety of environments.
XFML Core (XFML version 1.0) is the first version of XFML. Work is being done on XFML 2.0, but that version won't be finished for at least another year. It may feature elements to describe controlled vocabularies and more ways to distribute metadata. Check the XFML mailing list for the latest developments.
Conclusion
XFML is a simple standard to exchange faceted, hierarchical metadata. What makes it different is the way it addresses specific problems with metadata authoring by allowing for distributed metadata through the <connect> and <psi> elements. It is designed to be easy to code for and is already supported by a number of tools.
To get started with XFML, I recommend writing an XFML file by hand and uploading it to Facetmap. There's nothing like seeing this in action to get your head around the possibilities. After that, try exporting your existing data (if you have a site with some existing metadata) as XFML or play around with some of the available tools.
The XFML site has a page with relevant links to learn more about XFML and faceted classification. Let me just highlight the Faceted Classification mailing list, an excellent (non-techie) list about faceted classification, as well as Mark Pilgrims' Really Understandable Introduction to XFML.