Parsing Microformats
September 4, 2007
Microformats are a way to embed specific semantic data into the HTML that we use today. One of the first questions an XML guru might ask is "Why use HTML when XML lets you create the same semantics?" I won't go into all the reasons XML might be a better or worse choice for encoding data or why microformats have chosen to use HTML as their encoding base. This article will focus more on how to extract microformats data from the HTML, how the basic parsing rules work, and how they differ from XML.
Contact Information in HTML
One of the more popular and well-established microformats is hCard. This is a vCard
representation in HTML, hence the h in hCard, HTML vCard. You can read more about
hCards on the microformats wiki. A vCard
contains basic information about a person or an organization. This format is used
extensively in address book applications as a way to backup and interchange contact
information. By Internet standards it's an old format, the specification is RFC2426
from
1998. It is pre-XML, so the syntax is just simple text with a few delimiters and
start
and end
elements. We'll use my information for this
example.
BEGIN:VCARD FN:Brian Suda N:Suda;Brian;;; URL:http://suda.co.uk END:VCARD
This vCard file has a BEGIN:VCARD
and an END:VCARD
that acts as a
container so the parser knows when to stop looking for more data. There might be multiple
vCards in one file, so this nicely groups the data into distinct vCards. The FN
stands for Formatted Name, which is used as the display name. The N
is the
structured name, which encodes things like first, last, middle names, prefixes and
suffixes,
all semicolon separated. Finally, URL
is the URL of the web site associated
with this contact.
If we were to encode this in XML it would probably look something like this:
<vcard> <fn>Brian Suda</fn> <n> <given-name>Brian</given-name> <family-name>Suda</family-name> </n> <url>http://suda.co.uk</url> </vcard>
Let's see how we can mark up the same vCard data in HTML using microformats, which
make
extensive use of the rel
, rev
, and class
attributes
to help encode the semantics. The class
attribute is used in much the same way
as elements are used in XML. So the previous XML example might be marked up in HTML
as:
<div class="vcard"> <div class="fn">Brian Suda</div> <div class="n"> <div class="given-name">Brian</div> <div class="family-name">Suda</div> </div> <div class="url">http://suda.co.uk</div> </div>
If that was all microformats did, then it wouldn't be very interesting. Instead,
microformats make use of the semantics of existing HTML elements to explain where
the
encoded data can be found. In this example everything is a <div>
, but it
doesn't have to be. This is what makes extracting data from the HTML slightly more
difficult
for parsers, but makes it easier for publisher. Microformats do not force publishers
to
change their current HTML structure or publishing behavior. At the end of the day,
there
will be factors of 10 more people writing HTML than writing parsers, so why not make
it as
easy as possible for the publishers?
It bugs me when I look at the previous XML example and see "Brian Suda" encoded twice,
once
for FN
then repeated again for N
. With HTML this isn't a problem,
we can combine those two XML elements using space-separated values in the class
attribute. It is a little know fact that the class
, rel
, and
rev
attributes in HTML can actually take a space-separated list of values. If
we combine the FN
and N
we get something like this:
<div class="n fn"> <div class="given-name">Brian</div> <div class="family-name">Suda</div> </div>
Now the N
property still has its children and the FN
has the same
value as before. Remember, HTML collapses whitespace, so the FN
still is "Brian
Suda" even though it is spread over two elements now with spaces inside those
<div>
s.
So, we have sorted the ability to condense multiple properties with the same value.
The
next thing that bothers me about the XML example is that the URL is displayed, it
doesn't
seem natural. In XML we are talking about data, but the HTML is being displayed to
people in
a browser. Coincidentally, there is an <a>
element, which has an
href
attribute that takes the URL
value and also a node-value to
display more human-friendly text. We can further refine our HTML example to include
the URL
switching the <div>
to an <a>
element.
<a class="n fn url" href="http://suda.co.uk"> <span class="given-name">Brian</span> <span class="family-name">Suda</span> </a>
After switching to the <a>
element, we needed to change the child
<div>
s to <spans>
s because the
<a>
element can only contain inline elements as children. Microformats
do not force publishers to use specific elements, but it is recommended that you use
the
most semantic for each case. In the case of URL data, it makes the most sense in this
case
to use an <a>
element, because of this; the parsing rules change slightly
(we'll discuss this in a bit).
The final hCard microformat might look something like the following in HTML:
<div class="vcard"> <a class="n fn url" href="http://suda.co.uk"> <span class="given-name">Brian</span> <span class="family-name">Suda</span> </a> </div>
To me, this is much more intuitive, simpler, and more compact than the XML example at the start. People are already publishing blogrolls and links in this manner and all browsers recognize and style this information, plus it can easily be passed around inside a feed.
Parsing with XSLT
Let's take that HTML example and try to parse it using XSLT.
Microformats are designed to work with HTML 4 and up. The downside to using XSLT is
that
the document needs to be well-formed. HTML 4 does not. HTML 4 can use
<br>
, <img>
, and <hr>
elements
without closing tags. If you were using a different technology like REGEXs or the
DOM to
extract microformats, then this is a separate issue, but with XSLT we need to clean
up the
HTML first. There are two simple ways to do this, TIDY or a function like HTMLlib or loadHTML,
either will load the HTML document and convert it into a usable state for XSLT.
Now that we know we have a well-formed HTML document, we can begin to extract the microformat data. The following is a very rough XSLT that is far from comprehensive, but it should get you started. For more information you can see the microformats.org wiki page about parsing or use the XSLT templates that do most of the heavy-lifting data extraction (available at hg.microformats.org).
All the data inside an hCard is contained within the element that has a class of
"vcard
". In our example this is a <div>
, but it could be
any element, so we'll start with:
//*[@class="vcard"]
This XPath expression looks for any element anywhere in the tree that has a class
equal to
"vcard
". At first glance, this should find all the hCards, but the problem is
that the class attribute can take a space-separated list of values. So, class="vcard
myStyle"
would not be picked up by that XPath expression. To fix this we
can use the contains
function.
//*[contains(@class,"vcard")]
This is better, now we find any element when the class attribute contains the term
"vcard
." This will successfully find the "vcard
" in
class="vcard myStyle"
, but there is still a problem. The
contains
function is not word safe it is a substring match. So,
class="my-vcard"
would be found by contains()
just the same as
class="vcard"
, even though "my-vcard
" is not the proper name of
the property to indicate this is an hCard microformat, a false-positive. To fix this
we need
to work some magic and pad the values we are searching for with spaces, then search
for the
term with the padded spaces around it. It sounds complicated, but really isn't.
//*[contains(concat(" ",@class," "), " vcard ")]
With padding, class="my-vcard"
becomes " my-vcardZ
"
and would not contain the substring " vcard
," which solves the
substring problem. In the other instance, class="vcard mySytle"
becomes
" vcard myStyle
," which does contain "
vcard
" so the space-separated values in a class issue is also
solved with the padding technique.
Now that we know how to find the data, let's loop through each hCard using XSLT and begin to extract it into vCard output. At this point, it is pretty easy to see how using XSLT can let you easily convert this HTML data into just about any format you want. This includes other HTML, XML, RDF, flat vCard text, CSV, SPARQL results, JSON, or just about anything else your heart desires.
The for-each
will find all instances of an hCard on the page and create a new
vCard for each one. While creating each vCard it applies the templates to look for
any
properties inside an hCard, such as FN
, N
, and
URL
.
<xsl:for-each select="//*[contains(concat(" ",@class," "), " vcard ")]"> <xsl:text>BEGIN:VCARD</xsl:text> <xsl:apply-templates /> <xsl:text>END:VCARD</xsl:text> </xsl:for-each>
The FN
is a simple template that extracts the node-value of the element that
contains FN
as a class value.
<xsl:template match="//*[contains(concat(" ",@class," "), " fn ")]"> <xsl:text>FN:</xsl:text><xsl:value-of select="."/> </xsl:template>
The N
template is slightly more complex. It first has to look for an element
with a class containing N
. Then it looks for child elements that contain
subproperties of N
, such as family-name and given-name and outputs those
values.
<xsl:template match="//*[contains(concat(" ",@class," "), " n ")]"> <xsl:text>N:</xsl:text> <xsl:value-of select="//*[contains(concat(" ",@class," "), " family-name ")]"/> <xsl:text>;</xsl:text> <xsl:value-of select="//*[contains(concat(" ",@class," "), " given-name ")]"/> <xsl:text>;;;</xsl:text> </xsl:template>
The template for URL
uses the choose
element to determine where
the most semantic information for the URL
value is encoded. It tests to see if
the element the class="url"
is an <a>
element. If it is,
then the value of URL
is extracted from the @href
, otherwise it
uses the node-value.
<xsl:template match="//*[contains(concat(" ",@class," "), " url ")]"> <xsl:text>URL:</xsl:text> <xsl:choose> <xsl:when test="local-name() = 'a'"> <vxsl:alue-of select="@href"/> </xsl:when> <xsl:otherwise> <xsl:value-of select="."/> </xsl:otherwise> </xsl:choose> </xsl:template>
The <a>
element and many others carry implied semantics. In our original
HTML example the URL had been encoded on a <div>
, in that case, the
node-value would have been extracted and the value of URL would have been the same.
This is
just one of the many ways microformats are different than XML. The parsing of microformats
data is dependent the type of data and on the HTML element it was encoded on.
This is a very basic overview of parsing data from a microformat. There are more rules depending on the type of vCard property and on which HTML element it is encoded. For more information, you can refer to the Microformats wiki, my O'Reilly PDF book Using Microformats, or you can always email me or join the microformats dev mailing list if you have questions.