Handling Atom Text and Content Constructs
December 7, 2005
The Atom Syndication Format (RFC 4287) came about in part for social reasons and in part for technical reasons. The social reasons came down to difficulties reconciling factions of existing web feed formats. One of the key technical reasons is that existing web feed formats were not clear and rigorous in specifying rules for and interpretation of embedded content and human-readable text. Atom fixes this deficiency, making things easier for those writing processing code, but it also means you should clearly understand the rules governing such constructs, and, ideally, adopt reusable libraries for the purpose. In this article I discuss the forms of text and content constructs available in Atom, and in recognized extensions, and how to process them.
Text and Content Representation Options
Atom 1.0 defines text constructs and content constructs. The Atom spec says:
A Text construct contains human-readable text, usually in small quantities. The content of Text constructs is Language-Sensitive.
Text constructs are limited in allowed representation and are used for the following Atom elements:
-
title
-
subtitle
-
summary
-
rights
Content constructs are used only in content
elements. There are no limits to
the allowed representation (as long as the well-formedness of the Atom document is
not
compromised).
Text Constructs
The simplest possible form of text construct is exemplified by the title in listing 1.
Listing 1: Default form of text construct
<title>One bold foot forward</title>
This is simply a convenient abbreviation of the form in listing 2, and Atom processors must treat listings 1 and 2 identically.
Listing 2: Explicitly unmarked-up plain text construct
<title type="text">One bold foot forward</title>
This is unmarked-up plain text content. No actual child elements are allowed, and
you
should not even have tunnelled markup through encoding. Atom does not strictly prohibit
the
form in listing 3, but it does violate the spirit of the specification. The problem
is that
an Atom processor should never second-guess the meaning of the type
attribute, and since I implicitly use type="text"
a processor will not
interpret the contents as markup, as intended for the example.
Listing 3: Bogus (unsignalled) encoded markup in plain text construct
<title>One <strong>bold</strong> foot forward</title>
If you do want to embed HTML markup as in listing 3, you should signal this fact by
using
type="html"
, as in listing 4.
Listing 4: Signalled, encoded markup in text construct
<title type="html">One <strong>bold</strong> foot forward</title>
You can use a CDATA
section to express the exact same Atom form as in listing
4, as illustrated in listing 5.
Listing 5: Signalled, encoded markup in text construct using CDATA
sections
<title type="html"><![CDATA[One <strong>bold</strong> foot forward]]></title>
Listings 4 and 5 are perfectly valid Atom, but such escaping does make the embedded
markup
a second-class citizen, and will complicate processing (more on this later). Some
people
have a misperception that using CDATA
sections, as in listing 5 skirts these
issues, but it is very important to note that CDATA
sections are nothing but
syntactic sugar and do not in any way affect the core semantic issues of escaped markup.
If
possible, I advise you to use the final form of text construct if you wish to embed
markup.
Rather than tunnelling the markup into encoded text, you can use XHTML directly within
the
construct by using type="xhtml"
, as in listing 6.
Listing 6: XHTML text construct
<title type="xhtml"> <div xmlns="http://www.w3.org/1999/xhtml"> One <strong>bold</strong> foot forward </div> </title>
Yes, you must wrap the content in an XHTML div
, and all that. This makes
listing 6 a bit cumbersome and verbose, but it more than makes up for these shortcomings
by
offering a very clean layering of XML vocabularies, both of which you can be sure
are not
tag soup. The overhead is likely to be less imposing if you use XHTML text constructs
with
the typically longer content in summary
. Also, if you prefer, you can declare
the XHTML namespace once, on the Atom feed
element, and then use the
appropriate prefix (or default namespace) for all the XHTML, as in listing 7.
Listing 7: XHTML text construct using top-level namespace declaration
<feed xmlns="http://www.w3.org/2005/Atom" xmlns:xh="http://www.w3.org/1999/xhtml"> ... <title type="xhtml"> <xh:div> One <xh:strong>bold</xh:strong> foot forward </xh:div> </title> ...
Of course, you could choose to use a prefix for Atom elements and make XHTML the default namespace, but this feels a bit backwards considering that Atom is the host vocabulary. Properly implemented processors won't care one way or another. Keeping all the namespace declarations at the top level is actually a good practice in itself, so you might consider always using the form in listing 7, at the cost of having to use prefixes on many elements.
Content Constructs
The simplest content construct is illustrated in listing 8.
Listing 8: Default form of content construct
<content> The "atom:content" element either contains or links to the content of the entry. The content of atom:content is Language-Sensitive. </content>
Again this is effectively the same as if there was a type="text"
attribute on
the content
element. Once again there is the option of using
type="html"
, as in listing 9.
Listing 9: Embedded HTML content construct
<content type="html"> The <code>atom:content</code> element either contains or links to the content of the entry. The content of <code>atom:content</code> is <a href="http://www.ietf.org/rfc/rfc3066.txt">Language-Sensitive</a>. </content>
You can also use CDATA
sections, similarly to listing 5, or preferably use
type="xhtml"
, similarly to listing 6. You can also embed other textual
formats if you specify a type
with a value starting with text/
, in
which case the content must not have any child elements and must be text, with escaping
applied where necessary.
Atom content also allows arbitrary XML content, as long as you provide an XML media
type in
the type
attribute, with "XML media type" as defined in RFC 3023. Listing 10 shows how you would
embed an SVG image in Atom content.
Listing 10: SVG as Atom content
<content type="image/svg+xml"> <svg xmlns="http://www.w3.org/2000/svg" width="100px" height="100px"> <title>Itsy bitsy SVG</title> <circle cx="40" cy="25" r="20" style="fill: black;"/> <text x="10" y="80" fill="blue">Hello World</text> </svg> </content>
If you want to have content in-line while using any non-text and non-XML type, you must include it as Base64 encoded form. Listing 11 is a PNG image embedded as Atom content.
Listing 11: PNG as Atom content, embedded
<content type="image/png"> iVBORw0KGgoAAAANSUhEUgAAAB8AAAAqCAYAAABLGYAnAAAABmJLR0QA/wD/AP+gvaeTAAAACXBI WXMAAAsTAAALEwEAmpwYAAAAB3RJTUUH1QwCBCUlRSCuygAAAetJREFUWMPt1j1IVmEUB/Dfo1mv Bg1iSgjhZNGnFIhTDUXQ25x9LE2NbQ1NLe1tbbo1tjQEDk2BUzSITUapUejyvkQthXKfhvvcEs3g vl55Fd4/HLgfz//8zzn3Ps85tBGhBc4oLuIYuvAV77CwW0F24x7mEbex+bSmu0rhEcz9R3SzzSXO jnEWjRLChTUSt2X0Y7EF4cIWk4+WMLUD4cKmtPhHr1cgvp58/RNd2zy/VdFf2518lcJsBVkXNls2 85GKt2qpE+4nDlUk/gu1Mpk3Ksy8UbbsSxWKL5UVn6lQfGZP7vM9ecK1/Wxva1drez/f1UlmX8xw HXTQQQcd7Ctkl8jGiMeJk8SeDe9uEI+Q1bfyYg/xNrGf7CaxRhzP77esPUy8soFXLwbIU4QaPuM6 YY04SDxAOI2MMJGIw8Q0z4c14lg+k4dr6MEAoUkcIvYlzgAO4nKK5CgmknjIUou8mvflrJ6uH+Vt +k/09UR88rc64Xnq4Su4gzXiKMbxgHgBTzGcfEyirxivitH5LeF1yvIkXuLZptKdwQ98Q28Sf58y msbd1NdPJMKHlPFCWgfnsIxmEo/NvHRxCB/xgng/ZfkFg1glTBPP4w3h+4agHhOW8TAvuVcpuBV8 yrmxl7iKVKm4mn/WDtqA3yOQKuHaSApTAAAAAElFTkSuQmCC </content>
Finally, you can use any content externally sourced by specifying a src
attribute with the IRI (basically, "internationalized URI") of the content and being
sure to
specify a type
attribute that is a proper media type and not one of the special
text|html|xhtml
values. Listing 12 is similar to listing 10 except that the
PNG file is external to the Atom document.
Listing 12: PNG as Atom content, externally sourced
<content src="image.png" type="image/png"/>
Notice that the content
element is empty. It must be so if you use
src
in this way. Of course you could get tricksy (to put it like Gollum) with
data scheme URLs, which embed the content right in the URL itself. I do not for a
moment
recommend a trick such as in listing 13, where HTML is smuggled in even more diabolically
than by using type="html"
, but I'm exploring the breadth of cases, so there it
is.
Listing 13: HTML content provided in a data scheme URL (not recommended)
<content src="data:text/html,%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E" type="text/html"/>
%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E
is the URL quoted version of
<i>3733t, d00d</i>
Approach for Processing Atom Content
To demonstrate a likely algorithm for processing all these text and content construct possibilities, listing 14 is Python code using some hypothetical functions for parsing Atom using DOM and then emitting an XML output. In effect, it shows skeleton pivot code for the boundary between one XML processing pipeline and another where the origin stage produces Atom output and the destination is some XML format (perhaps Atom as well). A real-world example of where I have used such code is in an aggregator that combines multiple Atom feeds into a single feed (an aggregator pattern). It could also be used to generate presentation XHTML from source Atom. I chose to make it skeleton code so you can feel free to substitute the XML generation toolkit of your choice, and so the algorithm can be copied more transparently to other languages such as ECMAScript, Ruby, or even XSLT.
Listing 14: Skeleton Python code for processing Atom input to produce XML output
import base64 from xml.sax.saxutils import unescape def handle_text_construct(node): #Merge adjacent text nodes node.normalize() text_type = node.getAttributeNS(None, u"type") if text_type in [u"", u"text"]: write_cdata(node.firstChild.data) elif text_type == u"html": tagsoup = unescape(node.firstChild.data) tidied = tidy(tagsoup) write_literal_xml(tidied) elif text_type == u"xhtml": write_literal_xml(node.firstChild.data) else: raise TypeError("Illegal text construct type") return def handle_content(node): content_type = node.getAttributeNS(None, u"type") content_src = node.getAttributeNS(None, u"src") if content_src: #For example write an XHTML object start tag write_ext_reference(src, type) return #Atom built-in types are handled same way as text constructs if text_type in [u"", u"text", u"html", u"xhtml"]: handle_text_construct(node) return node.normalize() #Check the XML type case before the text type case if text_type.endswith("/xml") or text_type.endswith("+xml"): write_literal_xml(node.firstChild.data) elif text_type.startswith(u"text/"): write_cdata(node.firstChild.data) else: #You may choose to handle such by #duplicating this construct, creating an entity with NDATA, #Using a reference with data type URL, or other means content = base64.decodestring(node.firstChild.data) handle_foreign_content(content) return
Using Data URLs in HTML Output
While I do not recommend tunneling tag-soup content in data scheme URLs when expressed
in a
non-tag-soup format such as Atom, such URLs can be a limited solution for one problem
I've
encountered in Atom processing. If you want to tunnel tag soup to an output that can
handle
it (say a web browser), and the browser understands data scheme URLs, you can skip
the
decoding then tidying step for processing type="html"
and just URL encode the
escaped HTML into a data URL in an object
element for output.
Yes, this is a very suspicious hack, but it illustrates some of the desperate measures I have had to resort to when working with Atom given the realities of ubiquitous tag soup. Specifically I was trying to write a little personal feed viewer using XSLT so that I could render feed contents in Firefox. Writing object elements with data URLs was the easiest way to tame the escaped tag soup I was getting from upstream feeds. I would never have done such a thing if the next stage in the processing pipeline was what I consider a proper XML stage, but since it was directly to a web browser at the end of the line, I took the liberty. The relevant bit of XSLT was as in listing 15.
Listing 15: Sample XSLT for hacking escaped HTML into an HTML object with data URL
<xsl:transform version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:str="http://exslt.org/strings" > <xsl:output method="html"/> <xsl:template match="atom:content[@type='html']"> <object type="text/html" height="100" width="100" href="data:text/html,{str:encode-uri(string(.), false())}"> Unfortunately, your browser does not support data scheme URLs, so you cannot view this embedded content </object> </xsl:template> </xsl:transform>
At present data URLs are supported in Mozilla, Opera, Safari, and Konqueror. So far Internet Explorer does not support data URLs, which is a huge damper, but there are a lot of user requests for the feature, so it might make a surprise appearance in IE7. See, for example, comments in the IE team blog entry URLs in Internet Explorer 7.
Atomic Text Clean-up
For more on why you should avoid escaped HTML in Atom documents, see Escaped Markup Considered Harmful by Norm Walsh here on XML.com, and his follow-up Escaped Markup: What To Do Instead. Atom is of the XML family, and it's always best to keep as much as possible in the XML layer. If you do, processing will be easier and your data will be cleaner. Of course, since it's not always easy to keep the gloves on after more than a decade of tag soup on the Web, Atom makes it possible to deal with messy content without devolving completely to the chaotic content representation that marks many other web feed formats. In Atom, you at least have to properly declare your mess.
By the way, if this article interested you I'd like to invite you to join the Atom IRC channel on Freenode (#atom on irc.freenode.net), which I revived last month. We've settled down to a few regulars with people often popping in to ask quick questions or announce work in progress, but the more the merrier. Atom 1.0 is just out of the shrink-wrap and the Atom Publishing Protocol -- featured in Joe Gregorio's Restful Web column this week, Catching Up with the Atom Publishing Protocol -- is advancing towards production, so it's a great time to discuss user and implementation details in a friendly forum.