Handling Atom Text and Content Constructs

December 7, 2005

The Atom Syndication Format (RFC 4287) came about in part for social reasons and in part for technical reasons. The social reasons came down to difficulties reconciling factions of existing web feed formats. One of the key technical reasons is that existing web feed formats were not clear and rigorous in specifying rules for and interpretation of embedded content and human-readable text. Atom fixes this deficiency, making things easier for those writing processing code, but it also means you should clearly understand the rules governing such constructs, and, ideally, adopt reusable libraries for the purpose. In this article I discuss the forms of text and content constructs available in Atom, and in recognized extensions, and how to process them.

Text and Content Representation Options

Atom 1.0 defines text constructs and content constructs. The Atom spec says:

A Text construct contains human-readable text, usually in small quantities. The content of Text constructs is Language-Sensitive.

Text constructs are limited in allowed representation and are used for the following Atom elements:

title
subtitle
summary
rights

Content constructs are used only in content elements. There are no limits to the allowed representation (as long as the well-formedness of the Atom document is not compromised).

Text Constructs

The simplest possible form of text construct is exemplified by the title in listing 1.

Listing 1: Default form of text construct


<title>One bold foot forward</title>

This is simply a convenient abbreviation of the form in listing 2, and Atom processors must treat listings 1 and 2 identically.

Listing 2: Explicitly unmarked-up plain text construct


<title type="text">One bold foot forward</title>

This is unmarked-up plain text content. No actual child elements are allowed, and you should not even have tunnelled markup through encoding. Atom does not strictly prohibit the form in listing 3, but it does violate the spirit of the specification. The problem is that an Atom processor should never second-guess the meaning of the type attribute, and since I implicitly use type="text" a processor will not interpret the contents as markup, as intended for the example.

Listing 3: Bogus (unsignalled) encoded markup in plain text construct


<title>One &lt;strong&gt;bold&lt;/strong&gt; foot forward</title>

If you do want to embed HTML markup as in listing 3, you should signal this fact by using type="html", as in listing 4.

Listing 4: Signalled, encoded markup in text construct


<title type="html">One &lt;strong&gt;bold&lt;/strong&gt; foot forward</title>

You can use a CDATA section to express the exact same Atom form as in listing 4, as illustrated in listing 5.

Listing 5: Signalled, encoded markup in text construct using `CDATA` sections


<title type="html"><![CDATA[One <strong>bold</strong> foot forward]]></title>

Listings 4 and 5 are perfectly valid Atom, but such escaping does make the embedded markup a second-class citizen, and will complicate processing (more on this later). Some people have a misperception that using CDATA sections, as in listing 5 skirts these issues, but it is very important to note that CDATA sections are nothing but syntactic sugar and do not in any way affect the core semantic issues of escaped markup. If possible, I advise you to use the final form of text construct if you wish to embed markup. Rather than tunnelling the markup into encoded text, you can use XHTML directly within the construct by using type="xhtml", as in listing 6.

Listing 6: XHTML text construct


<title type="xhtml">

  <div xmlns="http://www.w3.org/1999/xhtml">

    One <strong>bold</strong> foot forward

  </div>

</title>

Yes, you must wrap the content in an XHTML div, and all that. This makes listing 6 a bit cumbersome and verbose, but it more than makes up for these shortcomings by offering a very clean layering of XML vocabularies, both of which you can be sure are not tag soup. The overhead is likely to be less imposing if you use XHTML text constructs with the typically longer content in summary. Also, if you prefer, you can declare the XHTML namespace once, on the Atom feed element, and then use the appropriate prefix (or default namespace) for all the XHTML, as in listing 7.

Listing 7: XHTML text construct using top-level namespace declaration


<feed xmlns="http://www.w3.org/2005/Atom"

  xmlns:xh="http://www.w3.org/1999/xhtml">

...

<title type="xhtml">

  <xh:div>

    One <xh:strong>bold</xh:strong> foot forward

  </xh:div>

</title>

...

Of course, you could choose to use a prefix for Atom elements and make XHTML the default namespace, but this feels a bit backwards considering that Atom is the host vocabulary. Properly implemented processors won't care one way or another. Keeping all the namespace declarations at the top level is actually a good practice in itself, so you might consider always using the form in listing 7, at the cost of having to use prefixes on many elements.

Content Constructs

The simplest content construct is illustrated in listing 8.

Listing 8: Default form of content construct


<content>

The "atom:content" element either contains or links to the content of

the entry.  The content of atom:content is Language-Sensitive.

</content>

Again this is effectively the same as if there was a type="text" attribute on the content element. Once again there is the option of using type="html", as in listing 9.

Listing 9: Embedded HTML content construct


<content type="html">

The &lt;code&gt;atom:content&lt;/code&gt; element either contains or links to the

content of the entry.  The content of &lt;code&gt;atom:content&lt;/code&gt; is

&lt;a href="http://www.ietf.org/rfc/rfc3066.txt"&gt;Language-Sensitive&lt;/a&gt;.

</content>

You can also use CDATA sections, similarly to listing 5, or preferably use type="xhtml", similarly to listing 6. You can also embed other textual formats if you specify a type with a value starting with text/, in which case the content must not have any child elements and must be text, with escaping applied where necessary.

Atom content also allows arbitrary XML content, as long as you provide an XML media type in the type attribute, with "XML media type" as defined in RFC 3023. Listing 10 shows how you would embed an SVG image in Atom content.

Listing 10: SVG as Atom content


<content type="image/svg+xml">

<svg xmlns="http://www.w3.org/2000/svg"

  width="100px" height="100px">

  <title>Itsy bitsy SVG</title>

  <circle cx="40" cy="25" r="20" style="fill: black;"/>

  <text x="10" y="80" fill="blue">Hello World</text>

</svg>

</content>

If you want to have content in-line while using any non-text and non-XML type, you must include it as Base64 encoded form. Listing 11 is a PNG image embedded as Atom content.

Listing 11: PNG as Atom content, embedded


<content type="image/png">

iVBORw0KGgoAAAANSUhEUgAAAB8AAAAqCAYAAABLGYAnAAAABmJLR0QA/wD/AP+gvaeTAAAACXBI

WXMAAAsTAAALEwEAmpwYAAAAB3RJTUUH1QwCBCUlRSCuygAAAetJREFUWMPt1j1IVmEUB/Dfo1mv

Bg1iSgjhZNGnFIhTDUXQ25x9LE2NbQ1NLe1tbbo1tjQEDk2BUzSITUapUejyvkQthXKfhvvcEs3g

vl55Fd4/HLgfz//8zzn3Ps85tBGhBc4oLuIYuvAV77CwW0F24x7mEbex+bSmu0rhEcz9R3SzzSXO

jnEWjRLChTUSt2X0Y7EF4cIWk4+WMLUD4cKmtPhHr1cgvp58/RNd2zy/VdFf2518lcJsBVkXNls2

85GKt2qpE+4nDlUk/gu1Mpk3Ksy8UbbsSxWKL5UVn6lQfGZP7vM9ecK1/Wxva1drez/f1UlmX8xw

HXTQQQcd7Ctkl8jGiMeJk8SeDe9uEI+Q1bfyYg/xNrGf7CaxRhzP77esPUy8soFXLwbIU4QaPuM6

YY04SDxAOI2MMJGIw8Q0z4c14lg+k4dr6MEAoUkcIvYlzgAO4nKK5CgmknjIUou8mvflrJ6uH+Vt

+k/09UR88rc64Xnq4Su4gzXiKMbxgHgBTzGcfEyirxivitH5LeF1yvIkXuLZptKdwQ98Q28Sf58y

msbd1NdPJMKHlPFCWgfnsIxmEo/NvHRxCB/xgng/ZfkFg1glTBPP4w3h+4agHhOW8TAvuVcpuBV8

yrmxl7iKVKm4mn/WDtqA3yOQKuHaSApTAAAAAElFTkSuQmCC

</content>

Finally, you can use any content externally sourced by specifying a src attribute with the IRI (basically, "internationalized URI") of the content and being sure to specify a type attribute that is a proper media type and not one of the special text|html|xhtml values. Listing 12 is similar to listing 10 except that the PNG file is external to the Atom document.

Listing 12: PNG as Atom content, externally sourced


<content src="image.png" type="image/png"/>

Notice that the content element is empty. It must be so if you use src in this way. Of course you could get tricksy (to put it like Gollum) with data scheme URLs, which embed the content right in the URL itself. I do not for a moment recommend a trick such as in listing 13, where HTML is smuggled in even more diabolically than by using type="html", but I'm exploring the breadth of cases, so there it is.

Listing 13: HTML content provided in a data scheme URL (not recommended)


<content src="data:text/html,%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E" type="text/html"/>

%26lt%3Bi%3E3733t%2C%20d00d%26lt%3B/i%3E is the URL quoted version of <i>3733t, d00d</i>

Approach for Processing Atom Content

To demonstrate a likely algorithm for processing all these text and content construct possibilities, listing 14 is Python code using some hypothetical functions for parsing Atom using DOM and then emitting an XML output. In effect, it shows skeleton pivot code for the boundary between one XML processing pipeline and another where the origin stage produces Atom output and the destination is some XML format (perhaps Atom as well). A real-world example of where I have used such code is in an aggregator that combines multiple Atom feeds into a single feed (an aggregator pattern). It could also be used to generate presentation XHTML from source Atom. I chose to make it skeleton code so you can feel free to substitute the XML generation toolkit of your choice, and so the algorithm can be copied more transparently to other languages such as ECMAScript, Ruby, or even XSLT.

Listing 14: Skeleton Python code for processing Atom input to produce XML output

import base64

from xml.sax.saxutils import unescape



def handle_text_construct(node):

    #Merge adjacent text nodes

    node.normalize()

    text_type = node.getAttributeNS(None, u"type")

    if text_type in [u"", u"text"]:

        write_cdata(node.firstChild.data)

    elif text_type == u"html":

        tagsoup = unescape(node.firstChild.data)

        tidied = tidy(tagsoup)

        write_literal_xml(tidied)

    elif text_type == u"xhtml":

        write_literal_xml(node.firstChild.data)

    else:

        raise TypeError("Illegal text construct type")

    return





def handle_content(node):

    content_type = node.getAttributeNS(None, u"type")

    content_src = node.getAttributeNS(None, u"src")

    if content_src:

        #For example write an XHTML object start tag

        write_ext_reference(src, type)

        return

    #Atom built-in types are handled same way as text constructs

    if text_type in [u"", u"text", u"html", u"xhtml"]:

        handle_text_construct(node)

        return

    node.normalize()

    #Check the XML type case before the text type case

    if text_type.endswith("/xml") or text_type.endswith("+xml"):

        write_literal_xml(node.firstChild.data)

    elif text_type.startswith(u"text/"):

        write_cdata(node.firstChild.data)

    else:

        #You may choose to handle such by

        #duplicating this construct, creating an entity with NDATA,

        #Using a reference with data type URL, or other means

        content = base64.decodestring(node.firstChild.data)

        handle_foreign_content(content)

    return

Using Data URLs in HTML Output

While I do not recommend tunneling tag-soup content in data scheme URLs when expressed in a non-tag-soup format such as Atom, such URLs can be a limited solution for one problem I've encountered in Atom processing. If you want to tunnel tag soup to an output that can handle it (say a web browser), and the browser understands data scheme URLs, you can skip the decoding then tidying step for processing type="html" and just URL encode the escaped HTML into a data URL in an object element for output.

Yes, this is a very suspicious hack, but it illustrates some of the desperate measures I have had to resort to when working with Atom given the realities of ubiquitous tag soup. Specifically I was trying to write a little personal feed viewer using XSLT so that I could render feed contents in Firefox. Writing object elements with data URLs was the easiest way to tame the escaped tag soup I was getting from upstream feeds. I would never have done such a thing if the next stage in the processing pipeline was what I consider a proper XML stage, but since it was directly to a web browser at the end of the line, I took the liberty. The relevant bit of XSLT was as in listing 15.

Listing 15: Sample XSLT for hacking escaped HTML into an HTML object with data URL


<xsl:transform version="1.0"

    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"

    xmlns:atom="http://www.w3.org/2005/Atom"

    xmlns:str="http://exslt.org/strings"

>

  <xsl:output method="html"/>



  <xsl:template match="atom:content[@type='html']">

    <object type="text/html" height="100" width="100"

      href="data:text/html,{str:encode-uri(string(.), false())}">

Unfortunately, your browser does not support data scheme URLs,

so you cannot view this embedded content

    </object>

  </xsl:template>



</xsl:transform>

At present data URLs are supported in Mozilla, Opera, Safari, and Konqueror. So far Internet Explorer does not support data URLs, which is a huge damper, but there are a lot of user requests for the feature, so it might make a surprise appearance in IE7. See, for example, comments in the IE team blog entry URLs in Internet Explorer 7.

Atomic Text Clean-up

For more on why you should avoid escaped HTML in Atom documents, see Escaped Markup Considered Harmful by Norm Walsh here on XML.com, and his follow-up Escaped Markup: What To Do Instead. Atom is of the XML family, and it's always best to keep as much as possible in the XML layer. If you do, processing will be easier and your data will be cleaner. Of course, since it's not always easy to keep the gloves on after more than a decade of tag soup on the Web, Atom makes it possible to deal with messy content without devolving completely to the chaotic content representation that marks many other web feed formats. In Atom, you at least have to properly declare your mess.

By the way, if this article interested you I'd like to invite you to join the Atom IRC channel on Freenode (#atom on irc.freenode.net), which I revived last month. We've settled down to a few regulars with people often popping in to ask quick questions or announce work in progress, but the more the merrier. Atom 1.0 is just out of the shrink-wrap and the Atom Publishing Protocol -- featured in Joe Gregorio's Restful Web column this week, Catching Up with the Atom Publishing Protocol -- is advancing towards production, so it's a great time to discuss user and implementation details in a friendly forum.