Implementing the Atom Publishing Protocol
July 19, 2006
The Atom Publishing Protocol (APP) is nearing completion, many of the issues that I pointed out in a previous article have settled down, and there is work being done on implementations and interoperability. Although the interoperability work will go on for years to come, we can put together an implementation and discuss the requirements the APP puts on you, the gotchas, and the ways we can optimize the service. If you've been following along with Restful Web columns at home, you won't be surprised that the implementation is in Python. In future articles we'll start building more complex services on top of this APP implementation.
Before we dive into the code, let's back up and take a high-level look at what we need to implement. For reference, we'll be implementing draft -08 of the Atom Publishing Protocol. At the conclusion of "How to Create a REST Protocol" the four primary questions about a REST protocol are answered and a table describes the protocol. Table 1 is just such a table for the Atom Publishing Protocol.
Table 1.
Resource |
HTTP Method |
Representation |
Description |
---|---|---|---|
Introspection |
GET |
Introspection Document |
Enumerates a set of collections and lists their URIs and other information about the collections. |
Collection |
GET |
Atom Feed |
A list of member of the collection. Note that this may be a subset of all entries in the collection. |
Collection |
POST |
Atom Entry |
Create a new entry in the collection. |
Member |
GET |
Atom Entry |
Get the Atom Entry. |
Member |
PUT |
Atom Entry |
Update the Atom Entry. |
Member |
DELETE |
N/A |
Delete the Atom Entry from the collection. |
All of these operations are pretty obvious, except using GET
on a collection.
In that case, the response to a GET
may not return all of the entries in a
collection. Actually, if it's a large collection I really hope it doesn't return
all the entries in the collection. So the initial GET
returns what may be a
subset of the entries in the collection, ordered in reverse chronological order of
their
atom:updated
date. Note that this means the most recently updated entries are
returned first in the feed. If the feed returned doesn't contain all the entries in
the
collection, then the feed will contain an atom:link
element of type "next" that
points to another feed with the next set of entries in the collection -- it will also
be
ordered in reverse atom:updated
chronological order. Reverse atom:updated
chronological order
is quite a mouthful, and it doesn't even abbreviate very
nicely; "RAUCO" sounds more like a Soprano's character than a technical term.
Table 1 is good, but if we are going to generate a concrete implementation, then we need to add more detailed information. We'll add another column for the URIs of each of the resources and drop the description column to save space (see Table 2).
Table 2.
URI |
Resource |
HTTP Method |
Representation |
---|---|---|---|
/collection/introspection/ |
Introspection |
GET |
Introspection Document |
/collection/ |
Collection |
GET |
Atom Feed |
/collection/ |
Collection |
POST |
Atom Entry |
/collection/member/{id} |
Member |
GET |
Atom Entry |
/collection/member/{id} |
Member |
PUT |
Atom Entry |
/collection/member/{id} |
Member |
DELETE |
N/A |
Of course, those aren't really URIs in the URI column -- the last three rows have
URI
Templates instead of URIs. That is, you substitute the string {id}
with some
value, in this case some unique identifier for each entry in the collection, in order
to
obtain a full URI. The idea of URI Templates isn't new; I provided code for how to
handle
URI Template variable expansion in "Constructing or
Traversing URIs?". We now need to take three short side trips to gather the pieces we
need to put together our implementation of the APP.
A Place for My Stuff
The first thing we will need is a place to store our Atom Entries. Let's look back
at Table
2 and see about our requirements. Each entry needs to be able to retrieved, updated,
and
deleted based on a single id
. Here is a sketch of the interface for the Python
class:
class Store: def get(self, id): pass def delete(self, id): pass def put(self, id, entry): pass
We'll assume that the entry taken in put()
and returned by get()
will be in the form of a string. In addition, we need the ability to add new entries
to the
collection using just an entry document, and that creation process needs to report
the
id
that was assigned the new entry.
class Store: def get(self, id): pass def delete(self, id): pass def put(self, id, entry): pass def post(self, entry): pass
And, finally, we need to enumerate the members of a collection, which means we have
to
generate a linked chain of atom feeds that enumerate the entries in the collection
in
reverse atom:updated
order (aka el-RAUCO). If we assume that we will generate
the feeds with a fixed number of entries per feed, then we need a way to query with
a size
and offset from the beginning of the list. The index method will just return a subset
of the
entries' id
s that are of size
length and the first entry is
offset
back in the list.
class Store: def get(self, id): pass def post(self, entry): pass def delete(self, id): pass def put(self, id, entry); pass def index(self, size, offset): pass
Another requirement, which we have by virtue of building a web service, is that the
underlying data store needs to be able to safely operate when being accessed simultaneously
by multiple processes or threads. That requirement was met by building the underlying
store
on top of SQLite, which automatically handles accesses from multiple processes or
threads.
The Store also parses each incoming entry to ensure that it is at least well-formed
and also
updates fields that need to be updated. For example, on creation an entry will be
assigned a
unique id
and that id
will overwrite the value in the
atom:id
. There are also hooks for custom behaviors on updates. I will save
the rest of the technical details of the Store class for another day. What I really
want to
concentrate on now is how implementing a RESTful protocol with the right tools is
easy and
the advantages you can get from using HTTP correctly.
A Word About WSGI
Please read PEP 333, a nicely written and detailed account of the Web Services Gateway Interface (WSGI). WSGI is an API for writing web services or components in Python. It also allows you to write applications in a platform independent manner. Wsgiref is a library that will be making it's Python core library debut in Python 2.5; it includes a reference implementation of a server, some middleware, and a WSGI application. We'll write our APP implementation as a WSGI application, which gives us more portability and opens up possibilities to write less code, which is always a good thing.
WSGI is simple and simply explained:
The WSGI interface has two sides: the "server" or "gateway" side, and the "application" or "framework" side. The server side invokes a callable object that is provided by the application side. The specifics of how that object is provided are up to the server or gateway. It is assumed that some servers or gateways will require an application's deployer to write a short script to create an instance of the server or gateway, and supply it with the application object. Other servers and gateways may use configuration files or other mechanisms to specify where an application object should be imported from, or otherwise obtained. [PEP 333]
So the application side is a callable object, and if you are familiar with Python you realize soon that you can start doing functional type things with callable objects, like composing them. That observation leads to a concept of "middleware." Not to be confused with high-priced enterprisey solutions, this kind of middleware is made of Python objects that wrap themselves around application objects to provide enhanced behavior.
In addition to "pure" servers/gateways and applications/frameworks, it is also possible to create "middleware" components that implement both sides of this specification. Such components act as an application to their containing server, and as a server to a contained application, and can be used to provide extended APIs, content transformation, navigation, and other useful functions. [PEP 333]
Here is an example of a simple WSGI application, straight from PEP 333:
def simple_app(environ, start_response): """Simplest possible application object""" status = '200 OK' response_headers = [('Content-type','text/plain')] start_response(status, response_headers) return ['Hello world!\n']
I won't go into any further detail on WSGI here. PEP 333 does a very good job of describing it, and I heartily suggest you go read the PEP if you are at all curious.
Selector
Selector is a piece of WSGI middleware from Luke Arno that, "...provides WSGI middleware for 'RESTful' mapping of URL paths to WSGI applications." So if we know our URI structure and have built WSGI applications for each of the resources in our application, Selector lets us map all those pieces together in a completely natural way, by mapping from URI Templates and method names into WSGI applications. Let's take Table 2 from above and redo it one more time to drop the resource and representation columns and instead plug in our WSGI application names (see Table 3).
Table 3.
URI |
Method |
WSGI Application |
---|---|---|
/collection/introspection/ |
GET |
introspection |
/collection/ |
GET |
enumerate_collection |
/collection/ |
POST |
create_new_entry |
/collection/member/{id} |
GET |
member_get |
/collection/member/{id} |
PUT |
member_update |
/collection/member/{id} |
DELETE |
member_delete |
Selector makes it easy to specify such a service. Assuming our applications are already defined, we can create a selector object that does the mapping:
import selector s = selector.Selector() s.add('/collection/introspection/', GET=introspection) s.add('/collection/', POST=create_new_entry, GET=enumerate_collection) s.add('/collection/member/{id}', GET=member_get, PUT=member_update, DELETE=member_delete)
If we wanted to run our service as a CGI application we can use the wrapper provided
in the
wsgirf
library.
from wsgiref.handlers import CGIHandler CGIHandler().run(s)
So, all that's left is the individual applications themselves -- the ones that do the work and provide an interface into our Store class. Let's look at the implementation of the WSGI application to create a new entry, remembering that a WSGI application is just a function or callable object that implements the WSGI interface. In this case the application is implemented as a function.
Create an Entry
def create_new_entry(environ, start_response): # 1. Check for a good Content-Type: header. content_type = environ.get('CONTENT_TYPE', '') content_type = content_type.split(';')[0] if content_type and content_type != 'application/atom+xml': start_response("400 Bad Request", [('Content-Type','text/plain')]) return ["Wrong media type."] # 2. Read in the entry length = int(environ['CONTENT_LENGTH']) content = environ['wsgi.input'].read(length) # 3. Store the entry store = getstore(environ) id = store.post(content) # 4. Response includes a Location: header start_response("201 Created", [ ('Location', urljoin( wsgiref.util.application_uri, expand_uri_template('/collection/member/{id}', {'id': id})) ) ] ) return [store.get(id).encode('utf-8')]
First, we do some basic checks (1) to ensure we have been sent the right kind of data.
Then we read in the entry that was sent (2) and place it in the store (3). If successful,
we send a 201 Created response that includes a Location:
header that points to
the newly created resource. The response needs to include a Location:
header
with the URI of the newly created entry, and from the spec of HTTP (RFC 2616) we know
that
the URI returned must be an absolute URI.
The call to wsgiref.util.application_uri()
gets us our base URI and then we
use expand_uri_template()
to expand the URI Template with the id
we just assigned the new entry. The expand_uri_template()
function is described
fully in the XML.com article, "Constructing or
Traversing URIs?".
If the store has problems with the entry -- for example, it isn't well-formed -- it will throw an exception that our WSGI wrapper will catch and turn into an appropriate error response. Note that the default error response isn't very helpful and a future enhancement will be to add more informative status codes and error messages.
Get an Entry
Let's look at the application that handles a GET on a member of the collection:
def member_get(environ, start_response): store = getstore(environ) # 1. Retrieve entry. body = store.get(environ['selector.vars']['id']).encode('utf-8') # 2. Send back to client. headers = [('Content-Type','application/atom+xml;charset=utf-8')] start_response("200 OK", headers) return [body]
This is rather simple since Selector pulls the id
out of the request URI and
places it in the environment as selector.vars
. From the id
we can
retrieve the entry (1) from the Store and send it back to the client (2). Now I've
talked in
the past about using etags and the If-None-Match:
header to speed up requests
if a resource hasn't been updated since the last request. We will need to modify our
application to calculate an etag, which in this case will just be an MD5 hash of the
response body.
def member_get(environ, start_response): store = getstore(environ) body = store.get(environ['selector.vars']['id']).encode('utf-8') etag = md5.new(body).hexdigest() # 1 incoming_etag = environ.get('HTTP_IF_NONE_MATCH', '') # 2 if etag == incoming_etag: # 3 start_response("304 Not Modified", []) return [] else: headers = [('Content-Type','application/atom+xml;charset=utf-8'), ('ETag', etag) # 4 ] start_response("200 OK", headers) return [body]
We calculate the etag (1) for the response and return it via the ETag
header
(4). If the client has sent an old etag via the If-None-Match:
header we get
that etag (2) and compare it against the current etag (3) and if they match then we
return
with a status of 304 Not Modified and an empty response body, otherwise we just return
the
entry as before. This means that if a client supports etags and the response has not
been
updated since the last GET
then the only data that passed over the wires is the
response headers.
In this case we have built the etag handling right into the application to show how
easy
etag handling can be, but that probably isn't the best way to handle it. A much better
approach would be to have our application compute the etag and have some WSGI middleware
that wraps our applications that looks for the If-None-Match:
header and
handles the 304 response.
Etag handling isn't the only way to speed up your responses, the response can also be gzip'd. You have several choices when handling gzip. If you are running under Apache you can turn on mod_deflate and that will handle gzip'ing your content. An alternative is to add some WSGI middleware that handles it for you. Here is our startup code from earlier but with the addition of the gzipper middleware from Python Paste.
from wsgiref.handlers import CGIHandler import paste.gzipper s = paste.gzipper.middleware(s, None) CGIHandler().run(s)
Note that we didn't have to change our applications at all, the functionality is completely orthogonal to the existing applications.
Delete an Entry
Deleting an entry is the mirror of GETting an entry -- we get the id
of the
entry from Selector's parsing of the request URI and we just pass the delete on down
to the
Store.
def member_delete(environ, start_response): store = getstore(environ) id = environ['selector.vars']['id'] store.delete(id) start_response("200 OK", []) return []
Update an Entry
Updating an entry is equally simple in a naive implementation. We read in the sent
entry
(1) and after determining the id
from the URI we put the entry into the store
(2) at that location.
def member_update(environ, start_response): # 1. Read the entry length = int(environ['CONTENT_LENGTH']) content = environ['wsgi.input'].read(length) store = getstore(environ) id = environ['selector.vars']['id'] # 2. Put the entry in the store. store.put(id, content) start_response("200 OK", []) return []
We can do better. One of the things we would like to protect against is lost updates. For example, two different
clients request an entry at the same time (that's not a problem), both clients edit
those
entries (also not a problem); but then both clients PUT
those modified entries
back to the server -- now we have a problem! There will be a race condition and one
of the
client's edits will be lost. HTTP has a minimal set of capabilities that allows a
server to
detect a conflict and inform the client of that condition. The solution relies on
etags,
which we already used to optimize our GET
s. In this case we rely on the
GET
to include an etag and then look for that etag in an
If-Match
header on the PUT
request. If the new and old etag
match, then we let the PUT
proceed; otherwise, we will fail with a status code
of 412 Precondition Failed.
def member_update(environ, start_response): length = int(environ['CONTENT_LENGTH']) content = environ['wsgi.input'].read(length) store = getstore(environ) id = environ['selector.vars']['id'] body = store.get(id).encode('utf-8') # 1 etag = md5.new(body).hexdigest() # 2 incoming_etag = environ.get('HTTP_IF_MATCH', '*') # 3 if (etag == incoming_etag) or ('*' == incoming_etag): # 4 store.put(id, content) start_response("200 OK", []) return [] else: start_response("412 Precondition Failed", []) # 5 return []
We will need to determine the etag for the current entry (1)(2) and then compare that
to
(3) the etag sent in via the If-Match:
header. If the two are equal (4), or if
the value of the etag sent is '*'
, then the PUT
request goes
through as before. A value of '*'
for If-Match:
means that the
client wishes the request to go through regardless of the current resources etag value,
which gives the client a way to forcibly overwrite the server's current value. If
the etags
don't match (5) we reject the request with a 412 status code.
This code isn't optimal since we do a get()
to retrieve the entry to calculate
the etag just to check it against the incoming If-Match:
header. A faster way
would be to calculate and store the etag for each entry instead of recalculating it
every
time we need it.
There is also bug in this code; a request could come in from another client between
the
call to store.get()
and store.put()
. In reality we need to either
have Store expose some sort of locking of the database or we need to push the etag
functionality down into Store.
This isn't the only way to avoid the lost update problem. Google's GData implementation
of
the Atom Publishing Protocol gives a unique edit URI to each version of an entry.
Every time
the entry is updated the edit URI changes. If the client sends a PUT
or
DELETE
to a stale edit URI, then the server returns with a status code of 409
Conflict. There are advantages to both approaches. With the ETag approach the Edit
URI never
changes, thus allowing local and intermediate caches to work better. In addition,
the ETag
approach gives a defined mechanism, If-Match: *
, to forcibly overwrite an
entry. The GData approach has the advantage the even naive clients will be protected
from
accidental overwrites. The ETag approach requires the client to know about preserving
etags
that the client sees in GET
responses and using them in PUT
requests back to the same URI, which is not required of clients of the GData implementation.
On the other hand, both systems must be prepared to handle 4xx responses by doing
a
GET
and applying the edits again, so on that account it's wash.
A Cliff Hanger
Next time we will finish looking at the implementations for introspection and enumeration the entries in a collection. That will require introducing a few more tools before we're done. After that we'll dig into the implementation of Store and start building some applications on top of of our APP implementation. Now, you might be asking yourself how we are going to go straight into building applications when I've said nothing about the associated HTML pages for each entry in the collection. In a traditional weblog implementation of the APP, the collection is just an analogue of the web pages that make up the blog, but that doesn't mean those web pages have to exist and our APP service can add plenty of value all on it's own. For a flavor of such a service that can be used, you can read th ACM Queue article "A Conversation with Werner Vogels".