Just Use Media Types?

June 8, 2005

In three of my four Restful Web columns, I've been describing the design of a REST web service for creating and managing web bookmarks. It's now time to get down to some coding. The major part of creating such a service is implementing the method of dispatching: how does an incoming HTTP request get routed to the right piece of code?

As a quick recap, here is the table summarizing the resources in our bookmark service. You will notice that the second column, Representations, now lists the media types of the representations we will accept at each of our resources. For now we'll assume that application/xbel+xml is a valid media type, even though it is not, in fact, registered. IANA maintains a list of the registered media types. If it's not on that list, it's not really a valid type. If you want to officially register a media type, the IANA has a web page for doing so.

For the simple format that we are using as the representation of [user]/config, we will use the media type application/xml. See RFC 3023 and Mark Pilgrim's XML.com article XML on the Web Has Failed to learn why we don't use text/xml.

*Resources in the Bookmark Service*
URI	Representations	Description
[user]/bookmark/[id]/	application/xbel+xml	A single bookmark for "user"
[user]/bookmarks/	application/xbel+xml	The 20 most recent bookmarks for "user"
[user]/bookmarks/all/	application/xbel+xml	All the bookmarks for "user"
[user]/bookmarks/tags/[tag]	application/xbel+xml	The 20 most recent bookmarks for "user" that were filed in the category "tag"
[user]/bookmarks/date/[Y]/[M]/	application/xbel+xml	All the bookmarks for "user" that were created in a certain year [Y] or month [M]
[user]/config/	application/xml	A list of all the "tags" a user has ever used

MIME or Media?

The first confusion to get out of the way is MIME versus media. In many discussions of HTTP you will see reference to both MIME types and media types. What's the difference? MIME stands for Multipurpose Internet Mail Extensions, which are extensions to RFC 822 that allow the transporting of something besides plain ASCII text. If you are going to allow other stuff--that is, other media besides plain text--then you will need to know what type of media it is. Thus RFC 2054 gave birth to MIME Media-Types. They have spread beyond mail messages--that is, beyond MIME--and that includes HTTP. The list of types is used by both MIME and HTTP, but that doesn't mean the HTTP entities are valid RFC 2045 entities--in fact, they aren't.

So where does that leave us? MIME Media-Type is rather awkward, so it's often shortened to MIME type or media type. For our purposes here, they are the same thing.

Where Did He Go?

One of the benefits of using HTTP correctly is that we can dispatch on a whole range of things. To make the discussion more concrete, let's look at an example HTTP request:

GET / HTTP/1.1 

Host: 127.0.0.1:8080 

User-Agent: Mozilla/5.0 (...) Gecko/20050511 Firefox/1.0.4 

Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, 

        text/plain;q=0.8, image/png,*/*;q=0.5

Accept-Language: en-us,en;q=0.5 

Accept-Encoding: gzip,deflate 

Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 

Keep-Alive: 300 

Connection: keep-alive

There are three items of interest here. First, the HTTP request method is GET. Second, the URI is carried in two locations. The path and query parameters are on the first line of the request. The remainder of the URI, the domain name of the server, is carried in the Host header. Third, the media type is carried in the Accept header because this is a GET request. For other POST or PUT requests, the Content-Type header in the request carries the media type of the entity body.

When requests come into our service, we can route them based on the URI, the method, and the media type. We'll return to dispatching on the URI and the HTTP method later. The media type is what we are concentrating on right now. It turns out that dispatching on media types isn't as simple as it sounds. It's not really that complicated--we'll be doing it by the end of this article--but it's not trivial either.

Media type location
Method	Header
GET	Accept
HEAD	Accept
PUT	Content-Type
POST	Content-Type
DELETE	n/a

If an entity is involved in the request--that is, a POST or PUT, then the media type is contained in the Content-Type header. If the request is a HEAD or GET, then a list of acceptable media types for the response is given in the Accept header. That's actually not true, but I'll discuss the falseness of that claim below. First, let's look at the Content-Type header. Here is the definition straight from the HTTP specification (RFC 2616):

Content-Type   = "Content-Type" ":" media-type

media-type     = type "/" subtype *( ";" parameter )

parameter      = attribute "=" value

attribute      = token

value          = token | quoted-string

quoted-string  = ( <"> *(qdtext | quoted-pair ) <"> )

qdtext         = <any TEXT except <">>

quoted-pair    = "\" CHAR

type           = token

subtype        = token

token          = 1*<any CHAR except CTLs or separators>

separators     = "(" | ")" | "<" | ">" | "@"

               | "," | ";" | ":" | "\" | <">

               | "/" | "[" | "]" | "?" | "="

               | "{" | "}" | SP | HT

CTL            = <any US-ASCII ctl chr (0-31) and DEL (127)>

I've gathered up all the pertinent pieces, but really the thing we'll be using the most is the definition of media-type. That definition states that a media type contains a type, subtype, and parameter, which are separated by "/" and ";" characters, respectively. We can decompose a media-type into its component parts using Python code like this:

(mime_type, parameter) = media_type.split(";"); (type, subtype) = mime_type.split("/")

I said the Accept header contained a list of all of the media types that the client was able to, well, accept. That isn't quite true. Accept is a little more complicated, allowing the client to list multiple media ranges. A media range is different from a media type: a media range can use wildcards (*) for the type and subtype and can have multiple parameters. One of the parameters that can be used is q, which is a quality indicator. It has a value, from 0.0 to 1.0, that indicates the client's preference for that media type. The higher the quality indicator value, the more preferred the media type is. For example, application/xbel+xml could match application/xbel+xml, application/*, or */*.

Microsoft's Internet Explorer browser typically uses the following Accept header: Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*, while Mozilla Firefox typically uses Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5.

One thing that makes our lives a little easier is that mime-type, as defined for the Content-Type header, is also a valid media range for an Accept header. So we only have to parse strings defined by mime-type. If we do that well, then we will be able to parse Accept headers without much additional work.

Our first function is parse_mime_type:

def parse_mime_type(mime_type):

    parts = mime_type.split(";")

    params = dict([tuple([s.strip() for s in param.split("=")]) 

	  for param in parts[1:] ])

    (type, subtype) = parts[0].split("/")

    return (type.strip(), subtype.strip(), params)

Let's follow the code by watching how a media range would be dissected. If our media range is application/xhtml+xml;q=0.5, then

parts = ["application/xhtml+xml", "q=0.5"]

params = {"q": "0.5"}

(type, subtype) = ("application", "xhtml+xml")

and the function returns the tuple ("application", "xhtml+xml", {"q": "0.5"}).

Now remember that the difference between a MIME type and a media range is the presence of wildcards and the q parameter. Our parse_mime_type function doesn't actually care about wildcards and will happily parse them. All that's left is to ensure that the q quality parameter is set, using a default value of 1 if none is given.

def parse_media_range(range): 

    (type, subtype, params) = parse_mime_type(range)

    if not params.has_key('q') or not params['q'] or \

            not float(params['q']) or float(params['q']) > 1 \

           or float(params['q']) < 0:

        params['q'] = '1'

    return (type, subtype, params)

So we can parse media ranges, and now we need to compare a target media type against a list of media ranges. That is, if we know our application supports image/jpeg, and we get a request that contains an Accept header--image/gif, image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*--will the client be able to accept a response with a MIME type image/jpeg? And what is the quality value associated with that type?

This is where things get a little tricky. Here are the rules for how to match a media type to a list of media ranges, which are distilled from Section 14.1 of RFC 2616:

More specific media ranges have precedence. application/foo;key=value has a higher precedence than application/foo, which has a higher precedence than application/*, which in turn has a high precedence than */*.
Once a match is found, the q parameter for that media range is applied.

Once we have this match function working, then matching up the media types we accept is easy: just pass each one to the match function; the one that comes out with the highest q value is the winner and, therefore, the MIME type of the representation we are going to return. I like to turn these kinds of comparisons into math problems. (It's the kind of thing I do.) To find the most specific match, we'll score a media range in the following way:

If a media range matches the "type," it scores 100 points.
If a media range matches the "subtype," it scores an additional 10 points.
If a media range matches in the parameters, it scores 1 point for each parameter.

Now we just score each media range, and the one with the highest score is the best match. We return the q parameter of the best match.

def quality_parsed(mime_type, parsed_ranges): 

    """Find the best match for a given mime_type against a list of

       media_ranges that have already been parsed by

       parse_media_range(). Returns the 'q' quality parameter of the

       best match, 0 if no match was found. This function bahaves the

       same as quality() except that 'parsed_ranges' must be a list of

       parsed media ranges."""



    best_fitness = -1; best_match = ""; best_fit_q = 0



    (target_type, target_subtype, target_params) = parse_media_range(mime_type)

    for (type, subtype, params) in parsed_ranges:

        param_matches = sum([1 for (key, value) in \

                target_params.iteritems() if key != 'q' and \

                params.has_key(key) and value == params[key]])

        if (type == target_type or type == '*') 

		  and (subtype == target_subtype or subtype == "*"):

            fitness = (type == target_type) and 100 or 0

            fitness += (subtype == target_subtype) and 10 or 0

            fitness += param_matches

            if fitness > best_fitness:

                best_fitness = fitness

                best_fit_q = params['q']

    return float(best_fit_q)

The best_match function , which ties all of this together, takes the list of MIME types that we support and the value of the Accept: header and returns the best match.

def best_match(supported, header): 

    """Takes a list of supported mime-types and finds the best match

    for all the media-ranges listed in header. The value of header

    must be a string that conforms to the format of the HTTP Accept:

    header. The value of 'supported' is a list of mime-types.

    

    >>> best_match(['application/xbel+xml', 'text/xml'],\                 

        'text/*;q=0.5,*/*; q=0.1')

    'text/xml'    """

    parsed_header = [parse_media_range(r) for r in header.split(",")]

    weighted_matches = [(quality_parsed(mime_type, parsed_header), mime_type) 

	  for mime_type in supported]

    weighted_matches.sort()

    return weighted_matches[-1][0] and weighted_matches[-1][1] or ''

The full Python module, which includes comments and unit tests, is available from bitworking.org.

So now let's loop back to where we started. When we receive an HTTP request, part of our dispatching is going to depend on the media type. The header we need to look at depends on the type of the request or response. Using our newly created module, we can parse both the Content-Type and Accept headers. In the next column we'll jump into the meat of dispatching our incoming requests.