Just Use Media Types?
June 8, 2005
In three of my four Restful Web columns, I've been describing the design of a REST web service for creating and managing web bookmarks. It's now time to get down to some coding. The major part of creating such a service is implementing the method of dispatching: how does an incoming HTTP request get routed to the right piece of code?
As a quick recap, here is the table summarizing the resources in our bookmark service.
You
will notice that the second column, Representations, now lists the media types of
the
representations we will accept at each of our resources. For now we'll assume that
application/xbel+xml
is a valid media type, even though it is not, in fact,
registered. IANA maintains a list of the
registered media types. If it's not on that list, it's not really a valid type. If
you want to officially register a media type, the IANA has a web page for doing so.
For the simple format that we are using as the representation of [user]/config
, we will use the media type
application/xml
. See RFC
3023 and Mark Pilgrim's XML.com article XML on the Web Has Failed to
learn why we don't use text/xml
.
URI |
Representations |
Description |
---|---|---|
[user]/bookmark/[id]/ |
application/xbel+xml |
A single bookmark for "user" |
[user]/bookmarks/ |
application/xbel+xml |
The 20 most recent bookmarks for "user" |
[user]/bookmarks/all/ |
application/xbel+xml |
All the bookmarks for "user" |
[user]/bookmarks/tags/[tag] |
application/xbel+xml |
The 20 most recent bookmarks for "user" that were filed in the category "tag" |
[user]/bookmarks/date/[Y]/[M]/ |
application/xbel+xml |
All the bookmarks for "user" that were created in a certain year [Y] or month [M] |
[user]/config/ |
application/xml |
A list of all the "tags" a user has ever used |
MIME or Media?
The first confusion to get out of the way is MIME versus media. In many discussions of HTTP you will see reference to both MIME types and media types. What's the difference? MIME stands for Multipurpose Internet Mail Extensions, which are extensions to RFC 822 that allow the transporting of something besides plain ASCII text. If you are going to allow other stuff--that is, other media besides plain text--then you will need to know what type of media it is. Thus RFC 2054 gave birth to MIME Media-Types. They have spread beyond mail messages--that is, beyond MIME--and that includes HTTP. The list of types is used by both MIME and HTTP, but that doesn't mean the HTTP entities are valid RFC 2045 entities--in fact, they aren't.
So where does that leave us? MIME Media-Type is rather awkward, so it's often shortened to MIME type or media type. For our purposes here, they are the same thing.
Where Did He Go?
One of the benefits of using HTTP correctly is that we can dispatch on a whole range of things. To make the discussion more concrete, let's look at an example HTTP request:
GET / HTTP/1.1 Host: 127.0.0.1:8080 User-Agent: Mozilla/5.0 (...) Gecko/20050511 Firefox/1.0.4 Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9, text/plain;q=0.8, image/png,*/*;q=0.5 Accept-Language: en-us,en;q=0.5 Accept-Encoding: gzip,deflate Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7 Keep-Alive: 300 Connection: keep-alive
There are three items of interest here. First, the HTTP request method is GET
.
Second, the URI is carried in two locations. The path and query parameters are on
the first
line of the request. The remainder of the URI, the domain name of the server, is carried
in
the Host
header. Third, the media type is carried in the Accept
header because this is a GET
request. For other POST
or
PUT
requests, the Content-Type
header in the request carries the
media type of the entity body.
When requests come into our service, we can route them based on the URI, the method, and the media type. We'll return to dispatching on the URI and the HTTP method later. The media type is what we are concentrating on right now. It turns out that dispatching on media types isn't as simple as it sounds. It's not really that complicated--we'll be doing it by the end of this article--but it's not trivial either.
Method | Header |
---|---|
GET |
Accept |
HEAD |
Accept |
PUT |
Content-Type |
POST |
Content-Type |
DELETE |
n/a |
If an entity is involved in the request--that is, a POST
or PUT
,
then the media type is contained in the Content-Type
header. If the request is
a HEAD
or GET
, then a list of acceptable media types for the
response is given in the Accept
header. That's actually not true, but I'll
discuss the falseness of that claim below. First, let's look at the
Content-Type
header. Here is the definition straight from the HTTP
specification (RFC 2616):
Content-Type = "Content-Type" ":" media-type media-type = type "/" subtype *( ";" parameter ) parameter = attribute "=" value attribute = token value = token | quoted-string quoted-string = ( <"> *(qdtext | quoted-pair ) <"> ) qdtext = <any TEXT except <">> quoted-pair = "\" CHAR type = token subtype = token token = 1*<any CHAR except CTLs or separators> separators = "(" | ")" | "<" | ">" | "@" | "," | ";" | ":" | "\" | <"> | "/" | "[" | "]" | "?" | "=" | "{" | "}" | SP | HT CTL = <any US-ASCII ctl chr (0-31) and DEL (127)>
I've gathered up all the pertinent pieces, but really the thing we'll be using the
most is
the definition of media-type
. That definition states that a media type contains
a type
, subtype
, and parameter
, which are separated
by "/"
and ";"
characters, respectively. We
can decompose a media-type
into its component parts using Python code like
this:
(mime_type, parameter) = media_type.split(";"); (type, subtype) = mime_type.split("/")
I said the Accept
header contained a list of all of the media types that the
client was able to, well, accept. That isn't quite true. Accept
is a
little more complicated, allowing the client to list multiple media ranges. A media
range is
different from a media type: a media range can use wildcards (*
) for the type
and subtype and can have multiple parameters. One of the parameters that can be used
is
q
, which is a quality indicator. It has a value, from 0.0 to 1.0, that
indicates the client's preference for that media type. The higher the quality indicator
value, the more preferred the media type is. For example, application/xbel+xml
could match application/xbel+xml
, application/*
, or
*/*
.
Microsoft's Internet Explorer browser typically uses the following Accept
header: Accept: image/gif, image/x-xbitmap, image/jpeg, image/pjpeg,
application/x-shockwave-flash, */*
, while Mozilla Firefox typically uses
Accept: text/xml, application/xml, application/xhtml+xml, text/html;q=0.9,
text/plain;q=0.8, image/png,*/*;q=0.5
.
One thing that makes our lives a little easier is that mime-type
, as defined
for the Content-Type
header, is also a valid media range for an
Accept
header. So we only have to parse strings defined by
mime-type
. If we do that well, then we will be able to parse
Accept
headers without much additional work.
Our first function is parse_mime_type
:
def parse_mime_type(mime_type):
parts = mime_type.split(";")
params = dict([tuple([s.strip() for s in param.split("=")])
for param in parts[1:] ])
(type, subtype) = parts[0].split("/")
return (type.strip(), subtype.strip(), params)
Let's follow the code by watching how a media range would be dissected. If our media
range
is application/xhtml+xml;q=0.5,
then
parts = ["application/xhtml+xml", "q=0.5"] params = {"q": "0.5"} (type, subtype) = ("application", "xhtml+xml")
and the function returns the tuple ("application", "xhtml+xml",
{"q": "0.5"})
.
Now remember that the difference between a MIME type and a media range is the presence
of
wildcards and the q
parameter. Our parse_mime_type
function
doesn't actually care about wildcards and will happily parse them. All that's left
is to
ensure that the q
quality parameter is set, using a default value of
1
if none is given.
def parse_media_range(range):
(type, subtype, params) = parse_mime_type(range)
if not params.has_key('q') or not params['q'] or \
not float(params['q']) or float(params['q']) > 1 \
or float(params['q']) < 0:
params['q'] = '1'
return (type, subtype, params)
So we can parse media ranges, and now we need to compare a target media type against
a list
of media ranges. That is, if we know our application supports image/jpeg,
and
we get a request that contains an Accept
header--image/gif,
image/x-xbitmap, image/jpeg, image/pjpeg, application/x-shockwave-flash, */*
--will
the client be able to accept a response with a MIME type image/jpeg
? And what
is the quality value associated with that type?
This is where things get a little tricky. Here are the rules for how to match a media type to a list of media ranges, which are distilled from Section 14.1 of RFC 2616:
- More specific media ranges have precedence.
application/foo;key=value
has a higher precedence thanapplication/foo,
which has a higher precedence thanapplication/*,
which in turn has a high precedence than*/*
. - Once a match is found, the
q
parameter for that media range is applied.
Once we have this match function working, then matching up the media types we accept
is
easy: just pass each one to the match
function; the one that comes out with the
highest q
value is the winner and, therefore, the MIME type of the
representation we are going to return. I like to turn these kinds of comparisons into
math
problems. (It's the kind of thing I do.) To find the most specific match, we'll score
a
media range in the following way:
- If a media range matches the "type," it scores 100 points.
- If a media range matches the "subtype," it scores an additional 10 points.
- If a media range matches in the parameters, it scores 1 point for each parameter.
Now we just score each media range, and the one with the highest score is the best
match.
We return the q
parameter of the best match.
def quality_parsed(mime_type, parsed_ranges):
"""Find the best match for a given mime_type against a list of
media_ranges that have already been parsed by
parse_media_range(). Returns the 'q' quality parameter of the
best match, 0 if no match was found. This function bahaves the
same as quality() except that 'parsed_ranges' must be a list of
parsed media ranges."""
best_fitness = -1; best_match = ""; best_fit_q = 0
(target_type, target_subtype, target_params) = parse_media_range(mime_type)
for (type, subtype, params) in parsed_ranges:
param_matches = sum([1 for (key, value) in \
target_params.iteritems() if key != 'q' and \
params.has_key(key) and value == params[key]])
if (type == target_type or type == '*')
and (subtype == target_subtype or subtype == "*"):
fitness = (type == target_type) and 100 or 0
fitness += (subtype == target_subtype) and 10 or 0
fitness += param_matches
if fitness > best_fitness:
best_fitness = fitness
best_fit_q = params['q']
return float(best_fit_q)
The best_match
function , which ties all of this together, takes the list of
MIME types that we support and the value of the Accept:
header and returns the
best match.
def best_match(supported, header):
"""Takes a list of supported mime-types and finds the best match
for all the media-ranges listed in header. The value of header
must be a string that conforms to the format of the HTTP Accept:
header. The value of 'supported' is a list of mime-types.
>>> best_match(['application/xbel+xml', 'text/xml'],\
'text/*;q=0.5,*/*; q=0.1')
'text/xml' """
parsed_header = [parse_media_range(r) for r in header.split(",")]
weighted_matches = [(quality_parsed(mime_type, parsed_header), mime_type)
for mime_type in supported]
weighted_matches.sort()
return weighted_matches[-1][0] and weighted_matches[-1][1] or ''
The full Python module, which includes comments and unit tests, is available from bitworking.org.
So now let's loop back to where we started. When we receive an HTTP request, part
of our
dispatching is going to depend on the media type. The header we need to look at depends
on
the type of the request or response. Using our newly created module, we can parse
both the
Content-Type
and Accept
headers. In the next column we'll jump
into the meat of dispatching our incoming requests.