Doing HTTP Caching Right: Introducing httplib2
February 1, 2006
You need to understand HTTP caching. No,
really, you do. I have mentioned repeatedly that you need to choose your HTTP methods
carefully when building a web service, in part because you can get the performance
benefits
of caching with GET
. Well, if you want to get the real advantages of
GET
then you need to understand caching and how you can use it effectively to
improve the performance of your service.
This article will not explain how to set up caching for your particular web server, nor will it cover the different kinds of caches. If you want that kind of information I recommend Mark Nottingham's excellent tutorial on HTTP caching.
Goals
First you need to understand the goals of the HTTP caching model. One objective is to let both the client and server have a say over when to return a cached entry. As you can imagine, allowing both client and server to have input on when a cached entry is to be considered stale is obviously going to introduce some complexity.
The HTTP caching model is based on validators, which are bits of data that a client can use to validate that a cached response is still valid. They are fundamental to the operation of caches since they allow a client or intermediary to query the status of a resource without having to transfer the entire response again: the server returns an entity body only if the validator indicates that the cache has a stale response.
Validators
One of the validators for HTTP is the ETag
. An ETag
is like a
fingerprint for the bytes in the representation; if a single byte changes the
ETag
also changes.
Using validators requires that you already have done a GET
once on a resource.
The cache stores the value of the ETag
header if present and then uses the
value of that header in later requests to that same URI.
For example, if I send a request to example.org and get back this response:
HTTP/1.1 200 OK Date: Fri, 30 Dec 2005 17:30:56 GMT Server: Apache ETag: "11c415a-8206-243aea40" Accept-Ranges: bytes Content-Length: 33286 Vary: Accept-Encoding,User-Agent Cache-Control: max-age=7200 Expires: Fri, 30 Dec 2005 19:30:56 GMT Content-Type: image/png -- binary data --
Then the next time I do a GET
I can add the validator in. Note that the value
of ETag
is placed in the If-None-Match:
header.
GET / HTTP/1.1 Host: example.org If-None-Match: "11c415a-8206-243aea40"
If there was no change in the representation then the server returns a 304 Not
Modified
.
HTTP/1.1 304 Not Modified Date: Fri, 30 Dec 2005 17:32:47 GMT
If there was a change, the new representation is returned with a status code of
200
and a new ETag
.
HTTP/1.1 200 OK Date: Fri, 30 Dec 2005 17:32:47 GMT Server: Apache ETag: "0192384-9023-1a929893" Accept-Ranges: bytes Content-Length: 33286 Vary: Accept-Encoding,User-Agent Cache-Control: max-age=7200 Expires: Fri, 30 Dec 2005 19:30:56 GMT Content-Type: image/png -- binary data --
Cache-Control
While validators are used to test if a cached entry is still valid, the Cache-Control:
header is used to signal how long a representation can
be cached. The most fundamental of all the cache-control directives is max-age
.
This directive asserts that the cached response can be only max-age
seconds old
before being considered stale. Note that max-age
can appear in both request
headers and response headers, which gives both the client and server a chance to assert
how
old they like their responses cached. If a cached response is fresh then we can return
the
cached response immediately; if it's stale then we need to validate the cached response
before returning it.
Let's take another look at our example response from above. Note that the
Cache-Control:
header is set and that a max-age
of
7200
means that the entry can be cached for up to two hours.
HTTP/1.1 200 OK Date: Fri, 30 Dec 2005 17:32:47 GMT Server: Apache ETag: "0192384-9023-1a929893" Accept-Ranges: bytes Content-Length: 33286 Vary: Accept-Encoding,User-Agent Cache-Control: max-age=7200 Expires: Fri, 30 Dec 2005 19:30:56 GMT Content-Type: text/xml
There are lots of directives that can be put in the Cache-Control:
header, and
the Cache-Control:
header may appear in both requests and/or responses.
Directive | Description |
---|---|
no-cache
|
The cached response must not be used to satisfy this request. |
no-store
|
Do not store this response in a cache. |
max-age=delta-seconds
|
The client is willing to accept a cached reponse that is delta-seconds
old without validating. |
max-stale=delta-seconds
|
The client is willing to accept a cached response that is no more than
delta-seconds stale. |
min-fresh=delta-seconds
|
The client is willing to accept only a cached response that will still be fresh
delta-seconds from now. |
no-transform
|
The entity body must not be transformed. |
only-if-cached
|
Return a response only if there is one in the cache. Do not validate or
GET a response if no cache entry exists. |
Directive | Description |
---|---|
public
|
This can be cached by any cache. |
private
|
This can be cached only by a private cache. |
no-cache
|
The cached response must not be used on subsequent requests without first validating it. |
no-store
|
Do not store this response in a cache. |
no-transform
|
The entity body must not be transformed. |
must-revalidate
|
If the cached response is stale it must be validated before it is returned in
any response. Overrides max-stale . |
max-age=delta-seconds
|
The client is willing to accept a cached reponse that is delta-seconds
old without validating. |
s-maxage=delta-seconds
|
Just like max-age but it applies only to shared caches. |
proxy-revalidate
|
Like must-revalidate , but only for proxies. |
Let's look at some Cache-Control:
header examples.
-
Cache-Control: private, max-age=3600
-
If sent by a server, this
Cache-Control:
header states that the response can only be cached in a private cache for one hour. -
Cache-Control: public, must-revalidate, max-age=7200
-
The included response can be cached by a public cache and can be cached for two hours; after that the cache must revalidate the entry before returning it to a subsequent request.
-
Cache-Control: must-revalidate, max-age=0
-
This forces the client to revalidate every request, since a
max-age=0
forces the cached entry to be instantly stale. See Mark Nottingham's Leveraging the Web: Caching for a nice example of how this can be applied. -
Cache-Control: no-cache
-
This is pretty close to
must-revalidate, max-age=0
, except that a client could use amax-stale
header on a request and get a stale response. Themust-revalidate
will override themax-stale
property. I told you that giving both client and server some control would make things a bit complicated.
So far all of the Cache-Control:
header examples we have looked at are on the
response side, but they can also be added on the request too.
-
Cache-Control: no-cache
-
This forces an "end-to-end reload," where the client forces the cache to reload its cache from the origin server.
-
Cache-Control: min-fresh=200
-
Here the client asserts that it wants a response that will be fresh for at least 200 seconds.
Vary
You may be wondering about situations where a cache might get confused. For example,
what
if a server does content negotiation, where different representations can be returned
from
the same URI? For cases like this HTTP supplies the Vary:
header. The
Vary:
header informs the cache of the names of the all headers that might
cause a resources representation to change.
For example, if a server did do content negotiation then the Content-Type:
header would be different for the different types of responses, depending on the type
of
content negotiated. In that case the server can add a Vary: accept
header,
which causes the cache to consider the Accept:
header when caching responses
from that URI.
Date: Mon, 23 Jan 2006 15:37:34 GMT Server: Apache Accept-Ranges: bytes Vary: Accept-Encoding,User-Agent Content-Encoding: gzip Cache-Control: max-age=7200 Expires: Mon, 23 Jan 2006 17:37:34 GMT Content-Length: 5073 Content-Type: text/html; charset=utf-8
In this example the server is stating that responses can be cached for two hours,
but that
responses may vary based on the Accept-Encoding
and User-Agent
headers.
Connection
When a server successfully validates a cached response, using for example the
If-None-Match:
header, then the server returns a status code of 304 Not
Modified
. So nothing much happens on a 304 Not Modified
response,
right? Well, not exactly. In fact, the server can send updated headers for the entity
that
have to be updated in the cache. The server can also send along a Connection:
header that says which headers shouldn't be updated.
Some headers are by default excluded from list of headers to update. These are called
hop-by-hop
headers and they are: Connection, Keep-Alive, Proxy-Authenticate, Proxy-Authorization,
TE, Trailers, Transfer-Encoding
, and Upgrade
. All other headers are
considered end-to-end headers.
HTTP/1.1 304 Not Modified Content-Length: 647 Server: Apache Connection: close Date: Mon, 23 Jan 2006 16:10:52 GMT Content-Type: text/html; charset=iso-8859-1 ...
In the above example Date:
is not a hop-by-hop header nor is it listed in the
Connection:
header, so the cache has to update the value of
Date:
in the cache.
If Only It Were That Easy
While a little complex, the above is at least conceptually nice. Of course, one of the problems is that we have to be able to work with HTTP 1.0 servers and caches which use a different set of headers, all time-based, to do caching and out of necessity those are brought forward into HTTP 1.1.
The older cache control model from HTTP 1.0 is based solely on time. The
Last-Modified
cache validator is just that, the last time that the resource
was modified. The cache uses the Date:
, Expires:
,
Last-Modified:
, and If-Modified-Since:
headers to detect changes
in a resource.
If you are developing a client you should always use both validators if present; you never know when an HTTP 1.0 cache will pop up between you and a server. HTTP 1.1 was published seven years ago so you'd think that at this late date most things would be updated. This is the protocol equivalent of wearing a belt and suspenders.
Now that you understand caching you may be wondering if the client library in your favorite language even supports caching. I know the answer for Python, and sadly that answer is currently no. It pains me that my favorite language doesn't have one of the best HTTP client implementations around. That needs to change.
Introducing httplib2
Introducing httplib2
, a
comprehensive Python HTTP client library that supports a local private cache that
understands all the caching operations we just talked about. In addition it supports
many
features left out of other HTTP libraries.
- HTTP and HTTPS
- HTTPS support is available only if the socket module was compiled with SSL support.
- Keep-Alive
- Supports HTTP 1.1 Keep-Alive, keeping the socket open and performing multiple requests over the same connection if possible.
- Authentication
- The following three types of HTTP Authentication are supported. These can be used over both HTTP and HTTPS.
- Caching
- The module can optionally operate with a private cache that understands the
Cache-Control:
header and uses both theETag
andLast-Modified
cache validators. - All Methods
- The module can handle any HTTP request method, not just
GET
andPOST
. - Redirects
- Automatically follows 3XX redirects on
GET
s. - Compression
- Handles both
compress
andgzip
types of compression. - Lost Update Support
- Automatically adds back
ETag
s intoPUT
requests to resources we have already cached. This implements Section 3.2 of Detecting the Lost Update Problem Using Unreserved Checkout. - Unit Tested
- A large and growing set of unit tests.
See the httplib2
project
page for more details.
Next Time
Next time I will cover HTTP authentication, redirects, keep-alive, and compression
in HTTP
and how httplib2
handles them. You might also be wondering how the "big guys"
handle caching. That will take a whole other article to cover.