httplib2: HTTP Persistence and Authentication
March 29, 2006
Last time we covered HTTP caching and how it can improve the performance of your web service. This time we'll cover some other aspects of HTTP that, if fully utilized, can also speed up your web service.
Persistent Connections
Persistent connections are critical to performance. In early versions of HTTP, connections from the client to the server were built up and torn down for every request. That's a lot of overhead on the client, on the server, and on any intermediaries. The persistent connection approach, that is, keeping the same socket connection open for multiple requests, is the default behavior in HTTP 1.1.
Now if all HTTP 1.1 connections are considered persistent, then there must be a mechanism
to signal that the connection is to be closed, right? That is handled by the
Connection:
header.
Connection: close
The header signals that the connection is to be closed after the current request-response is finished. Note that either the client or the server can send such a header.
If you allow persistent connections, then the next obvious optimization is
pipelining: stuffing a bunch of requests down a socket without waiting for the
response from the first request to be returned before sending subsequent requests.
Now this
only works for certain types of requests; at a minimum, those requests have to be
idempotent. Now aren't you glad you made all your GET
s
idempotent when you designed your RESTful web service?
Compression
So now we're saving time and bandwidth by using caching to avoid retrieving content if it hasn't changed, and using persistent connections to avoid the overhead of tearing down and rebuilding sockets. If you have an entity to transfer, then you can still speed things up by transferring fewer bytes over the wire--that is, by using compression.
Though RFC 2616 specifies
three types of compression, the values are actually tracked in an IANA registry and could in
theory be supplemented by others. But it's been nine years since HTTP 1.1 was released
and
it hasn't been added to yet. Even at that, with three types of compression specified,
only
two, gzip
and compress
, are regularly seen in the wild.
The way compression normally works is that the client announces the types of compression
it
can handle by using the Accept-Encoding:
request header:
Accept-Encoding: gzip;q=1.0, identity; q=0.5, *;q=0
Those are weighted parameters with resolution rules similar to the mime-types in
Accept:
headers. I covered parsing and interpreting those in Just Use Media Types?, which
you should read if you missed it the first time.
If the server supports any of the listed compression types, it can compress the response
entity and announce how a response was compressed so that the client can decompress
it
correctly. That information is carried by the Content-Encoding:
header.
Content-Encoding: gzip
In the process of implementing httplib2
I also discovered some rough spots in HTTP
implementations.
Authentication
In the past, people have asked me how to protect their web services and I've told them to just use HTTP authentication, by which I meant either Basic or Digest as defined in RFC 2617.
For most authentication requirements, using Basic alone isn't really an option since it transmits your name and password unencrypted. Yes, it encodes them as base64, but that's not encryption.
The other option is Digest, which tries to protect your password by not transferring it directly, but uses challenges and hashes to let the client prove to the server that it knows a shared secret.
Here's the "executive summary" of HTTP Digest authentication:
- The server rejects an unauthenticated request with a challenge. That challenge contains a nonce, a random string generated by the server.
- The client responds with the same request again, but this time with a
WWW-Authenticate:
header that contains a hash of the supplied nonce, the username, the password, the request URI, and the HTTP method.
The problem with Digest is that it suffers from too many options, which are implemented
non-uniformly, and not always correctly. For example, there is an option to include
the
entity body in the calculation of the hash, called auth-int
. There are also two
different kinds of hashing, MD5
and MD5-sess
. The server can
return a fresh challenge nonce with every response, or the client can include a
monotonically increasing nonce-count value with each request. The server also has
the option
of returning a digest of its own, which is a way the server can prove to the client
that it
also knows the shared secret.
With all those options it doesn't seem suprising that there are interop problems.
For
example, Apache 2.0 does not do auth-int
in Digest. While Python's
urllib2
claims to do MD5-sess
, Apache does not implement it correctly. In addition, looking at the code of Python's
urllib2
, it appears to support the SHA
hash in addition to the
standard MD5
hash. The only problem is that there's no mention of
SHA
as an option in RFC 2617. And, of course, no mention of Digest is
complete without mentioning Internet Explorer, which doesn't calculate the digest
correctly
for URIs that have query parameters.
Now in case it seems like we're trapped in a twisted Monty Python sketch, there are some bright spots: on Apache 2.0.51 or later you can get IE and Digest to work by using this directive:
BrowserMatch "MSIE" AuthDigestEnableQueryStringHack=On
OK, you know you're in trouble when a directive called
AuthDigestEnableQueryStringHack
is the bright spot.
Oh yeah, one last twist in implementing both Basic and Digest is that you should keep track of the URIs that you have authenticated because if you attempt to access a URI "below" an authenticated URI, then you can send authentication on the first request and not wait for a challenge. By "below," I mean based on the URI path. Also, be prepared because the authentication at a lower level in path depth may require a different set of credentials or use a different authentication scheme.
If you move outside of RFC 2617 you could use WSSE, but it isn't really specified for plain HTTP; it doesn't work in any known browsers, it was originally designed for WS-Security and unofficially ported to work in HTTP headers and not in a SOAP envelope; the definitive reference is an XML.com article, and while XML.com is an august publication, it isn't the IETF or W3C.
Now you might think I could use TLS (HTTPS), which is what lots of web apps and services use in conjunction with HTTP Basic. But you should realize that I, like many other people, use a shared hosting account; even if I wanted to shell out the money to buy a certificate, I wouldn't be able to set up TLS for my site, as certificates are tied to a specific IP address and not a domain name. This is really too bad since client-side support for TLS (HTTPS) seems pretty good.
The bad news is that current state of security with HTTP is bad. The best interoperable solution is Basic over HTTPS. The good news is that everyone agrees the situation stinks and there are multiple efforts afoot to fix the problem. Just be warned that security is not a one-size-fits-all game and that the result of all this heat and smoke may be several new authentication schemes, each targeted at a different user community.
For further reading you may want to check out this W3C note from 1999 (!), User Agent Authentication Forms. In addition the WHATWG's Web Applications 1.0 specification lists as a requirement "Better defined user authentication state handling. (Being able to 'log out' of sites reliably, for instance, or being able to integrate the HTTP authentication model into the Web page.)"
More from |
Implementing the Atom Publishing Protocol Doing HTTP Caching Right: Introducing httplib2 Catching Up with the Atom Publishing Protocol |
Redirects
As I implemented 3xx
redirects I came across a couple things that
were new to me, some of which could provide performance boosts. Now, in general, the
3xx
series of HTTP status codes are either for redirecting the
client to a new location or for indicating that more work needs to be done by the
client.
One of the things I learned is that 300
, 301
, 302
,
and 307
are all cacheable in some circumstances, either by default or in the
presence of cache control headers. That means that if your client implements caching,
it may
avoid one or more round trips if it is able to cache those 3xx
responses.
httplib2
At the end of my last article I introduced httplib2
, a Python client
library that implemented all the caching covered in that article. So for those of
you
keeping track at home, httplib2
also handles many of the things here, such as
HTTPS, Keep-Alive
, Basic, Digest, WSSE, and both gzip
and
compress
forms of compression. That's enough of libraries and specs for now;
next article, we'll get back to writing code and putting all this infrastructure to
work.