Introducing OpenSearch
July 24, 2007
Search and web feeds go together pretty naturally, as anyone who has set up some kind
of
vanity search feed knows. You go to your favorite Web 2.0 search engine and set up
a query
like http://web20.example.com/search=john+doe&ouptut=atom
and search for
"john doe," but rather than getting back results as the usual HTML web page, you get
it back
in Atom format. You can subscribe to this URL in your favorite feed reader, and you
have all
the useful features of web feeds attached to this search query. Most notably, rather
than
having to poll the search engine yourself and having to remember which results you
have
seen, your reader will simply alert you when there are new results. This simple but
very
useful concept is the core idea behind the OpenSearch specification.
OpenSearch was originally developed at Amazon.com's A9 incubator. It's a specification under the Creative Commons Attribution-ShareAlike License, covering discovery and description documents for search engines, expression of queries, and the convention of RSS 2.0 or Atom Web feed results. It is very RESTful in nature and complementary to the Atom Publishing Protocol (APP). In fact, many have called for OpenSearch to serve as the query aspect of APP, which provides a way to access identified or located results, but no mechanism for ad hoc query. With all this affinity to Atom and REST, OpenSearch is a natural topic for this Agile Web column. OpenSearch 1.0 is still the latest full version; it has been around since 2005. Version 1.1 is in beta, but has some important improvements and is thus the version I'll be discussing.
Finding a Suitable Search Engine
Once you've found a search engine, the first issue is learning more about it and, in particular, how to query it. The OpenSearch description document format is designed to provide this information. Listing 1 is a simple example describing a fictitious search engine for XML.com.
Listing 1: OpenSearch description document
<?xml version="1.0" encoding="UTF-8"?> <OpenSearchDescription xmlns="http://a9.com/-/spec/opensearch/1.1/" xmlns:dc="http://purl.org/dc/elements/1.1/"> <ShortName>XML.com search</ShortName> <dc:relation href="http://www.xml.com"/> <Description>Search XML.com articles and Weblogs</Description> <Tags>xml web</Tags> <Contact>admin@xml.com</Contact> <Url type="application/atom+xml" template="http://search.xml.com/?q={searchTerms}&p={startPage?}&format=atom"/> <Query role="example" searchTerms="test+xml"/> <Attribution>All content Copyright 2007, O'Reilly and Associates</Attribution> </OpenSearchDescription>
It's pretty straightforward stuff for the most part. Elements such as
ShortName
and Description
provide basic information for people
who are browsing search engine information. Tags
and Attribution
offer additional details that are useful when narrowing down the choice to use the
search
engine. Url
is an interesting element. It tells the search client how to query
the search engine in terms of what URL forms can be used for searching. In this way
it
connects to another important section of the OpenSearch specification, URL template
syntax,
which I'll discuss in a later section. Query
is another special element that,
in this case, tells search clients that they can test the search engine (this test
purpose
indicated by role="example"
) by querying with the search terms "test+xml."
Query
elements are more broadly used in OpenSearch results, as I'll discuss
in a later section.
Foreign Markup
Listing 1 also demonstrates how you can extend OpenSearch description syntax using
the
common mechanism of adding foreign elements in a separate namespace. In this case,
there is
a Dublin Core metadata element dc:relation
to express a simple relationship
between search.xml.com and www.xml.com. It's interesting that, besides
Url
and Query
, all the elements in Listing 1 could be expressed
in equivalent Atom syntax. Even the foreign dc:relation
is similar to
atom:link
, and the latter provides a bit more expressiveness (though you can
even things up a bit by using Dublin Core qualifiers). Listing 2 is an example of
the search
engine description like in Listing 1, but converted to Atom syntax; it is purely the
envelope with no entries, which is perfectly legal in Atom.
Listing 2: Atom document with the equivalent information to the OpenSearch description document in Listing 1
<?xml version="1.0" encoding="UTF-8"?> <feed xmlns="http://www.w3.org/2005/Atom" xmlns:os="http://a9.com/-/spec/opensearch/1.1/"> <id>http://search.xml.com</id> <link rel="self" href="http://search.xml.com"/> <link type="text/html" href="http://www.xml.com"/> <updated>2007-07-07T12:00:00Z</updated> <title>XML.com search</title> <subtitle>Search XML.com articles and Weblogs</subtitle> <author> <name>XML.com</name> <email>admin@xml.com</email> </author> <rights>All content Copyright 2007, O'Reilly and Associates</rights> <category term="xml"/> <category term="web"/> <os:Url type="application/atom+xml" template="http://search.xml.com/?q={searchTerms}&p={startPage?}&format=atom"/> <os:Query role="example" searchTerms="test+xml"/> </feed>
There is no need for Dublin Core, in this case, given atom:link
. But rather
than abuse that element, Url
is pulled in from the OpenSearch namespace to
express the search URL template. The purpose of this example is not to disparage
OpenSearch's choice in rolling its own format. I do believe that it's useful to reuse
formats where possible, but I also think that it's important not to push reuse until
you're
stretching a format to an alien purpose. One could make an argument that Listing 2
stretches
the purpose of Atom syntax too far.
URL Templates
Take another look at the Url
element in Listing 1, which serves as the
mechanism for telling the search client how to query the search engine. The
template
attribute looks like a URL, but the parts within curly braces are
parameters the client provides to specialize the search. There are half a dozen parameter
names like searchTerms
with a purpose established within the OpenSearch spec.
The searchTerms
parameter is a placeholder for the search criteria (e.g.,
john+doe); startPage
allows the client to specify a page within the result set.
More on result pages in a later section. You can use a parameter in the form of an
XML QName
for a foreign namespace to cover meanings not provided by the standard parameters.
Notice
the question mark after the startPage
parameter in Listing 1. This means the
search client is free to not provide a value for this parameter (i.e., to substitute
it with
an empty string). The search client must provide a non-empty value for
searchTerms
because it does not have the question mark.
Result Format
Once again, a central idea of OpenSearch is that search results come as web feeds. The supported formats are RSS 2.0 and Atom 1.0. In this article, and in my personal recommendation, I stick to the latter. Each search result corresponds to an Atom entry, using the usual semantics for entry syntax. There is, however, some interesting specialization at the feed element level. Listing 3 is a sample OpenSearch query result with a single result item.
Listing 3: OpenSearch Atom search result
<?xml version="1.0" encoding="UTF-8"?> <feed xmlns="http://www.w3.org/2005/Atom" xmlns:os="http://a9.com/-/spec/opensearch/1.1/"> <id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id> <title>XML.com search: atom store python</title> <link rel="self" type="application/atom+xml" href="http://search.xml.com/?q=atom+store+python&p=&format=atom"/> <link rel="alternate" type="text/html" href="http://search.xml.com/?q=atom+store+python&p="/> <link rel="search" type="application/opensearchdescription+xml" href="http://example.com/opensearchdescription.xml"/> <os:totalResults>1</os:totalResults> <os:startIndex>1</os:startIndex> <os:itemsPerPage>10</os:itemsPerPage> <os:Query role="request" searchTerms="atom store python" startPage=""/> <updated>2007-07-07T12:00:00Z</updated> <author> <name>XML.com</name> <email>admin@xml.com</email> </author> <rights>All content Copyright 2007, O'Reilly and Associates</rights> <entry> <title>XML.com: Implementing the Atom Publishing Protocol</title> <!-- Note: following URL modified for article formatting reasons --> <link href="http://www.xml.com/pub/a/2006/07/19/implementing-app-python-wsgi.html"/> <id>http://www.xml.com/2006/07/19/implementing-app-python-wsgi</id> <updated>2006-07-19T15:00:00Z</updated> <content type="text"> Joe Gregorio's latest Restful Web column implements the Atom Publishing Protocol as a Python web service using WSGI. </content> <author> <name>Joe Gregorio</name> </author> </entry> </feed>
The links with rel="self"
and rel="alternate"
follow the usual
Atom semantics. The rel="search"
link is a convention added by OpenSearch for
feed auto-discovery. When accessing this URL you should get a search engine description
document like Listing 1. Notice the application/opensearchdescription+xml
media
type, which OpenSearch proposes for description documents. You can also use special
link
types for paging search results. If a search would produce thousands of results, neither
the
client nor the service provider is likely to want to pile them all into a single result
feed
document, especially considering that most search engines provide hits of most likely
interest in early pages. The Feed Paging and Archiving (aka Feed History) extension
to Atom
provides a simple mechanism for breaking down large virtual feeds into pages or sections
in
such cases. It's currently an IETF Internet Draft, but probably will be adopted as
a
standard soon. OpenSearch adopts its conventions for paging search results. An OpenSearch
response might be one of a series of feeds, each of which represents a subset of the
total
results, including links with types such as first
, last
,
prev
, and next
to inform the search client how to navigate
through the results. Listing 3 also shows elements, totalResults
,
startIndex
, and itemsPerPage
in the OpenSearch namespace that
provide additional contextual metadata for search results. The common URL parameter
startPage
allows the search client to jump to a specific result page, and the
count
parameter controls the number of result items per page.
OpenSearch provides a few conventions to work with HTML web pages, including meta
tags for
auto-discovery of description documents and the totalResults
,
startIndex
, and itemsPerPage
information about search
results.
Searching, Agile Web Style
OpenSearch really just provides the framework of a query mechanism to complement the
Atom
Protocol. It defines enough semantics to tell you how to express simple full-text
searches.
You can extend it for more specialized query by adding your own extension parameters
in URL
templates. For example, you might want to specify a parameter to limit searching to
a
specific element in Atom feeds using a template like
http://search.xml.com/?q={searchTerms}&f={x:restrictField?}
(you'd have
to define a namespace for the "x" prefix). This would allow the search client to search
for
"xslt" within summaries by specifying
http://search.xml.com/?q=xslt&f=atom:summary
. By keeping it simple,
OpenSearch complements other related technologies very well, and adheres to solid
Agile Web
principles. There is a long and growing list of OpenSearch tools and search engines,
so
there is a good chance this specification will guide how we approach search and query
for
Web 2.0.