GovTrack.us, Public Data, and the Semantic Web
February 8, 2006
No matter where you fall in debates over free software or DRM, there's one type of information that is unarguably meant to be free, and that's information about our government. The more knowledge citizens have about government the better. So how can we use XML and the Semantic Web to make it easier to get that knowledge, and to foster civic participation?
This is a question I've spent a lot of time on over the past few years while putting together www.GovTrack.us, a site that gathers existing information on the web about the U.S. Congress and puts it all together in new ways, using RSS feeds and Google Maps, for instance. The site is possible because the government has been posting the relevant information online for a while, but in scattered locations. For instance, legislation is posted in one place and votes on the very same legislation in another. Gathering the information in one place and in a common format gives rise to new ways of mixing the information together.
Each day GovTrack screen-scrapes these sites to gather the new information. The information gets normalized and goes into XML files so that when GovTrack wants to display the status of a bill to a user, it can just run an XSLT stylesheet on the XML bill file.
There have been around 40,500 bills introduced in Congress since 1999 (the vast majority aren't ever seriously considered, which says a lot about the process). Here's part of the file for the bill passed Sept. 14, 2001, authorizing the President to use military force against terrorists:
<bill session="107" type="sj" number="23"> <titles> <title type="popular">Military Force Authorization resolution</title> <title type="official">A joint resolution to authorize the use of United States Armed Forces against those responsible for the recent attacks launched against the United States.</title> </titles> <sponsor id="300031" /> <actions> <vote date="1000440000" how="roll" roll="281" where="s" /> <vote date="1000520280" how="without objection" where="h" /> <enacted law="107-40" date="1000785600"/> </actions> <subjects> <term name="Defense policy"/> <term name="Air piracy"/> <term name="Armed forces"/> ... </subjects> <summary> Authorizes the President to use all necessary and appropriate force... </summary> </bill>
All of that comes from the official source, except the official source doesn't provide the information in a structured way. GovTrack is responsible for parsing dates, turning names into IDs, picking out the list of actions, and so on. GovTrack also fetches voting records and other documents and puts them into XML. (By the way, if you want to play with the data, all of the XML files that power GovTrack are made available to be freely reused.)
XML has been good for the job, but when you put lots of XML files together, you don't immediately get something special out of it — code has to be written. And so GovTrack has a way to browse bills by the subject terms assigned to the bills.
But I got an email the other day asking for legislation that falls into two categories, and I needed a way to write a simple query over the data, looking for bills that matched both subject terms. The simplest thing to do might have been to write a program that evaluates an XPath expression over each bill file:
count(bill/subjects/term[@name = "Medical care"]) > 0 and count(bill/subjects/term[@name = "Illegal aliens"]) > 0
Sure, that would have gotten the job done. But if I stuck with XPath for all of my
querying
needs, I'd be very limited in the types of queries I could run over the data. An XPath
expression really can involve only one document, which is to say that the types of
questions
one can ask with XPath are whether or not a document matches an expression, and that
match
depends on the document itself. (True you can use the document()
function to
cross documents, but only if you can get the name of the file.)
If tomorrow someone asks me for a list of bills that Bill Frist and John Kerry voted differently on, I'll be stuck. Each roll call vote file looks something like this:
<roll where="senate" year="2005" roll="00230"> <question>On the Motion (Motion To Table)</question> <voter id="300001" vote="+"/> <voter id="300002" vote="+"/> <voter id="300003" vote="+"/> <voter id="400546" vote="-"/> ... </roll>
The names of the senators aren't in the file, so the first step would be to look up the IDs of Frist and Kerry. Then, iterate through the bill XML files, open up the related vote file for each, and finally use an XPath (or even XQuery) expression to test the vote file to see if the votes differed.
Or what if someone wants to know whether the votes on a bill were correlated with the representative's age, amount of campaign contributions, or geographic location of his or her district? Better yet, what about the question of whether the board of directors of Disney made contributions to representatives introducing legislation about copyrights? GovTrack has all of this information (it's all public information downloadable from the Census, Federal Election Commission, and Securities and Exchange Commission). When we can ask these types of questions easily, things start to get much more interesting.
Here's an interesting question I can answer today with a simple query: what's the population of each senator's state? This is the query:
PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX foaf: <http://xmlns.com/foaf/0.1/> PREFIX pol: <tag:govshare.info,2005:rdf/politico/> PREFIX census: <tag:govshare.info,2005:rdf/census/> SELECT ?name ?statename ?population WHERE { ?person foaf:name ?name . ?person pol:hasRole [ pol:forOffice [ pol:represents ?state ] ] . ?state dc:title ?statename . ?state census:population ?population . }
This is a SPARQL query that results in a table of the names of senators and the corresponding state and population. SPARQL is a new query language over information in RDF. Not to make a shameless plug, but I really recommend reading my own introduction to RDF. It goes beyond the old notion of RDF as an XML metadata format. RDF is more commonly thought of today as a general method for knowledge interchange. And for more about SPARQL itself, I recommend Leigh Dodds' SPARQL tutorial on XML.com, Introducing SPARQL: Querying the Semantic Web.
You can play around with queries over GovTrack's data here, but I don't want to talk about SPARQL in this article. I just wanted to show that the types of questions we can ask can easily grow in complexity and "interestingness" using RDF. No XPath or XQuery query is going to be nearly so concise for those questions.
Of course it's possible to do this with XML rather than RDF, and the difference is just in where the effort must be applied to get the data sources to link together. In XML, the burden is on the person with the query to figure out how the elements and attributes in one XML file relate to the elements and attributes in another. Glue-code has to be programmed to mesh the data. With RDF, the burden is on the people with the data to ensure that their identifiers for things overlap with other data sources. The difficulty in RDF is more of a design decision, and design decisions are tough too. But one of RDF's advantages in meshing disparate data is that the hard work is done once by the people who know the data best, rather than repeated by each programmer that has a new query to make.
So in the rest of this article I'll go over some of the design of the RDF version of GovTrack's data (which you can also download to play around with).
Here's some biographical data for Senator Schumer from New York:
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix foaf: <http://xmlns.com/foaf/0.1/> . @prefix pol: <tag:govshare.info,2005:rdf/politico/> . @prefix usgov: <tag:govshare.info,2005:rdf/usgovt/> . @prefix people: <tag:govshare.info,2005:data/us/congress/people/> . people:S000148 rdf:type pol:Politician ; foaf:name "Charles Schumer" ; foaf:gender "male" ; usgov:party "Democrat" .
This is RDF data in a format called Notation 3, which is a nice alternative to the XML format of RDF. Since RDF is really
an abstract way to represent information, rather than a particular data format, we're
always
free to choose the serialization syntax that's easiest to read (or scribble or exchange)
for
the task and data at hand. (When in doubt about the syntax, run it through a validator to see what the underlying
triples are.) This data was intended to mean that the entity identified by the URI
tag:govshare.info,2005:data/us/congress/people/s000148
(put together by
simply concatenating the prefix URI with the local name) is a politician, has the
name
given, is a Democrat, etc.
In addition to the data above, there is RDF data about Schumer's role in Congress, including the state he represents. This is where some real modeling choices came in. There are a number of sensible ways to relate a politician to the region he or she represents. Here's one:
people:S00148 pol:represents "New York" .
This is very to-the-point. Schumer represents New York. It's accurate enough, but
not
particularly precise. The literal expression "New York"
isn't very informative.
New York State or New York City? We could get around this problem by stating in the
pol:
vocabulary that pol:represents
only refers to states and
not cities, and that would be a fine solution if that restriction were acceptable.
But we
can make a small change to make it better:
@prefix states: <tag:govshare.info,2005:data/us/> . people:S00148 pol:represents states:ny .
Now it's very precise. Except, when a computer reads in the URI
tag:govshare.info,2005:data/us/ny
it has no idea what that means. So we have
to list somewhere else:
@prefix dc: <http://purl.org/dc/elements/1.1/> . states:ny rdf:type <tag:govshare.info,2005:rdf/usgovt/State> . states:ny dc:isPartOf <tag:govshare.info,2005:data/us> . (i.e. the United States)
The computer may have no idea what tag:govshare.info,2005:rdf/usgovt/State
means either, but at least it knows it's the same type of thing as the other states.
Or the
application writer can assign a special meaning to the URI
tag:govshare.info,2005:rdf/usgovt/State
.
Using a URI rather than a literal value also lets you, or others, contribute information about the entity. If I'm publishing information on Congress and someone else transforms some census data into this:
@prefix census: <tag:govshare.info,2005:rdf/census/> . states:ny census:population "18976457" .
then immediately one can start writing queries that bridges the two data sets.
This is a fine way of representing the information. Beyond this point, the modeling
choices
become a real trade-off between simplicity and informativeness. There are two shortcomings
with the representation of the pol:represents
relation above. The first is that
it misses the generalization that anyone who is a senator from New York represents
New York.
Or, rather, it's not an inherent property of Schumer that he represents New York,
but rather
it's in virtue of another property of his, which is holding the office of senator.
So then
we should revise the information as this:
@prefix senate: <tag:govshare.info,2005:data/us/congress/senate/> . people:S00148 pol:holdsOffice senate:ny . senate:ny pol:represents states:ny .
That's more informative, at the cost of being more complex to create and query.
The second shortcoming is a pervasive problem in any representation of the real world, and it's that the world isn't static. There are two ways to look at this. First, it's not an inherent property of Schumer that he holds the office of senator. Compare that to the assertion above that New York is a part of the United States, which we could reasonably say is a time-invariant truth. The second perspective is that this information may be correct now, but it won't be when Schumer leaves office. So when we write RDF, are we asserting time-invariant information or information that's claimed to be true only at the time of writing?
The answer is, we don't know. Some predicates are time-sensitive, some are time-invariant. Lots of information in RDF out there on the internet is time-sensitive with no indication of the time that it was written, or how long it might be correct for. This is a problem we'll have to deal with in the future.
So while it would be appropriate to leave the design as time-sensitive, GovTrack goes a step further and models the time that someone holds an office:
@prefix time: <http://pervasive.semanticweb.org/ont/2004/06/time#> . people:S00148 pol:hasRole [ rdf:type pol:Term ; time:from [ time:at "2005-01-01" ] ; time:to [ time:at "2010-12-31" ] ; pol:forOffice senate:ny . ] .
The practical benefit of this is that GovTrack can include historical data this way without it seeming like George Washington is still in office.
In the Semantic Web, it's easy to get caught up in the theory. Modeling issues are fun to think about (at least for me), but it's good to have a practical application too, at least from time to time. Stay tuned for a future article where I'll get into the nuts and bolts of bringing government data onto the Semantic Web.
As a final note: as an American, my knowledge of my own government is fair and my knowledge of the outside world is, alas, minimal. I encourage you to post comments below about the Semantic Web and government in other countries.