The Library of Congress Comes Home

March 17, 2004

In the inaugural article ("Geeks and the Dijalog Lifestyle") of my new XML.com column, Hacking the Library, I offered a short tour of the territory I intend to explore with you, dear reader. It's a territory I call "dijalog", which stands for the confluence and intertwingling of the digital and the analog. If you're like me, you will never live the pure, weightless all-digital media lifestyle. Our media collections weren't born digital.

While I presented dijalog last time as a characteristic of people with lifestyles like ours, it's more fundamental than that. Dijalog is really about correlating and managing the interplay between physical space and virtual space, which is accomplished by organizing, describing, and managing objects which exist in both of these spaces.

I know that sounds complex and cerebral, but it's actually a simple, even basic idea. And the best way to explain it to you is by way of example. Let's consider what will be my subject for this and the next column: organizing your library (that is, your personal media collection) at home.

The Library as Dijalog Institution

What is a library? The first social thing to say about lending libraries in the United States is that they are thriving socialist institutions in the midst of capitalist fervor. That's a very interesting idea, one which could delay me for many paragraphs if I let it.

What can we say about lending libraries from the perspective of information management? Libraries, including some personal ones, are dijalog institutions. Libraries are (1) chunks of physical space, (2) highly organized and regimented, which exist, in part, to facilitate (3) the navigation of a virtual space, in this case, the information space of all (ideally, anyway) recorded human knowledge.

Let's unpack that sentence in three steps. First. Libraries are places, sites, locations in the physical world. A library is a place that you can visit, around and in and through which you can move, as a body moves through space.

Second. Libraries aren't merely spaces: they are highly regimented, organized, controlled spaces. A library is a space that brims with all the signs and pomps of human purpose. I want to revive an old word to describe this kind of socially significant space. A library is a habitation; it is a human dwelling place -- a place where human projects, goals, purposes, and ends can be acted out.

Third. Libraries aren't merely habitations: they are social spaces organized to aid people's navigation of another, a non-physical space, namely, the information space made up of and by all recorded human knowledge.

Want to learn about Chinese pottery during the Han period? Take the elevator to the third floor, go down 12 rows, turn right, walk halfway down the aisle, five rows from the top, grab the first three books. Need schematics for the design of a wastewater treatment plant with excellent aerative capacity and a small footprint? Take the tunnel to the next building, up the stairs to the seventh floor, turn left....

Thus, by navigating through, that is, by cleverly inhabiting, a particular, highly regimented social space, you can identify, locate, and interact with objects -- born digital, born physical, or both -- that represent or constitute your very own culture, or cultures far removed in space and time from your own.

A library is, then, a dijalog institution: it's a place where the interplay of physical and information space is managed.

LCC@Home: Why and What

Who knew libraries were such complex places, right? Librarians knew, of course, as do most people, even if we don't often think about libraries in these terms. This social and informational complexity means at the very least that it's okay for librarians to be so incredibly anal retentive. It's okay because they have to be! It's not a simple job.

But if it's so hard, why do I want you to consider implementing a classification scheme for your library at home? Because, first, all of the really hard work has already been done for you; and, second, there are benefits in return for a small investment of time and energy. You'll understand the point about the hard work having been done already after next month's column; but what about the expected benefits? They include easier discovery of things you own but don't know that you own; easier management of the physical constraints of owning a large library; easier digital management of physical objects, and so on.

Let me put this point in another way. XML.com writers, editors, and columnists often say that no XML vocabulary is worth much unless or until there is code that consumes and produces it. That's roughly the case with personal media collections, too. Implementing a classification scheme for your library is the first step toward managing a dijalog lifestyle because it gives you a replicable, algorithmic, tractable grounding in the real world. It means you can easily, predictably, reliably put your hands on your copy of Wayne Meeks' The Origins of Christian Morality or Jorge Luis Borges' Ficciones -- both of which you forgot you even owned -- as a result of asking a computer to tell you about some books that discuss Christian ethics in the patristic period or notable Spanish fabulists of the 20th century.

A Classification Scheme for Your Personal Media Collection

In the idiosyncratic way that I'm using the term, a "classification scheme" is a method of organizing the items of a media collection in such a way that they can be physically indexed, digitally queried, and physically retrieved. I'm focusing on books, but the items of a media collection may include other artifacts: CDs, DVDs, cassettes, 8-tracks, albums, magazines, journals, and assorted ephemera.

In other words, a classification scheme is a method for managing the interplay of information space (your collection) and physical space (all of the items which constitute your collection, as well as the space in which they reside). Chances are, if you have a large library, you've already implemented some kind of classification scheme. Probably something like this: most of my art history stuff lives on that skinny shelf in the bedroom, except for the oversized coffee table books which live on the coffee table; the fiction stays in the rec room; all the CDs are in the basement, near the stereo; and all the computer books are in the office, separated into open source and non-open source.

Which Classification Scheme?

Let's assume that I've convinced you to consider implementing a classification scheme at home. Which one should you use? There are several possibilities: Library of Congress Classification (LCC), Universal Decimal Classification (UDC), Dewey Decimal Classification (DDC), Colon Classification (CC), Bliss Classification (BC). If you're curious about some of these, I've collected good web resources about each one, as well as some microcommentary, in the Resources section at the end of this article.

To anticipate next month's column, I'm going to show you how to implement LCC. Well, really, I'm going to show you how to implement a variant that I'm calling "LCC@Home". It's a variant because we're not going to do much, if any actual cataloging at all (though I probably won't be able to resist telling you about cuttering, since it's a kind of interesting canonicalization algorithm of sorts), and we're going to make a few simplifying moves and assumptions in order to keep things realistic and manageable.

Before moving on, I want to show you the top-level categories of LCC, so that you can start to get an idea of what it's like. According to the Library of Congress Classification Outline, there are 21 top-level categories, one for most of the letters in the Latin alphabet:

A -- GENERAL WORKS

B -- PHILOSOPHY. PSYCHOLOGY. RELIGION

C -- AUXILIARY SCIENCES OF HISTORY

D -- HISTORY (GENERAL) AND HISTORY OF EUROPE

E -- HISTORY: AMERICA

F -- HISTORY: AMERICA

G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION

H -- SOCIAL SCIENCES

J -- POLITICAL SCIENCE

K -- LAW

L -- EDUCATION

M -- MUSIC AND BOOKS ON MUSIC

N -- FINE ARTS

P -- LANGUAGE AND LITERATURE

Q -- SCIENCE

R -- MEDICINE

S -- AGRICULTURE

T -- TECHNOLOGY

U -- MILITARY SCIENCE

V -- NAVAL SCIENCE

Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)

Alpha By Title or Author?

All of this begs a real question: why shouldn't we just alphabetize our collection items by title or by author's last name? There are a few reasons why that's not ideal.

It will come as no surprise to XML developers and other readers that the choice of classification scheme goes a long way toward determining what kinds of queries one can easily perform. As I discuss below in the Resources section, some of the alternatives to LCC are faceted schemes, which allow for a variety of complex, composable queries. LCC is not in fact a faceted scheme, though for personal collections that's not going to matter very much.

The problem with only arranging your collection physically by alphabetical order is that, without a computerized index of the collection, you can't form queries like, "show me all the resources that are about Spanish anarchism or anarcho-syndicalism" or "show me all the resources that are about Buddhist folk magic". The only way to browse an alphabetized collection is to stroll among its items, looking at each one carefully, trying to decide whether it matches what you want. Or you have to know a relevant title or author's name already.

Discovering resources, in the common case, is going to take longer if your collection is arranged alphabetically, though that's really only a problem once it grows above a certain size. For me that size was about 1,000 books. As soon as I could no longer take in my library in one continuous glance or eye-sweep, I started being unable to find things easily. Now that my library is spread out over three distinct physical spaces, it would be even harder to find things if it were arranged alphabetically.

I should tell the truth: you can get by with physically arranging your collection alphabetically by title or author. In that case, you use some computerized index at a big library, probably over the Web, to look for stuff. When you find items that you're interested in, you retrieve them in your collection, if they exist there, by doing an alphabetical lookup.

But that approach has real limitations. That physical arrangement makes future planning harder because it doesn't scale well at all. It doesn't scale well because it lacks a rational or predictable connection between the information space which you query and the physical space within which you retrieve the results of that query. Recall the top-level LCC categories; it offers a connection between the conceptual and physical arrangement of the collection.

For example, it is very unlikely that I will ever own any items that are classifiable as U or V. Sorry, but that's not gonna happen. Likewise, while I have a copy of Black's Legal Dictionary (what self-respecting IP-loving geek doesn't?), and a few dozen other legal reference texts, I'm never going to have a large number of K or R items. On the flip side, as a long-time student of philosophy and religion, I have some thousand or so B books; I have nearly 500 M items (mostly CDs, but lots of books about music); and my Q and T sections are very crowded, too.

While this may seem a subtle, insignificant, point, it's actually quite useful. It allows me to do some rational preplanning about the physical arrangement of my collection. I typically want all of the items of a particular top-level category to be as close together, physically, as possible. Since I know how LCC distributes and joins items by looking at its top-level categories (as well as major categories within the top-levels: I have tons and tons of BR, BT, B790--B802, and T), I can make some decisions in advance about how to map my local distribution of items in top-level categories onto the constraints of my physical space. Given that I have a lot of B items, I need to make sure I leave a lot of shelf space for them, for example.

Compare this predictability to the case where your collection is arranged alphabetically. Do you have any idea how many items the titles of which begin with "r" or "p" or any other letter are in your collection? There's no easy way to know this and certainly not with any predictability. We do know roughly the most commonly used letters in English, but how that correlates to initial letters of author names and resource titles is anyone's guess. You could roughly allocate space according to wild guesses derived from letter frequencies, but that's a far cry from the kind of planning you can do with LCC.

Now, clearly, this argument is especially relevant to universities and other institutions with very large collections. In fact, you can often save yourself time on the campus of a large university by remembering that the law, engineering, divinity, and medical schools are all likely to have their own libraries, which is where you'll find the highest concentrations of K, T, B, and K items. Copies of items of general relevance -- like that Black's Legal Dictionary I mentioned -- may, at some universities, be found in the central library, but not always.

So, yes, this is an argument more for large, resource-constrained institutions, but I think it applies to resource-constrained individuals with relatively large collections. If we're gonnna do this, we might as well do it right.

Why Not Dewey?

Resources

Library of Congress

Home of the Library of Congress Classification (LCC): a rather US-centric (two top-level categories, E and F, containing U.S. history), unfaceted, not-rational classification scheme, the top-level categories of which are largely a result of local need. The subject of probably the largest cataloging budget in existence. See the Wikipedia entry on LCC.

Dewey Decimal Classification (DDC)

The best known, "most widely used" classification scheme. Also called the "Dewey Decimal System". Unlike LCC, a faceted classification scheme. Less expressive than UDC, which is a "forked variant" of DDC. See the Wikipedia entry on DDC.

Universal Decimal Classification

A fork of DDC dating to first decade of the last century. Highly expressive, natural language independent. Can express relations between subjects. Allows for composable classifications using five relational operators: addition (+), extension (/), relation (:), algebraic subgrouping ([]), natural language (=). "In UDC, the universe of information (all recorded knowledge) is treated as a coherent system, built of related parts..." See the Wikipedia entry on UDC.

Colon Classification.

Created by S.R. Ranganathan in the first years of the previous century. The first faceted or "analytico-synthetic" classification scheme. Still much used in libraries in India.

Bliss Bibliographic Classification

Created for CCNY (now know as CUNY) by Henry E. Bliss, adopted most often by UK libraries. "...the leading example of a fully faceted classification scheme." See the Wikipedia entry on Bliss.

Finally, before concluding this column, I want to consider briefly the reasons I had in mind when I chose LCC over Dewey Decimal Classification (DDC). DDC is, as its partisans will remind you, the mostly widely used scheme in the world. But I think that's slightly misleading. I don't mean in any way to denigrate DDC, since I'm neither remotely qualified to do so nor do I have any technical reasons whatever for preferring LCC. However, "widely used" is ambiguous. I have no doubt that DDC is the scheme used in the largest number of libraries. I also have as little doubt that more money and resources are poured into LCC cataloging efforts than into DDC cataloging efforts. I'm also relatively confident that more LCC-organized library indexes are available for query over the Web.

Why do those things matter? They matter because we want to push all of the cataloging burden onto relatively well-heeled public institutions, where it belongs. The reason that implementing LCC@Home is at all possible is because individuals are able to push the cataloging burden onto powerful public institutions, and we are able to take advantage of the results of all the investment that's gone into cataloging to date. The simple fact is that LCC is supported by the US federal government and by the overwhelming majority of research universities in the US. The goal is for us little folks to do as little actual cataloging work as is possible; one way we can achieve that goal is to align ourselves with the biggest and best funded cataloging effort. Near as I can tell, that's LCC.

How Exactly?

As a teaser for next month's column, I want to summarize very concisely how we'll actually implement LCC@Home:

Form an initial impression of the distribution of your collection in terms of LCC top-level categories and major subcategories.
Allocate physical and storage space (bookshelves, primarily) in a way that corresponds roughly with (1), taking into consideration your present and expected future interests.
Gather item-labeling materials -- including a variety of labels, stickers, and pens of various kinds -- taking into consideration any special requirements presented by unusual items in your collection.
For each item in your collection, find its unique LCC identifier and affix that identifier to the item, using the materials in (3).
Depending on the number and type of items in your collection that are not LCC cataloged, apply some other classification scheme, leave the items unidentified, or consider cataloging the item yourself.
Physically arrange the distribution of items matching LCC categories according to some locally-derived, sensible plan.

In next month's column, replete with pictures and diagrams, I'll walk you through these steps, point out pitfalls and hidden traps, and discuss some of the choices we have to make. As always, I'm curious to hear your feedback about this article and these ideas.