Friday, April 28, 2006

Thesauri and controlled vocabularies

I had a very interesting conversation recently with two colleagues about the differences between thesauri and controlled vocabularies. Both of these colleagues are developers who work in my department. One is finishing up a Ph.D. in Computer Science, is currently in charge of system design for a major initiative of ours, and has a knack for seeing all the aspects of a problem before finding the right solution; the other is a database guru with whom I've collaborated on some very interesting research and has just started pursuing an M.L.S to add to his already considerable expertise. I like and respect both of these individuals a great deal.

The interesting conversation began when the database-guru-and-soon-to-be-librarian (DGASTBL) (geez, that's not any better, is it?) asked if the terms "controlled vocabulary" and "thesaurus" are used interchangeably in the library world. He asked because from our previous work and a solid basis in these concepts he knew they really aren't the same thing, yet he had seen them used in print in ways that didn't match his (correct) understanding. The high-level system diagram we had at the time had a box for "vocabulary" which was intended to handle thesaurus lookups for the system. We discussed how a more precise representation of that diagram would have an outer box for "vocabulary" to handle things like name authority files and subject vocabularies with lead-in terms but no other relationships, and an inner box for "thesauri" (as a subset of controlled vocabularies) with full syndetic structures that the system could make use of. We lamented that the required outer label in this scenario of "controlled vocabulary" isn't as sexy as its subset "thesauri." The latter sounds a great deal more interesting when describing a system to those not involved in developing it.

The system designer then presented a different perspective on the issue. While the librarian types considered thesauri a subset of controlled vocabularies (perhaps party for historical reasons - we've been using loosely controlled vocabularies a lot longer than true thesauri), the system designer viewed the situation as the opposite - that controlled vocabularies were a specific type of thesauri using only one type of relationship (the synonym), or perhaps also some rudimentary broader/narrower relationships that don't qualify as true thesauri (think LCSH). I found the difference in point of view interesting - that the C.S. perspective expected a completely structured approach to the vocabulary problem, and the library perspective represented an evolving view that has never quite gotten to the point where we can make robust use of this data in our systems. It struck me that the system designer's perspective in this conversation was overly optimistic as to the state of controlled vocabularies in libraries.

Yet there's light at the end of this particular tunnel. Production systems in digital libraries are starting to emerge that make good use of controlled vocabularies in search systems, rather than relying on users to consult external vocabulary resources before searching. Libraries have not taken advantage of the revolution in search systems shifting many functions from the user to the system (think spell-checking), to our supreme discredit. Making better use of these vocabularies and thesauri is one way of shifting this burden. I hope this integration of vocabularies into search systems will push the development of these vocabularies further and make them more useful as system tools rather than just cataloger tools. By providing search systems that can integrate this structured metadata, we can improve discovery in ways not currently provided by either library catalogs or mainstream search engines.

Monday, April 17, 2006

"Orienteering" as an information seeking strategy

I was introduced today to the notion of "orienteering" as an information seeking strategy, through a paper presented at the CHI 2004 conference by Jamie Teevan and several other colleagues. The paper discusses orienteering as a strategy by which users make "small steps...used to narrow in on the target" rather than simply typing words in a search box. For some time, I've been struggling inside my head with trying to articulate the differences between the search engine model with a wide-open box for typing in a search and the library model with vast resources but a need for users to know ahead of time which of those resources are relevant to their search. This paper very clearly spoke to me, by demonstrating that real users (to use one of my favorite phrases) are somewhere in the middle.

Users have resources they like. We prefer one map site over another, one news site over another, one author over another. And we know where each of our prefered resources can be accessed. For many types of information needs, we know the right place (for us) to start looking. Even as we make the hidden Web more accessible, the resource (like an email) we need often won't be something a generic Web search engine can get to. But for many information needs, a box and "I'm feeling lucky" is an effective solution. I think the point is that we need a wide variety of discovery models to match the wide variety of our searching needs. We can't expect all users to start with the "right" resource (what's "right"?), but we should provide seamless methods for users to move, step by step, towards what they're looking for.

Thursday, April 06, 2006

techessence.info launched

I was recently honored to be asked to participate with a stunningly informed and diverse group of library technology types in an online initiative called TechEssence. TechEssence is envisioned as a rich resource for library decision-makers to learn just enough about a wide variety of technologies to allow them to make good decisions. I'm a big fan of this approach - not everyone can know everything, and many of us need succinct information with just the right amount of evaluation from those with experience. As of yesterday, the site is now officially launched!

Here's a summary from Roy Tennant, our fearless leader:

TechEssence.info
The essence of technology for library decision-makers

A new web site and collaborative blog on technology for library
decision-makers is now available at http://techessence.info/.
TechEsssence provides library managers with summary information about
library technologies, suggested criteria for decision-making, and
links to resources for more information on essential library
technologies.

A collaborative blog provides centralized access to some of the best
writers in the field. By subscribing to the RSS feed of the
TechEssence.info blog, you will be able to keep tabs on the latest
trends of which library administrators should be aware.

To accomplish this I am joined by a truly amazing group:

* Andrew Pace
* Dorothea Salo
* Eric Lease Morgan
* Jenn Riley
* Jerry Kuntz
* Marshall Breeding
* Meredith Farkas
* Thomas Dowling

For more information on the group, see our "about us" page at http://techessence.info/about/.

Wednesday, April 05, 2006

Library digitization efforts

Many libraries are seeing efforts such as the Google Books Library Project, and think they need to follow suit by digitizing books in order not to be left behind. I worry that many of these libraries are jumping in just to be on the bandwagon without fully considering wheir their efforts fit in with those of others. Digitizing books, performing dirty OCR, and making use of existing metadata is about as easy as it gets in the digital library world (not that this is exactly a walk in the park), so it's an attractive option for libraries looking to make a splash with their first efforts to deliver their local collections online.

I argue that this is not the right approach for most libraries. That impact libraries are looking for as a result of digitization of local collections is achieved through the right ratio of benefit to users versus costs to the library. While the costs to the library are lower to digitize already-described, published books sitting on the shelves, the benefits are also lower than focusing on other types of materials (more on which materials I'm thinking of later...). We already have reasonable access to the books in our collection. I'll be the first to go on and on ad infinitum about the poor intellectual access we currently provide to our library materials. But there is some intellectual access. For books a library doesn't own, interlibrary loan is a slightly cumbersome but mostly reasonable method of delivering a title to a user. There are also a (comparatively) great many digitized books out there, without good registries of what's digitized and what isn't, or good ways to share digital versions when they do exist and the institution that owns the files is willing to share. Take the Google project - they're digitizing collections from five major research libraries, yet libraries planning digitization projects don't have access to lists of materials that are being digitized as part of this project, even though we expect to have some (not complete) access to these materials through Google's services at some point in the next few years. Even though library collections have surprisingly less duplication than one might expect, a library embarking on a digitization project for published books would be duplicating effort already spent to some non-negligible extent.

Libraries in the aggregate hold almost unimaginably vast amounts of material. We're simply never going to get around to digitizing all of it, or even the proportion we would select given any reasonable set of selection guidelines. An enormously small proportion of these materials are the "easy" type - books, published, with MARC records. The huge majority are rare or unique materials: historical photographs, letters, sound recordings, original works of art, rare imprints. These sorts of materials generally have grossly inadequate or no networked method of intellectual discovery. While digitizing and delivering online these collections would take more time, effort and money than published collections, I believe strongly that the increase in benefit greatly outweighs the additional costs. In the end, the impact of focusing our efforts on classes of materials that we currently underserve will be greater than taking the easy road. Our money is better spent focusing on those materials that are held by individual libraries, held by only few or no others, and to which virtually no intellectual access exists. Isn't this preferable to spending our money digitizing published books to which current access is reasonable, if not perfect?

Tuesday, April 04, 2006

On metadata "experts"

I'm often asked how one gets the skills required to do my job as a Metadata Librarian. My answer is one I can't stress strongly enough: experience. We need to know the theoretical foundation of what we do inside and out, and need to constantly think about why we're doing something - the big picture. But theory is not enough. The only way to become skilled at making good metadata decisions is practice--seeing what happens as a result of an approach and improving on that approach the next time. No matter how many times I've done a certain type of task, I see the next repetition as a way to re-use good decisions and re-think others.

I've found the metadata community in libraries to be a very open one. When I'm starting on a task that I haven't done before, I use what I can from my experience with similar tasks. But I also ask around for advice from others who do have that experience. "Metadata" is a very big and diverse area of work. Even with the best abstract thinking, applying known principles to new environments, we all often need a boost for getting started from someone who has been through a given situation.

I'm skeptical of the idea of "experts" overall. These things are all relative - only once you start learning enough to be able to effectively share what you've learned with others do you truly realize how much you still have to learn. I put much more stock in the goal of becoming good at thinking about generalized solutions, good at making decisions for classes of problems rather than simply repeating specific implementations over and over. I'm not a programmer, and neither are many in the metadata librarian community. Yet this type of thinking that makes a good programmer can, in my opinion, make the best metadata experts as well.