Wednesday, October 26, 2005

Hierarchical catalog records

The October issue of D-Lib Magazine has an article describing an experimental FRBR implementation in the Perseus Digital Library, by David Mimno, Gregory Crane, and Alison Jones, entitled Hierarchical Catalog Records. I'm thrilled to see reports of experiments like this be shared in outlets as widely read as D-Lib. I'm also extremely happy to see this particular experiment happen outside of the MARC environment. I've been becoming more and more convinced as some experiments we're conducting with MARC records for sound recordings progress that FRBRization really is a revolution in resource description. Statistics abound estimating the small percentage of works that exist in more than one expression, and expressions that exist in more than one manifestation. While I don't doubt these numbers at all, I think they way in which they're presented minimizes the enormity of the task ahead of us to reach a true FRBR environment.

I believe the same is true of efforts to use MARC for FRBRized records. The MARC format could be adopted for this purpose. But is it in our best interests to do so? Using MARC makes the task seem less scary, that it won't be that difficult. But it is a difficult task, and we're fooling ourselves if we pretend otherwise. I wonder if we aren't better off addressing the issue head-on, admitting to a change with a new base record format. The change would be one of mind-set, rather than functionality.

I've mentioned I believe the FRBRization task is difficult. I don't believe difficult means impossible in this case, however. We don't yet have a good sense of the cost associated with such a conversion, so any claim to its value will be tempered by that uncertainty. But I am convinced of that value, and I believe studies like that of the Perseus Digital Library are vital in demonstrating it. No cost can be justified without first understanding the associated benefit. We have a great deal more work to do to reach that understanding.

Sunday, October 23, 2005

You know you go to too many conferences when... look down halfway through the morning session and realize you're wearing the name tag from the wrong conference. Sigh.

Saturday, October 22, 2005

Separating data entry from data structure

I believe we've fallen extremely short in at least one area of potential for improving our cataloging and metadata creation systems -- user interfaces. We're still stuck in a mindset developed in the early days of the MARC format, whereby data is entered in the exact form in which it needs to be stored. When Web-based OPACs and cataloging modules emerged, cursory attempts to "improve" the interface appeared, but the changes were almost exclusively surface changes (labeling, etc.), and not implemented with community involvement.

But of course current technology provides many possibilities for a design layer in between the data entry interface and the data storage format. Metadata creation by humans is expensive. We need to do everything we can to design data entry interfaces that speed this process along, that help the cataloger to create high-quality data quickly. Visual cues, tab completion, and keyboard shortcuts are just a few simple tricks that could help. More fundamental approaches like automatic inclusion of boilerplate text and integration of controlled vocabularies could provide enormous strides forward.

Yet with all of this potential, I frequently (WAAAAAY too frequently) have conversations with librarians where it becomes clear they're focused exclusively on the data output format. It never even occurred to them that a system could do something with entered data that doesn't require cataloger involvement. (Man, I knew we librarians were control freaks, but this really takes the cake.) Of course, librarians aren't on the whole system designers. That's OK. But all librarians still need to be able to think creatively about possibilities. I'm convinced that the way forward here is to take the initiative to develop systems that demonstrate this potential, that show everyone what is possible with today's technology. Everyone has vision, yet that vision always has limits. By demonstrating explicitly a few steps forward from where we are, vision can then expand that much further.

Sunday, October 09, 2005

Museums and user-contributed metadata

It's funny how often, once one starts thinking about a subject, one finds examples of it absolutely everywhere. I've been thinking about user-contributed metadata for a while now in the context of a digital music library project, where we can provide innovative types of searching, if only we could find a way to make the creation of the robust metadata that drives it cost-effective. I wrote about this topic recently, inspired by OCLC's Wiki WorldCat pilot service.

So imagine my pleasure when, catching up on my reading this weekend, I came across "Social Terminology Enhancement through Vernacular Engagement" by David Bearman and Jennifer Trant in September's D-Lib Magazine. (Yes, I do know it's no longer September. Thanks for asking.) I'm thrilled to hear about this initiative, especially how well-developed it seems to be. I haven't yet followed the citations in the article to read any of the project documentation, but it certainly looks extensive. In the digital library (and museum!) world, I firmly believe ongoing documentation such as this associated with a project can be of as much or even more value than formally-published reports.

Two features strike me about the "Steve" system described here, that make it clear to me there are many ways to implement systems collecting metadata from users. It also makes me realize these decisions need to be made at the very beginning of a project, as they drive all other implementation decisions. The first is an assumption that the user interacting with the system is charged with the task of description rather than simply reacting to something they see and perceive either as an error or an omission. The user is interacting with the system for the purpose of contributing metadata; finding resources relevant to an information need is not the point. I suppose different users end up contributing with this model than with one that allows users to comment casually on resources they find in the course of doing other work. Different users might affect the "authoritativeness" of the metadata being contributed, but I wonder to what degree.

The second feature I find notable is that the system is designed to be folksonomic; there is no attempt at vocabulary control. Us library folk tend to start from the assumption that controlled vocabulary is better than uncontrolled and move on from there. At first glance, some of the reports from this project seem to resist that assumption, and start from the beginning looking for a real comparison. I'm anxious to read on.

Thursday, October 06, 2005

User-contributed metadata

OCLC recently announced the Wiki WorldCat pilot service. What a fantastic idea! Too bad I'm having trouble trying it out. I looked at a few books in Open WorldCat (via Google), including this one that I read recently and the book shown on the Wiki WorldCat page (The Da Vinci Code), and I didn't see the reviews tab or the links to add a table of contents or a note shown on the project page. Hmm. I wonder what I'm missing.

But, anyhoo... incorporating user-contributed metadata into library systems is something I've been thinking about for a while. Librarians tend to be pretty wedded to the notion of authority, that as curators of knowledge we're the best qualified folks out there to perform the documentation of bibliographic information. Assuming for a moment that this is true for some data elements, there are still several classes of data that could easily benefit from end-user involvement.

The first is detailed information from specialized domains. I work on a number of projects related to music. Information such as exactly which members of a jazz combo play on any given piece on a CD or the date of composition of a relatively obscure work is the sort of thing our catalogs could be providing to serve as research systems instead of just finding systems. But this sort of metadata is expensive to create; it requires research and domain expertise on the part of the cataloger. Many of our users, however, do have this specialized knowledge and love to share it.

Other information that might be appropriate for supplying by end-users could be tables of contents, instrumentation of a musical work, language of a text, and others of this type of "objective" information. Before you say, "But what about standard terminology, spelling, capitalization?!?" in a panicked voice, consider basic interface capabilities in 21st-century systems such as picking values from provided lists rather than typing them in.

But should we restrict ourselves to these more obvious of elements? I've been hoping for some time to be able to test various degrees of vetting of user-contributed metadata to a digital library system. I have in mind a completely open Wiki-type system, one that simply sends a suggestion to a cataloger, and a number of options in between. I suspect the quality of the user-contributed metadata will be overall much higher than critics assume. Yet even if it isn't, what sort of trade-off between quality and quantity are we willing to make? Traditional cataloging operations don't have extensive quality control operations, perhaps because QC is expensive work. And catalogers make mistakes, every day, just like the rest of us. Assuming a system where users can correct errors, how quickly will errors (made by a cataloger or by another end-user) be found and corrected? Will the "correct" data win out in the end? Surely these issues are worth a serious look.