Monday, December 18, 2006

I *love* this

I'm talkin' 'bout the new version of OCLC's FictionFinder. Specifically, the browse feature in FictionFinder. You heard me. Browse. In a library system. Not the LCSH browse with pages upon pages upon pages upon pages of subdivisions with no discernable grouping, but a real browse.

The best is the "genre" browse (but take out those --Young adult fiction subdivisions and move them to an audience facet). It's not a short list, but it's not too long either. It would be interesting to arrange these hierarchically and see if navigating that list made any sense to users. And "settings"! How cool to be able to locate fiction that takes place in the Pyrenees. This is what library catalogs should do for our users.

I'm also intrigued by the "character" browse. This is something I've never thought of before. My general rule for browsing facets is to only include facets that have a (relatively) small number of categories, each with a (relatively) large number of members. At first, I didn't think characters met this requirement. Then I clicked on Captain Ahab, and I realized just how many works of fiction there are about him! Great works inspire derivatives, and exploring those is a fun way to guide new reading, in my opinion. It would be interesting to have access to a browse list of all characters in some situations, and only those with a large number of works (note works here, not publications) in other situations. Exploring which situations warrant which presentation would be another interesting line of inquiry.

The next improvement I want to see is allowing users to combine these facets (and others) dynamically so I can find Psychological fiction set in the Pyrenees, then narrow it to works after 1960, then remove the Pyrenees requirement, then add in Captain Ahab to the requirements that are left.... ad nauseum. Our catalogs need to support discovery of new works, not just those we already know the author and title. Systems like this are light years (sci fi fan here!) ahead of LCSH-style "browsing". I want more!

(Note to OCLC - the link to "Known problems" is broken. I'm interested to find out what challenges you've faced when building this beta system. I have a very strange idea of fun.)

I *love* this

I'm talkin' 'bout the new version of OCLC's FictionFinder. Specifically, the browse feature in FictionFinder. You heard me. Browse. In a library system. Not the LCSH browse with pages upon pages upon pages upon pages of subdivisions with no discernable grouping, but a real browse.

The best is the "genre" browse (but take out those --Young adult fiction subdivisions and move them to an audience facet). It's not a short list, but it's not too long either. It would be interesting to arrange these hierarchically and see if navigating that list made any sense to users. And "settings"! How cool to be able to locate fiction that takes place in the Pyrenees. This is what library catalogs should do for our users.

I'm also intrigued by the "character" browse. This is something I've never thought of before. My general rule for browsing facets is to only include facets that have a (relatively) small number of categories, each with a (relatively) large number of members. At first, I didn't think characters met this requirement. Then I clicked on Captain Ahab, and I realized just how many works of fiction there are about him! Great works inspire derivatives, and exploring those is a fun way to guide new reading, in my opinion. It would be interesting to have access to a browse list of all characters in some situations, and only those with a large number of works (note works here, not publications) in other situations. Exploring which situations warrant which presentation would be another interesting line of inquiry.

The next improvement I want to see is allowing users to combine these facets (and others) dynamically so I can find Psychological fiction set in the Pyrenees, then narrow it to works after 1960, then remove the Pyrenees requirement, then add in Captain Ahab to the requirements that are left.... ad nauseum. Our catalogs need to support discovery of new works, not just those we already know the author and title. Systems like this are light years (sci fi fan here!) ahead of LCSH-style "browsing". I want more!

(Note to OCLC - the link to "Known problems" is broken. I'm interested to find out what challenges you've faced when building this beta system. I have a very strange idea of fun.)

Friday, December 08, 2006

True confessions

I recently checked out David Allen's Getting things done from my local public library, thinking I could use a little help calming down the craziness that my life seems to have turned in to. Probably predictably, I turned it in late having only read the first 2 chapters. Oh, well.

In light of this and other related events, I've been thinking a bit about what I do get done and why. I believe I've been spoiled by having jobs for a number of years now where I find the work interesting. It's a whole lot easier to get work done when it's engaging and I care about the outcome. I find the tasks I find interesting are the ones I end up working on for the most part, leaving the ones I find un-interesting until right before a deadline.

So what does this mean for libraries? I think it means that we need to make sure to allow our staff to step up and get involved in projects as deeply as interests them. There are many of us out there who get motivated by understanding and buying into the big picture. Don't "protect" your staff from those high-level discussions - allow them to participate as much as they see fit. Sure, there are lots of folks in library-land that are just interested in the paycheck. We need to meet their needs too. But reward those who think beyond the next five minutes - they're going to be running the place soon enough.

Wednesday, November 15, 2006

Children's Book Week

Reading all the touching stories of favorite childhood books across the biblioblogosphere in honor of Children's Book Week has guilted me into posting my own contribution. I still smile when I think of The Little Old Man Who Could Not Read, Irma Simonton Black (Author), Seymour Fleishman (Illustrator). It's a story of a man (who cannot read) who goes to the grocery store and selects items based on the box size and color, trying to match them to products he knows he has at home. Of course, he ends up with an amusing assortment of unintended purchases. The story is touching and the illustrations really make the point. Like many books from my childhood, I think it's out of print (and I see it was first published in 1968, before I was born), but it looks like Amazon can hook you up with a copy, as could many local libraries.

Tuesday, November 07, 2006

More structured metadata

I often encounter people who see my job title (Metadata Librarian) and assume I have an agenda to do away with human cataloging entirely and rely solely on full-text searching and uncontrolled metadata generated by authors and publishers. That’s simply not true; I have no such goal. I am interested in exploring new means of description, not for their own sake, but for the retrieval possibilities they suggest for our users. So here are a few statements that begin to explain my metadata philosophy:

I want more automation. Throwing more money at a manual cataloging process is not a reasonable solution. First of all, it would take waaaaaaayyyyy more money than we can even dream of getting, and second, much metadata creation is not a good use of human effort. Let’s automate everything we can, saving our skilled people for the tasks current automation means are furthest from performing adequately. Let’s get more objective types of metadata, such as pagination, from resources themselves or from their creators (including publishers). Let’s build systems that make data entry and authority control easy. Yes, there will be some mistakes. There will be mistakes if the whole thing is done by humans too. Are catching the few mistakes that will happen from these automated processes more important than devoting our human effort to that extra few resources? More automation means more data total, and the sorts of discovery services I have in mind need lots of that data.

I want more consistency. Users can’t find what’s not there. While we can’t prescribe all records for all resources everywhere have to have a large number of features (I’m against metadata police!), the more of those features that are there mean more discovery options for those users. Imagine a system that provides access to fiction based on geographic setting. Cool, huh? I read one book recently set in Cape Breton Island and can’t wait to get my hands on more. We can’t do that very well today because that data is in very few of our records, and when it is there, isn’t always in the same place. The more consistent we are with our metadata, the better able we’ll be to build those next-generation systems.

I want more structure. I’m a big fan of faceted browsing. The ability to move seamlessly through a system, adding and removing features such as language, date, geography, topic, instrumentation (hey, I’m a musician…), and the like based on what I’m currently seeing in a result set is something I believe our users will be demanding more and more. But we can’t do this if that information isn’t explicitly coded. Instrumentation (e.g., “means of performance”) as part of a generic “subject” string isn’t going to cut it. Geographic subdivisions (even in their own subfield) that are structured to be human- rather than machine-readable also aren’t going to cut it. Nor are textual language notes, [ca. 1846?], or most GMDs. Many of these things can be parsed, and turned into more highly structured data with some degree of success. But why aren’t we doing it that way in the first place? More structure = better discovery capabilities.

What this all means is I’m glad there are lots of extremely bright people with all sorts of perspectives and skills thinking about improved discovery for library materials, but that doesn’t necessarily mean throwing out metadata-based searching. The sorts of systems I envision require more, more highly structured, more predictable, and higher-quality metadata. I want more, not less.

I’ll stand on one last (smallish) soapbox before wrapping this up. In many communities (including both search engines and libraries), discussions about retrieval possibilities often center around textual resources. However, not everything that people are interested in is textual. That’s of course not a surprise, but I’m shocked at how often discovery models are presented that rely on this assumption. I’m all for using the contents of a textual resource to enhance discovery in interesting ways, but we need systems that can provide good retrieval for other sorts of materials too. Let’s not leave our music, our art, our data sets, our maps hanging out to dry while we plow forward with text alone.

Sunday, October 29, 2006

Thinking bigger than fixing typos

The Typo of the Day blog, which presents a typographical error likely to be found in library catalogs every day, and encourages catalogers to search their own catalogs for this error, has generated much discussion and linking in the library world. I’m all for ensuring our catalog data is as accurate as possible; however, I would like to think beyond the needle-in-a-haystack approach presented here. I want our emphasis to be on systems that make it difficult to make a mistake in the first place, rather than focusing on review systems that emphasize what’s wrong over what’s right and give a select few a false sense of intellectual superiority over those who do good work and make the occasional inevitable simple mistake.

There are many ways our cataloging systems could better promote quality records and make it more difficult to commit simple errors. I’ll mention just two here: spell checking and heading control. We hear frequent complaints about the lack of spell checking in our patron search interfaces, but few talk about this feature of being useful to catalogers. And I’m not talking about a button that looks over a record before saving it—I’m talking about real interactive visual feedback that helps a cataloger fix a typo right when it happens. Think Word with its little red squiggly lines—they show up instantly so all you have to do it hit backspace a few times while you’re thinking about this particular field and not miss a beat. If it’s not really an error, the feedback is easy to ignore. Word also has a feature whereby it can automatically correct a misspelling as you type based on a preset (and customizable) list of common typos. Features like this require a bit more attention to make sure the change isn’t an undesired one, but for most people in most cases it saves a great deal more time than it takes, and the feature can be tuned to an individual’s preferences. Checking the entire record after the fact requires a higher cognitive load—turning back to a title page, remembering what you were thinking when you formulated the value for that field, checking an authority file a second time, etc., and is less helpful than real-time feedback.

Heading control is the second area in which our systems could make it easy to do the right thing. Easier movement between a bibliographic record and an authority file, widgets that fill in headings based on a single click or keystroke, and automatic checks that ensure a controlled value matches an authority reference before leaving the field can all help the cataloger avoid simple typographical errors in the first place and make the sort of treasure hunt common typo lists provide less necessary.

Consider also the enormous duplication of effort we’re expending by hundreds of individuals at hundreds of institutions all looking up the same typos in our catalogs and all editing our own copies of the same records. This local editing makes an already tough version control problem worse by increasing the differences between hundreds of copies of a record for the same thing. We have way more cataloging work to do than we can possibly afford, and duplication of effort like this is an embarrassingly poor use of our limited resources. The single most effective step we can take to improve record quality is to stop this insanity we call “cooperative cataloging” today and adopt a streamlined model whereby all benefit instantaneously and automatically from one person fixing a simple typo.

Tuesday, October 17, 2006

Grant proposals

Writing competitive grant proposals for putting analog collections online is difficult, and is becoming more so as more institutions are in a position to submit high-quality proposals and digitization for its own sake is no longer a priority for funding agencies. Collections themselves are no longer enough. There are many more collections that deserve a wider audience, that will significantly contribute to the work of scholars, and that will bring new knowledge to light, than can possibly be funded by even a hundred times the amount of grant funding available. The key is to offer something new. A new search feature. Expert contextual materials. User tagging capabilities. Something to make your project stand out as special and test some new ideas.

The trick is that in order to write that convincing proposal, you have to do a significant amount of the project, even before you write the proposal and before you get any money. Most of the important decisions, such as what metadata standards you will use, must be made before you write the proposal, both to convince a funding agency you know what you are doing and to develop reasonable cost figures. To make these decisions, an in-depth understanding of the materials, your users, the sorts of discovery and delivery functionalities you will provide, and the systems you will use are all necessary. Coming to those understandings is no small task, and is one of the most important parts of project planning. Don’t think of grant money as “free”—think of it as a way to do something you were going to do anyways, just a bit faster and sooner.

Saturday, September 30, 2006

Librarians in the Media

CNET news published an article this week entitled, “Most reliable search tool could be your librarian.” While it’s nice to see librarians getting some press, I remain concerned about our image, both as presented in the media and as we present ourselves.

The article contains the usual rhetoric about caution in evaluating the “authority” of information retrieved by Web search engines, the need for advanced search strategies to achieve better search results, and the bashing of keyword searching. Here, as in so many other places, the subtext is that “our” (meaning libraries’) information is “better” – that if only you, the lowly ignorant user, would simply deem to listen to us, we can enlighten you, teach you the rituals of “quality” searching and location of deserving resources rather than that drivel out there on the Web, that could be written by (gasp!) any yahoo out there.

Of course we know it’s not that simple. But the oversimplification is what’s out there. We’re not doing ourselves any favors by portraying ourselves (or allowing ourselves to be portrayed) as holier-than-thou, constantly telling people they’re not looking for things the right way or using the right things from what they do find, even though they thought they were getting along just fine. We simply can’t draw a line in the sand and say, “the things you find through libraries are good and the things you don’t are suspect.” There are really terrible articles in academic journals, and equally terrible books, many published by reputable firms. There are, on the other hand, countless very good resources out there on the Web, discoverable through search engines. And the line between the two is becoming ever more blurry as scholarly publishing moves towards open access, libraries are putting their collections online, government resources are increasingly becoming Web-accessible, and search engines gain further access to the deep Web.

The first strategy I feel we should be taking is to move discussion away from focusing on the resource and its authority to the information need. Evaluating an individual resource is of course important, but it’s not the first step. Let’s instead talk first about all the resources and search strategies that can meet a given need, rather than always focusing on resources and search strategies that can’t meet that need. There are many, many ways a user can successfully locate the name of the actor in the movie he saw last night, identify a source to purchase a household item at a reasonable price, find a good novel to read on a given theme, or learn more about how the War of 1812 started. Let’s not assume every information need is best met by a peer-reviewed resource, and make those peer-reviewed resources and the mediation services for them we can offer more accessible when these resources and our services are appropriate to meet those information needs. Let’s be a part of the information landscape for our patrons, rather than telling them we sit above it.

Saturday, September 02, 2006

On "authority"

I recently got around to reading the response from Encyclopedia Britannica to the comparison of the “accuracy” of articles in Britannica and Wikipedia by Nature. It’s got me thinking about the nature of authority, accuracy, and truth.

Britannica’s objections to the Nature article arise from a different interpretation of the words “accuracy” and “error.” The refutations by Britannica fall into two general categories. The first is the disputation of certain factual statements, mostly when such facts were established by research. Here, these facts aren’t truly objective, rather, they’re a product of what a human is willing to believe based on the evidence. Different humans will draw different conclusions based on the same evidence. And then there’s the other human element: mistakes. We make them, both those of us who work for Britannica and those who work for Nature. The “error” rates Nature reported for both sources are astonishingly high. Certainly not all of these are true mistakes, maybe not even very many of them, but they exist, in every resource humans create, despite any level of editorial oversight.

Second, and more prevalent, are differing opinions among reasonable people, even experts in a given domain, about what is appropriate at what isn’t to include in text written for a given audience. Anything but the most detailed, comprehensive coverage of a subject requires some degree of oversimplification (and maybe even those as well). By some definition, all such oversimplifications are “wrong” – it’s a matter of perspective and interpretation whether or not they’re useful to make in any given set of circumstances. Truth is circumstantial, much as we hate to admit it.

I’d say the same principles apply to library catalog records. First, think about factual statements. At first glance, something like a publication date would seem to be an objective bit of data that’s either wrong or right. But it’s not that simple. There are multitudes of rules in library cataloging governing how to determine a publication date and how to format it. Interpretation of those rules is necessary, therefore often two different reasonable decisions based on them as to what the publication date is are possible. In cases where a true mistake has been made, our copy cataloging workflows require huge amounts of effort to distribute corrections among all libraries that have used the record with that mistake. Only sometimes is a library correcting a mistake able to reflect this correction in a shared version of a record, and no reasonable system exists to populate that correction to libraries that have already made their own copy of that record. The very idea of hundreds of copies of these records, each slightly different, floating around out there is ridiculous in today’s information environment. We’re currently stuck in this mode for historical reasons, and a major cooperative cataloging infrastructure upgrade is in order.

More subjective decisions are not frequently recognized as such when librarians talk about cataloging. We talk as if one would only follow the rules, the perfect catalog record would be produced, and that if two people were to just follow the same rules, they would produce identical records. But of course that’s not true. There will always be individual variation, no matter how well-written, well-organized, or complete the instructions. Librarians complain about “poor” records when subject headings don’t match their ideas of what a work is about. But catalogers don’t (and of course can’t) read every book, watch every video, or listen to every musical composition they describe. Why have we set up a system whereby we spend a great deal of duplicate effort overriding one subjective decision with another, based on only the most cursory understanding of the resources we’re describing, and keeping multiple but different copies of these records in hundreds of locations? How, exactly, does this promote “quality” in cataloging?

An underlying assumption here is that there is one single perfect cataloging record that is the best description of an item. But of course this isn’t true either. All metadata is an interpretation. The choices we make about vocabularies, level of description, and areas of focus all preference certain uses over others. I’m fond of citing Carl Lagoze’s statement that "it is helpful to think of metadata as multiple views that can be projected from a single information object." Few would argue with this statement taken alone, yet our descriptive practices don’t reflect it. It’s high time we stopped pretending that the rules are all we need, changed our cooperative cataloging models to do it truly cooperatively, and use content experts rather than syntax experts to describe our valuable resources.

Tuesday, August 08, 2006

What about dirty OCR?

I often hear discussions as part of the digital project planning process about how best to approach full-text searching of documents. A common theme of these discussions is whether or not “dirty” (uncorrected, raw) OCR is acceptable or not. The “con” position tends to argue that OCR is only so effective (say, 95%) and that the errors made can and will adversely affect searching. The “pro” position is that some access is better than none, and OCR is a relatively cheap option for providing that “some” access.

The con position has some convincing arguments. Providing some sort of full text search sends a very strong implication that the search works – and if the error rate in the full text is more than negligible, it could be said that implied promise has been broken. Error rates themselves are misleading. A colleague of mine likes to use the following (very effective, in my opinion) example, noting that error rates refer to characters, but we search with words:

Quick brown fix jumps ever the lazy dog.

In this case, there are two errors (fix and ever), out of 40 characters (including spaces), for an accuracy rate of 95%. However, only 75% (6 of 8) words are correct in that example.

So uncorrected OCR has some problems. But the costs of human editing of OCR-ed texts are high – too high to be a valuable alternative in many situations. Double- and triple-keying (two or three humans manually typing in a text while looking at scanned images) tends to be cheaper than OCR with human editing, but these cost savings are typically achieved by outsourcing the work to third-world countries, promoting ethical concerns for many. And both of the human-intervention options themselves represent a non-zero error rate. No solution can reasonably yield completely error-free results.

I’ll argue that the appropriate choice lies, as always, in the details of the situation. How accurate can you expect the OCR to be for the materials in question? 90% vs. 95% vs. 99% makes a big difference. What sorts of funds are available for the project? Are there existing staff available for reassignment, or is there a pool of money available for paying for outsourcing? TEST all the available options with the actual materials needing conversion. Find out what accuracy rate can be achieved via OCR with all available software. Ask editing and double-keying vendors for samples of their work based on samples from the collection. Do a systematic analysis of the results. Don’t guess as to which way is better. Make a decision based on actual evidence, and make sure you get ample quantities of that evidence. Results from one page, or even ten pages, are not sufficient to make a reasoned decision. Use a larger sample, based on the size of the entire collection, to provide an appropriate testbed for making an informed choice between the available options. Too often we assume a small sample represents actual performance and accept quick support of our existing preferences as evidence of their superiority. To make good decisions about the balance of cost and accuracy, we must use all available information, including accurate performance measures from OCR and its alternatives.

Monday, July 10, 2006

No more magic bullets

This week OCLC announced, a freely-available site for searching WorldCat data, which will be released in August 2006. Here’s their one-sentence explanation of its purpose:

Where Open WorldCat inserts "Find in a Library" results within regular search engine results, provides a permanent destination page and search box that lets a broader range of people discover the riches of library-held materials cataloged in the WorldCat database.

I’m a huge fan of this addition to the OCLC arsenal. I’m also a fan of Open WorldCat, however. I think these two tools need to work together (and along with many others) to provide the full set of services our users need. Like others, I use various tricks to limit search engine results to Open WorldCat items when I’m looking for basic information about a book I know exists, and, like others, I’ve never seen an Open WorldCat item appear in a Google search result set that wasn’t intentionally limited to Open WorldCat results. While Open WorldCat has its benefits, it can’t be all things to all users.

And there’s the rub: in libraries (and, to be fair, in many other fields as well) we tend to think there’s a magic solution. We just need to be more like Google. Federated searching is the answer. If we had Endeca, like NCSU, everything would be fine. Shelf-ready cataloging will make all of this affordable. Put like this, it sounds absurd. Yet the magic bullet theory drives all too many library decision-making processes. Of course, only by combining these and many other technologies in innovative ways will we make the substantive changes needed in the discovery systems libraries provide to our users. Systems of different scope need different means of presenting search results. A system with a tightly-controlled scope may be able to present search results in a simple list (note: these are few and far between!). The wider the scope of the system, with regards to format, genre, and subject, the more sophisticated we need to get in presenting the search results. Grouping, drilling down, dynamic expanding and refining of results all need to be incorporated into our next-generation systems. Single books in Google results aren’t going to cut it – we need to find ways to represent groups of our materials in aggregated resources.

For many user needs, sophisticated searching options for a specific genre or format of resource are absolutely essential. For others, more generic access to a variety of resources is the appropriate approach. Flexibility is the key, and the data we’re talking about here will never live in a single location. Mashups of data from multiple sources, presented with a variety of interfaces and search systems, can provide the advanced access envisioned here. We need to stop accepting the quick fix; instead, we must broaden our expectations, and move forward evaluating every option as to its place in the grand vision.

Wednesday, June 28, 2006

A quest for better metadata

I wasn’t able to attend the ACM/IEEE Joint Conference on Digital Libraries this year, but the buzz surrounding the paper by Carl Lagoze (et al) about the challenges faced by a large aggregator despite using supposed low-barrier methods such as OAI led me to look the written version of this paper up. This paper demonstrates very well that now matter how “low-barrier” the technology (OAI) or the metadata schema (DC), bad metadata makes life difficult for aggregators. Garbage in, garbage out had been a truism for some time, and the “magic” behind current technology can help, but can only go so far to mediate poor input.

There has been spirited discussion in the library world recently about next generation catalogs, but that discussion has heavily centered on systems rather than the data that drives them. I’d argue that one needs both highly functional systems and good data in order to provide the sorts of access our users demand. How we get that good data is what I’ve been interested in recently. Humans generating it the way libraries currently do is one part of a larger-scale solution, but given the current ratio of interesting resources to funding for humans to describe them, we must find other means to supplement our current approach.

So what might we do? Here are my thoughts:

  • Tap into our users. There are a whole lot of people out there that know and care a lot more about our resources than Random J. Cataloger. Let’s harness the knowledge and passion of those users, and provide systems that let them quickly and easily share what they know with us and other users.

  • Get more out of existing library data. As Lorcan Dempsey says, we should “make our data work harder.” Although MARC and other library descriptive traditions have many limitations in light of next-generation systems, they still represent a substantial corpus of data that we must use as a basis for future enhancements. Let’s use any and all techniques at our disposal to transform this data into that which drives these next-generation systems.

  • Look outside of libraries. Libraries do things differently than publishers, vendors, enthusiasts, and many other communities that create and use metadata. We should keep in mind the cliché, “Different is not necessarily better.” We need to both look at ways of mining existing metadata from other communities to meet our needs, and re-examine the way we structure our metadata with specific user functions in mind.

  • Put more IR techniques into production. Information retrieval research provides a wide variety of techniques to better process metadata from libraries and other communities. Simple field-to-field mapping is only a portion of what we can make this existing data do for us. We must work with IR experts to push our existing data farther. IR techniques can also be made to work not just on metadata but the data itself. Document summarization, automatic metadata generation, and content-based searching of text, still images, audio, and video can all provide additional data points for our systems to operate upon.

  • Develop better cooperative models. Libraries have a history of cooperative cataloging, yet this process is anything but streamlined. We simply must get away from models where every library hosts local copies of records, and each of those records receives individual attention, changing, enhancing, even removing (!) data for local purposes. Any edits or enhancements performed by one should benefit all, and the current networked environment can support this approach much better than was possible when cooperative cataloging systems were first developed.

My point is, we can’t plug our ears, sing a song, and keep doing things the way we have been doing. Let’s make use of the developments around us, contribute the expertise we have, and all benefit as a result.

Saturday, June 24, 2006

Finding new perspectives

I spent last week at a conference with an extremely diverse group of attendees. Almost all were trained musicians; among these were traditional humanist scholars, librarians of all sorts, and a smattering of technologists. I spoke at two sessions, each on a topic related to how library systems might better meet the needs of our users. I was pleasantly surprised by the environment in these sessions, and in the conference as a whole.

Due to the diversity of attendees, I had feared that my ideas might be either rejected wholesale in light of very real and valid practical concerns, or ignored due to a perception that they were irrelevant to the work of many attendees. I was wrong. I had many stimulating and mutual idea-generating discussions with other attendees, most of whom don't spend their time thinking about system design like I'm lucky enough to do. My perspective of thinking big and not being satisfied by what current systems deliver us was greeted with a great deal of enthusiasm, showing me in no uncertain terms just how connected and devoted many librarians (and those in related fields) are to the needs of our users. Perhaps those who disagreed with my approach were just being polite in not expressing major differences in perspective publicly or privately (it was an international conference and I admit to not fully understanding all the cultural factors at work); I hope not, or at least I'd like to think that such disagreements could take the form of collegial conversation that starts in a session then continues afterward to the mutual benefit of both parties. But, then again, I can be an optimist about such things.

Perhaps the most surprising thing was that my point of view wasn't the most progressive there. I had a number of conversations with attendees whose vision was broader, more visionary, more of a departure from the current environment than mine. I view myself as striking a reasonable compromise between vision and practicality in the digital library realm, but my preconception of this conference was that I would be very far outside the attendees' respective norms. I was certainly on that side, and it was good to see I had company, and even a few compatriots that were further out to stimulate discussion.

What I took away was that we in the digital library world have a tendency to navel-gaze, to think we're the only ones that can plan our next-generation systems. This week I found an excellent cross-section of groups we need to more fully engage in this discussion. Without them and others like them, we're missing vital ideas.

Monday, May 29, 2006

An RDF Revelation

While doing some reading recently, I had an RDF revalation. I've long felt I didn't really get RDF. This time, the parts that sunk in made a bit more sense. I'm not a convert in this particular religious war, but I do feel like I now understand both sides a bit better.

I've read the W3C RDF Primer before; several times, I think. The first thing that struck me this time was a simple fact I know I'd read before but that I'd forgotten--that an object can be a either a URIref or a literal (a URI referencing a definition elsewhere, or a string containing a human-readable value). This means the strict machine-readable definitions of things RDF strives to achieve is potentially only half there--only the predicate (relationship between the subject and object) is expected to be a reference to a presumably shared source. I assume this option exists for ease of use. Certainly building up an infrastructure that allows for all values to be referenced rather than declared represents unreasonable startup time. This sort of thing is better done in an evolutionary fashion rather than forcing it to happen at the start; a reasonable decision on the part of RDF.

RDF contains some other constructs to make things easier, for example, blank nodes to group a set of nodes (or, in the words of the primer, provide "the necessary connectivity"). Blank nodes are a further feature that allow lack of formal identification of entities. The primer discusses a case using a blank node to describe a person, rather than relying on a URI such as an email address as an identifier for that person. A convenient feature, certainly, but also a step away from the formal structures envisioned in Semantic Web Nirvana.

So now I'm looking at the whole XML vs. RDF discussion much more as a continuum rather than opposing philosophical perspectives. The general tenor of RDF is that it expects everything to be declared in an extremely formal manner. But there are reasonable exceptions to that model, and RDF seems to make them. I'd argue now that both RDF and XML represent practical compromises. Both strive for interoperability in their own way. It's just a question of degree whether one expects a metadata creator to check existing vocabularies, sources, and models for individual concepts (RDF-ish) or for representing entire resources (XML-ish). I see the value of RDF for use in unpredictable environments. Yet I'm still not convinced our library applications are ready for it yet. The reality is that libraries are still for the most part sharing metadata in highly controlled environments where some human semantic understanding is present in the chain somewhere (even in big aggregations like OAIster). (Of course, if we had more machine-understandable data, that human step would be less essential...)

I'm a big champion of two-way sharing of metadata between library sources and the "outside world." I just don't think the applications that can make use of RDF metadata for this purpose are yet mature enough to make it worth the extra development time on my end. And, again, the reality is that it really would take significant extra development time for me. The metadata standards libraries use are overwhelmingly XML-based rather than RDF-based. XML tools are much more mature than RDF tools. I fully understand the power of the RDF vision. But this is one area I just can't be the one pushing the envelope to get there.

Monday, May 15, 2006

Whither book scanning

A recent New York Times Magazine article entitled Scan this Book! by Kevin Kelly is getting lip service in the library world. The article describes the current digitization landscape, discussing the Google book project, among other initiatives, and describes both the potential benefits and current challenges to the grand vision of a digitized, hyperlinked world. I was specifically glad to see the discussion not just centering around books, but around other forms of information and expression as well. However, library folk are starting in on our usual reactions to such pieces, finding factual errors, talking about how tags and controlled subjects aren't mutually exclusive, pointing out the economics of digitization efforts, discussion of how the digitization part is only the first step and how the rest is much more difficult. All of these points are perfectly valid.

Yet even though these criticisms might be correct, I think that we do ourselves a disservice by letting knee-jerk reactions to "outsiders" talking about our books take center stage. Librarians have a great deal to offer to the digitization discussion. We've done some impressive demonstrations of the potential for information resources in the networked world. Yet we don't have a corner on this particular market. Like any group with a long history, we can be pathetically short-sighted about changes we're facing. I believe it would be a fatal mistake to believe we can face this future alone. We have solid experience and many ideas to bring to Kelly's vision for the information future. However, we simply can't do it alone, and not just for economic reasons. We simply must be listening to other perspectives, just as we expect search engines, publishers, and others we might be working with to listen to ours. Let's keep our defensiveness in check, and start a dialog with those who are interested in these efforts, instead of finding ways to criticize them.

Tuesday, May 09, 2006

On the theoretical and the practical

When I do metadata training, I make a point to talk about theoretical issues first, to help set the stage for why we make the decisions we do. Only then do I give practical advice on approaches to specific situations. I’m a firm believer in that old cliché about teaching a man to fish, and think that doing any digital library project involves creative decision-making, applying general principles rather than hard-and-fast rules.

Yet the feedback I get from these sessions frequently ranks practical advice as the most useful part of the training. I struggle with how to structure these training sessions based on the difference between what I think is important and what others find useful. I learned to make good metadata decisions first by implementing rules and procedures developed by others, and only later to develop those rules and procedures myself. It should make sense that others would learn the same way.

The difference is that I learned these methods over a long period of time. The training sessions I teach don’t ever claim to provide anyone with everything they would need to know to be a metadata expert. Instead, their goal is to provide participants with the tools they need to start thinking about their digital projects. I expect each of them will have many more questions and ample opportunity to apply theory presented to them as they begin planning for digital projects. This is where I see the theoretical foundation for metadata decisions coming into play. I can’t possibly provide enough practical advice to meet every need in the room; I can make a reasonable attempt to address theoretical issues that would help to address these issues.

I realize the theory (why we do things) can be an overwhelming introduction to the metadata landscape. Without any practical grounding, it doesn’t make any sense. Yet I know it’s essential in order to plan even one digital project, much less many. I and many others out there need to continue to improve the methods by which we train others to create consistent, high-quality, shareable metadata, finding the appropriate balance between giving a theoretical foundation and providing practical advice.

Friday, April 28, 2006

Thesauri and controlled vocabularies

I had a very interesting conversation recently with two colleagues about the differences between thesauri and controlled vocabularies. Both of these colleagues are developers who work in my department. One is finishing up a Ph.D. in Computer Science, is currently in charge of system design for a major initiative of ours, and has a knack for seeing all the aspects of a problem before finding the right solution; the other is a database guru with whom I've collaborated on some very interesting research and has just started pursuing an M.L.S to add to his already considerable expertise. I like and respect both of these individuals a great deal.

The interesting conversation began when the database-guru-and-soon-to-be-librarian (DGASTBL) (geez, that's not any better, is it?) asked if the terms "controlled vocabulary" and "thesaurus" are used interchangeably in the library world. He asked because from our previous work and a solid basis in these concepts he knew they really aren't the same thing, yet he had seen them used in print in ways that didn't match his (correct) understanding. The high-level system diagram we had at the time had a box for "vocabulary" which was intended to handle thesaurus lookups for the system. We discussed how a more precise representation of that diagram would have an outer box for "vocabulary" to handle things like name authority files and subject vocabularies with lead-in terms but no other relationships, and an inner box for "thesauri" (as a subset of controlled vocabularies) with full syndetic structures that the system could make use of. We lamented that the required outer label in this scenario of "controlled vocabulary" isn't as sexy as its subset "thesauri." The latter sounds a great deal more interesting when describing a system to those not involved in developing it.

The system designer then presented a different perspective on the issue. While the librarian types considered thesauri a subset of controlled vocabularies (perhaps party for historical reasons - we've been using loosely controlled vocabularies a lot longer than true thesauri), the system designer viewed the situation as the opposite - that controlled vocabularies were a specific type of thesauri using only one type of relationship (the synonym), or perhaps also some rudimentary broader/narrower relationships that don't qualify as true thesauri (think LCSH). I found the difference in point of view interesting - that the C.S. perspective expected a completely structured approach to the vocabulary problem, and the library perspective represented an evolving view that has never quite gotten to the point where we can make robust use of this data in our systems. It struck me that the system designer's perspective in this conversation was overly optimistic as to the state of controlled vocabularies in libraries.

Yet there's light at the end of this particular tunnel. Production systems in digital libraries are starting to emerge that make good use of controlled vocabularies in search systems, rather than relying on users to consult external vocabulary resources before searching. Libraries have not taken advantage of the revolution in search systems shifting many functions from the user to the system (think spell-checking), to our supreme discredit. Making better use of these vocabularies and thesauri is one way of shifting this burden. I hope this integration of vocabularies into search systems will push the development of these vocabularies further and make them more useful as system tools rather than just cataloger tools. By providing search systems that can integrate this structured metadata, we can improve discovery in ways not currently provided by either library catalogs or mainstream search engines.

Monday, April 17, 2006

"Orienteering" as an information seeking strategy

I was introduced today to the notion of "orienteering" as an information seeking strategy, through a paper presented at the CHI 2004 conference by Jamie Teevan and several other colleagues. The paper discusses orienteering as a strategy by which users make "small steps...used to narrow in on the target" rather than simply typing words in a search box. For some time, I've been struggling inside my head with trying to articulate the differences between the search engine model with a wide-open box for typing in a search and the library model with vast resources but a need for users to know ahead of time which of those resources are relevant to their search. This paper very clearly spoke to me, by demonstrating that real users (to use one of my favorite phrases) are somewhere in the middle.

Users have resources they like. We prefer one map site over another, one news site over another, one author over another. And we know where each of our prefered resources can be accessed. For many types of information needs, we know the right place (for us) to start looking. Even as we make the hidden Web more accessible, the resource (like an email) we need often won't be something a generic Web search engine can get to. But for many information needs, a box and "I'm feeling lucky" is an effective solution. I think the point is that we need a wide variety of discovery models to match the wide variety of our searching needs. We can't expect all users to start with the "right" resource (what's "right"?), but we should provide seamless methods for users to move, step by step, towards what they're looking for.

Thursday, April 06, 2006 launched

I was recently honored to be asked to participate with a stunningly informed and diverse group of library technology types in an online initiative called TechEssence. TechEssence is envisioned as a rich resource for library decision-makers to learn just enough about a wide variety of technologies to allow them to make good decisions. I'm a big fan of this approach - not everyone can know everything, and many of us need succinct information with just the right amount of evaluation from those with experience. As of yesterday, the site is now officially launched!

Here's a summary from Roy Tennant, our fearless leader:
The essence of technology for library decision-makers

A new web site and collaborative blog on technology for library
decision-makers is now available at
TechEsssence provides library managers with summary information about
library technologies, suggested criteria for decision-making, and
links to resources for more information on essential library

A collaborative blog provides centralized access to some of the best
writers in the field. By subscribing to the RSS feed of the blog, you will be able to keep tabs on the latest
trends of which library administrators should be aware.

To accomplish this I am joined by a truly amazing group:

* Andrew Pace
* Dorothea Salo
* Eric Lease Morgan
* Jenn Riley
* Jerry Kuntz
* Marshall Breeding
* Meredith Farkas
* Thomas Dowling

For more information on the group, see our "about us" page at

Wednesday, April 05, 2006

Library digitization efforts

Many libraries are seeing efforts such as the Google Books Library Project, and think they need to follow suit by digitizing books in order not to be left behind. I worry that many of these libraries are jumping in just to be on the bandwagon without fully considering wheir their efforts fit in with those of others. Digitizing books, performing dirty OCR, and making use of existing metadata is about as easy as it gets in the digital library world (not that this is exactly a walk in the park), so it's an attractive option for libraries looking to make a splash with their first efforts to deliver their local collections online.

I argue that this is not the right approach for most libraries. That impact libraries are looking for as a result of digitization of local collections is achieved through the right ratio of benefit to users versus costs to the library. While the costs to the library are lower to digitize already-described, published books sitting on the shelves, the benefits are also lower than focusing on other types of materials (more on which materials I'm thinking of later...). We already have reasonable access to the books in our collection. I'll be the first to go on and on ad infinitum about the poor intellectual access we currently provide to our library materials. But there is some intellectual access. For books a library doesn't own, interlibrary loan is a slightly cumbersome but mostly reasonable method of delivering a title to a user. There are also a (comparatively) great many digitized books out there, without good registries of what's digitized and what isn't, or good ways to share digital versions when they do exist and the institution that owns the files is willing to share. Take the Google project - they're digitizing collections from five major research libraries, yet libraries planning digitization projects don't have access to lists of materials that are being digitized as part of this project, even though we expect to have some (not complete) access to these materials through Google's services at some point in the next few years. Even though library collections have surprisingly less duplication than one might expect, a library embarking on a digitization project for published books would be duplicating effort already spent to some non-negligible extent.

Libraries in the aggregate hold almost unimaginably vast amounts of material. We're simply never going to get around to digitizing all of it, or even the proportion we would select given any reasonable set of selection guidelines. An enormously small proportion of these materials are the "easy" type - books, published, with MARC records. The huge majority are rare or unique materials: historical photographs, letters, sound recordings, original works of art, rare imprints. These sorts of materials generally have grossly inadequate or no networked method of intellectual discovery. While digitizing and delivering online these collections would take more time, effort and money than published collections, I believe strongly that the increase in benefit greatly outweighs the additional costs. In the end, the impact of focusing our efforts on classes of materials that we currently underserve will be greater than taking the easy road. Our money is better spent focusing on those materials that are held by individual libraries, held by only few or no others, and to which virtually no intellectual access exists. Isn't this preferable to spending our money digitizing published books to which current access is reasonable, if not perfect?

Tuesday, April 04, 2006

On metadata "experts"

I'm often asked how one gets the skills required to do my job as a Metadata Librarian. My answer is one I can't stress strongly enough: experience. We need to know the theoretical foundation of what we do inside and out, and need to constantly think about why we're doing something - the big picture. But theory is not enough. The only way to become skilled at making good metadata decisions is practice--seeing what happens as a result of an approach and improving on that approach the next time. No matter how many times I've done a certain type of task, I see the next repetition as a way to re-use good decisions and re-think others.

I've found the metadata community in libraries to be a very open one. When I'm starting on a task that I haven't done before, I use what I can from my experience with similar tasks. But I also ask around for advice from others who do have that experience. "Metadata" is a very big and diverse area of work. Even with the best abstract thinking, applying known principles to new environments, we all often need a boost for getting started from someone who has been through a given situation.

I'm skeptical of the idea of "experts" overall. These things are all relative - only once you start learning enough to be able to effectively share what you've learned with others do you truly realize how much you still have to learn. I put much more stock in the goal of becoming good at thinking about generalized solutions, good at making decisions for classes of problems rather than simply repeating specific implementations over and over. I'm not a programmer, and neither are many in the metadata librarian community. Yet this type of thinking that makes a good programmer can, in my opinion, make the best metadata experts as well.

Saturday, March 18, 2006

What exactly is the "catalog"?

From reading UC's Bibliographic Services Task Force report on "Rethinking How We Provide Bibliographic Services for the University of California," and participating in an initiative to write a similar white paper at MPOW, I've been thinking a great deal recently about what people mean when they talk about the future of the "catalog" or the "OPAC" in libraries.

Many people, when referring to the future of the catalog, mean the future of MARC. These arguments tend to center around how we can adapt the MARC record to handle new types of materials. Others mean the cataloging/metadata creation system present in the library's Integrated Library System. Many vendors (and OCLC) are talking about including metadata formats other than MARC in these systems. This sounds like a reasonable idea on the surface, but given the track records of ILS vendors, I'm not holding my breath for this one to work out very well. Another common usage is to mean "those things which the library owns," but this model has become problematic with the advent of licensed and free online resources, so this meaning is falling out of use.

I do think we need to figure out what systems locally-created metadata will go into. However, it's not realistic to expect we're moving towards an environment in which everything our patrons want access to is in a single database. As I'm fond of saying in this context, "That ship has sailed." Consider article-level access to the journal literature as the "elephant in the room" example of this phenomenon. Many vendors provide databases for this purpose that we happily subscribe to. It would be madness for libraries to try to replicate this information. We need to focus our attention instead on systems to make all the various information sources (including the catalog!) work together to provide seamless access to our users. Federated searching products on the market today are a step in this direction, but I've been decidedly underwhelmed by their functionality. We have a long way to go, one step at a time.

After all of this, I'm still not sure what my definition of "catalog" in ten years will be. I toyed briefly with the idea of "metadata records we created locally," but our current models with my library having a local copy of a shared record don't really fit with that definition. It could be something more like "records we manage locally" but that seems to administrative to be useful to anyone other than ourselves. Perhaps we should just bite the bullet and call the as-yet-still-imaginary single front end to all resources of possible interest to our users the catalog.

We're starting to tackle these issues in a big way, and I hope we can continue to make progress by agreeing on some semantics so we're not constantly talking past each other.

Tuesday, March 07, 2006

RSS feed gone

For some reason the conversion from the native Blogger Atom feed to RSS via 2RSS for Inquiring Librarian (and a bunch of other Blogger blogs, I see!) seems to be down. I haven't had time to look into this yet, unfortunately. So in the meantime, you can subscribe to the Atom feed. 'Course, if you read this via the RSS feed, you won't be seeing this note...

Wednesday, March 01, 2006

More of four

Yikes! I'm way behind in my blog reading. As I start to catch up, I find I was pseudo-taged for 4 things by Kevin two weeks ago. Congrats on the new addition to the family, Kevin. Seems like there must be something in the water up there in Princeton! ;-)

I see a variety of categories floating around out there. I'll pick my favorites.

4 Jobs I’ve Had

Metadata Librarian, Indiana University Digital Library Program
Circulation Supervisor, Indiana University Cook Music Library
Phone answerer/order taker, TIS Music Catalog
Camp counselor at the Brevard Music Center

4 Places I've Lived
Bloomington, IN
Coral Gables, FL
Marietta, GA
Satellite Beach, FL

4 Movies I Can Watch Over & Over
When Harry Met Sally
The Empire Strikes Back
Moulin Rouge (I have no idea why.)

4 TV Shows I Love To Watch
The Simpsons
Baseball (does that count?)

4 Novels
Lucia, Lucia by Adriana Trigiani
The Handmaid's Tale by Margaret Atwood
The Foundation Series by Isaac Asimov
The Stand by Stephen King

4 Places I’ve Been On Vacation
Gatlinburg, TN (it's a family thing…)
The Grand Canyon
Las Vegas
Walt Disney World (the Florida one)

4 Favorite Dishes
Pasta primavera (that's with veggies)
Potatoes, cooked any way
Salads with nuts and dried fruit
Chicken Breasts Stuffed with Fontina and Sun-Dried Tomato Sauce (mmm…)

4 Websites I Visit Daily
Charles W. Cushman Photograph Collection
The Onion (well, maybe weekly, not daily)

4 Places I’d Rather Be
Hanging out with my puppy (aww)
The Grand Canyon
Camping in the middle of nowhere
Right where I am

4 Bloggers I’m Tagging
This tagging thing has really made the rounds in the biblioblogsphere, but here are 4 blogs I read I haven't seen 4 things on. My apologies if you've already been tagged and I missed it.

Vampire Librarian
Audio Artifacts
The FRBR Blog

FRBRizing Find in a Library

Wow! I go traveling for a while and find all sorts of interesting things have happened while I was gone! OCLC's Open WorldCat now has FRBRized results. This is pretty darn cool. But I can't help but thinking, yet again, it hasn't quite gone far enough. I know, one step at a time. I have to do that in my job too. But I like thinking big, and I know the folks at OCLC Research like thinking big too. They've done a great job with Open WorldCat so far, and I hope they keep pushing the envelope.

Soooooo, how about limiting not just by format? What about language? There are probably other options I'm not thinking of right now in addition to these as well.

Also, as an extension to that last idea, how about mechanisms for moving about between related works? Do a search on "Gone with the Wind" in Google, limiting to the Find in a Library service. The novel, film, film score, etc., are all separate search results and once you pick one, I don't see a way to know the others exist. Yeah, I know there's no consensus on whether or not the novel and film version of Gone With the Wind are separate Works or two Expressions of the same work. Regardless, shouldn't we let our users move between them? Please?

I'm a huge FRBR fan as I think it gives us a very useful model for thinking about the relationshiops between things. But I think perhaps right now we're getting a bit too bogged down in the terminology when we start building services like this - a Work is selected, all Expressions are displayed, etc.- and we can forget that the exact definitions of these things aren't useful to our patrons. We should take full advantage of these relationships, and make sure our patrons can get between a film and a novel, even if they're separate Works related to one another. This leads me toward my current favorite rant about FRBR seeming to sideline Work relationships with this "Aggregation" idea, but I'll save that one for when I have more time...

Wednesday, February 22, 2006

Back to the Basics

I spend a large proportion of my time thinking about pretty advanced library-type systems, and how we can always go one step farther in providing better access to our materials for our users. But every once in a while I hear someone talking or experience something that makes me step back and think about the basics, why we do this in the first place.

I've been an avid reader from a very young age. My biggest relief in finishing graduate school was that I could read books for fun again, without feeling like I should be reading something else (OK, well I still do this a bit because I'm always behind on my professional reading, but you get the idea…). The recent release of the film version of C.S. Lewis' The Lion, the Witch, and the Wardrobe pushed me to re-read the Chronicles of Narnia series. I haven't read these books since I was something like 10 or 12 years old. Reading The Lion, the Witch, and the Wardrobe again was nothing short of a magical experience for me. I'd long forgotten the details of the story or even perhaps the major themes. But every page I turned while reading brought back a flood of memories and an overwhelming nostalgia. I did know what was coming next once I dove in. I did remember meanings behind the actions as I came close to them in the story. I completely lost myself in the book and read it through in two short sittings.

What fun it was to simply sit back and enjoy a book for its own sake. Information of any sort can be this enlightening to the right user. I'm going to remember that.

Sunday, February 12, 2006

Copyright for Sound Recordings

I've been catching up on reading I've been meaning to do while traveling recently. I found the CLIR report on Copyright Issues Relevant to Digital Preservation and Dissemination of Pre-1972 Commercial Sound Recordings by Libraries and Archives to be very interesting. Like discussions of copyright issues often must be, this report tends towards scenarios, likelihoods, and trends rather than absolute conclusions. I think that's OK. Even if there are no easy answers, knowing more about the issues involved is certainly beneficial.

Perhaps the most interesting part of this report is the discussion of how state copyright laws still affect audio preservation activities in libraries. The report's appendix summarizes state laws in California, Illinois, Michigan, New York, and Virginia. Each of the states examined include language in the criminal statute cited by the report indicating that reproduction for profit or commercial gain is illegal. Some, but not all, include specific exemptions for educational or non-profit use (under which library preservation activities would presumably fall?), but all specifically say what's illegal is profiting from the copying. This is a very different tone than today's discussions of copyright issues, where intent rarely enters into the argument. I wasn't previously aware of this shift, and wonder if state laws such as these could help serve as models as federal copyright law undergoes future revison.

Tuesday, January 17, 2006

Next-generation catalogs

Bravo! I'll add my voice to the hubub surrounding the announcement that NCSU has launched a new library catalog, representing a new model for user interaction. I'm a huge fan of the "narrow by" menus on the left-hand side. (I should be--we did this in a digital library system a few years ago: here's an example and a paper describing the project.) I believe some of the options presented here are more useful than others, but different options would be useful in different situations and picking the right default is tricky.

I also love the idea of browsing the collection in the OPAC. Long have librarians extolled the virtues of the serendipity of shelf browsing. Our catalogs can and should try to replicate this experience online, and allow other sorts of browsing our shelves don't provide.

Despite the increased functionality of the NCSU catalog, the results within any given set, regardless of the sort option chosen, are the same sort of jumble we see in more traditional OPACs. I'm thrilled to see that FRBR-like grouping is on the list for the next release.

It's too bad that NCSU had to go to a third party (and presumably shell out some big bucks) in order to provide this innovative service. I hope this demonstration will push more of us to relentlessly push our OPAC vendors for similar improvements, and put our money where our mouths are.

Thursday, January 12, 2006

User-contributed metadata failure?

In November's D-Lib Magazine, there is an extensive article describing the development of the Digital Library for Earth System Education (DLESE). I was interested to learn more about this project, and delighted to see they had put in a method for end-users to provide descriptive information into their system. Unfortunately, the DLESE staff felt user contributions weren't the right way to go:

One approach that did not work well for DLESE was "community cataloging." The idea behind community cataloging was that educators would contribute to the library by cataloging a few of their favorite on-line educational resources through an easy-to-use web interface. In spite of considerable effort spent on developing the web-interface, guidelines and best practices documents, this approach yielded few resources and the community-cataloged metadata often turned out to be incomplete or incorrect. The community cataloging functionality has been replaced by a simple "Suggest a Resource" web form.
I'm disappointed to see an example of this approach actually put into practice and then rescinded. I haven't seen the "community cataloging" interface they used, so I don't know what sorts of tools existed to assist the user in providing accurate and complete data. But I do wonder how closely the community cataloging tool resembled a professional cataloger's tool. Today's library catalog systems are designed for use by experts. They don't assist in data entry in any meaningful way, and they rely on catalogers to make use of a vast amount of outside resources in order to create quality records. If a system for user-contributed metadata followed the same model (some empty boxes and a dense set of instructions on what to put in each of them), I'd predict that system would fail.

I believe in order to make good use of our users' expertise, we need to build interfaces on new models. These interfaces need to make it easy to do the right thing. Users don't have to create entire records, for example. Interfaces for user-contributed metadata could allow those who believe they have supplemental or corrected information to a resource to target their efforts to the bit of information they possess, rather than asking them to provide a complete descriptive record. Interfaces could limit the fields users are allowed to contribute or edit, or enforce strict datatyping for small bits of metadata in order to prevent simple data entry errors.

User-contributed metadata is not just about shifting the effort from catalogers to end-users. It's really about supplementing our current practices with new models, in order to start to get a handle on the vast information landscape we face today.

Wednesday, January 04, 2006

Britannica vs. Wikipedia and a parallel in cataloging?

An interesting new commentary in the Britannica vs. Wikipedia discussion following an article comparing the authenticity of the two in Nature was published yesterday in the NY Times, with a clever title: The Nitpicking of the Masses vs. the Authority of the Experts (free registration may be required, yadda yadda yadda...). The NYT article supports Wikipedia pretty strongly, but it wasn't the conclusion that struck me. Rather, it was this idea:

"The idea that perfection can be achieved solely through deliberate effort and centralized control has been given the lie in biology with the success of Darwin and in economics with the failure of Marx."

I'm not overly informed on either Darwin or Marx, so I'll refrain from analyzing the validity of this claim. But reading it, I'm strongly reminded of arguments being made for library cataloging in the Google age. The effort and control described here could also apply to the effort and control spent in expert cataloging. But I believe the key here is the word solely. Just because effort and control in and of themselves don't get us where we want to be, doesn't mean we should abandon them entirely. It just means we can supplement them with other means and potentially end up better off.

The idea of "perfection" in this quote interests me as well. Proponents of the status quo in library cataloging frequently speak as if library catalog records are perfect. As if they are the pinnacle of description, meeting every user need, exactly right if only the rules are followed. But of course that's not true. The good news is that library catalogs and cataloging rules are evolving, and that our systems are just starting to make use of other sources of information supplementing those human-generated-through-blood-sweat-and-tears catalog records. There are many experts on our materials out there - each of our users has something useful to tell us about our resources. I think it's time we listen to them.

Tuesday, January 03, 2006

Ways of thinking and ways of representing

I've been thinking lately about the interaction of ideas spreading and the labeling of those ideas. Every so often, a new technology trend spreads around, creating a buzz. But I'm getting to the point where the buzzwords no longer create much excitement for me, no longer represent to me a new way of thinking or approaching a problem. I suspect my changing attitude has two sources. First, I'm in touch with the field enough that I see the small bits of progress in ideas that precede the label and the hype. (Or at least once the trend gets a name I can identify signs of its development in hindsight!) Second, as I see more and more of these trends play out, I'm becoming more skeptical about the revolution each one promises. Often a single idea is represented as single-handedly altering the information landscape, but instead I for the most part see many factors converging to affect a change.

Rarely is the idea truly new and revolutionary once it gets a label. Consider "Web 2.0." One recent much-cited explanation from Tim O'Reilly appeared recently. It gets a label because it's emerging as a trend across many different implementers. In turn, the label inspires more implementers. But the O'Reilly article shows the label is intended to provide a convenient way of referring to an emerging paradigm, rather than as a means of causing a shift. But as this article indicates, labels can quickly descend in common usage to mean the catalyst rather than an assessment of an existing trend. True interactivity, meaningful end-user participation, and personalization aren't the result of "Web 2.0." Rather, "Web 2.0" gives us an easy way to refer to these and other similar trends that together represent an emerging shift in the norm of the Web.

I don't mean to say Web 2.0 is a "meaningless marketing buzzword," as the O'Reilly article warns against. I do, however, think we need to remember that the label is not the buzz. The work of countless people over a period of time finding ways to make their ideas a reality, which happen to coalesce around a theme, is what's really important.