Monday, December 19, 2005

Out of the loop

Wow! I've been buried in deadlines recently, and am just starting to get caught up on email lists and the biblioblogsphere. Have I been missing out!

On Web4Lib, there was a ruckus recently about activism, copyright, and free expression. It at times descended into obscenities and name-calling, but also raised a number of thought-provoking questions about the information landscape and the maintenance of relevant and professional forums for discussion about issues that librarians should care about.

The Shifted Librarian, Jenny Levine, with a reasonable concern about the lack of comping registration fees for invited speakers at library conferences, sparked a rousing debate about conference economics, the value of institutional support for professional development, and a librarian's responsibility to give back to the profession.

I'm taking all of this as a reminder to step back from the inevitible daily emergencies and petty disagreements to think about the larger issues: why I'm a librarian in the first place, how I can contribute to our shared mission, and what our users really need in this day and age. I'm going to take some time over the next few weeks, as I have some time off (in between all the writing I have been putting off!), to reflect on these issues and re-focus my work. I hope everyone out there has a similar opportunity.

Thursday, December 08, 2005

DLF MODS Implementation Guidelines available for public comment and review

The Digital Library Federation's Aquifer Initiative is pleased to invite public review and comment on the DLF MODS Implementation Guidelines for Cultural Heritage Materials.

The primary goal of the Digital Library Federation's Aquifer Initiative is to enable distributed content to be used effectively by libraries and scholars for teaching, learning, and research. The provision of rich, shareable metadata for this distributed content is an important step towards this goal. To this end, the Metadata Working Group of the DLF Aquifer Initiative has developed a set of implementation guidelines for the Metadata Object Description Schema (MODS). These guidelines are meant specifically for metadata records that are to be shared (whether by the Open Archives Initiative Protocol for Metadata Harvesting (OAI PMH) or other means) and that describe digital cultural heritage and humanities-based scholarly resources. The Guidelines are available at http://www.diglib.org/aquifer/DLF_MODS_ImpGuidelines_ver4.pdf (pdf document about 470 kb).

In order to ensure the Implementation Guidelines are useful and coherent, we are collecting comments and feedback from the wider digital library community. We appreciate any and all comments, feedback, and questions. These may be sent to DLF-MODS-GUIDELINES-COMMENTS-L@LISTSERV.INDIANA.EDU. The deadline for comments and review is January 20, 2006.

DLF Aquifer Metadata Working Group:

Sarah Shreeves (Chair) - University of Illinois at Urbana-Champaign John Chapman - University of Minnesota Bill Landis - California Digital Library Liz Milewicz - Emory University David Reynolds - Johns Hopkins University Jenn Riley - Indiana University Gary Shawver - New York University

Tuesday, November 15, 2005

Learning cool new things

One of the things I love most about my job as a librarian is the enormous variety of content I get to work with. By partnering with content specialists for most if not all of our digital library projects, I get introduced to research areas I previously didn't know much about: the rise and fall of the "company town," victorian literature, "the commons," etc. One of the more recent topics is The Chymistry of Isaac Newton, a project in which I'm only tangentially involved. But this is really cool stuff. You can learn more about it too in a Nova episode titled Newton's Dark Secrets, premiering tonight in most areas.

Saturday, November 05, 2005

True folksonomic thesauri?

I saw on the Simile mailing list recently a thread discussing possible uses of relationships between user-supplied tags in an information system. This idea is intriguing to me. I've long believed we don't use the relationships recorded in our library-land controlled vocabularies in our systems for end-users to anywhere near their potential. A digital library collection I've been involved with demonstrates ways in which we might use these relationships. The methodology used is documented in this paper.

Yet I'd never thought about relationships for folksonomic vocabularies before. I think it's a fantastic idea, however. The same strategies for improving end-user discovery based on term relationships can be used no matter where these relationships come from. Relationships determined by methods such as this could be used in the same way human-generated relationships in a formal thesaurus could be used. I wonder if these relationships might be even more important in a folksonomic environment, as a method by which the vocabulary control us library folk hold so dear could be achieved.

Wednesday, October 26, 2005

Hierarchical catalog records

The October issue of D-Lib Magazine has an article describing an experimental FRBR implementation in the Perseus Digital Library, by David Mimno, Gregory Crane, and Alison Jones, entitled Hierarchical Catalog Records. I'm thrilled to see reports of experiments like this be shared in outlets as widely read as D-Lib. I'm also extremely happy to see this particular experiment happen outside of the MARC environment. I've been becoming more and more convinced as some experiments we're conducting with MARC records for sound recordings progress that FRBRization really is a revolution in resource description. Statistics abound estimating the small percentage of works that exist in more than one expression, and expressions that exist in more than one manifestation. While I don't doubt these numbers at all, I think they way in which they're presented minimizes the enormity of the task ahead of us to reach a true FRBR environment.

I believe the same is true of efforts to use MARC for FRBRized records. The MARC format could be adopted for this purpose. But is it in our best interests to do so? Using MARC makes the task seem less scary, that it won't be that difficult. But it is a difficult task, and we're fooling ourselves if we pretend otherwise. I wonder if we aren't better off addressing the issue head-on, admitting to a change with a new base record format. The change would be one of mind-set, rather than functionality.

I've mentioned I believe the FRBRization task is difficult. I don't believe difficult means impossible in this case, however. We don't yet have a good sense of the cost associated with such a conversion, so any claim to its value will be tempered by that uncertainty. But I am convinced of that value, and I believe studies like that of the Perseus Digital Library are vital in demonstrating it. No cost can be justified without first understanding the associated benefit. We have a great deal more work to do to reach that understanding.

Sunday, October 23, 2005

You know you go to too many conferences when...

...you look down halfway through the morning session and realize you're wearing the name tag from the wrong conference. Sigh.

Saturday, October 22, 2005

Separating data entry from data structure

I believe we've fallen extremely short in at least one area of potential for improving our cataloging and metadata creation systems -- user interfaces. We're still stuck in a mindset developed in the early days of the MARC format, whereby data is entered in the exact form in which it needs to be stored. When Web-based OPACs and cataloging modules emerged, cursory attempts to "improve" the interface appeared, but the changes were almost exclusively surface changes (labeling, etc.), and not implemented with community involvement.

But of course current technology provides many possibilities for a design layer in between the data entry interface and the data storage format. Metadata creation by humans is expensive. We need to do everything we can to design data entry interfaces that speed this process along, that help the cataloger to create high-quality data quickly. Visual cues, tab completion, and keyboard shortcuts are just a few simple tricks that could help. More fundamental approaches like automatic inclusion of boilerplate text and integration of controlled vocabularies could provide enormous strides forward.

Yet with all of this potential, I frequently (WAAAAAY too frequently) have conversations with librarians where it becomes clear they're focused exclusively on the data output format. It never even occurred to them that a system could do something with entered data that doesn't require cataloger involvement. (Man, I knew we librarians were control freaks, but this really takes the cake.) Of course, librarians aren't on the whole system designers. That's OK. But all librarians still need to be able to think creatively about possibilities. I'm convinced that the way forward here is to take the initiative to develop systems that demonstrate this potential, that show everyone what is possible with today's technology. Everyone has vision, yet that vision always has limits. By demonstrating explicitly a few steps forward from where we are, vision can then expand that much further.

Sunday, October 09, 2005

Museums and user-contributed metadata

It's funny how often, once one starts thinking about a subject, one finds examples of it absolutely everywhere. I've been thinking about user-contributed metadata for a while now in the context of a digital music library project, where we can provide innovative types of searching, if only we could find a way to make the creation of the robust metadata that drives it cost-effective. I wrote about this topic recently, inspired by OCLC's Wiki WorldCat pilot service.

So imagine my pleasure when, catching up on my reading this weekend, I came across "Social Terminology Enhancement through Vernacular Engagement" by David Bearman and Jennifer Trant in September's D-Lib Magazine. (Yes, I do know it's no longer September. Thanks for asking.) I'm thrilled to hear about this initiative, especially how well-developed it seems to be. I haven't yet followed the citations in the article to read any of the project documentation, but it certainly looks extensive. In the digital library (and museum!) world, I firmly believe ongoing documentation such as this associated with a project can be of as much or even more value than formally-published reports.

Two features strike me about the "Steve" system described here, that make it clear to me there are many ways to implement systems collecting metadata from users. It also makes me realize these decisions need to be made at the very beginning of a project, as they drive all other implementation decisions. The first is an assumption that the user interacting with the system is charged with the task of description rather than simply reacting to something they see and perceive either as an error or an omission. The user is interacting with the system for the purpose of contributing metadata; finding resources relevant to an information need is not the point. I suppose different users end up contributing with this model than with one that allows users to comment casually on resources they find in the course of doing other work. Different users might affect the "authoritativeness" of the metadata being contributed, but I wonder to what degree.

The second feature I find notable is that the system is designed to be folksonomic; there is no attempt at vocabulary control. Us library folk tend to start from the assumption that controlled vocabulary is better than uncontrolled and move on from there. At first glance, some of the reports from this project seem to resist that assumption, and start from the beginning looking for a real comparison. I'm anxious to read on.

Thursday, October 06, 2005

User-contributed metadata

OCLC recently announced the Wiki WorldCat pilot service. What a fantastic idea! Too bad I'm having trouble trying it out. I looked at a few books in Open WorldCat (via Google), including this one that I read recently and the book shown on the Wiki WorldCat page (The Da Vinci Code), and I didn't see the reviews tab or the links to add a table of contents or a note shown on the project page. Hmm. I wonder what I'm missing.

But, anyhoo... incorporating user-contributed metadata into library systems is something I've been thinking about for a while. Librarians tend to be pretty wedded to the notion of authority, that as curators of knowledge we're the best qualified folks out there to perform the documentation of bibliographic information. Assuming for a moment that this is true for some data elements, there are still several classes of data that could easily benefit from end-user involvement.

The first is detailed information from specialized domains. I work on a number of projects related to music. Information such as exactly which members of a jazz combo play on any given piece on a CD or the date of composition of a relatively obscure work is the sort of thing our catalogs could be providing to serve as research systems instead of just finding systems. But this sort of metadata is expensive to create; it requires research and domain expertise on the part of the cataloger. Many of our users, however, do have this specialized knowledge and love to share it.

Other information that might be appropriate for supplying by end-users could be tables of contents, instrumentation of a musical work, language of a text, and others of this type of "objective" information. Before you say, "But what about standard terminology, spelling, capitalization?!?" in a panicked voice, consider basic interface capabilities in 21st-century systems such as picking values from provided lists rather than typing them in.

But should we restrict ourselves to these more obvious of elements? I've been hoping for some time to be able to test various degrees of vetting of user-contributed metadata to a digital library system. I have in mind a completely open Wiki-type system, one that simply sends a suggestion to a cataloger, and a number of options in between. I suspect the quality of the user-contributed metadata will be overall much higher than critics assume. Yet even if it isn't, what sort of trade-off between quality and quantity are we willing to make? Traditional cataloging operations don't have extensive quality control operations, perhaps because QC is expensive work. And catalogers make mistakes, every day, just like the rest of us. Assuming a system where users can correct errors, how quickly will errors (made by a cataloger or by another end-user) be found and corrected? Will the "correct" data win out in the end? Surely these issues are worth a serious look.

Tuesday, September 27, 2005

The more things change...

I've just finished reading the short volume The MARC music format : from inception to publication / by Donald Seibert. MLA technical report, no. 13. Philadelphia : Music Library Association, 1982. The book is an account of the decision-making process involved in designing and implementing the MARC music format. I was both heartened and discouraged to read arguments in support of implementing MARC that mirror closely arguments I and others make today for moving beyond MARC.

The rationale behind the MARC music format reads full of hope, for improved access for users and higher quality data. Yet many of the improvements mentioned have not come to fruition. I'm heartened to see the vision represented here for the type of access we can and should be providing. Yet I'm discouraged to see more evidence that we haven't achieved this level of access in the time since the MARC format was implemented. I believe this serves to remind us that many factors other than database structure contribute to the success of a library system.

I also learned a valuable lesson reading this text that ideas and potential alone are not enough to convince everyone that any given change is a good idea. A large percentage of librarians out there have heard these very arguments before and seen them not pan out. I do believe, however, that this time can be different. (Yes, I know how that sounds...) Computer systems are much more flexible than they were when the MARC music format was first implemented, and can be designed to alleviate more of the human effort than before. We've learned a great deal from automation and implementation of the MARC format that we can build on in the next generation library catalog. We have a long road ahead of us, but I think it's time to address these issues head-on once again. I'd like to believe we can leverage the experience of those like Donald Siebert involved in the first round of MARC implementation, together with experts in recent developments, to make progress towards our larger goal.

Sunday, September 18, 2005

The next big thing in searching?

At a conference last week, I heard Stephen Robertson of Microsoft Research Cambridge speak about the primacy of text in information retrieval, whether for text, images, or any other type of medium. He made a statement in the talk that the first generation of information retrieval systems operated on Boolean principles, and the second generation (our current systems) provide relevance-ranked lists. This may be a truism in the IR world, but it's something I hadn't thought about in these terms before. Our library sytems certainly are primitive in terms of searching, and they operate on the Boolean model. But I hadn't thought of relevance ranking as the "next step" - probably because the control freak in me is suspicious of a definition of "relevance" not my own. But I think it's fine to look at the progression of IR systems in this way.

So what's the third generation? Where are we going next? I think the next step is grouping in search results. Grouping is where I see the power of Google-like search systems merging with library priorities like vocabulary control. Imagine systems that allow the user to explore (and refine) a result set by a specific meaning of a search term that has multiple meanings, by format, or by any number of other features meaningful to that user for that query at that time. I picture highly adaptive systems far more interactive than those we see today. Options for search refinement alone, I don't believe, go far enough, as they require the user to deduce patterns in the result set. I believe systems should explicitly tell users about some of those patterns and use them to present the result set in a more meaningful way. Search engines like Clusty are starting to incorporate some of these ideas. It remains to be seen if they catch on.

FRBR assumes this sort of grouping can be provided, using the different levels of group 1 entities. Discussions of FRBR displays frequently talk about presenting Expressions with a language for textual items, with a director for film, or with a performer for music, allowing users to select the Expression most useful to them before viewing Manifestations. What's missing is how the system knows what bits of information would be relevant for distinguishing between Expressions, since these bits of information will be different for different types of materials, and sometimes even with similar types of materials. We have a ways to go before the type of system I'm imagining reaches maturity.

Wednesday, September 07, 2005

Dangers of assumptions

Over the holiday weekend, I read the paper by Thomas Mann, Will Google’s Keyword Searching Eliminate the Need for LC Cataloging and Classification? Mann presumes to know exactly what is possible (not just currently implemented) in a search engine - the paper is stuffed full of absolutes: "cannot," "only," and "will not." The paper seems to focus on Google as simply taking in words in a query, looking them all up in a word-by-word index of all documents, and performing some sort of relevance ranking on documents that contain the search terms. It not only assumes that Google takes this simplistic approach, it rejects that any further capabilities are even possible in a search engine.

I believe this is a thoroughly (and perhaps, in this case, deliberately) naive assessment of the situation. Just because library catalogs offer only simple fielded searching and straightforward keyword indexes doesn't mean all retrieval systems do the same. Mann ignores the possibility of a layer between the user's query and the word-by-word index. He states, "having only keyword access to content is that it cannot solve the problems of synonyms, variant phrases, and different languages being used for the same subjects." This statement confuses "keyword access" (just looking something up in a full-text index) with a system that uses a keyword index among other things for searching. Google could (and right now, does, with the ~ operator [thanks Pat, for the heads up on this!], and who of us library folk is to say they won't do this by default in Google Print) do synonym expansion on search terms before sending the query to the full-text index. Point is, it's not impossible to do this in a search system. The same idea goes for finding items in other languages - translation before the search is actually executed could be done. Ordering, grouping (yes, grouping!), and presentation of search results in this environment would require some advanced processing, but that's doable too.

Of course, there is a difference between what's possible and what's actually implemented in Google today. Mann's language confuses the two, by stating (incorrectly) what's possible using as evidence what's implemented. What's implemented today is the functionality in the Web search engine, but we shouldn't assume the same functionality will drive Google Print. This article uses rhetoric to stir the librarians up for their cause. But it does us a disservice by making false assumptions and obscuring the facts. There are arguments to be made for why libraries are still essential and relevant today. But rabble-rousing with partial truths isn't the way to make them.

Monday, August 29, 2005

Google Print and Fair Use

Having thought some about the "copying" aspect of Google Print, it would now be prudent to think about exceptions to the exclusive right of copyright holders to reproduce a work. Google's stance seems to be that their activities fall under the scope of the Fair Use exception to copyright. Fair Use is by far a straightforward concept, and comparatively very few cases have served to clarify the issue. Here's the text of section 107 of the copyright act, which describes the fair use exception:


Notwithstanding the provisions of sections 106 and 106A, the fair use of a copyrighted work, including such use by reproduction in copies or phonorecords or by any other means specified by that section, for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research, is not an infringement of copyright. In determining whether the use made of a work in any particular case is a fair use the factors to be considered shall include —

(1) the purpose and character of the use, including whether such use is of a commercial nature or is for nonprofit educational purposes;

(2) the nature of the copyrighted work;

(3) the amount and substantiality of the portion used in relation to the copyrighted work as a whole; and

(4) the effect of the use upon the potential market for or value of the copyrighted work.

The fact that a work is unpublished shall not itself bar a finding of fair use if such finding is made upon consideration of all the above factors.


Note that whether the copyright owner objects or not is not a factor to be considered when determining fair use. That copyright owner could file a lawsuit, but the fair use claim is evaluated on these four factors only.

So how does Google Print stack up against the four factors?

(1) Purpose and character. Commercial vs. educational is singled out here, and certainly Google's use is commercial. But that's not the only purpose or character allowed to be considered. A lawyer for Google could claim that their service, meeting people's information needs and directing them to a copyright holder when a work meets that information need, is a Good Thing. They could then go on to argue that making money of this is secondary, but lots of folks wouldn't believe that.

(2) Nature of the copyrighted work. This is hard to pin down due to the scope of what's being digitized. Books that have been out of print for 45 years and aren't widely available in the used book marked would evaluate differently according to this criteria than Harry Potter. (Yes, research libraries collect fiction too.)

(3) Amount of the work. Again, tricky. Google is digitizing (copying) the entire work, and, presumably, using the entire work to create their index. The counter-argument seems to be they're only showing a small part to users of their service, but I don't believe that applies here. The exclusive right is the copying part, not what you show to other people.

(4) Effect on the market. Here is where only showing snippets to end-users comes in to play. Certainly the effect on the market is potentially severe if one could download, print, read a whole book from Google instead of purchasing it. The recording industry feels that way about file sharing, but there are many who disagree, claiming file sharing actually stimulates purchasing. (Sorry no citations right now, but there are gobs of studies out there on both sides of this issue.) I imagine Google would claim that by showing snippets they're telling users about resources they didn't know about before, and are thus adding to the market. This will be an interesting argument to follow.

My conclusion is that the fair use claim is far from a slam dunk in either direction. Personally, I'd love to see this litigated (and found in favor of Google!) to start what I consider to be much-needed reform in copyright law.

IANAL. Any misinterpretations or flawed analyses are entirely mine, and the result of me trying to pretend I know something about this stuff.

Sunday, August 28, 2005

Musings on the state of coyright

The recent brou-ha-ha (wow, I think that’s the first time I’ve ever written that word down!) over Google Print has me thinking about copyright law. I am not a lawyer. I have no legal training or education. I have picked up a bit about copyright law while working in the area of digital libraries for the past five years, however. I think what I think I know is accurate, but hey, I'm wrong a reasonable amount of the time.

The publishers who have objected to the Google Print project say that the project violates copyright law by scanning the books in question (copying, which is the first exclusive right granted to copyright holders by section 106 of U.S. copyright law) to index them. So how is this different than Google’s Web index? Well, in creating the Web index Google caches Web pages too. Caching may not actually be the right word there – Google probably more actively, intentionally, or permanently creates a copy than Random J. User’s Web browser does. One could argue there’s some sort of difference between the caching done by Google of Web pages and scanning page images of printed books, but it seems to me this difference is a matter of degree rather than of real substance. So if the digitization for Google Print is a copyright violation, does that mean all Web search engines are copyright violations?

Let’s take this exercise one step further. Indexes have been around for a very long time: the Readers’ Guide to Periodical Literature, Academic Search Premier, the MLA International Bibliography, and on ad infinitum. I admit to being ignorant as to whether these more traditional indexes tend to operate with the blessing of the copyright holders (although many of them are actually produced by publishers to cover their content), but surely not all of them do, and the library world isn’t exactly abuzz with these copyright holders crying foul. One difference is that the processing that happens to create these more traditional indexes (although this may no longer be true today!) is entirely an intellectual exercise. Any “copying” of the work done to create the index is purely in a person’s head. Is this difference one of degree or of substance?

To go yet another step further, library catalogs use a copyrighted item to create a new representation – is there an argument there that catalog records are derivative works? Obviously we’re in danger of descending into the ridiculous here, but the need for some sort of balance is clear. The concept of balance between the rights of the creator of a work and the benefit to the public good from its use is inherent in copyright law. Too bad the specifics of maintaining this balance are in language that languishes far behind current technologies.

I think it will take a copyright challenge to a large for-profit like Google (rather than to even the most resource-rich library) to overhaul copyright law, to bring it up to the times. Google seems to me to have the desire and the resources to present a reasonable defense, and persist through a legal battle rather than settling the short-term problem through an agreement with publishers. But, as I’ve said, I’m wrong a reasonable amount of the time.

Thursday, August 11, 2005

A billion and one, a billion and two...

The OCLC folks are all abuzz with the addition today of the billionth holding to Worldcat, as reported all over. This is obviously an enormous milestone for OCLC and for libraries in general. Kudos are in order for all of us, I think!

The union catalog has transformed the way libraries provide access to their material. A billion holdings in one database seems to me to be proof positive of that. But OCLC Research staff and many others, researchers and practicioners, aren't content with the functionality our current union catalogs offer. The enormous wealth of data represented by those one billion holdings has the potential to be used in innumerable ways. I believe OCLC's FRBR activities are excellent examples of the sorts of creative things we can do with this data to better serve our users. We've made huge strides in access to materials, yet we have many miles to go.

UPDATE: I've discovered today the misfortune of having a book on The Monkees be Worldcat's one billionth holding. We're going to have a country of librarians walking around for two weeks now with that damn theme song stuck in our heads!

Wednesday, August 10, 2005

Keeping up with technology

Podcasting, Web services, RDF, Flickr, Ebooks. Buzzwords, right? All of these are extremely useful technologies or applications, but I don't actually use any of them. Each has its place, each is good at solving certain types of problems. None, of course, is a magic wand that makes everything in life easier.

I follow a number of library- and technology-related blogs. Many of them hype a certain technology that is meaningful to the blogger for their particular needs. I learn a huge amount from these bloggers, the information they provide, and the fervor with which they provide it. But rarely do I go out and try any of the technologies being described just to see what they are. A few peak my curiosity and I go check them out, but for the majority I just mentally file the information away for when I have a problem the technology in question solves. There's just too much going on in this environment right now to really delve in and learn everything new that comes along. Each of us picks up on the emerging technologies most relevant to us in our personal or professional lives. Other technologies are only relevant to us at a later time, but hearing about them before we need them reminds us of the vast range of possibility out there. Sharing our experiences helps others both to adopt them right away when appropriate, but also to adopt them later as the need grows.

Tuesday, August 02, 2005

To each their own "metadata"

I was introduced to someone today as the "Metadata Librarian," and received a reaction I seem to get a lot: "Oh, metadata, huh? Someday I'll understand that." On my optimistic days, I want to respond "Would you like to go get a cup of coffee and chat?" On my cynical days, "You've got an opportunity here to learn something new! Take it!"

Everyone has their talents and areas of difficulty. We're all really good at some things and equally bad at others. Me, I'm completely spatially inept. It once took me 3 hours to put together a futon frame (with instructions). I'm fine with that, because I know my talents lie elsewhere, although I do often think it would be nice to be handy. Despite my lack of innate talent in some areas, I've never thought I simply can't learn any of it. Little by little I'll learn to fix things around the house. I'll never be able to paint with any level of inspiration, but with a whole lot of practice I might be able to use color effectively or produce a still life that is recognizable. One might think metadata is uninteresting. That's cool. I find a lot of stuff out there uninteresting. But don't think it's unlearnable.

Part of the problem here is that "metadata" isn't a monolithic concept. Depending on one's perspective, it can mean virtually anything. To lots of people, all they need is descriptive metadata, and maybe even some version of qualified Dublin Core their content management solution provides them. GIS specialists delve deeply into an area of metadata many know very little about. For many, text encoding is the metadata world, of extremely rich depth and subtlety. I had an interesting conversation recently with a colleague about the definition of "structural" metadata. By some definition, TEI markup is structural metadata, indicating the stucture of the text by surrounding that text with tags. Does that same logic apply to music encoding? Music markup languages specify the musical features themselves, rather than "surrounding" them with metadata. But certainly there's some similarity to text markup. The boundary between structural metadata and markup isn't the same to everyone. Similarly, there are times when I use the word metadata to refer to something that might more accurately be "data," and when I use it to refer to something that might be "meta-metadata."

All of these views are valid. I'm constantly reminding myself of this. Often when my first reaction is that someone doesn't get it, it's really their view not quite meshing with mine. It's important that we have some common terminology and meanings, but I believe there's room for perspective as well. I can get better at my job if I listen more closely to these perspectives.

Thursday, July 28, 2005

Music subject headings

Wow, the blog has been lonely lately, hasn't it? It's almost as if there are too many things floating around in my head that I can't get any of them fully formed enough to even write about here.

One of those many floating thoughts has been subject headings for music. Many traditional schemes, like LCSH, make a distinction between headings used for works about music, and headings used for music itself. For example, "Symphonies" is used for music scores and recordings of symphonies. But "Symphony" is assigned to texts about symphonies.

Obviously at first glance the distinction between the two forms is subtle. Even if a user realized the potential for this distinction being made (!), it would be difficult for that user to determine which form to use in which case. In my library catalog, a subject browse on "symphonies" lists first an entry for 5407 matches, then second, "see related headings for: symphonies." Clicking on the latter yields a screen saying "Search topics related to the subject SYMPHONIES," but no way to actually do that. This is probably because the authority record for symphonies has no 550s specifying any related headings. Geez. Both because the system shows this anyways and because there are no related headings. [Yet another NOTE: the mechanism for specifying a heading is broader or narrower than another heading in the MARC authority format is ridiculously complicated. No wonder the relationships between LCSH headings are so poor.] This same screen is also where one would view the scope note for the heading "symphonies":

Here are entered symphonies for orchestra. Symphonies for other mediums of performance are entered under this heading followed by the medium, e.g. Symphonies (Band); Symphonies (Chamber orchestra). Works about the symphony are entered under Symphony.

OK. So to find out if "symphonies" is what I'm looking for, I need to click "see related headings for: symphonies"? Riiiiight. Sure, my catalog could handle this better. Not many do.

This distinction isn't always so obvious to specialists, either. I've been reading up on the topic for a project and I'm struck by how rarely it's made explicit. A huge majority of writings simply assume they're talking about one, the other, or both, but never say so. Many others indicate they're discussing one or the other but provide examples of both. I myself recently forgot the distinction at a critical juncture. :-)

I'm wondering if this distinction between headings for works about music and works of music is still needed in modern systems. [NOTE: I don't consider any of the MARC catalogs I'm familiar with to be "modern systems"!] We certainly now have mechanisms to make this distinction in ways other than a subject string. Most of me says this is an outdated mechanism. But in a huge library catalog covering both types of materials, the distinction does need to be made in some way. I'm still pondering over exactly which way that should be.

Monday, July 11, 2005

Structure standards and content standards

It's funny how related things seem to come in spurts in our lives. Or maybe it's just that once we notice something once, it's easier to notice again. In my standard metadata spiel, I, like many others, distinguish between structure standards that tell you what "fields" (for lack of a better term) to record, and content standards that tell you how to structure values in those fields. The latter can be either rules for structuring content or actual lists of permissible entries. It's an extremely useful distinction. Yet I've been noticing recently that it's frequently misunderstood, or that the distinction is implicit in a conversation rather than explicit.

One place this trend caught my eye recently was in a blog post by Christopher Harris on using LII's RSS feed to generate MARC records, and subsequent comments and posts by several people, including Karen Schneider of LII. Most of the ensuing discussion was about keeping the two data sources in sync, which of course is important to plan for. But I noted a conspicuous absence of content standards in the discussion. MARC records, of course, do not have to adhere to AACR2 practices. In fact, there are millions of non-AACR2 records (mostly created pre-AACR2 and never upgraded for practical reasons) in our catalogs. But today if one is creating a MARC record, it would be prudent to either use AACR2 or have a compelling argument against it. Yet neither of those options appeared in this discussion. Reading between the lines, I suspect the transformation should be reasonably straightforward, but one shouldn't have to read between the lines to know.

I suppose what I'm really saying here is that when talking about these sorts of activities, we need to completely define the problem to be solved before a solution can be determined. And that includes dealing with content standards in addition to structure standards. Explicitly. Knowing which standards (or lack of them) are in use in the source data and which are expected in the target schema. Planning for moving between them. This is an extremely interesting topic, and I personally would love to see more discussion about it.

Oh, and, for the record, I'm with Karen that one would want to be careful about putting lots of records for things like LII content into our MARC catalogs. My vision (imperfectly focused, unfortunately!) is that because the format (and the content standard that is normally used with it) doesn't describe this type of material well, and the systems in which we store and deliver our MARC records don't provide the sort of retrieval we might desire for these materials, our users would be better served by a layer on top of the catalog that also provides retrieval on other information sources better suited to describing these materials. This higher-level system would provide some basic searching but most importantly lead a user down into specific information sources that best meet his needs. We have lots of technologies and bits of applications that might be used for this purpose. I wonder what will emerge.

Wednesday, July 06, 2005

So what's up with RDF?

Kevin Clarke posted on his blog last week some thoughts on the recent ALA conference and a session on metadata interoperability. A discussion has ensued from this about RDF, with commentary by Leigh Dodds and a follow-up post by Kevin. I've learned a great deal from this exchange. I've always felt that I was missing something with RDF, that I needed a discussion on a much more practical level than those I'd been exposed to in order to understand what it could do for me better than the tools I already use. I've heard smart people I like and respect make comments like those by Kevin, Bill Moen, Dorothea Salo, and Roy Tennant quoted in these blog postings, and felt a bit of comfort that I wasn't the only one who felt left out. But it's not enough to have company in the "huh?" camp - I want to understand. I want to be able to make a reasoned argument against RDF, or embrace it for tasks it does better (in my world) than other things. Yet I've never felt like I can do either of those things. For now, I'll follow discussions such as this one in order to slowly absorb all the angles.

And all of this banter reminds me I need to learn RelaxNG and finally figure out what the deal is with topic maps. Anybody have a few extra hours in their day they're willing to send my way? :-)

Tuesday, July 05, 2005

Addition of dates to existing name headings

The Library of Congress Cataloging Policy and Support Office recently announced a review and request for comments on a potential change of policy regarding addition of dates to existing personal name headings. Currently, dates are only added in certain situations, and once a heading is established, dates are never added to it after the fact. Personal name headings are frequently created while an individual is alive, leading to headings such as (from the CPSO proposal):

100 1# $a Bernstein, Leonard, $d 1918-

This heading then was not changed when Bernstein died in 1990. The CPSO proposal notes that libraries, including LC, receive frequent comments and complaints from users regarding the "out of date" nature of headings of this sort.

In discussion of this policy on the AUTOCAT listserv, the question arose as to whether name authority files served to simply generate unique headings for an person, or if they served a wider biographical function. Certainly historically the former is true. But many, including the CPSO, are recognizing that increasingly we may be well served by delving into the latter. We have an opportunity here to become more useful and relevant to the wider information community. To take that opportunity might seem to be a no-brainer.

However, the current cataloging infrastructure makes the implementation of this change challenging, to say the least. As authority data is replicated in local catalogs and the shared environment, and most integrated library systems store actual heading strings in bibliographic records rather than pointers to authority records, changing a heading would then require notifying all libraries that a change has been made, propagating that change from one library to the rest, then continuting to propagate that change in every local system to all affected bibliographic records. Clearly this mechanism is anachronistic in today's networked world, where relational databases are so entrenched as to be considered almost quaint. I fully understand the practical implications of the CPSO implementing this policy. Yet I believe that it is the right thing to do. We as librarians simply must have a vision for what we're trying to accomplish, and work tirelessly towards that goal. While we must keep the practical considerations in mind, we can't let them dictate all of our other decisions. Let's set the policy to do the right thing, and insist on systems that support our goals.

Tuesday, June 28, 2005

Back from ALA

ALA Annual in Chicago was the usual flurry of old friends, Powerpoint presentations, and exposure to topics new to me. The blogger's get-together put together by the It's all good folks and generously hosted by OCLC was certainly one of the highlights. I've been a bit tentative in promoting my blog to date, so it was nice to mingle with other bloggers and talk shop (and beyond!). Another major highlight was Kevin Clarke's presentation on XOBIS. Nice to finally meet you, Kevin!

I spent most of my time at ALA attending presentations I "had" to attend--those related to my daily work. I was able to spend a small amount of time expanding my horizons, but I wish I could have done more. And this schedule is without being involved in any ALA committees that meet during the conference. There is simply too much going on to take advantage of it all.

On another note: on the trip home I started reading, but didn't finish, Martha Yee's recent paper outlining a "MARC 21 Shopping List." I should hold any substantial comment until I finish the article, but so far I'm impressed. The approach of looking very precisely at the criticism of MARC and current cataloging practice to determine what exactly is being criticized, I believe, is long overdue. I do find myself thinking of counter-arguments to some of the conclusions, however. But intelligent discourse is absolutely what we should be striving for!

Thursday, June 23, 2005

Coming out of the woodwork

I've been noticing lately just how progressive librarians are. It gives me a nice warm fuzzy feeling inside every time I see evidence of this phenomenon.

FRBR is a good example. A colleague of mine recently described FRBR as a "religion," and I think that's not entirely untrue. But I'm increasingly seeing rank-and-file librarians (not just us "digital" folks or special collections librarians who do things "differently" anyways, according to one popular perception) show an interest in it. These folks commonly just want to learn what it is and what it can do for them. They aren't interested in jumping on a bandwagon just to be there. Rather, they genuinely want to evaluate for themselves the value of the model to them and their users. Sure, there are now and will always be extremists on both sides of the issue. I know librarians who want nothing to do with FRBR, and I know others who insist nothing from today's bibliographic control practices will be of any use in five years. But thankfully most of us fall somewhere in the middle.

I see huge numbers of librarians willing to talk about their ideas, even if they represent a departure at some small or vast level from current practice. I see huge numbers of librarians taking analytical approaches to solving real access problems they deal with every day. I see huge numbers of librarians keeping the overall goals of access and preservation of intellectual output foremost in their minds as they look for solutions. I see huge numbers of librarians having lively, interesting, professional discussions about options for achieving these goals. I love my job.

Friday, June 17, 2005

DCMI & Bibliographic Description

This week the Dublin Core Metadata Initiative published a new recommendation, "Guidelines for Encoding Bibliographic Citation Information in Dublin Core Metadata." I'm finding it to be a muddled mess of possibilities and examples with few real, clear guidelines for an implementer to follow.

The recommendation is described as emerging from the need for describing journal articles in DC. The recommendations tend to center around putting the information that was previously problematic (journal title, volume number, issue number, page range, etc.) within a bibliographicCitation refinement for dc:identifier, while getting the rest of the citation information from other parts of the DC record. "Optionally, but redundantly, these details may be included in the citation as well." This optional part has huge consequences for anyone using DC metadata to get to these citations. One could never know if the complete citation is present in the dc:identifier.bibliographicCitation element, or if one needs to look elsewhere for information to complete the citation. Also, it results in a situation where some of the data needed for this citation is clearly fielded (author in dc:creator, article title in dc:title, etc.), but the rest of it is not. This is hardly an elegant solution to the problem at hand.

Also, "there are no recommendations for providing bibliographic citations in simple Dublin Core." However, it is "suggested" that citation information be put in dc:identifier or dc:description. How is anybody suppposed to use DC for this purpose if the "experts" on it can't bring themselves to turn a "suggestion" into a "recommendation?" This document says to all of us out in metadata-land that there's a solution (actually, TWO solutions - identifier and description - choose between them randomly!), but the powers that be can't or won't formally endorse it, perhaps because it's viewed as a hack. This passive-aggressive "well, we see you have a problem and here are some possible ways to solve it on an official-looking document, but we're not going to tell you that we think any of these solutions are a good idea" crap is really starting to get on my nerves.

I'm also confused about something. bibliographicCitation is a refinement of dc:identifier, and therefore by the DC "dumb-down" rule is a type of identifier. The recommendation says, "dcterms:bibliographicCitation is an element refinement of dc:identifier, recognising that a bibliographic resource is effectively identified by its citation information." But then it goes on to say, "In Dublin Core Abstract Model terms the value of the dcterms:bibliographicCitation property is a value string, encoded according to a KEV ContextObject encoding scheme. It is not intended to be the resource identifier, which for a journal article would probably use an appropriate URI scheme such as DOI." So which is it? Is bibliographicCitation an identifier or not? Is the second quote using "identifier" to mean something different than dc:identifier without telling us? I'm willing to assume for now what I see as a contradiction here comes from my purely surface-level understanding of the DC Abstract model. But maybe not...

Monday, June 13, 2005

A gulf between research and practice

I've observed, as have others, that there is often a large gap between "digital library research" and "digital library practice" (by some definition of those terms). I got a good taste of this at the Joint Conference on Digital Libraries last week. At one session, an audience member asked the presenter if he had read this:

Nov. 2004, PhD dissertation, Marcos André Gonçalves, "Streams, Structures, Spaces, Scenarios, and Societies (5S): A Formal Digital Library Framework and Its Applications", http://scholar.lib.vt.edu/theses/available/etd-12052004-135923/

... as it related to the topic at hand. The presenter hadn't heard of it, and neither had I. But why hadn't I heard of it?!? This sort of work should absolutely be on any digital library practicioner's reading list, and any researcher in this area, be it computer science (as this one was) or LIS, should have some familiarity and ongoing discourse with practicioners. Both pure research and pure implementations of digital libraries are necessary, but that doesn't mean there is no middle ground, or that the two can't engage each other in a meaningful way. My work will be better for having read this research, and research will be better for having learned about what departments like mine produce.

I think one reason for this gulf is the differing definition of "library" held by different folks. But that's a post for another day.

Wednesday, June 01, 2005

Beyond silly...

Ok. I'm not usually one to dismiss something out of hand as silly. I've definitely become in adulthood a "let's take a minute to look at all sides" kind of person. After that, I'll still tend to develop a strong opinion, but I like to believe I'm always willing to listen. That said, there are some things that I do have an immediate reaction to, consisting of me wanting to yell, "What in the world were you thinking?!?!" I had one of those moments stretched out over the last few days. Feel free to get me off my high horse and engage in real dialogue!

A post on Autocat last Friday asked about what to record in MARC 007 as the playing speed of a CD. The answer:

"Compact digital discs: Speed is measured in meters per second. This
is the distance covered on the disc's surface per second, and not the
number of revolutions.
f 1.4 m. per sec."

WHY, exactly, is this information important to be included in a MARC record? CDs and DVDs only play at one speed. I know that for analog discs (records, remember those?), one needs to know, for example, if it's a 45 or a 33 1/3, but not for the media currently under discussion! (And LP speeds are what they're *supposed* to be, not what they really should be to reproduce at pitch!) It strikes me very strongly as an anachronism, completely unnecessary in a bibliographic record for a CD created in 2005.

The conversation on Autocat then spun into a discussion of why it's not measured in revolutions per second, some technical details about how CD players work, etc. Interesting, certainly. But I'm a bit incredulous that the focus is on the method of measurement rather than the point of including that data in the first place! If and when CD players are historical artifacts, and all information on how they worked is lost, looking in MARC records and interpreting the very complex semantics of 007 is not going to be the revelation reconstructing the speed at which they should play. Even if we should be recording this information for posterity (value for dollar, anyone?), it doesn't have to be in every single bib record for a CD! We record this information at the expense of far more important data, such as analytics for individual musical works on the recording. Please, please, please! Let's step back and think about why we create these records in the first place. AACR3 (oops, RDA!) is trying to do this, but I fear it's not going nearly far enough.

Rant over. I do realize there are lots of practical problems we have with legacy data if we're going to make large-scale practice to cataloging changes. Let's work to solve those problems and not let them scare us off from doing anything. There are lots and lots of folks out there doing just this stepping back I'm pleading for. Good work, all of you! Let's do some more.

UPDATE! I get AUTOCAT in digest mode, and wrote the above based on messages received up to the morning of 6/1. In the digest I received 6/2, there are no less than TWO posters wondering what the heck this stuff is doing in a MARC record anyways. There's also continued endless discussion about linear velocity, how the CD measurements relate to tape media, how they relate to the "48x" speed advertised for CD-ROMs, etc. It's great that folks want to really understand these things, but I'd still argue that preferencing this sort of information over lots of other useful information isn't the right thing to do.

Wednesday, May 25, 2005

Z39.19 Revision

Well, today is the deadline for comments to NISO on the new Z39.19 revision, and unfortunately I haven't had a chance to dig in far enough to make any comments useful to them. Blast. I did, however, open the document up today to look for something specific. In the course of that search, I came across this text:

8.3.3.2 Parts of Multiple Wholes
When a whole-part relationship is not exclusive to a pair
wholes, the name of the whole and its part(s) should not
they should be linked associatively rather than hierarchically
Carburetors, for example, are parts of machines other
relationship in this instance is cars RT carburetors.

I'm disappointed in this decision. In order to preserve a pure hierarchy (something cannot be a part of multiple wholes), some semantics are lost. The idea that a carburetor is a part of a car (as well as potentially a part of lots of other stuff) is lost by relegating it to an RT (associative relationship). Whole-part relationships appear in the document as one of three types of hierarchical relationships; therefore, it seems that by categorizing them here the authors were forced to make the decision to move a huge number of things commonly thought of as having a whole-part relationship to an associative relationship. We librarians just love hierarchy, don't we. Too bad the world is polyhierarchical. Looks like our information systems won't be able to catch up yet.

Monday, May 23, 2005

"In Search of the Single Search Box"

Whew! It sure has been a while since I've posted. When starting this blog, finding the time to post on it was one of my major concerns. I'd been doing pretty well, but I recently hit a stretch where I was traveling more than home for about 6 weeks, and I moved in there as well! But I'm back now, and should be closer to home for a large portion of the summer. Here's to keeping up the blog while sitting on my new patio with a frosty beverage!

I heard an excellent presentation recently at the Digital Library Federation Spring Forum, which has been referenced recently on a library mailing list (WEB4LIB?). Staff at NC State have developed some methods for a single search box on the library's web site actually providing relevant information for all the many types of queries users type in that box no matter how much explanatory information indicating what resources that box searches is present on the page. The presentation was titled "In Search of the Single Search Box: Building a 'First Step' Library Search Tool." (Firefox users beware: the presentation is in HTML-ized Powerpoint and will look really strange in your browser!) Their video demo does an excellent job of illustrating the types of information needs to which the tool can respond. As the presentation suggests, this box doesn't search inside absolutely everything, but is intended to be a first step from which users can see some ideas and choose among them for continuing their journey.

As I recall (this is what I get for waiting this long to post on the topic...), the tool presents results in four major categories:

1) FAQ for the libraries
2) Library web pages
3) Links to perform the same search in some databases (the catalog, Academic Search Premier, list of journal titles, etc.)
4) Related subject categories

The FAQs meet needs where somebody wants to know the library hours or where the closest computer lab is. The library web pages results are Google-driven, so a page excerpt appears that a user might find helpful in selecting a result when they want some contextual information about a resource. The "search the collection" links make catalog or database search results an extra click away if that was the desired search, but that click is simply moved from the beginning of the process (click a catalog link on the home page, or, alternatively, take a few minutes to figure out which box on the front page to type in!) to this stage.

The "Browse Subjects" area, where a list of potentially relevant subjects is displayed, peaks my interest most about this project. The presentation didn't have a ton of information about where these links go and how the logic to develop them is created, and unfortunately I didn't have a chance to ask the NC State folks in person more about it. But from the presentation and the demo video, it looks like these links go to pathfinder-style pages where "selected" resources (selected how and by who would presumably be a local implementation decision) are displayed or linked. The presentation slides state that journal article titles and course descriptions are currently used to provide the connections between search terms and the pre-defined subjects. That's a great place to start! One can imagine a host of other options, including subject authority files, those same library web pages indexed elsewhere, and periodic looks at search logs for this box. Oh, and I see now one of the final slides in the presentation talks about some other sources - I'd forgotten that! I find the huge amount of potential here very exciting.

This tool isn't currently deployed on the NC State Libraries Web site, but I hope to see it soon. I don't recall if they plan to release any of their source code, but it sure would be nice if this was possible. I'll be keeping an eye on developments in this area.

Oh, and by the way. Never. Moving. Again. :-)

Saturday, May 07, 2005

Cataloging sound recordings

There has been a fascinating discussion on the Association for Recorded Sound Collections (ARSC) email list over the last week on cataloging sound recordings (look for threads starting with "database template" and "cataloging," then continuing here). The ARSC community is wonderfully diverse, including audiophiles, librarians, archivists, and others just interested in learning about sound recordings. The thread started out with an announcement of a database template for recording information about sound recordings; someone solving an immediate problem and wanting to share their solution with others. It's expanded greatly to become somewhat of a religious discussion on the relative merits and problems of MARC/AACR2 cataloging.

I can't help but feel that, like a great many discussions of this sort, the participants are talking past each other. One point that has been mentioned but perhaps not strongly enough, is that the user experience problems with library cataloging is heavily a problem of the use the search system makes of the data and how it's presented to end-users. Ralph Papakhian, one of the premier music catalogers in the country, who I like and respect a great deal, has made the point in this thread that the data elements some respondents mention as wanting to record are in fact recordable in MARC. And if anyone would know and can explain this to others, it's Ralph. But these elements, even though they're there, are often not accessible to users. For example, MARC has fields for date of composition and coded instrumentation of a recording or score. But few if any library systems index or display this data. So catalogers rarely enter them, which provides less incentive for systems to use them, which provides less incentive for catalogers to use them, which provides less incentive for systems to use them...

But I believe systems aren't the only problem. There are lots of little things I think MARC/AACR2 could do better. However, the biggest, and mostly implicit in this discussion, difference in what MARC does and what some of the other participants in this thread look for in sound recording cataloging, is the library focus on the carrier over the content. Catalogers discuss this issue frequently, but it hasn't been brought up explicitly in this thread. Audiophiles absolutely are interested in the recording as a whole--its matrix number, sound engineers, etc. But they are also equally interested in the musical works on the recording, what personnel are connected with which piece, timings of tracks, etc. MARC has places for these things, but they are relegated to second-class status. Catalogers know and tout the benefits of structure and authority control in information retrieval. But when it comes to the contents of a bibliographic item, we apply none of these principles in the MARC environment. Contents notes are largely unstructured (and what structure is possible is rarely used and keeps changing!), don't make use of name or title authority control, and in many cases aren't indexed in library systems.

As pointed out in this thread, creating this content-level information is extremely expensive. But the networked world has the potential to change that. Much of this information has been created in structured form outside of the library environment, by record companies, retailers, and enthusiasts, but we don't make use of it. Right now, it's difficult to make use of it because our systems don't know how to talk to each other. It will take a great many baby steps, but I hope we can start down the road towards changing that.

Matt Snyder of NYPL, who I met at MLA this year and was extrememly impressed with, has made the point in this thread that MARC records (and, by extension, library catalogs) and discographies have different purposes. This is definitely true in today's environment. Library catalogs are primarily for locating things, and discographies have more of a research bent. But I feel strongly, and this email discussion seems to support this view, that the distinction is largely artificial and is becoming less relevant as information retrieval systems continue to evolve. More sharing of data between systems will hopefully result in fewer systems to consult by end-users. That's certainly my goal!

Thursday, May 05, 2005

Known-item vs. unknown-item searching

A series of project assignments and offhand conversations recently have me thinking about how well (or how poorly) our current library-ish systems support users diving in and simply exploring what the system has to offer. On the whole, most of our discovery systems focus on known-item searching, where a user comes to the system with something specific in mind that they want to find: books by a certain author, a movie with a specific title, recordings by a particular artist. These information needs are of course common, and they are in fact the focus of Cutter's first objective of the catalog.

But look more closely at c) in that first objective - we should provide access to an item when the subject of it is known. So what exactly does that mean? Most current systems in a library environment fulfil that by making text in a subject-ish field keyword searchable. When I do a subject search in a system of that sort, I get back records that have subjects containing the word I typed in. But how do users know what the words in those subjects are? Some (certainly not all!) systems provide the user a way to look at a list of subjects used in that system. The user then is expected to locate all subjects of interest in that list, then construct a properly-formulated Boolean query OR-ing those subjects together. I'll be perfectly frank and state that I believe strongly that this is silly to expect of any user in this day and age, even an "expert" user such as a reference librarian. Let's use the computing power we have!

And what about these of Cutter's objectives?

2. To show what the library has
e. On a given and related subjects
f. In a given kind of literature
Mechanisms to achieve these goals, in support of unknown-item searching, fall far short of the sophistication we provide for known-item searching. We don't provide our users with ways to look around, to explore, to just see what we've got. If I read a book that inspires me to read some more on the topic, I go to my public library's catalog, find the book I liked, and click on a subject heading (from a maximum of three!) that seems like it might be promising. And what I find a huge majority of the time is a browse screen of LCSH headings, each with three or fewer hits. The topic of interest to me tends to be the first part, but the browse index is a seemingly endless list of geographic subdivisions of the topic, interspersed with other subdivisions such as "juvenile," and, in particularly poor systems, interspersed with other headings starting with the same word as the term before the first subdivision.

What we need are systems that do an exponentially better job of starting out from an interesting thing and finding more things like it. I personally think postcoordinated subject headings would be a major advance in this area, but they're certainly not enough. Systems that map lead-in terms to authorized terms, and expand search results to include narrower terms than a matched broader term are also necessary. One can also imagine other mechanisms to build that "like" relationship, based on information retrieval research, folksonomies, and transaction logs.

I suppose my point in the end is that it's simple to build a system that searches the text of pre-created metadata fields for an entered query string. It's much more difficult to build systems that allow users to truly explore. We often forget how important that exploration function is. We look at our search logs, and see mostly known-item searches, so we think that's what we need to focus on. Of course we see that - it's what our systems are designed around! But what would happen if we started to provide relevant results to subject and other unknown-item searches? I'd bet a whole lot of money that we'd see a huge increase in unknown-item searching. Sure, for some types of materials, known-item searching may very well be the primary means of access users need. But let's at least look at the alternative, and work with actual users to see how we can provide them with exploratory functions we don't currently supply.

Tuesday, May 03, 2005

FRBR Workshop

Wow! Wow, wow, wow, and WOW. I'm at the end of day 2 of a 2.5 day FRBR Workshop at OCLC, and I've been continuously blown away by the activity going on here. The workshop is supposed to be in large part a working session to start thinking about what revisions to the original FRBR report would look like. I was skeptical of that goal coming in, seeing as 75 people are here, but I've been extremely pleasantly surprised. If I've even been in a room with as many bright, engaged, and interesting people before, I didn't appreciate it at the time. Within the discussion, I find just the right balance of theory and practice, of idealism and realism. There's a very clear vision of what a bibliographic future could be, and a great many ideas for ways we can reach there in manageable steps.

The workshop itself is a mixture of presentations on specific topics and time to just talk. Some presentations don't at first glance look to be FRBR related, but every single one really does have a definite impact on how FRBR should develop in the future, either as a conceptual model or as some sort of implementation model based on the conceptual one. Some presentation slides are on the workshop site now, and hopefully all will eventually be. But the presentation slides in no way do the actual presentations and the resulting large- and small-group discussions justice. I feel more confident than at most meetings of this type that the discussions will have real results, in the form of writings and implementations. I sincerely hope so - many people out there are interested in this topic, and the best thing we can do now is share, share, and share some more.

Thursday, April 28, 2005

Newsweek article on "tagging"

In the April 18, 2005, issue of Newsweek, "The Technologist" column is about "tagging," sites like Flickr and del.icio.us, that collect users labels for things. These "things" can be absolutely anything - the great thing about the Internet is that communities can appear almost instantaneously around anything at all. These labels can then be used to generate a folksonomy. There's a lot of buzz out there about folksonomies right now on the Web, and it's well-deserved. It's cool stuff. It provides such a great sense of how REAL PEOPLE (who?) think about things.

I'm enough of a skeptic to think it's not practical for libraries to switch wholesale to folksonomy-type endeavors for subject access, but surely there are ways in which we can capitalize on the wealth of relevant information being generated out there. I've been interested for some time in incorporating user-contributed to a project I work on. My plugs for this to date have used Wikis as examples - I think I'm going to have to add folksonomies to my spiel!

Tuesday, April 26, 2005

AMeGA Automatic Metadata Generation final report

I've just finished reading (reading, what's that? haven't done it in a while...) the final report from the AMeGA (Automatic Metadata Generation Applications) Project. I filled out the survey on which part of this report was based, and I have to admit, I wasn't optimistic about the project. The survey referenced that it was meant for text objects primarily, but as someone who works heavily in non-text environments, I found this disappointing. But now that it's out, overall I think the report has done a good job outlining the issues involved.

Of particular interest to me is Section 8, where proposed functionalities are listed for metadata generation applications. There are a number of very good suggestions here, often focusing on streamlining the metadata generation proceess - making use of automation when current technologies perform well, and making the human-generated part of the process easier. I definitely agree with the report that there is a huge disconnect today between research in this area and production systems. There is very interesting research in this area going on, but production systems don't yet make good use of it. Right now, we still need humans in the process. I'm not opposed on principle to changing this, but that's today's reality.

The report characterizes survey respondents as "optimists' and "skeptics," based on their projections of future abilities to automate metadata creation. The report quotes several skeptics as proclaiming it simply not, under any circumstances, to completely automate metadata creation. I'd like to think of myself on the fence with regard to this issue. I don't like to say "never" but I do see that generation of certain types of metadata elements will be easier to automate than others. The more we can automate, great. I also understand the problem with evaluating automatic metadata generation applications. Few people agree on approprate subject headings, etc., so how do we know if a generated heading is appropriate? In my opinion, the more we can expose people to the results of generated metadata, the better we can evaluate it, and the better these systems will eventually get.

Wednesday, April 20, 2005

ANSI/NISO Z39.19 draft revision

Whew! I'm finally back from 3 trips in 3 weeks, and have slogged through enough email to think about the blog again. I had lots of interesting developments waiting for me when I returned - new blog fodder!

ANSI/NISO has released a draft revision of Z39.19, now titled "Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies." I haven't had a chance to read the document yet, but it sure looks interesting! From the table of contents, I'm glad to see a small section on synonym rings, as we encountered these not working the way we expected in an implementation of OracleText. At first glance, the scope of the standard seems to have expanded. There are sub-sections of the "principles" section on ambiguity and facet analysis that I don't recall being in the existing standard (but don't quote me on that!). I'm extremely interested in the section on displaying controlled vocabularies. In my opinion this is the biggest barrier to end users of systems using controlled vocabularies today - displays that completely separate the vocabulary from the search interface, requiring users to know of their existence, understand their structure, and take the time to consult them! I look forward to seeing if this draft standard can make them more understandable.

Sunday, April 10, 2005

"Authority control in AACR3"

I recently read a paper by Kierdre Kiorgaard and Ann Huthwaite that I heard about on Catalogablog, entitled "Authority control in AACR3." The paper describes the efforts underway to address the issue of authority control explicitly in AACR3, in a manner more explicit than in AACR2. The statement in this paper I find most interesting is this:

"The definition that is likely to be included in AACR3 is: 'the means by which entries for a specific entity are collocated under a single, unique authorized form of a heading; access is provided to that authorized form from variant forms; and relationships between entities are expressed.'"

Authority control for names certainly fulfils the collocating function described here, and, conversely, a disambiguation function by creating different headings for different people with similar or identical names. But in today's information systems it can and should fulfil another function - helping users to decide if the name heading displayed to them is for the individual they're interested in. But I believe only the first goal is served by a system where the uniqueness of a person is represented only by the form of the heading. Name authority files also don't completely disambiguate names; there are many cases of duplicate names in the authority file when no information other than what appears on a publication is available to the cataloger.

I can't help but wonder if we're missing an opportunity here to move to a structure that can more easily fulfil both goals. Information that would help a user decide if a person is the one they're interested in is frequently added to a name heading, but not always. If all of that information, plus any more that may be of use, is made available to the user in a flexible manner, rather than just the data necessary to disambiguate one name from another, the second goal would be much more easily served. Perhaps this is not the time for this sort of change to be made. I do think we as librarians and system designers should be open to changes of this sort, continuing to focus primarily on the task we want to accomplish, and leaving the mechanics of accomplishing that goal as a later step.

Saturday, April 09, 2005

LJ April 1 Retrospective

The Library Journal April 1 issue has been archived.

Friday, April 01, 2005

April Fools!

Be sure to check out the April 1 edition of Library Journal. I hope this gets archived somewhere. What a hoot!

Sunday, March 27, 2005

Random thoughts on XOBIS

Kevin Clarke, one of the authors of XOBIS, kindly left a comment on my recent blog post on the topic. It shamed me into returning to the XOBIS general overview document I peeked at briefly when originally writing about it. I've now given the entire document a quick read. I can't claim to have an in-depth understanding of it at this point; it certainly took me several readings of the FRBR report and a decent amount of time thinking about modeling different things in FRBR before I felt I could really say anything intelligent about it. Nevertheless, I have a few initial impressions on XOBIS.

The most obvious difference I see between XOBIS and FRBR is that XOBIS attempts to be a model that can describe all of knowledge, while FRBR limits itself to modelling bibliographic relationships. In a practical sense, for recording bibliographic data (and this certainly isn't the only possible use of XOBIS!) this means that XOBIS explicitly handles entities that represent in a bibliographic environment creators or subjects of bibliographic items (and, in FRBR, other Group 1 entities), currently residing in a relatively unstructured way in name and subject authority files. FRBR, on the other hand considers only briefly its Group 2 ("person" and "corporate body") and Group 3 ("concept," "object," "event," and "place") entities, focusing instead on Group 1 entities.

Relationships between entities is a key feature of XOBIS; they are also a bit confusing to me on my first read. My initial impression is that the relationships as specified focus more on subject-type relationships rather than relationships among bibliographic items. My reading is that the XOBIS definition of work is much closer to what we currently consider a bibliographic item than FRBR's work. The discussion and examples in the overview document talk about versions of works and how they are related, but I saw much less about the "accidental" sort of relationship a FRBR-ish work (as its expressed in a specific manifestation) would have to another expressed work on the same manifestation, for example, two symphonies appearing on the same CD. It would be an interesting excercise to map out how the XOBIS model would handle this sort of situation, where the symphony itself is the entity of primary interest to a majority of end-users rather than the specific performance or the title of the CD on which it appears.

XOBIS comes out of the Medlane project of the Lane Medical Library at Stanford. I wonder what effect medical materials have had on the development of the XOBIS model. I know my focus on musical materials in various projects, most notably Variations2, certainly strongly affects my thinking about FRBR and related efforts. I'm sure that's obvious from my earlier question wondering how XOBIS would handle a situation that the Variations2 model is designed around.

There are also some very interesting items in the report's bibliography, including a project mailing list (renamed since the version listed here, and looks low-traffic). Time for citation chasing!

Wednesday, March 23, 2005

Postcoordinated subject headings

There has been an interesting discussion on the Autocat mailing list over the last two weeks (well it's died down now, but I haven't gotten around to writing about it yet...) with the subject "The inadequacies of subject headings." The discussion has centered on a few posters questioning whether the LCSH-style focus on precoordinated headings is really a good idea. Several posters proposed (not all by name) postcoordinated headings as more useful, both for end-users and for catalogers. More than one person mentioned the large amount of training required for catalogers to effectively apply headings from a precoordinated system.

I was struck in the discussion by the widespread lack of big-picture thinking about the issue, and the corresponding lack of awareness of the many initiatives going on in this area. There were certainly some members contributing to the discussion who have spent some time thinking about this issue, but many seemed afraid of the idea. I got the sense that many folks were trained on LCSH, that's what they use, and why in the world would they want to use anything else? When posts mentioned specific postcoordinated schemes (FAST, AAT, etc.) they tended to be mentioned as something the person had heard of but never used and didn't fully understand. I'm generalizing a bit here, but that tone was definitely present.

I don't know that I have anything concrete to say other than that I've noticed a trend of resistance to non-LCSH subject systems, but I do think that as catalogers are increasingly being asked to be metadata experts (and by that I mean metadata in a broad sense, not just traditional cataloging practice!) they'll more and more need to know about what vocabularies are out there. A huge part of my job as a Metadata Librarian is choosing among the various data structure and data content standards available for a given implementation. We're definitely past the days when one size (MARC/AACR/LCSH) fits all. The more all sorts of librarians learn about alternatives and can make good decisions about when they're appropriate to use, the better off our whole profession will be.

Sunday, March 20, 2005

"We're not competing with Google"

I was in a meeting recently that had as an agenda item Google Print and its effect on our current library services. (I seem to be having this meeting a lot lately.) I was by far the youngest in the room and by far the attendee working most frequently in areas outside of "traditional" librianship (whatever that means). I intentionally spent most of the meeting listening rather than talking. One statement in particular made by someone in the room struck me and started me thinking a great deal: "We're not competing with Google."

I didn't respond to it at the time, but the statement has been churning around in my head ever since. Whether or not it's true depends, of course, on what one means by "competing." If we mean, "attempting to do exactly the same thing," then that's pretty much true. While we're both in the information business, the way in which we approach it is fundamentally different. And that's OK. But if we mean "fighting for the attention of users" or "fighting for the perception that we provide valuable services worth funding," then maybe we are competing with Google. The differences between libraries' missions and the way in which we go about achieving them is important to us, but perhaps it's too subtle for a large proportion of the population. Certainly there are lots of folks out there that think Google can and will replace libraries, even if we think they're wrong.

So what does this mean? Well, I think it means that libraries need to continue to promote what we do and why. Not in the preachy Michael Gorman style proclaiming from on high to the masses that libraries are the cornerstone of high civilization and those who disagree aren't worth thinking about, but rather by building and delivering services that meet our users' needs. In the rapidly changing information environment, this means we do need to be rethinking how we do a lot of what we do. Let's remember our core principles of preservation, collocation, and free access, and find new ways to implement these in today's environment and for today's diverse users.

Wednesday, March 16, 2005

A DC frustration

I had another one of those <sigh> moments about Dublin Core today. I've got some really amazingly simple bibliographic data I need to put in a within a METS document. At first I said, "Hey, let's just use DC. It will be easy." (Note to self: anytime you say "It will be easy," you're asking for trouble.) Everything was going along swimmingly until I was thinking about boilerplate text to put in all the records. One of these pieces of text would be to indicate the department at my institution that housed the materials in question. Ding, ding, ding! Alarm bells! There's no good place for this in simple Dublin Core! (Or qualified Dublin Core for that matter.)

I've dealt with this exact situation before, I guess I was blocking it out because it's SO annoying. Some folks would put this information in <dc:contributor>, and in fact several of my OAI sets do just this in their DC records. I suppose that might be OK, but the DC Contributor definition is "An entity responsible for making contributions to the content of the resource" and I don't know if I'm so comfortable calling "paying somebody to digitize this stuff and then asking another department to 'put it up on the Web'" "making contributions to the content of the resource." Some folks would put this information in <dc:publisher>, but again I'm skeptical. "An entity responsible for making the resource available" (DC Publisher definition) does apply to the digital resource. However, we're dealing with published materials here whose publisher for the print item can be an important access point. And we don't (nor does pretty much anybody) have a sophisticated mechanism in place for making good 1:1 principle records and linking them all together in a way that allows users to search on things meaningful to them and get meaningful results back. Putting our holding institution in Publisher in this environment would not serve our users' needs.

I started out using a hack I'd used before: put the holding info at the beginning of a <dc:source> field and add to the end the local call number so it fits the Source definition. But then I got annoyed at using what I consider a hack. So I started digging around. The Western States Dublin Core Metadata Best Practices made up their own element (currently called "Contributing Institution") and don't map it to DC. This is one of the very few elements they go completely out of DC for. The DC Libraries Working Group made a proposal in 2002 for a new DC element called holdingLocation, but by the time this proposal was reviewed by the Usage Board, MODS had gotten off the ground, so, the UB decision said to use the MODS <location> element instead.

So the DC solution to this problem is to use an Application Profile that borrows an element from another schema. But once you start doing this, the draw of DC (simplicity!) is lost. I'm probably just going to use MODS instead. Sigh.

Monday, March 14, 2005

Defining "librarian"

I've seen a few articles and discussions recently converging around the idea of defining what a "librarian" actually is. The March 2005 issue of American Libraries has a cover story about paraprofessionals working in libraries and the perceptions of them by patrons and "librarians" at their place of work, there has been an ongoing thread with the subject "End of Librarianship" over the Autocat mailing list weeks 1 and 2 of March 2005 (browse the archives), and a posting today at lisnews.com reporting a library director job indicating an MLS was optional for applicants. These all touch in some way on whether an MLS, a job title including the word "librarian," or a job in a library makes one a "librarian."

Certainly the definition of "librarian" is contextual. The American Libraries article asks some library paraprofessionals what their answer is to the question "Are you the librarian?" Since a patron asking that question almost certainly means "Can you help me?" rather than "Do you have an MLS?" or "Does your job title say you're a librarian?" so the answer there in my opinion should be an emphatic YES.

But many librarians are extremely protective of this label. It represents a significant investment of time, money, and intellect into earning a professional degree. And that's certainly nothing to sneeze at. (Even if some MLS programs in this country today can't reasonably be described as "rigorous.") However, I certainly know a number of people in jobs with titles including "librarian" who were hired under the rationale "MLS or equivalent experience" who do excellent jobs. Shouldn't one's ability to perform the duties of a position be the primary criterion for hiring them? I tend to think that a piece of paper bearing the designation MLS doesn't necessarily tell an employer whether or not an applicant is qualified.

I guess the argument comes down to whether the term librarian should refer to "what you do" or "who you are." And I can see how each would be appropriate in different circumstances. I tend to believe one should demonstrate his or her skill and professionalism in their interactions with people and in their work performance, rather than assuming an acronym and a diploma are an accurate indication.

Saturday, March 12, 2005

"Google at the Gate"

In the March 2005 issue of American Libraries, there is an article entitled "Google at the Gate" containing questions about the recently-announced Google digitization project, with answers from Michael Gorman of Cal State-Fresno and ALA president-elect, Deanna Marcum of LC, Susan McGlamery of OCLC, and Ann Wolpert of MIT. The article appears at an interesting time, just as the buzz is dying down from what some have called "Gormangate" - a huge reaction, especially among the blog community, to comments Michael Gorman made recently lampooning the value of bloggers.

In this article, Gorman continues the dismissive style of rhetoric that have incensed so many in his previous comments on the Google projcet and on blogging. The tone is very much one of a person who is certain he is right and need not consider any other arguments put to him. Two quotes in particular caught my eye:

"Any user of Google knows it is pathetic as an information retrieval system..."

This quote, of course, depends heavily on the definition of "information retrieval system." The remainder of the sentence references the traditional IR research metrics of recall and precision, so it's probably reasonable to assume that Gorman is measuring the effectiveness of Google along those lines. And that's one fair way to measure. However, your random Google user is probably unlikely to measure Google according to those terms. Most information needs are for something on a topic rather than everything on a topic. We in libraries are used to (and should be!) focusing on the latter. But that doesn't mean it's the only way to design a search engine.

"I cannot see the threat to small libraries [from the Google digitization project], nor can I see much of an advantage."

Gorman's answer to this question stands in stark contrast to those of the participants in the interview. The others give multi-sentence responses, addressing at least some possiblities for advantages and disadvantages to small libraries from the Google digitization project. But the style of Gorman's answer is, again, dismissive, giving the impression he's made up his mind that the Google project is "bad" and that there is no need to consider its impact on libraries, small or otherwise. Perhaps he's carefully thought through all the issues and this quote is the result of a great deal of reflection. But there's no explanation presented, so the reader cannot know. I suspect this style of rhetoric, passing down from on high a conclusion without any explanation or support, will not prove effective for libraries as we increasingly need to talk about our services and expertise to those outside the profession.