Tuesday, September 27, 2005

The more things change...

I've just finished reading the short volume The MARC music format : from inception to publication / by Donald Seibert. MLA technical report, no. 13. Philadelphia : Music Library Association, 1982. The book is an account of the decision-making process involved in designing and implementing the MARC music format. I was both heartened and discouraged to read arguments in support of implementing MARC that mirror closely arguments I and others make today for moving beyond MARC.

The rationale behind the MARC music format reads full of hope, for improved access for users and higher quality data. Yet many of the improvements mentioned have not come to fruition. I'm heartened to see the vision represented here for the type of access we can and should be providing. Yet I'm discouraged to see more evidence that we haven't achieved this level of access in the time since the MARC format was implemented. I believe this serves to remind us that many factors other than database structure contribute to the success of a library system.

I also learned a valuable lesson reading this text that ideas and potential alone are not enough to convince everyone that any given change is a good idea. A large percentage of librarians out there have heard these very arguments before and seen them not pan out. I do believe, however, that this time can be different. (Yes, I know how that sounds...) Computer systems are much more flexible than they were when the MARC music format was first implemented, and can be designed to alleviate more of the human effort than before. We've learned a great deal from automation and implementation of the MARC format that we can build on in the next generation library catalog. We have a long road ahead of us, but I think it's time to address these issues head-on once again. I'd like to believe we can leverage the experience of those like Donald Siebert involved in the first round of MARC implementation, together with experts in recent developments, to make progress towards our larger goal.

Sunday, September 18, 2005

The next big thing in searching?

At a conference last week, I heard Stephen Robertson of Microsoft Research Cambridge speak about the primacy of text in information retrieval, whether for text, images, or any other type of medium. He made a statement in the talk that the first generation of information retrieval systems operated on Boolean principles, and the second generation (our current systems) provide relevance-ranked lists. This may be a truism in the IR world, but it's something I hadn't thought about in these terms before. Our library sytems certainly are primitive in terms of searching, and they operate on the Boolean model. But I hadn't thought of relevance ranking as the "next step" - probably because the control freak in me is suspicious of a definition of "relevance" not my own. But I think it's fine to look at the progression of IR systems in this way.

So what's the third generation? Where are we going next? I think the next step is grouping in search results. Grouping is where I see the power of Google-like search systems merging with library priorities like vocabulary control. Imagine systems that allow the user to explore (and refine) a result set by a specific meaning of a search term that has multiple meanings, by format, or by any number of other features meaningful to that user for that query at that time. I picture highly adaptive systems far more interactive than those we see today. Options for search refinement alone, I don't believe, go far enough, as they require the user to deduce patterns in the result set. I believe systems should explicitly tell users about some of those patterns and use them to present the result set in a more meaningful way. Search engines like Clusty are starting to incorporate some of these ideas. It remains to be seen if they catch on.

FRBR assumes this sort of grouping can be provided, using the different levels of group 1 entities. Discussions of FRBR displays frequently talk about presenting Expressions with a language for textual items, with a director for film, or with a performer for music, allowing users to select the Expression most useful to them before viewing Manifestations. What's missing is how the system knows what bits of information would be relevant for distinguishing between Expressions, since these bits of information will be different for different types of materials, and sometimes even with similar types of materials. We have a ways to go before the type of system I'm imagining reaches maturity.

Wednesday, September 07, 2005

Dangers of assumptions

Over the holiday weekend, I read the paper by Thomas Mann, Will Google’s Keyword Searching Eliminate the Need for LC Cataloging and Classification? Mann presumes to know exactly what is possible (not just currently implemented) in a search engine - the paper is stuffed full of absolutes: "cannot," "only," and "will not." The paper seems to focus on Google as simply taking in words in a query, looking them all up in a word-by-word index of all documents, and performing some sort of relevance ranking on documents that contain the search terms. It not only assumes that Google takes this simplistic approach, it rejects that any further capabilities are even possible in a search engine.

I believe this is a thoroughly (and perhaps, in this case, deliberately) naive assessment of the situation. Just because library catalogs offer only simple fielded searching and straightforward keyword indexes doesn't mean all retrieval systems do the same. Mann ignores the possibility of a layer between the user's query and the word-by-word index. He states, "having only keyword access to content is that it cannot solve the problems of synonyms, variant phrases, and different languages being used for the same subjects." This statement confuses "keyword access" (just looking something up in a full-text index) with a system that uses a keyword index among other things for searching. Google could (and right now, does, with the ~ operator [thanks Pat, for the heads up on this!], and who of us library folk is to say they won't do this by default in Google Print) do synonym expansion on search terms before sending the query to the full-text index. Point is, it's not impossible to do this in a search system. The same idea goes for finding items in other languages - translation before the search is actually executed could be done. Ordering, grouping (yes, grouping!), and presentation of search results in this environment would require some advanced processing, but that's doable too.

Of course, there is a difference between what's possible and what's actually implemented in Google today. Mann's language confuses the two, by stating (incorrectly) what's possible using as evidence what's implemented. What's implemented today is the functionality in the Web search engine, but we shouldn't assume the same functionality will drive Google Print. This article uses rhetoric to stir the librarians up for their cause. But it does us a disservice by making false assumptions and obscuring the facts. There are arguments to be made for why libraries are still essential and relevant today. But rabble-rousing with partial truths isn't the way to make them.