Tuesday, October 23, 2007

Google Book Search and... LCSH?

The Inside Google Book Search Blog recently announced that they've "added subject links in a left navigation bar as additional entry points into the index." This, predictably, piqued my interest. I followed the links in the blog entry, poked around a bit, then looked at this book. (Hope that link is persistent... the book is "Asian American Playwrights: A Bio-Bibliographical Critical Sourcebook By Miles Xian Liu") [UPDATE: Either that link evaporated or I screwed up and pointed to the wrong place. Try this one.] See that "Subjects" heading over on the right. Expand it. At least some of those are LCSH! ("Asian Americans in literature" is a dead giveaway.) The first three are close to what one sees in the book and e-book records in Open WorldCat. I've been something very close to living under a rock recently, so maybe this isn't news, but it's news to me at least.

I don't quite know what to think of this. I've heard Google was getting MARC records for books they're digitizing from libraries, but this doesn't appear to be one of those books. Is this a sign they're incorporating library cataloging from other places as well? And to date we haven't seen them do much with that data. Is this the sign of a change? I don't know that we can interpret it that way. This is perpetual beta, remember, and it's Google with roughly a zillion servers, and the ability to try all sorts of things out simultaneously. Just because I see those headings now doesn't mean they'll be there tomorrow, or today for you who is hitting the service through a different route.

To some extent I think this is a good thing. We have a great deal of data in our catalogs that deserves to be put to better use than it currently is. It's great to see this data making its way into services such as GBS, and for GBS to realize "subjects" are useful, perhaps even essential, access points. (I'll skip in this post a rant about the many things "subject" can mean, including "genre" [pet peeve warning!], and my thoughts on when this data needs to be human-generated and when it doesn't.)

But I'm surprised to see the precoordinated headings there. One of them seems to have the free-floating --Biography and --Dictionaries removed, but Dictionaries stays in two of the headings. It's also interesting, although I don't know what it means, that the delimiter between parts of the heading in GBS is / rather than --. I'm wondering if there's any intelligent processing at work here or if this is a quick and dirty approach to providing subject access. These headings have a subfield structure that would make it trivial to just leave in the topical aspects (according to some definition of topical that doesn't match mine, especially for music) and remove the rest. Why wasn't this done? Does GBS perceive value in the precoordinated headings? Or have they just not spent time focusing on this yet?

It's my great hope that the way in which GBS ends up using library-originated subject headings sparks a great rethinking of how we provide subject access in the library community. We're very vested in the way we do things, and there's a great deal of value behind those ways. But just because there's some value doesn't mean that we can rest on our laurels. We simply must be continually evaluating how well our vocabularies perform in ever-evolving systems and user expectations. How closely services like GBS stick to those vocabularies will be a litmus test for us. Ever the optimist, I hope we can use what they do as data to help us shape our evolution, rather than dismissing it as uninformed or not applicable to us. Only time will tell.

Saturday, October 13, 2007

Catalog vs. search engine - or is it?

Discussion on the future of library catalogs is common today. In these conversations, I often hear an argument something like this: “Catalogs and search engines have different goals; are trying to accomplish different things. Therefore we shouldn’t be making direct comparisons between them. By extension, we shouldn’t be comparing their functionality and features either.” This is of course an oversimplification of what’s generally said, but the spirit is there.

I’m concerned about this line of thinking. The original posit makes sense on the surface, in the sense that there is a history of analyzing and documenting the goals of the catalog (Panizzi, etc.), and that the business goal of search engines is to make money by selling advertising. But I think this approach both sells search engines short and doesn’t go far enough thinking about catalogs. From the search engine point of view, the business argument is true, of course, but overly simplistic. We can extend the definition of the goal of search engines to say that they strive to make money by selling advertising in a system that connects people to information they seek. Google wasn’t a business at first, it started as a research project by CS students to better index information. That’s a pretty simple and laudable goal – to help people find things. The catalog is the same. With all the talk about the goal of the catalog being collocation (and all the other related goals well-documented in the literature), it’s easy to forget that those goals exist (wait for it…) to connect people, today and in the future, with information they seek. So in this very basic sense, catalogs and search engines are trying to accomplish the same thing. The methods are often different, but I don’t think we’re serving ourselves well if we just write the success of search engines and the current struggles of library catalogs off because of those differences.

Early search engines had one big difference from library catalogs: the materials they index. But this is no longer true to any significant degree. I’m no fan of cataloging web sites in MARC to make them searchable in our catalogs, and I see this as largely out of favor now, but this was only the first step towards blurring the line between the content indexed by search engines and that in our catalog. Google Book Search, for example, provides access to many of the same materials that are in our catalogs. The methods of searching are very different, with full-text indexing being a strong component of GBS and bibliographic information the strongest component of our catalogs, but again, the goal is the same – getting people to books relevant to their information need. The argument separating catalogs from search engines by format of materials indexed is waning, but I still hear it from time to time. The conventional argument that a catalog provides access to things a library owns is also waning, for obvious reasons.

So what’s left to distinguish the goals our catalogs from search engines, giving us a convenient excuse for why our catalogs perform so poorly? Not much of substance, I think. To me, the different is all in style instead. Let’s certainly keep those goals of the catalog in mind, but let’s not assume that the methods we’ve used to achieve those goals in the past are the only methods that can be effective. If the goals of search engines and catalogs aren’t all that different in the end, maybe we can mix and match some methods too. We’ll never know until we try.