Wednesday, September 07, 2005

Dangers of assumptions

Over the holiday weekend, I read the paper by Thomas Mann, Will Google’s Keyword Searching Eliminate the Need for LC Cataloging and Classification? Mann presumes to know exactly what is possible (not just currently implemented) in a search engine - the paper is stuffed full of absolutes: "cannot," "only," and "will not." The paper seems to focus on Google as simply taking in words in a query, looking them all up in a word-by-word index of all documents, and performing some sort of relevance ranking on documents that contain the search terms. It not only assumes that Google takes this simplistic approach, it rejects that any further capabilities are even possible in a search engine.

I believe this is a thoroughly (and perhaps, in this case, deliberately) naive assessment of the situation. Just because library catalogs offer only simple fielded searching and straightforward keyword indexes doesn't mean all retrieval systems do the same. Mann ignores the possibility of a layer between the user's query and the word-by-word index. He states, "having only keyword access to content is that it cannot solve the problems of synonyms, variant phrases, and different languages being used for the same subjects." This statement confuses "keyword access" (just looking something up in a full-text index) with a system that uses a keyword index among other things for searching. Google could (and right now, does, with the ~ operator [thanks Pat, for the heads up on this!], and who of us library folk is to say they won't do this by default in Google Print) do synonym expansion on search terms before sending the query to the full-text index. Point is, it's not impossible to do this in a search system. The same idea goes for finding items in other languages - translation before the search is actually executed could be done. Ordering, grouping (yes, grouping!), and presentation of search results in this environment would require some advanced processing, but that's doable too.

Of course, there is a difference between what's possible and what's actually implemented in Google today. Mann's language confuses the two, by stating (incorrectly) what's possible using as evidence what's implemented. What's implemented today is the functionality in the Web search engine, but we shouldn't assume the same functionality will drive Google Print. This article uses rhetoric to stir the librarians up for their cause. But it does us a disservice by making false assumptions and obscuring the facts. There are arguments to be made for why libraries are still essential and relevant today. But rabble-rousing with partial truths isn't the way to make them.

4 comments:

Jenn Riley said...

Thanks for the support and the advice, Dorothea! Checking my blogger settings now...

That's a great point about the MARC records for the digitized books, Thom. As I recall, the Michigan contract with Google doesn't mention these records. That doesn't mean Google isn't using them, though. This would be interesting to find out!

Amy said...

Nice point Jenn. Librarians need to stop being intimidated by computer-driven search engines, and start trying to find ways to enhance the generated results. We need to work with search engines instead of against them. After all, why do humans need to prove that we offer a different perspective than computers? It should be obvious.

Anonymous said...

This post was recommended for the Carnival of the Infosciences #6 which can be found at: http://tinyurl.com/chqkr

Anonymous said...

Good job! You really nail it, IL. -- Karen (Free Range Librarian)