Monday, May 29, 2006

An RDF Revelation

While doing some reading recently, I had an RDF revalation. I've long felt I didn't really get RDF. This time, the parts that sunk in made a bit more sense. I'm not a convert in this particular religious war, but I do feel like I now understand both sides a bit better.

I've read the W3C RDF Primer before; several times, I think. The first thing that struck me this time was a simple fact I know I'd read before but that I'd forgotten--that an object can be a either a URIref or a literal (a URI referencing a definition elsewhere, or a string containing a human-readable value). This means the strict machine-readable definitions of things RDF strives to achieve is potentially only half there--only the predicate (relationship between the subject and object) is expected to be a reference to a presumably shared source. I assume this option exists for ease of use. Certainly building up an infrastructure that allows for all values to be referenced rather than declared represents unreasonable startup time. This sort of thing is better done in an evolutionary fashion rather than forcing it to happen at the start; a reasonable decision on the part of RDF.

RDF contains some other constructs to make things easier, for example, blank nodes to group a set of nodes (or, in the words of the primer, provide "the necessary connectivity"). Blank nodes are a further feature that allow lack of formal identification of entities. The primer discusses a case using a blank node to describe a person, rather than relying on a URI such as an email address as an identifier for that person. A convenient feature, certainly, but also a step away from the formal structures envisioned in Semantic Web Nirvana.

So now I'm looking at the whole XML vs. RDF discussion much more as a continuum rather than opposing philosophical perspectives. The general tenor of RDF is that it expects everything to be declared in an extremely formal manner. But there are reasonable exceptions to that model, and RDF seems to make them. I'd argue now that both RDF and XML represent practical compromises. Both strive for interoperability in their own way. It's just a question of degree whether one expects a metadata creator to check existing vocabularies, sources, and models for individual concepts (RDF-ish) or for representing entire resources (XML-ish). I see the value of RDF for use in unpredictable environments. Yet I'm still not convinced our library applications are ready for it yet. The reality is that libraries are still for the most part sharing metadata in highly controlled environments where some human semantic understanding is present in the chain somewhere (even in big aggregations like OAIster). (Of course, if we had more machine-understandable data, that human step would be less essential...)

I'm a big champion of two-way sharing of metadata between library sources and the "outside world." I just don't think the applications that can make use of RDF metadata for this purpose are yet mature enough to make it worth the extra development time on my end. And, again, the reality is that it really would take significant extra development time for me. The metadata standards libraries use are overwhelmingly XML-based rather than RDF-based. XML tools are much more mature than RDF tools. I fully understand the power of the RDF vision. But this is one area I just can't be the one pushing the envelope to get there.

Monday, May 15, 2006

Whither book scanning

A recent New York Times Magazine article entitled Scan this Book! by Kevin Kelly is getting lip service in the library world. The article describes the current digitization landscape, discussing the Google book project, among other initiatives, and describes both the potential benefits and current challenges to the grand vision of a digitized, hyperlinked world. I was specifically glad to see the discussion not just centering around books, but around other forms of information and expression as well. However, library folk are starting in on our usual reactions to such pieces, finding factual errors, talking about how tags and controlled subjects aren't mutually exclusive, pointing out the economics of digitization efforts, discussion of how the digitization part is only the first step and how the rest is much more difficult. All of these points are perfectly valid.

Yet even though these criticisms might be correct, I think that we do ourselves a disservice by letting knee-jerk reactions to "outsiders" talking about our books take center stage. Librarians have a great deal to offer to the digitization discussion. We've done some impressive demonstrations of the potential for information resources in the networked world. Yet we don't have a corner on this particular market. Like any group with a long history, we can be pathetically short-sighted about changes we're facing. I believe it would be a fatal mistake to believe we can face this future alone. We have solid experience and many ideas to bring to Kelly's vision for the information future. However, we simply can't do it alone, and not just for economic reasons. We simply must be listening to other perspectives, just as we expect search engines, publishers, and others we might be working with to listen to ours. Let's keep our defensiveness in check, and start a dialog with those who are interested in these efforts, instead of finding ways to criticize them.

Tuesday, May 09, 2006

On the theoretical and the practical

When I do metadata training, I make a point to talk about theoretical issues first, to help set the stage for why we make the decisions we do. Only then do I give practical advice on approaches to specific situations. I’m a firm believer in that old cliché about teaching a man to fish, and think that doing any digital library project involves creative decision-making, applying general principles rather than hard-and-fast rules.

Yet the feedback I get from these sessions frequently ranks practical advice as the most useful part of the training. I struggle with how to structure these training sessions based on the difference between what I think is important and what others find useful. I learned to make good metadata decisions first by implementing rules and procedures developed by others, and only later to develop those rules and procedures myself. It should make sense that others would learn the same way.

The difference is that I learned these methods over a long period of time. The training sessions I teach don’t ever claim to provide anyone with everything they would need to know to be a metadata expert. Instead, their goal is to provide participants with the tools they need to start thinking about their digital projects. I expect each of them will have many more questions and ample opportunity to apply theory presented to them as they begin planning for digital projects. This is where I see the theoretical foundation for metadata decisions coming into play. I can’t possibly provide enough practical advice to meet every need in the room; I can make a reasonable attempt to address theoretical issues that would help to address these issues.

I realize the theory (why we do things) can be an overwhelming introduction to the metadata landscape. Without any practical grounding, it doesn’t make any sense. Yet I know it’s essential in order to plan even one digital project, much less many. I and many others out there need to continue to improve the methods by which we train others to create consistent, high-quality, shareable metadata, finding the appropriate balance between giving a theoretical foundation and providing practical advice.