Tuesday, August 08, 2006

What about dirty OCR?

I often hear discussions as part of the digital project planning process about how best to approach full-text searching of documents. A common theme of these discussions is whether or not “dirty” (uncorrected, raw) OCR is acceptable or not. The “con” position tends to argue that OCR is only so effective (say, 95%) and that the errors made can and will adversely affect searching. The “pro” position is that some access is better than none, and OCR is a relatively cheap option for providing that “some” access.

The con position has some convincing arguments. Providing some sort of full text search sends a very strong implication that the search works – and if the error rate in the full text is more than negligible, it could be said that implied promise has been broken. Error rates themselves are misleading. A colleague of mine likes to use the following (very effective, in my opinion) example, noting that error rates refer to characters, but we search with words:

Quick brown fix jumps ever the lazy dog.

In this case, there are two errors (fix and ever), out of 40 characters (including spaces), for an accuracy rate of 95%. However, only 75% (6 of 8) words are correct in that example.

So uncorrected OCR has some problems. But the costs of human editing of OCR-ed texts are high – too high to be a valuable alternative in many situations. Double- and triple-keying (two or three humans manually typing in a text while looking at scanned images) tends to be cheaper than OCR with human editing, but these cost savings are typically achieved by outsourcing the work to third-world countries, promoting ethical concerns for many. And both of the human-intervention options themselves represent a non-zero error rate. No solution can reasonably yield completely error-free results.

I’ll argue that the appropriate choice lies, as always, in the details of the situation. How accurate can you expect the OCR to be for the materials in question? 90% vs. 95% vs. 99% makes a big difference. What sorts of funds are available for the project? Are there existing staff available for reassignment, or is there a pool of money available for paying for outsourcing? TEST all the available options with the actual materials needing conversion. Find out what accuracy rate can be achieved via OCR with all available software. Ask editing and double-keying vendors for samples of their work based on samples from the collection. Do a systematic analysis of the results. Don’t guess as to which way is better. Make a decision based on actual evidence, and make sure you get ample quantities of that evidence. Results from one page, or even ten pages, are not sufficient to make a reasoned decision. Use a larger sample, based on the size of the entire collection, to provide an appropriate testbed for making an informed choice between the available options. Too often we assume a small sample represents actual performance and accept quick support of our existing preferences as evidence of their superiority. To make good decisions about the balance of cost and accuracy, we must use all available information, including accurate performance measures from OCR and its alternatives.