Tuesday, August 08, 2006

What about dirty OCR?

I often hear discussions as part of the digital project planning process about how best to approach full-text searching of documents. A common theme of these discussions is whether or not “dirty” (uncorrected, raw) OCR is acceptable or not. The “con” position tends to argue that OCR is only so effective (say, 95%) and that the errors made can and will adversely affect searching. The “pro” position is that some access is better than none, and OCR is a relatively cheap option for providing that “some” access.

The con position has some convincing arguments. Providing some sort of full text search sends a very strong implication that the search works – and if the error rate in the full text is more than negligible, it could be said that implied promise has been broken. Error rates themselves are misleading. A colleague of mine likes to use the following (very effective, in my opinion) example, noting that error rates refer to characters, but we search with words:

Quick brown fix jumps ever the lazy dog.

In this case, there are two errors (fix and ever), out of 40 characters (including spaces), for an accuracy rate of 95%. However, only 75% (6 of 8) words are correct in that example.

So uncorrected OCR has some problems. But the costs of human editing of OCR-ed texts are high – too high to be a valuable alternative in many situations. Double- and triple-keying (two or three humans manually typing in a text while looking at scanned images) tends to be cheaper than OCR with human editing, but these cost savings are typically achieved by outsourcing the work to third-world countries, promoting ethical concerns for many. And both of the human-intervention options themselves represent a non-zero error rate. No solution can reasonably yield completely error-free results.

I’ll argue that the appropriate choice lies, as always, in the details of the situation. How accurate can you expect the OCR to be for the materials in question? 90% vs. 95% vs. 99% makes a big difference. What sorts of funds are available for the project? Are there existing staff available for reassignment, or is there a pool of money available for paying for outsourcing? TEST all the available options with the actual materials needing conversion. Find out what accuracy rate can be achieved via OCR with all available software. Ask editing and double-keying vendors for samples of their work based on samples from the collection. Do a systematic analysis of the results. Don’t guess as to which way is better. Make a decision based on actual evidence, and make sure you get ample quantities of that evidence. Results from one page, or even ten pages, are not sufficient to make a reasoned decision. Use a larger sample, based on the size of the entire collection, to provide an appropriate testbed for making an informed choice between the available options. Too often we assume a small sample represents actual performance and accept quick support of our existing preferences as evidence of their superiority. To make good decisions about the balance of cost and accuracy, we must use all available information, including accurate performance measures from OCR and its alternatives.


Dorothea said...

It seems to me that some very basic efficiencies are available to the proofreading process that aren't being sufficiently exploited.

Concordancing, for example. It's dead simple for a computer to create a concordance of an OCR'ed text and do two very simple checks: a dictionary check, and an outlier check. (I envision the latter looking for low-occurrence words and checking them with Soundex or similar against higher-occurrence words, presenting "best guess" to a human editor for confirmation.)

Plus, of course, some of the usual scanno checks ("words" containing numbers or oddly-placed punctuation, 1-for-l and 0-for-o, and so on).

How many scannos would that eliminate, and what would the time-cost be? I don't know, but I'm guessing that it's a LOT less time than proofing by hand -- and it could well be more accurate. How will we know unless we try?

Anonymous said...

There is some work out there which looks into the recall numbers when searching dirty OCR. I kept some of these references when doing a literature review for a study that never went ahead.

Anyway, from within the library domain, see:

"Measuring Search Retrieval Accuracy of Uncorrected OCR", Harvard, 2001


Measuring the Accuracy of the OCR in the Making of America, University of Michigan, 1998.

From the information science side, see:

"Evaluation of Model-Based Retrieval Effectiveness with OCR Text"

Robust Retrieval of Noisy Text

Information Retrieval Can Cope With Many Errors

Usable OCR: What are the Minimum Performance Requirements?

These articles discuss relative rates of recall with various qualities of OCR, and also discuss whether any post-processing of OCR, such as dictionary look-ups, help (they generally conclude "no").


Jenn Riley said...

Excellent points, all, thank you!

Thanks for the citations, Aaron. It's great to have easy access to this kind of information when making decisions. I'll still push hard for testing, though. It's nice to know how a certain OCR engine performs on a specific set of materials, but your materials are different than those. Features like dictionary lookups and analysis of low-occurring patterns work extremely well on certain types of texts, and much less well on others.

Dorothea, you've really hit on something here with the suggestion to include guesses for human confirmation. We tend to assume things like OCR (and automated metadata enhancement methods, the area I've been thinking about this in) work only by the machine doing its work, then a person fixing all the errors. We think this because that's how they pretty much all work now. But this process can and should be more iterative, using human feedback to continually improve the system. I hope to get a chance to try out some methods like this if a certain grant proposal gets funded this fall...

Eby said...

Setting up some kind of review process with something like Amazon's Mechanical Turk might also be an option.