04 January 2008

The Dangers of OCR

"Stoopid Computers."

For those who may not be aware, OCR refers to Optical Character Recognition and is usually part of a software program (Adobe Acrobat has an OCR option, as do some scanner software packages) or can be a standalone program like OmniPage or ABBYY Finereader. At it's best, OCR is artificial intelligence on a very small scale, using pattern recognition algorithms which it may or may not be able to adjust on the fly. At it's worst, it is a frustrating process which may actually add time to a project; I have read some studies that suggest simply hiring a good typist/data entry clerk to input the text, especially if the original is hand-written or poor quality.

At work, I use OmniPage which is very robust and very good at learning as it goes but still only manages about a 90-95% accuracy rate. Other readers are not so good; tests on Acrobat's functionality found it had about a 55-60% accuracy on English documents. Sometimes it is the fault of the copy -- smudges on the original or the use of multiple fonts in one document can confuse the software; sometimes it's the hardware or its operator -- dust on the scanbed or a crooked scan can instantly make a mess of things; and sometimes, well it's a mystery.

I have no idea what happened in this example from Neil Gaiman's Journal, but it sure is funny:

Incidentally, I think Amazon are using more Optical Character Recognition these days. At least, according to this description I just cut and pasted from their description of Sandman: The Doll's House, where I learned,

Excerpt - Back Cover: "... . ~~- ~. ~ . _ .. " N Neil Gaiman is the New Yak Times bat-sdrmg hWphor of Mnorkan rods ..."

And, of course, I am.

No comments: