16 October 2011

OCR on GNU/Linux: A survey

Today was spent checking out options for optical character recognition (OCR) on GNU/Linux. There are apparently the following basic engines for OCR:

EngineVersionLicence
OpenOCR (Cuneiform)1.0.0BSD-style
GOCR0.49GPLv2
Ocrad0.20GPLv3
Tesseract2.04Apache 2.0
ABBYY OCR9.0proprietary
OCR Shop XTR5.6proprietary

In June of last year Andreas Gohr did a short experiment where he compared the first five above-listed GNU/Linux OCR engines and found that ABBYY OCR had the highest accuracy, with 100% for proportionally spaced serif and sans-serif text; Tesseract was the best-performing Free Software package, with accuracy in the 92–98% range. GOCR and Ocrad were significantly worse, with accuracy as low as 76% and 82%, respectively.

The non-proprietary engines usually just have bare-bones command-line interfaces with a very limited feature set. There are a number of higher-level tools, often with graphical interfaces and allowing more sophisticated pre- and post-processing of the data. These tools include the following:

Front endVersionLicenceBack endsNotes
easy-ocr3.4BSD-styleCuneiform, GOCR, Ocrad, OCRopus, Tesseractapparently available only as a Debian binary package
gImageReader0.9GPLv3Tesseract
OCRFeeder0.6.6GPLv3GOCR, Ocrad, Tesseract
ocrodjvu0.7.5GPLv2Cuneiform, GOCR, Ocrad, OCRopus, Tesseract
OCRopus0.4Apache 2.0Tesseract
pdfocr0.1.2BSD-styleCuneiform
WatchOCR0.8GPLCuneiformavailable only as a Debian binary package or Knoppix LiveCD

Since it's my intention to use Free Software wherever possible for this project, I installed Tesseract and a GPLv3-licensed graphical front end, gImageReader. (Rather than compiling from source, I used Malcolm Lewis's openSUSE RPMs. These RPMs fail to specify all the dependencies; they require the presence of the python-imaging and python-enchant packages.) I then tried processing a couple pages from the Standard as test runs: the front page of the September 1904 issue, and the third page of the July 1961 issue. The former is a relatively poor-quality scan, and the latter is quite clean and has a simple layout. You can see the results in the screenshots below: the text OCR'd from the 1904 issue is almost complete gibberish, whereas the text for the 1961 issue is mostly correct (though still with lots of mistakes).

Unfortunately, the gImageReader interface produces only plain text as its output, which is useless for my purposes. What I need is for there to be a mapping of selectable, searchable text to the position it appears at in the original scan. Apparently, there is an open standard, hOCR, for representing text layout, recognition confidence, style, and other OCR information. Tesseract, Cuneiform, and other OCR packages can output to hOCR. The problem is that the hOCR file doesn't itself contain the original scanned image; for this you need some extra software to produce (say) a PDF which combines the text information and the original image. Only then will you have a searchable PDF.

It turns out that only some of the engines and front ends support hOCR, and of the Free Software front ends, only two of them add text layers to PDFs: WatchOCR and pdfocr. The ocrodjvu wrapper produces DjVu files instead of PDFs. My next task will therefore be to install and test WatchOCR, pdfocr, and ocrodjvu. I may also try out some proprietary packages for purposes of comparison.