Showing posts with label ABBYY OCR. Show all posts
Showing posts with label ABBYY OCR. Show all posts

03 September 2012

hOCR-capable OCR programs

As indicated in my last posting, I tested various OCR programs which output either to hOCR or directly to PDF. For the ones which output hOCR, I tried producing a PDF with the text layer hidden underneath the image using hocr2pdf, a Free Software tool which the creators do a very good job of preventing you from finding. There seems to be absolutely nowhere on their website to download it, either in source or binary form. Fortunately, the source seems to be available on a few third-party download sites, and users at the openSUSE Build Service have posted RPMs.

Anyway, I tested each OCR package on a page from a 1904 and 1961 issue. My findings are summarized as follows:

Cuneiform
Cuneiform seemed to do a decent job at OCR, at least as far as character matching went, but the hOCR it produced didn't work well with hocr2pdf—this despite using an earlier version of Cuneiform, as instructed by a post on the DIY Book Scanner forms which an anonymous commenter referred me to.
Tesseract
Tesseract's output was almost as good as Cuneiform's, and moreover the hOCR was digestible by hocr2pdf.
Adobe Acrobat Professional
I have access to this through my workplace. Accuracy was similar to the above two Free Software packages, but the user interface doesn't support batch processing. I've got hundreds of issues to process, so OCRing them one at a time in a GUI isn't an option.
ABBYY OCR
I activated a trial version of ABBYY's command-line OCR package. The accuracy was by far the highest of any of the suites I tested. However, it's proprietary and also very expensive software; in order to process the complete Standard archive I'd need to buy a €999 licence.

None of the above OCR programs seemed to recognize the column layout of the newspaper. It's therefore not possible to use the text selection tool in the resulting PDF to copy and paste more than one line of a column at a time. However, at least the PDF will be searchable (modulo the character recognition errors).

I've therefore settled on Tesseract. I set up a batch processing job and estimate it will take about 20 to 30 hours to do the whole archive.

One difficulty I foresee is that I don't think hocr2pdf works on the output of jbig2enc. I may need to use hocr2pdf to create an uncompressed PDF with hidden text, and then reprocess it using pdfsizeopt, which integrates jbig2.

16 October 2011

OCR on GNU/Linux: A survey

Today was spent checking out options for optical character recognition (OCR) on GNU/Linux. There are apparently the following basic engines for OCR:

EngineVersionLicence
OpenOCR (Cuneiform)1.0.0BSD-style
GOCR0.49GPLv2
Ocrad0.20GPLv3
Tesseract2.04Apache 2.0
ABBYY OCR9.0proprietary
OCR Shop XTR5.6proprietary

In June of last year Andreas Gohr did a short experiment where he compared the first five above-listed GNU/Linux OCR engines and found that ABBYY OCR had the highest accuracy, with 100% for proportionally spaced serif and sans-serif text; Tesseract was the best-performing Free Software package, with accuracy in the 92–98% range. GOCR and Ocrad were significantly worse, with accuracy as low as 76% and 82%, respectively.

The non-proprietary engines usually just have bare-bones command-line interfaces with a very limited feature set. There are a number of higher-level tools, often with graphical interfaces and allowing more sophisticated pre- and post-processing of the data. These tools include the following:

Front endVersionLicenceBack endsNotes
easy-ocr3.4BSD-styleCuneiform, GOCR, Ocrad, OCRopus, Tesseractapparently available only as a Debian binary package
gImageReader0.9GPLv3Tesseract
OCRFeeder0.6.6GPLv3GOCR, Ocrad, Tesseract
ocrodjvu0.7.5GPLv2Cuneiform, GOCR, Ocrad, OCRopus, Tesseract
OCRopus0.4Apache 2.0Tesseract
pdfocr0.1.2BSD-styleCuneiform
WatchOCR0.8GPLCuneiformavailable only as a Debian binary package or Knoppix LiveCD

Since it's my intention to use Free Software wherever possible for this project, I installed Tesseract and a GPLv3-licensed graphical front end, gImageReader. (Rather than compiling from source, I used Malcolm Lewis's openSUSE RPMs. These RPMs fail to specify all the dependencies; they require the presence of the python-imaging and python-enchant packages.) I then tried processing a couple pages from the Standard as test runs: the front page of the September 1904 issue, and the third page of the July 1961 issue. The former is a relatively poor-quality scan, and the latter is quite clean and has a simple layout. You can see the results in the screenshots below: the text OCR'd from the 1904 issue is almost complete gibberish, whereas the text for the 1961 issue is mostly correct (though still with lots of mistakes).

Unfortunately, the gImageReader interface produces only plain text as its output, which is useless for my purposes. What I need is for there to be a mapping of selectable, searchable text to the position it appears at in the original scan. Apparently, there is an open standard, hOCR, for representing text layout, recognition confidence, style, and other OCR information. Tesseract, Cuneiform, and other OCR packages can output to hOCR. The problem is that the hOCR file doesn't itself contain the original scanned image; for this you need some extra software to produce (say) a PDF which combines the text information and the original image. Only then will you have a searchable PDF.

It turns out that only some of the engines and front ends support hOCR, and of the Free Software front ends, only two of them add text layers to PDFs: WatchOCR and pdfocr. The ocrodjvu wrapper produces DjVu files instead of PDFs. My next task will therefore be to install and test WatchOCR, pdfocr, and ocrodjvu. I may also try out some proprietary packages for purposes of comparison.