As indicated in my last posting, I tested various OCR programs which output either to hOCR or directly to PDF. For the ones which output hOCR, I tried producing a PDF with the text layer hidden underneath the image using hocr2pdf, a Free Software tool which the creators do a very good job of preventing you from finding. There seems to be absolutely nowhere on their website to download it, either in source or binary form. Fortunately, the source seems to be available on a few third-party download sites, and users at the openSUSE Build Service have posted RPMs.
Anyway, I tested each OCR package on a page from a 1904 and 1961 issue. My findings are summarized as follows:
- Cuneiform seemed to do a decent job at OCR, at least as far as character matching went, but the hOCR it produced didn't work well with hocr2pdf—this despite using an earlier version of Cuneiform, as instructed by a post on the DIY Book Scanner forms which an anonymous commenter referred me to.
- Tesseract's output was almost as good as Cuneiform's, and moreover the hOCR was digestible by hocr2pdf.
- Adobe Acrobat Professional
- I have access to this through my workplace. Accuracy was similar to the above two Free Software packages, but the user interface doesn't support batch processing. I've got hundreds of issues to process, so OCRing them one at a time in a GUI isn't an option.
- ABBYY OCR
- I activated a trial version of ABBYY's command-line OCR package. The accuracy was by far the highest of any of the suites I tested. However, it's proprietary and also very expensive software; in order to process the complete Standard archive I'd need to buy a €999 licence.
None of the above OCR programs seemed to recognize the column layout of the newspaper. It's therefore not possible to use the text selection tool in the resulting PDF to copy and paste more than one line of a column at a time. However, at least the PDF will be searchable (modulo the character recognition errors).
I've therefore settled on Tesseract. I set up a batch processing job and estimate it will take about 20 to 30 hours to do the whole archive.
One difficulty I foresee is that I don't think hocr2pdf works on the output of jbig2enc. I may need to use hocr2pdf to create an uncompressed PDF with hidden text, and then reprocess it using pdfsizeopt, which integrates jbig2.