16 October 2011

OCR on GNU/Linux: A survey

Today was spent checking out options for optical character recognition (OCR) on GNU/Linux. There are apparently the following basic engines for OCR:

OpenOCR (Cuneiform)1.0.0BSD-style
Tesseract2.04Apache 2.0
ABBYY OCR9.0proprietary
OCR Shop XTR5.6proprietary

In June of last year Andreas Gohr did a short experiment where he compared the first five above-listed GNU/Linux OCR engines and found that ABBYY OCR had the highest accuracy, with 100% for proportionally spaced serif and sans-serif text; Tesseract was the best-performing Free Software package, with accuracy in the 92–98% range. GOCR and Ocrad were significantly worse, with accuracy as low as 76% and 82%, respectively.

The non-proprietary engines usually just have bare-bones command-line interfaces with a very limited feature set. There are a number of higher-level tools, often with graphical interfaces and allowing more sophisticated pre- and post-processing of the data. These tools include the following:

Front endVersionLicenceBack endsNotes
easy-ocr3.4BSD-styleCuneiform, GOCR, Ocrad, OCRopus, Tesseractapparently available only as a Debian binary package
OCRFeeder0.6.6GPLv3GOCR, Ocrad, Tesseract
ocrodjvu0.7.5GPLv2Cuneiform, GOCR, Ocrad, OCRopus, Tesseract
OCRopus0.4Apache 2.0Tesseract
WatchOCR0.8GPLCuneiformavailable only as a Debian binary package or Knoppix LiveCD

Since it's my intention to use Free Software wherever possible for this project, I installed Tesseract and a GPLv3-licensed graphical front end, gImageReader. (Rather than compiling from source, I used Malcolm Lewis's openSUSE RPMs. These RPMs fail to specify all the dependencies; they require the presence of the python-imaging and python-enchant packages.) I then tried processing a couple pages from the Standard as test runs: the front page of the September 1904 issue, and the third page of the July 1961 issue. The former is a relatively poor-quality scan, and the latter is quite clean and has a simple layout. You can see the results in the screenshots below: the text OCR'd from the 1904 issue is almost complete gibberish, whereas the text for the 1961 issue is mostly correct (though still with lots of mistakes).

Unfortunately, the gImageReader interface produces only plain text as its output, which is useless for my purposes. What I need is for there to be a mapping of selectable, searchable text to the position it appears at in the original scan. Apparently, there is an open standard, hOCR, for representing text layout, recognition confidence, style, and other OCR information. Tesseract, Cuneiform, and other OCR packages can output to hOCR. The problem is that the hOCR file doesn't itself contain the original scanned image; for this you need some extra software to produce (say) a PDF which combines the text information and the original image. Only then will you have a searchable PDF.

It turns out that only some of the engines and front ends support hOCR, and of the Free Software front ends, only two of them add text layers to PDFs: WatchOCR and pdfocr. The ocrodjvu wrapper produces DjVu files instead of PDFs. My next task will therefore be to install and test WatchOCR, pdfocr, and ocrodjvu. I may also try out some proprietary packages for purposes of comparison.

29 September 2011

January 1968 and 1969

I first reported in March 2010 that the microfiche scans provided by LSE are missing the cover pages for the January 1968 and 1969 issues. Before I left London I visited the Socialist Party of Great Britain headquarters and picked up a copy of the January 1969 issue. The archive had no more loose January 1968 issues, so I had to photocopy it from a bound volume. Unfortunately, this meant that the page was cropped along the binding edge, but it's better than no page at all.

I had also mentioned in March 2010 that the aspect ratio of the actual printed issues doesn't seem to correspond to that of the LSE scans. This is something I had forgotten about when I finally scanned these issues in yesterday, and it caused me no end of confusion when I saw the discrepancy between what I had scanned myself and what all the other scans looked like. Below are two images for comparison: the left shows my scan of the January 1969 issue cover, and the right is the same image stretched to fit the aspect ratio of the LSE scans.

Since all the other scans are distorted in this manner, I'm faced with the choice of similarly distorting my scans to match them, or else disproportionately scaling all the LSE scans to match the physical page size. Much as I would like my archive to be as faithful as possible a copy of the printed issues, I am going to have to go with the first option, at least for now. The LSE scans aren't at a high enough resolution to withstand too much image manipulation, and besides that altering them would necessitate another careful inspection of each page to ensure that there are no further unpaper anomalies.

28 September 2011

jbig2enc seeks a new maintainer

I just received the following message from Adam Langley, maintainer of jbig2enc:
I wrote jbig2enc many years ago and I'm aware that it's been very useful to some people, for which I'm glad. But I'm afraid that I simply don't have the time to maintain it any more. Please feel free to fork it in the usual open source style. If you think you have a sufficiently well maintained fork, let me know and I'll start directing people to it.
As someone who has found jbig2enc quite valuable, I just want to say thanks to Adam, and hope that someone reading this blog might decide to take over the good work he's been doing!

25 September 2011

Upsampling results

In June I thought I had completed cropping, but upon reviewing the covers I spotted a further three with problems. It's possible that some of the inner pages were likewise overlooked, though I'll leave that rather tedious inspection for later (or perhaps to eagle-eyed volunteers, if I can get any!)…

Also in June I indicated that I had used jbig2enc to produce three PDFs for each of the issues I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. The size of the collections are as follows:

upsamplingsize (MB)

So even at the greatest upsampling the collection is still easily small enough to fit on a DVD-ROM. However, from browsing the PDF covers, it seems a lot of the 4× upsampled issues are missing text (similar to how unpaper sometimes omits text on pages with low contrast or a close black border). I figure this could be fixed by playing with jbig2enc's threshold settings, though that could take several hours or days of experimentation and careful checking of the output.

The PDFs with 2× upsampling seem to be of good quality—much better than the non-upsampled ones, and not obviously worse than the 4× upsampled ones. And none of the covers had any obviously missing text; I hope this holds for the inside pages as well. I think, then, that it will be best to go with 2× upsampling. You can see some comparisons below.

No upsampling

2× upsampling

4× upsampling

My next steps will therefore be as follows:

  • Locate and scan the missing covers of the January 1968 and January 1969 issues. I got photocopies of these pages before I left London; I hope they didn't get lost in the move to Darmstadt!
  • Start investigating OCR software so that the final collection can have full-text search.
  • Start investigating DjVu to compare it with the PDFs.

A series of unfortunate events

This spring I accepted a new job in Darmstadt, with a starting date in July, so in late May I quit my old job in London, hoping to use the time (except for that spent moving) to finish the digitization of at least the 1904–1970 issues. Unfortunately, this plan was thwarted by a series of unfortunate events. June saw one legal, one medical, and one veterinary emergency, which together consumed all my available time. When I finally arrived in Darmstadt, it took nearly two months after signing up for Internet access before they came to install the cables. And to top it all off, shortly after we got wired, my laptop became irreparably damaged, so I had to procure a new computer on short notice and transfer all my data onto it.

The new computer is all set up, now, and the Internet is the fastest I've ever had. Here's a comparison of the specs of my old machine and the new one I'll be working with:

sable (old machine)ferret (new machine)
CPUIntel Core2 Duo T8300 @ 2.4 GHzAMD Athlon II X2 260 @ 3.2 GHz
Hard disk250 GB SATA500 GB SATA II
Display39 cm TFT @ 1440×900 (WSXGA)61 cm TFT @ 1920×1080 (1080p)
GraphicsIntel 965 GMAMD Radeon HD3000
OSopenSUSE 11.3openSUSE 11.4

So as you can see, the new machine is a modest improvement on the old one in every respect: it's got a faster CPU and faster memory, a larger and faster hard drive, a larger and higher-resolution display, a better graphics card (not that that matters much for the 2D imaging work this project is concerned with), and a newer operating system.

Because the machine's architecture is slightly different, and because it's a completely new install of the operating system, I'm going to have to recompile jbig2enc, Leptonica, and some various support utilities I've written myself. After that I can get back to work, so watch this space for updates in the hopefully very near future…

05 June 2011

Cropping complete

As reported in March 2010, the microfiche images omit the cover pages for the January 1968 and January 1969 issues. I attended the Socialist Party of Great Britain's head office yesterday and picked up a spare copy of the January 1969 issue. There were no extra January 1968 issues remaining, but the issue did appear in the bound volumes in the archive, so I took a photocopy. The binding obscures about a centimetre of the left edge of the page, but my copy is better than nothing, I guess. Now my only problem is getting these two pages digitized, as I no longer have access to a scanner. I'll either have to find someone with a scanner, or see if I can photograph the covers myself.

With most of the pages now in place, it's time to start thinking again about how to "bind" them into PDF or DjVu documents. Since it's been a year since I last experimented with this, I downloaded the latest version of jbig2enc and its dependency, Leptonica. I discovered that jbig2enc doesn't compile with Leptonica 1.68, but only because the parameters to the findFileFormat() function have changed. This function is referenced once, in jbig2.cc, where it's used to check something involving multi-page TIFFs. I don't use jbig2enc to process TIFFs so I just commented out these lines, and then jbig2enc compiled fine.

My computer is now whirring away, generating three PDFs for each of the issues that I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. It will probably be busy doing this all night. Once it's done, I'll examine the results to see what looks the best and what the file sizes are like. Watch this space for further analysis of the results…

02 June 2011

unpaper revisited

The last few days have been spent reviewing and revising the results I obtained with unpaper in March of last year. After double-checking the image output, I found I had missed some cases where unpaper had failed to properly process the images. After applying the appropriate command-line options discussed previously, I was able to get unpaper to correctly process most of these; the rest I added to the list of images which unpaper cannot process. I also double-checked this list, which originally had 142 images; I found that many of them were able to be processed successfully with a little more command-line option experimentation.

In the end, there was a net increase of seven images to the list, so it now contains 149 images. These cases are almost exclusively pages with illustrations (usually cover pages). I will now have to do some preliminary tests to determine whether it would be more efficient to crop these images manually or use my autocrop tool.

31 May 2011

PDF viewing woes: update

In March 2010 I reported on a couple problems viewing PDFs. The first problem was that my file manager, Dolphin, was unable to generate previews of some PDFs due to an arbitrary limit on file sizes. The second problem was slow rendering of the PDFs in my viewer, Okular. I'm pleased to report that the first of these issues has been fixed, and the second is due to be fixed in the next stable release series of Poppler (the PDF rendering library used by Okular).