20 December 2012

PDFBeads doesn't like consecutive whitespace in hOCR

Lazy Kent has now published openSUSE RPMs for Tesseract 3.02, so I installed it and ran it on the files Tesseract 3.01 was failing on. This time it was able to produce hOCR files for them. However, PDFBeads did not like some of these hOCR files:

$ pdfbeads -f -t 216 -m JBIG2 -b JP2 -B 72 1910-036b.tiff  >/dev/null
Prepared data for processing 1910-036b.tiff
JBIG2 compression complete. pages:1 symbols:5780 log2:13
/usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `to_i': NaN (FloatDomainError)
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `each_index'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:267:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/bin/pdfbeads:203
        from /usr/bin/pdfbeads:19:in `load'
        from /usr/bin/pdfbeads:19

Through a tedious process of binary searching through the files, I narrowed down the problem to cases where the hOCR file contains two or more spans of class ocrx_word which contain no CDATA except for whitespace. Affected files can be found by grepping for the following extended regular expression:

(<span[^>]*>(<strong>)? +(</strong>)?</span> *){2}

I don't know the hOCR format in detail, but I suspect that having two whitespace-containing ocrx_word spans in a row isn't prohibited. It's therefore probably PDFBeads which is at fault here.

The next steps are therefore to process the Tesseract output with sed to find and remove the duplicate whitespace spans. Following that it looks like my OCR job is complete, and the only really necessary remaining work is to build some HTML-based UI to access the PDFs.

04 November 2012

More bugs

So I wrote to Alexey Kryukov, author of PDFBeads, alerting him about his program's inability to handle input files where the horizontal and vertical DPI differ. I never heard back. I also did some further research on manually changing the output PDF resolution, but from what I can tell this isn't possible. So it looks like I'll just have to override the DPI settings in the original TIFFs and live with slightly stretched or compressed page sizes.

After running PDFBeads on my entire collection of images, I noticed that it failed to produce some issues due to missing hOCR files. Looking back, I see that Tesseract 3.01 has failed on some of the images, producing the following error message:

ELIST_ITERATOR::add_after_then_move:Error:Attemting to add an element with non NULL links, to a list

It looks like this problem has been reported at least a couple times before on the Tesseract issue tracker (Issue 541, Issue 788). Comments on the second report suggest the problem may have been solved in Tesseract 3.02, which was released a few days ago. This version hasn't yet been packaged in Lazy Kent's repository, so I can either wait to see if he updates the RPM, or try producing one myself using his spec file and patches.

05 September 2012

Further experiences with PDFBeads

I had a chance to visually examine the output of PDFBeads, and so far it looks OK. I think I will keep the unpapered backgrounds.

One problem that has arisen, however, is properly specifying the physical dimensions of the page. Back when I started this blog I reported that for most of my scans, the horizontal DPI is not the same as the vertical DPI. However, it seems that PDFBeads can't handle TIFFs where the horizontal and vertical DPI differ; when it tries to combine such images with hOCR data, the text in the resulting PDFs is a complete mess. I suppose there are three possible solutions to this problem:

  1. Examine the source code of PDFBeads to track down and fix the bug. This is likely to be difficult, at least for me, because the tool is written in Ruby, a language I have no knowledge of. (Or perhaps the author could be persuaded to fix it; there's no bug tracker but he does give his e-mail address in the documentation.)
  2. Postprocess the output PDF to override the DPI or paper size settings. I'm not sure if there's any easy way of doing this.
  3. Use ImageMagick's convert --density to override the input TIFF DPI so that the vertical and horizontal DPI values are the same. This will result in distorted images, however.

Experiences with PDFBeads

Further to my post of yesterday, I downloaded and installed PDFBeads, a Ruby tool for assembling images and hOCR data into PDFs. Unlike hocr2pdf, PDFBeads supports JBIG2, and is also able to up- or downscale the page image to a specified DPI. In theory this means it's no longer necessary to call jbig2enc separately. Also, unlike jbig2enc's pdf.py, PDFBeads inserts a JPEG or JPEG2000 foreground image of the original page, though whether or not this is desirable depends on the source material. For my microfiche scans, it's not particularly helpful: some of the scans are quite dark, so I would rather have bitonal, text-only images which print with higher contrast and less ink. Also, large areas of these images have been blanked by unpaper, so it's possible some of them may look a bit ugly. I'll have to examine the results to see whether they're acceptable.

First though, I had to figure out how to use the tool. The full manual is available in Russian only, though running it with --help does produce a useful, if incomplete, English summary of the command-line options. But the usage instructions leave something to be desired: "pdfbeads [options] [files to process] > out.pdf" conspicuously omits such important details as what file types are supported and how to associate a given image file with a given hOCR file. Some experimentation revealed some usage quirks and bugs, which I document here for future reference and for the benefit of anyone else using this tool:

  • One need give only the image files on the command line; it tries to find matching hOCR files automatically based on the image filenames. For example, if you call pdfbeads foo.tiff then it will look for hOCR data in the file foo.html. Frustratingly, however, it looks for this file in the current directory, and not in foo.tiff's directory, so calling pdfbeads /path/to/foo.tiff won't work if the hOCR data is in /path/to/foo.html.
  • The tool leaves a lot of temporary files lying around. To be fair, this is a good thing, since they are expensive to produce and you wouldn't want to recreate them on each run unless necessary; there's also a command-line option to delete them. The problem is that where these files are produced is neither documented nor specifiable. This issue, plus the one mentioned in the previous point, make it a bit more difficult to cleanly use the tool in a batch environment such as a shell script or makefile.
  • The program doesn't always throw an error when things go wrong—for example, if you try to invoke it on a PNG image, it will happily produce a blank PDF instead of informing you that it can't handle PNG files. It took some trial and error to find an image file format that it liked (TIFF).
  • Even when called with the correct arguments, the program sometimes ends up producing a 0-byte PDF file. I let it run overnight to produce PDFs for 820 issues of the Standard, and in about a dozen cases it produced a 0-byte file. However, when I tried rerunning the tool on these cases, in all but one it successfully produced the PDF. So evidently it's a bit flaky.
  • The tool still failed on one of my newspaper issues, throwing the error /usr/lib64/ruby/gems/1.8/gems/hpricot-0.8.5/lib/hpricot/parse.rb:33: [BUG] Segmentation fault. The problem is evidently an insidious and often-reported bug with hpricot, the HTML parser PDFBeads uses to process the hOCR files. There was nothing obviously wrong with the particular hOCR file that hpricot was choking on; and I found that making almost any trivial modification to it (such as adding another newline to the end of the file) allowed hpricot to process it without error.

Now that I've used the tool to produce a set of PDFs, I'm doing some spot checks on them to make sure they all look OK and have the hOCR data properly integrated. Also, because my scans vary in size (both in terms of pixels and physical paper dimensions) I may need to rerun the tool using different DPI settings for different issue ranges. Once that is done I can look at adding proper metadata to the PDFs. (Then there's the whole issue of using DjVu as an alternative, which so far I haven't investigated yet!)

DIY Book Scanner

In one of my recent posts an anonymous commenter alerted me to the existence of the DIY Book Scanner website, and more specifically its forum. The forum looks to be an excellent resource for anyone doing their own book (or newspaper) scanning project, and contains areas for discussing both hardware and software workflows. It's there that I first learned about PDFBeads (more on which in an upcoming post).

04 September 2012

Combining hOCR and image data into a PDF

You will recall that the page images originally supplied to me were DCT images embedded in PDFs, DCT being a lossy compression scheme based on the JPEG standard. I needed to crop, deskew, and OCR these images, for which I had to decompressed them to bitmaps. The finished PDFs I ultimately produce will use lossless JBIG2 compression on the scans—or so the plan is.

At the moment I have the cropped and deskewed page images as lossless PNG bitmaps, along with the OCR'd text in hOCR format. Using jbig2enc it's easy to create JBIG2 bitmaps and symbol tables from the PNG files. However, I don't (yet) have any tool which will directly combine the JBIG2 data and the hOCR data for a page into a single PDF. Jbig2enc's pdf.py can assemble JBIG2 files into a PDF, but it doesn't add the hOCR text. I did some investigation and I think I have two options available to me:

  1. I could use ExactImage's hocr2pdf to combine the PNG bitmaps and hOCR text into a PDF, and then use pdfsizeopt to JBIG2-compress the PDFs. There are two possibly surmountable problems with this:
    • Hocr2pdf always converts the images you give it to the lossy DCT format when outputting them to the PDF. In our case this is a bad thing, because our images are already from a DCT-compressed source, and are pretty low resolution to begin with. From reading the comments in the hocr2pdf source code (the only source for which I found was a user-contributed openSUSE RPM) I see that support for other image compression schemes is on the to-do list:
      // TODO: more image compressions, jbig2, Fax
      Fortunately, I think it should be easy to hack lossless image output into the code. The code for writing PDFs in codecs/pdf.cc starts off as folows:
        virtual void writeStreamTagsImpl(std::ostream& s)
        {
          // default based on image type
          if (image.bps < 8) encoding = "/FlateDecode";
          else encoding = "/DCTDecode";
      So apparently the code already supports not only DCT but also the lossless Flate scheme, and chooses between them based on the bit depth of the image. If I change the above code to
      virtual void writeStreamTagsImpl(std::ostream& s)
        {
          encoding = "/FlateDecode";
      and recompile, maybe hocr2pdf will no longer lossily compress the PNGs I feed it.
    • You will recall that I got the best-looking results from jbig2enc when I set it to upscale the images by a factor of two. However, pdfsizeopt doesn't appear to let you change the scaling factor. Since pdfsizeopt is just a Python script which calls jbig2, I should be able to just add -2 to the system call at that point.
  2. I could instead use something called PDFBeads. According to a thread on DIY Book Scanner, PDFBeads is a Ruby application which can add hOCR to PDF. However, the reader is warned that the "manual [is] in Russian only"! This could be fun.

So tomorrow will be spent patching hocr2pdf and pdfsizeopt, and/or learning Russian. :)

03 September 2012

GNU Parallel, where have you been all my life?

Digitizing the Socialist Standard archive involves running CPU-bound image processing tools on a large number of files. Since I've got a multicore CPU, it makes sense to run such operations in parallel rather than one after another. (A good rule of thumb I've heard is to always have twice as many processes running as you have cores or CPUs.) Up until now, I've been coding each batch of tasks in a makefile, and then invoking make with the -j argument for parallel execution. Needless to say, this is a bit inconvenient when I just have a one-off batch job to run, and it also prevents me from developing and testing bash-scripted tasks from the command line. For years I've wished that bash's looping statements could be parameterized by the number of loop bodies to run simultaneously. For example, instead of writing for x in a b c d e f;do somehugecommand $x;done and waiting for somehugecommand to run six times, one after the other, I want to be able to write something like for x in a b c d e f;do -j3 somehugecommand $x;done and have three instances of somehugecommand launch and run simultaneously.

Well, apparently such a tool has existed for many years now, but no one told me about it. It's called GNU Parallel, and it works much like the old familiar xargs from GNU Findutils. You pass it a list of values on stdin, and pass as command-line arguments a command line to execute. As with xargs, the character sequence {} gets replaced with the values from stdin. And of course, you also tell it how many simultaneous jobs to run with the -j option, just like with GNU Make. For example, whereas before I was calling the Tesseract OCR software on one file at a time with for f in $list_of_images;do tesseract $f.png $f -l eng hocr;done, I'm now executing them in parallel with echo $list_of_images | parallel -j4 tesseract {}.png {} -l eng hocr. What a fantastically useful utility!

As might be surmised from its name, GNU Parallel is an official GNU project, so it's surprising that it's not better known and more widely available. (For example, it's not packaged by openSUSE or other major distributions.) GNU Parallel's web page has some background which explains why:

In the years after 2005… I tried getting parallel accepted into GNU findutils. It was not accepted as it was written in Perl and the team did not want GNU findutils to depend on Perl…
In February 2009 I tried getting parallel added to the package moreutils. The author never replied to the email or the two reminders…
In 2010 parallel was adopted as an official GNU tool and the name was changed to GNU parallel. As GNU already had a tool for running jobs on remote computers (called pexec) it was a hard decision to include GNU parallel as well. I believe the decision was mostly based on GNU parallel having a more familiar user interface - behaving very much like xargs. Shortly after the release as GNU tool remote execution was added and all missing options from xargs were added to make it possible to use GNU parallel as a drop in replacement for xargs.

So to Ole Tange, the author of GNU Parallel, I just want to say thank you for this wonderful utility, and I'm sorry that you had so much trouble getting it adopted into a GNU package.

hOCR-capable OCR programs

As indicated in my last posting, I tested various OCR programs which output either to hOCR or directly to PDF. For the ones which output hOCR, I tried producing a PDF with the text layer hidden underneath the image using hocr2pdf, a Free Software tool which the creators do a very good job of preventing you from finding. There seems to be absolutely nowhere on their website to download it, either in source or binary form. Fortunately, the source seems to be available on a few third-party download sites, and users at the openSUSE Build Service have posted RPMs.

Anyway, I tested each OCR package on a page from a 1904 and 1961 issue. My findings are summarized as follows:

Cuneiform
Cuneiform seemed to do a decent job at OCR, at least as far as character matching went, but the hOCR it produced didn't work well with hocr2pdf—this despite using an earlier version of Cuneiform, as instructed by a post on the DIY Book Scanner forms which an anonymous commenter referred me to.
Tesseract
Tesseract's output was almost as good as Cuneiform's, and moreover the hOCR was digestible by hocr2pdf.
Adobe Acrobat Professional
I have access to this through my workplace. Accuracy was similar to the above two Free Software packages, but the user interface doesn't support batch processing. I've got hundreds of issues to process, so OCRing them one at a time in a GUI isn't an option.
ABBYY OCR
I activated a trial version of ABBYY's command-line OCR package. The accuracy was by far the highest of any of the suites I tested. However, it's proprietary and also very expensive software; in order to process the complete Standard archive I'd need to buy a €999 licence.

None of the above OCR programs seemed to recognize the column layout of the newspaper. It's therefore not possible to use the text selection tool in the resulting PDF to copy and paste more than one line of a column at a time. However, at least the PDF will be searchable (modulo the character recognition errors).

I've therefore settled on Tesseract. I set up a batch processing job and estimate it will take about 20 to 30 hours to do the whole archive.

One difficulty I foresee is that I don't think hocr2pdf works on the output of jbig2enc. I may need to use hocr2pdf to create an uncompressed PDF with hidden text, and then reprocess it using pdfsizeopt, which integrates jbig2.