Further to my post of yesterday, I downloaded and installed PDFBeads, a Ruby tool for assembling images and hOCR data into PDFs. Unlike hocr2pdf, PDFBeads supports JBIG2, and is also able to up- or downscale the page image to a specified DPI. In theory this means it's no longer necessary to call jbig2enc separately. Also, unlike jbig2enc's
pdf.py, PDFBeads inserts a JPEG or JPEG2000 foreground image of the original page, though whether or not this is desirable depends on the source material. For my microfiche scans, it's not particularly helpful: some of the scans are quite dark, so I would rather have bitonal, text-only images which print with higher contrast and less ink. Also, large areas of these images have been blanked by unpaper, so it's possible some of them may look a bit ugly. I'll have to examine the results to see whether they're acceptable.
First though, I had to figure out how to use the tool. The full manual is available in Russian only, though running it with
--help does produce a useful, if incomplete, English summary of the command-line options. But the usage instructions leave something to be desired: "
pdfbeads [options] [files to process] > out.pdf" conspicuously omits such important details as what file types are supported and how to associate a given image file with a given hOCR file. Some experimentation revealed some usage quirks and bugs, which I document here for future reference and for the benefit of anyone else using this tool:
- One need give only the image files on the command line; it tries to find matching hOCR files automatically based on the image filenames. For example, if you call
pdfbeads foo.tiffthen it will look for hOCR data in the file
foo.html. Frustratingly, however, it looks for this file in the current directory, and not in
foo.tiff's directory, so calling
pdfbeads /path/to/foo.tiffwon't work if the hOCR data is in
- The tool leaves a lot of temporary files lying around. To be fair, this is a good thing, since they are expensive to produce and you wouldn't want to recreate them on each run unless necessary; there's also a command-line option to delete them. The problem is that where these files are produced is neither documented nor specifiable. This issue, plus the one mentioned in the previous point, make it a bit more difficult to cleanly use the tool in a batch environment such as a shell script or makefile.
- The program doesn't always throw an error when things go wrong—for example, if you try to invoke it on a PNG image, it will happily produce a blank PDF instead of informing you that it can't handle PNG files. It took some trial and error to find an image file format that it liked (TIFF).
- Even when called with the correct arguments, the program sometimes ends up producing a 0-byte PDF file. I let it run overnight to produce PDFs for 820 issues of the Standard, and in about a dozen cases it produced a 0-byte file. However, when I tried rerunning the tool on these cases, in all but one it successfully produced the PDF. So evidently it's a bit flaky.
- The tool still failed on one of my newspaper issues, throwing the error
/usr/lib64/ruby/gems/1.8/gems/hpricot-0.8.5/lib/hpricot/parse.rb:33: [BUG] Segmentation fault. The problem is evidently an insidious and often-reported bug with hpricot, the HTML parser PDFBeads uses to process the hOCR files. There was nothing obviously wrong with the particular hOCR file that hpricot was choking on; and I found that making almost any trivial modification to it (such as adding another newline to the end of the file) allowed hpricot to process it without error.
Now that I've used the tool to produce a set of PDFs, I'm doing some spot checks on them to make sure they all look OK and have the hOCR data properly integrated. Also, because my scans vary in size (both in terms of pixels and physical paper dimensions) I may need to rerun the tool using different DPI settings for different issue ranges. Once that is done I can look at adding proper metadata to the PDFs. (Then there's the whole issue of using DjVu as an alternative, which so far I haven't investigated yet!)