05 September 2012

Further experiences with PDFBeads

I had a chance to visually examine the output of PDFBeads, and so far it looks OK. I think I will keep the unpapered backgrounds.

One problem that has arisen, however, is properly specifying the physical dimensions of the page. Back when I started this blog I reported that for most of my scans, the horizontal DPI is not the same as the vertical DPI. However, it seems that PDFBeads can't handle TIFFs where the horizontal and vertical DPI differ; when it tries to combine such images with hOCR data, the text in the resulting PDFs is a complete mess. I suppose there are three possible solutions to this problem:

  1. Examine the source code of PDFBeads to track down and fix the bug. This is likely to be difficult, at least for me, because the tool is written in Ruby, a language I have no knowledge of. (Or perhaps the author could be persuaded to fix it; there's no bug tracker but he does give his e-mail address in the documentation.)
  2. Postprocess the output PDF to override the DPI or paper size settings. I'm not sure if there's any easy way of doing this.
  3. Use ImageMagick's convert --density to override the input TIFF DPI so that the vertical and horizontal DPI values are the same. This will result in distorted images, however.

No comments:

Post a Comment