04 September 2012

Combining hOCR and image data into a PDF

You will recall that the page images originally supplied to me were DCT images embedded in PDFs, DCT being a lossy compression scheme based on the JPEG standard. I needed to crop, deskew, and OCR these images, for which I had to decompressed them to bitmaps. The finished PDFs I ultimately produce will use lossless JBIG2 compression on the scans—or so the plan is.

At the moment I have the cropped and deskewed page images as lossless PNG bitmaps, along with the OCR'd text in hOCR format. Using jbig2enc it's easy to create JBIG2 bitmaps and symbol tables from the PNG files. However, I don't (yet) have any tool which will directly combine the JBIG2 data and the hOCR data for a page into a single PDF. Jbig2enc's pdf.py can assemble JBIG2 files into a PDF, but it doesn't add the hOCR text. I did some investigation and I think I have two options available to me:

  1. I could use ExactImage's hocr2pdf to combine the PNG bitmaps and hOCR text into a PDF, and then use pdfsizeopt to JBIG2-compress the PDFs. There are two possibly surmountable problems with this:
    • Hocr2pdf always converts the images you give it to the lossy DCT format when outputting them to the PDF. In our case this is a bad thing, because our images are already from a DCT-compressed source, and are pretty low resolution to begin with. From reading the comments in the hocr2pdf source code (the only source for which I found was a user-contributed openSUSE RPM) I see that support for other image compression schemes is on the to-do list:
      // TODO: more image compressions, jbig2, Fax
      Fortunately, I think it should be easy to hack lossless image output into the code. The code for writing PDFs in codecs/pdf.cc starts off as folows:
        virtual void writeStreamTagsImpl(std::ostream& s)
          // default based on image type
          if (image.bps < 8) encoding = "/FlateDecode";
          else encoding = "/DCTDecode";
      So apparently the code already supports not only DCT but also the lossless Flate scheme, and chooses between them based on the bit depth of the image. If I change the above code to
      virtual void writeStreamTagsImpl(std::ostream& s)
          encoding = "/FlateDecode";
      and recompile, maybe hocr2pdf will no longer lossily compress the PNGs I feed it.
    • You will recall that I got the best-looking results from jbig2enc when I set it to upscale the images by a factor of two. However, pdfsizeopt doesn't appear to let you change the scaling factor. Since pdfsizeopt is just a Python script which calls jbig2, I should be able to just add -2 to the system call at that point.
  2. I could instead use something called PDFBeads. According to a thread on DIY Book Scanner, PDFBeads is a Ruby application which can add hOCR to PDF. However, the reader is warned that the "manual [is] in Russian only"! This could be fun.

So tomorrow will be spent patching hocr2pdf and pdfsizeopt, and/or learning Russian. :)

No comments:

Post a Comment