Socialist Standard digitization blog

05 February 2013

UI mockup #1

I've produced a quick mock-up of what a web browser–based interface to the Socialist Standard archive might look like. Follow the link below to see a working copy.

I should stress that this is a very rough sketch; the colours and exact positioning of elements aren't finalized, the JavaScript is a bit buggy, and the page header is just something I threw together in the Gimp in five minutes. But I think that the overall issue navigation is OK. Opinions?

20 December 2012

PDFBeads doesn't like consecutive whitespace in hOCR

Lazy Kent has now published openSUSE RPMs for Tesseract 3.02, so I installed it and ran it on the files Tesseract 3.01 was failing on. This time it was able to produce hOCR files for them. However, PDFBeads did not like some of these hOCR files:

$ pdfbeads -f -t 216 -m JBIG2 -b JP2 -B 72 1910-036b.tiff  >/dev/null
Prepared data for processing 1910-036b.tiff
JBIG2 compression complete. pages:1 symbols:5780 log2:13
/usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `to_i': NaN (FloatDomainError)
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `each_index'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:267:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/bin/pdfbeads:203
        from /usr/bin/pdfbeads:19:in `load'
        from /usr/bin/pdfbeads:19

Through a tedious process of binary searching through the files, I narrowed down the problem to cases where the hOCR file contains two or more spans of class ocrx_word which contain no CDATA except for whitespace. Affected files can be found by grepping for the following extended regular expression:

(<span[^>]*>(<strong>)? +(</strong>)?</span> *){2}

I don't know the hOCR format in detail, but I suspect that having two whitespace-containing ocrx_word spans in a row isn't prohibited. It's therefore probably PDFBeads which is at fault here.

The next steps are therefore to process the Tesseract output with sed to find and remove the duplicate whitespace spans. Following that it looks like my OCR job is complete, and the only really necessary remaining work is to build some HTML-based UI to access the PDFs.

04 November 2012

More bugs

So I wrote to Alexey Kryukov, author of PDFBeads, alerting him about his program's inability to handle input files where the horizontal and vertical DPI differ. I never heard back. I also did some further research on manually changing the output PDF resolution, but from what I can tell this isn't possible. So it looks like I'll just have to override the DPI settings in the original TIFFs and live with slightly stretched or compressed page sizes.

After running PDFBeads on my entire collection of images, I noticed that it failed to produce some issues due to missing hOCR files. Looking back, I see that Tesseract 3.01 has failed on some of the images, producing the following error message:

ELIST_ITERATOR::add_after_then_move:Error:Attemting to add an element with non NULL links, to a list

It looks like this problem has been reported at least a couple times before on the Tesseract issue tracker (Issue 541, Issue 788). Comments on the second report suggest the problem may have been solved in Tesseract 3.02, which was released a few days ago. This version hasn't yet been packaged in Lazy Kent's repository, so I can either wait to see if he updates the RPM, or try producing one myself using his spec file and patches.

05 September 2012

Further experiences with PDFBeads

I had a chance to visually examine the output of PDFBeads, and so far it looks OK. I think I will keep the unpapered backgrounds.

One problem that has arisen, however, is properly specifying the physical dimensions of the page. Back when I started this blog I reported that for most of my scans, the horizontal DPI is not the same as the vertical DPI. However, it seems that PDFBeads can't handle TIFFs where the horizontal and vertical DPI differ; when it tries to combine such images with hOCR data, the text in the resulting PDFs is a complete mess. I suppose there are three possible solutions to this problem:

Examine the source code of PDFBeads to track down and fix the bug. This is likely to be difficult, at least for me, because the tool is written in Ruby, a language I have no knowledge of. (Or perhaps the author could be persuaded to fix it; there's no bug tracker but he does give his e-mail address in the documentation.)
Postprocess the output PDF to override the DPI or paper size settings. I'm not sure if there's any easy way of doing this.
Use ImageMagick's convert --density to override the input TIFF DPI so that the vertical and horizontal DPI values are the same. This will result in distorted images, however.

Experiences with PDFBeads

Further to my post of yesterday, I downloaded and installed PDFBeads, a Ruby tool for assembling images and hOCR data into PDFs. Unlike hocr2pdf, PDFBeads supports JBIG2, and is also able to up- or downscale the page image to a specified DPI. In theory this means it's no longer necessary to call jbig2enc separately. Also, unlike jbig2enc's pdf.py, PDFBeads inserts a JPEG or JPEG2000 foreground image of the original page, though whether or not this is desirable depends on the source material. For my microfiche scans, it's not particularly helpful: some of the scans are quite dark, so I would rather have bitonal, text-only images which print with higher contrast and less ink. Also, large areas of these images have been blanked by unpaper, so it's possible some of them may look a bit ugly. I'll have to examine the results to see whether they're acceptable.

First though, I had to figure out how to use the tool. The full manual is available in Russian only, though running it with --help does produce a useful, if incomplete, English summary of the command-line options. But the usage instructions leave something to be desired: "pdfbeads [options] [files to process] > out.pdf" conspicuously omits such important details as what file types are supported and how to associate a given image file with a given hOCR file. Some experimentation revealed some usage quirks and bugs, which I document here for future reference and for the benefit of anyone else using this tool:

One need give only the image files on the command line; it tries to find matching hOCR files automatically based on the image filenames. For example, if you call pdfbeads foo.tiff then it will look for hOCR data in the file foo.html. Frustratingly, however, it looks for this file in the current directory, and not in foo.tiff's directory, so calling pdfbeads /path/to/foo.tiff won't work if the hOCR data is in /path/to/foo.html.
The tool leaves a lot of temporary files lying around. To be fair, this is a good thing, since they are expensive to produce and you wouldn't want to recreate them on each run unless necessary; there's also a command-line option to delete them. The problem is that where these files are produced is neither documented nor specifiable. This issue, plus the one mentioned in the previous point, make it a bit more difficult to cleanly use the tool in a batch environment such as a shell script or makefile.
The program doesn't always throw an error when things go wrong—for example, if you try to invoke it on a PNG image, it will happily produce a blank PDF instead of informing you that it can't handle PNG files. It took some trial and error to find an image file format that it liked (TIFF).
Even when called with the correct arguments, the program sometimes ends up producing a 0-byte PDF file. I let it run overnight to produce PDFs for 820 issues of the Standard, and in about a dozen cases it produced a 0-byte file. However, when I tried rerunning the tool on these cases, in all but one it successfully produced the PDF. So evidently it's a bit flaky.
The tool still failed on one of my newspaper issues, throwing the error /usr/lib64/ruby/gems/1.8/gems/hpricot-0.8.5/lib/hpricot/parse.rb:33: [BUG] Segmentation fault. The problem is evidently an insidious and often-reported bug with hpricot, the HTML parser PDFBeads uses to process the hOCR files. There was nothing obviously wrong with the particular hOCR file that hpricot was choking on; and I found that making almost any trivial modification to it (such as adding another newline to the end of the file) allowed hpricot to process it without error.

Now that I've used the tool to produce a set of PDFs, I'm doing some spot checks on them to make sure they all look OK and have the hOCR data properly integrated. Also, because my scans vary in size (both in terms of pixels and physical paper dimensions) I may need to rerun the tool using different DPI settings for different issue ranges. Once that is done I can look at adding proper metadata to the PDFs. (Then there's the whole issue of using DjVu as an alternative, which so far I haven't investigated yet!)

DIY Book Scanner

In one of my recent posts an anonymous commenter alerted me to the existence of the DIY Book Scanner website, and more specifically its forum. The forum looks to be an excellent resource for anyone doing their own book (or newspaper) scanning project, and contains areas for discussing both hardware and software workflows. It's there that I first learned about PDFBeads (more on which in an upcoming post).

04 September 2012

Combining hOCR and image data into a PDF

You will recall that the page images originally supplied to me were DCT images embedded in PDFs, DCT being a lossy compression scheme based on the JPEG standard. I needed to crop, deskew, and OCR these images, for which I had to decompressed them to bitmaps. The finished PDFs I ultimately produce will use lossless JBIG2 compression on the scans—or so the plan is.

At the moment I have the cropped and deskewed page images as lossless PNG bitmaps, along with the OCR'd text in hOCR format. Using jbig2enc it's easy to create JBIG2 bitmaps and symbol tables from the PNG files. However, I don't (yet) have any tool which will directly combine the JBIG2 data and the hOCR data for a page into a single PDF. Jbig2enc's pdf.py can assemble JBIG2 files into a PDF, but it doesn't add the hOCR text. I did some investigation and I think I have two options available to me:

I could use ExactImage's hocr2pdf to combine the PNG bitmaps and hOCR text into a PDF, and then use pdfsizeopt to JBIG2-compress the PDFs. There are two possibly surmountable problems with this:
- Hocr2pdf always converts the images you give it to the lossy DCT format when outputting them to the PDF. In our case this is a bad thing, because our images are already from a DCT-compressed source, and are pretty low resolution to begin with. From reading the comments in the hocr2pdf source code (the only source for which I found was a user-contributed openSUSE RPM) I see that support for other image compression schemes is on the to-do list:
```
// TODO: more image compressions, jbig2, Fax
```
  Fortunately, I think it should be easy to hack lossless image output into the code. The code for writing PDFs in codecs/pdf.cc starts off as folows:
```
  virtual void writeStreamTagsImpl(std::ostream& s)
  {
    // default based on image type
    if (image.bps < 8) encoding = "/FlateDecode";
    else encoding = "/DCTDecode";
```
  So apparently the code already supports not only DCT but also the lossless Flate scheme, and chooses between them based on the bit depth of the image. If I change the above code to
```
virtual void writeStreamTagsImpl(std::ostream& s)
  {
    encoding = "/FlateDecode";
```
  and recompile, maybe hocr2pdf will no longer lossily compress the PNGs I feed it.
- You will recall that I got the best-looking results from jbig2enc when I set it to upscale the images by a factor of two. However, pdfsizeopt doesn't appear to let you change the scaling factor. Since pdfsizeopt is just a Python script which calls jbig2, I should be able to just add -2 to the system call at that point.
I could instead use something called PDFBeads. According to a thread on DIY Book Scanner, PDFBeads is a Ruby application which can add hOCR to PDF. However, the reader is warned that the "manual [is] in Russian only"! This could be fun.

So tomorrow will be spent patching hocr2pdf and pdfsizeopt, and/or learning Russian. :)

05 February 2013

UI mockup #1

20 December 2012

PDFBeads doesn't like consecutive whitespace in hOCR

04 November 2012

More bugs

05 September 2012

Further experiences with PDFBeads

Experiences with PDFBeads

DIY Book Scanner

04 September 2012

Combining hOCR and image data into a PDF

Blog Archive

Labels

Related links

05 February 2013

UI mockup #1

20 December 2012

PDFBeads doesn't like consecutive whitespace in hOCR

04 November 2012

More bugs

05 September 2012

Further experiences with PDFBeads

Experiences with PDFBeads

DIY Book Scanner

04 September 2012

Combining hOCR and image data into a PDF

Subscribe

Blog Archive

Labels

Related links