20 December 2012

PDFBeads doesn't like consecutive whitespace in hOCR

Lazy Kent has now published openSUSE RPMs for Tesseract 3.02, so I installed it and ran it on the files Tesseract 3.01 was failing on. This time it was able to produce hOCR files for them. However, PDFBeads did not like some of these hOCR files:

$ pdfbeads -f -t 216 -m JBIG2 -b JP2 -B 72 1910-036b.tiff  >/dev/null
Prepared data for processing 1910-036b.tiff
JBIG2 compression complete. pages:1 symbols:5780 log2:13
/usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `to_i': NaN (FloatDomainError)
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `each_index'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:267:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/bin/pdfbeads:203
        from /usr/bin/pdfbeads:19:in `load'
        from /usr/bin/pdfbeads:19

Through a tedious process of binary searching through the files, I narrowed down the problem to cases where the hOCR file contains two or more spans of class ocrx_word which contain no CDATA except for whitespace. Affected files can be found by grepping for the following extended regular expression:

(<span[^>]*>(<strong>)? +(</strong>)?</span> *){2}

I don't know the hOCR format in detail, but I suspect that having two whitespace-containing ocrx_word spans in a row isn't prohibited. It's therefore probably PDFBeads which is at fault here.

The next steps are therefore to process the Tesseract output with sed to find and remove the duplicate whitespace spans. Following that it looks like my OCR job is complete, and the only really necessary remaining work is to build some HTML-based UI to access the PDFs.