05 June 2011

Cropping complete

As reported in March 2010, the microfiche images omit the cover pages for the January 1968 and January 1969 issues. I attended the Socialist Party of Great Britain's head office yesterday and picked up a spare copy of the January 1969 issue. There were no extra January 1968 issues remaining, but the issue did appear in the bound volumes in the archive, so I took a photocopy. The binding obscures about a centimetre of the left edge of the page, but my copy is better than nothing, I guess. Now my only problem is getting these two pages digitized, as I no longer have access to a scanner. I'll either have to find someone with a scanner, or see if I can photograph the covers myself.

With most of the pages now in place, it's time to start thinking again about how to "bind" them into PDF or DjVu documents. Since it's been a year since I last experimented with this, I downloaded the latest version of jbig2enc and its dependency, Leptonica. I discovered that jbig2enc doesn't compile with Leptonica 1.68, but only because the parameters to the findFileFormat() function have changed. This function is referenced once, in jbig2.cc, where it's used to check something involving multi-page TIFFs. I don't use jbig2enc to process TIFFs so I just commented out these lines, and then jbig2enc compiled fine.

My computer is now whirring away, generating three PDFs for each of the issues that I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. It will probably be busy doing this all night. Once it's done, I'll examine the results to see what looks the best and what the file sizes are like. Watch this space for further analysis of the results…

02 June 2011

unpaper revisited

The last few days have been spent reviewing and revising the results I obtained with unpaper in March of last year. After double-checking the image output, I found I had missed some cases where unpaper had failed to properly process the images. After applying the appropriate command-line options discussed previously, I was able to get unpaper to correctly process most of these; the rest I added to the list of images which unpaper cannot process. I also double-checked this list, which originally had 142 images; I found that many of them were able to be processed successfully with a little more command-line option experimentation.

In the end, there was a net increase of seven images to the list, so it now contains 149 images. These cases are almost exclusively pages with illustrations (usually cover pages). I will now have to do some preliminary tests to determine whether it would be more efficient to crop these images manually or use my autocrop tool.