25 September 2011

Upsampling results

In June I thought I had completed cropping, but upon reviewing the covers I spotted a further three with problems. It's possible that some of the inner pages were likewise overlooked, though I'll leave that rather tedious inspection for later (or perhaps to eagle-eyed volunteers, if I can get any!)…

Also in June I indicated that I had used jbig2enc to produce three PDFs for each of the issues I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. The size of the collections are as follows:

upsamplingsize (MB)
none427
733
1257

So even at the greatest upsampling the collection is still easily small enough to fit on a DVD-ROM. However, from browsing the PDF covers, it seems a lot of the 4× upsampled issues are missing text (similar to how unpaper sometimes omits text on pages with low contrast or a close black border). I figure this could be fixed by playing with jbig2enc's threshold settings, though that could take several hours or days of experimentation and careful checking of the output.

The PDFs with 2× upsampling seem to be of good quality—much better than the non-upsampled ones, and not obviously worse than the 4× upsampled ones. And none of the covers had any obviously missing text; I hope this holds for the inside pages as well. I think, then, that it will be best to go with 2× upsampling. You can see some comparisons below.

No upsampling

2× upsampling

4× upsampling


My next steps will therefore be as follows:

  • Locate and scan the missing covers of the January 1968 and January 1969 issues. I got photocopies of these pages before I left London; I hope they didn't get lost in the move to Darmstadt!
  • Start investigating OCR software so that the final collection can have full-text search.
  • Start investigating DjVu to compare it with the PDFs.

No comments:

Post a Comment