Also in June I indicated that I had used jbig2enc to produce three PDFs for each of the issues I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. The size of the collections are as follows:
upsampling | size (MB) |
---|---|
none | 427 |
2× | 733 |
4× | 1257 |
So even at the greatest upsampling the collection is still easily small enough to fit on a DVD-ROM. However, from browsing the PDF covers, it seems a lot of the 4× upsampled issues are missing text (similar to how unpaper sometimes omits text on pages with low contrast or a close black border). I figure this could be fixed by playing with jbig2enc's threshold settings, though that could take several hours or days of experimentation and careful checking of the output.
The PDFs with 2× upsampling seem to be of good quality—much better than the non-upsampled ones, and not obviously worse than the 4× upsampled ones. And none of the covers had any obviously missing text; I hope this holds for the inside pages as well. I think, then, that it will be best to go with 2× upsampling. You can see some comparisons below.
No upsampling |
2× upsampling |
4× upsampling |
My next steps will therefore be as follows:
- Locate and scan the missing covers of the January 1968 and January 1969 issues. I got photocopies of these pages before I left London; I hope they didn't get lost in the move to Darmstadt!
- Start investigating OCR software so that the final collection can have full-text search.
- Start investigating DjVu to compare it with the PDFs.
No comments:
Post a Comment