The goal of today's exercise is to see how to make the smallest possible PDF from the scans I have. To this end, I experimented with the September 1954 issue, which is a good candidate because it is fairly long and contains both text and images. The following table and graph summarizes four approaches and the results.
|process||PDF creation command||PDF size (KB)|
|join the JPEGs into a PDF with ImageMagick||convert *.jpg JPEG.pdf||43 777|
|convert the JPEGs to bilevel PNGs, then join them into a PDF with ImageMagick||convert *.png PNG.pdf||6 907|
|convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc||jbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf||947|
|upscale and convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc||jbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf||1 451|
So the clear winner here is JBIG2. The 2× upscaled version is actually much easier to read than the unscaled JBIG2 or PNG images, which are sometimes too faint. If I were to use the 2× upscaled JBIG2 method to produce PDFs for all the 1904–1972 issues, the total would be about 450 MB in size, which would easily fit on a single CD-ROM.
However, I know that much better compression ratios can be achieved using DjVu—pages can typically be reduced to just a few kilobytes each. The problem is that creating DjVu documents is a bit more involved. I tried using pdf2djvu, but the DjVu files it created were even larger than the PDFs; clearly what I really need to do is to use the individual DjVuLibre tools to properly segment and compress the original cropped JPEGs. Fortunately there appear to be some guidance and scripts on Wikisource. The Wikisource guide also pointed me towards unpaper, which apparently does a better job of autocropping scans than my own tool, and also deskews the pages. So the next few days will probably be spent investigating these resources.