23 March 2010


The goal of today's exercise is to see how to make the smallest possible PDF from the scans I have. To this end, I experimented with the September 1954 issue, which is a good candidate because it is fairly long and contains both text and images. The following table and graph summarizes four approaches and the results.

processPDF creation commandPDF size (KB)
join the JPEGs into a PDF with ImageMagickconvert *.jpg JPEG.pdf43 777
convert the JPEGs to bilevel PNGs, then join them into a PDF with ImageMagickconvert *.png PNG.pdf6 907
convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2encjbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf947
upscale and convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2encjbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf1 451

So the clear winner here is JBIG2. The 2× upscaled version is actually much easier to read than the unscaled JBIG2 or PNG images, which are sometimes too faint. If I were to use the 2× upscaled JBIG2 method to produce PDFs for all the 1904–1972 issues, the total would be about 450 MB in size, which would easily fit on a single CD-ROM.

However, I know that much better compression ratios can be achieved using DjVu—pages can typically be reduced to just a few kilobytes each. The problem is that creating DjVu documents is a bit more involved. I tried using pdf2djvu, but the DjVu files it created were even larger than the PDFs; clearly what I really need to do is to use the individual DjVuLibre tools to properly segment and compress the original cropped JPEGs. Fortunately there appear to be some guidance and scripts on Wikisource. The Wikisource guide also pointed me towards unpaper, which apparently does a better job of autocropping scans than my own tool, and also deskews the pages. So the next few days will probably be spent investigating these resources.

1 comment:

  1. I know it exist also *Gsdjvu*


    that performs conversion to djvu

    I have never used, but maybe you may want test

    DjVuDigital is a very efficient way of converting PostScript and PDF documents into DjVu. It relies on a GhostScript driver named GSDjVu that analyzes the sequence of rendering operations and classifies each of them as foreground or background. The rather sophisticated algorithm is fully described in this paper. The resulting segmentation is then used to produce a DjVu file, using, for instance, the DjVuLibre program csepdjvu.