Socialist Standard digitization blog: PDFs: JPEG vs PNG vs JBIG2

23 March 2010

PDFs: JPEG vs PNG vs JBIG2

The goal of today's exercise is to see how to make the smallest possible PDF from the scans I have. To this end, I experimented with the September 1954 issue, which is a good candidate because it is fairly long and contains both text and images. The following table and graph summarizes four approaches and the results.

process	PDF creation command	PDF size (KB)
join the JPEGs into a PDF with ImageMagick	`convert *.jpg JPEG.pdf`	43 777
convert the JPEGs to bilevel PNGs, then join them into a PDF with ImageMagick	`convert *.png PNG.pdf`	6 907
convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc	`jbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf`	947
upscale and convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc	`jbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf`	1 451

So the clear winner here is JBIG2. The 2× upscaled version is actually much easier to read than the unscaled JBIG2 or PNG images, which are sometimes too faint. If I were to use the 2× upscaled JBIG2 method to produce PDFs for all the 1904–1972 issues, the total would be about 450 MB in size, which would easily fit on a single CD-ROM.

However, I know that much better compression ratios can be achieved using DjVu—pages can typically be reduced to just a few kilobytes each. The problem is that creating DjVu documents is a bit more involved. I tried using pdf2djvu, but the DjVu files it created were even larger than the PDFs; clearly what I really need to do is to use the individual DjVuLibre tools to properly segment and compress the original cropped JPEGs. Fortunately there appear to be some guidance and scripts on Wikisource. The Wikisource guide also pointed me towards unpaper, which apparently does a better job of autocropping scans than my own tool, and also deskews the pages. So the next few days will probably be spent investigating these resources.

1 comment:

DingoTuesday, March 23, 2010 at 8:35:00 p.m. GMT+1
I know it exist also *Gsdjvu*

http://djvu.sourceforge.net/gsdjvu.html

that performs conversion to djvu

I have never used, but maybe you may want test

DjVuDigital is a very efficient way of converting PostScript and PDF documents into DjVu. It relies on a GhostScript driver named GSDjVu that analyzes the sequence of rendering operations and classifies each of them as foreground or background. The rather sophisticated algorithm is fully described in this paper. The resulting segmentation is then used to produce a DjVu file, using, for instance, the DjVuLibre program csepdjvu.
ReplyDelete
Replies

Add comment

23 March 2010

PDFs: JPEG vs PNG vs JBIG2

1 comment:

Blog Archive

Labels

Related links

23 March 2010

PDFs: JPEG vs PNG vs JBIG2

1 comment:

Subscribe

Blog Archive

Labels

Related links