21 March 2010

Manual cropping

This afternoon I used GIMP to find cropping coordinates for the 18 pages my autocrop program didn't successfully process. Having passed these to jpegtran, I'm now in possession of 13 020 properly cropped JPEG images, of which 11 164 are unique pages of the Socialist Standard (and the 1967 supplement) and the remaining 1856 are blank pages, microfiche title slides, indices, or duplicates.

Having cropped the images and discarded the irrelevant pages has brought the size of the corpus down from 17.58 GB to 15.13 GB, a savings of 13.94%. Of course, if LSE had properly scanned them as high-resolution bilevel images rather than JPEGs in the first place, the size would have been about a third of this. I am wondering if there is some way to convert the JPEGs to bilevel images, but given the relatively poor quality of the photographs and low resolution of the scans, this may not be possible. I'll have a go at batch-converting them with ImageMagick and examine the results, but I am not optimistic that they will be acceptable.

At any rate, the next step will be to assemble the individual pages into PDFs or DjVus, one issue per file. I shall have to look around to see what software is available for this. The only one I'm aware of is the pdfpages package for pdfTeX, though I'm sure there are others more suitable for my task.

1 comment:

  1. for djvu encoding:

    for black and white scans

    *minidjvu*
    http://minidjvu.sourceforge.net/

    may be the best choice

    for color/grayscale,

    is also available
    *Djvusolo* (freeware)
    http://www.djvu.org/resources/

    ========
    for pdf assembling I usually use:

    *sam2p*
    http://pts.szit.bme.hu/sam2p/

    with this script (I written to convert .png to .pdf)

    #!/bin/bash

    directory=`pwd`

    for file in $directory/*.png
    do
    filename=${file%.png}
    sam2p $filename.png $filename.pdf
    done

    finally, I join all single pdfs in one pdf with

    *pdftk*
    http://www.accesspdf.com/pdftk/

    pdftk *.pdf cat output joined.pdf
    pdftk joined.pdf output fixed.pdf
    mv fixed.pdf joined.pdf

    these passages

    pdftk joined.pdf output fixed.pdf
    mv fixed.pdf joined.pdf

    are needed since xref table of pdf is rightly rebuilt with

    pdftk joined.pdf output fixed.pdf

    ReplyDelete