Socialist Standard digitization blog: Manual cropping

21 March 2010

Manual cropping

This afternoon I used GIMP to find cropping coordinates for the 18 pages my autocrop program didn't successfully process. Having passed these to jpegtran, I'm now in possession of 13 020 properly cropped JPEG images, of which 11 164 are unique pages of the Socialist Standard (and the 1967 supplement) and the remaining 1856 are blank pages, microfiche title slides, indices, or duplicates.

Having cropped the images and discarded the irrelevant pages has brought the size of the corpus down from 17.58 GB to 15.13 GB, a savings of 13.94%. Of course, if LSE had properly scanned them as high-resolution bilevel images rather than JPEGs in the first place, the size would have been about a third of this. I am wondering if there is some way to convert the JPEGs to bilevel images, but given the relatively poor quality of the photographs and low resolution of the scans, this may not be possible. I'll have a go at batch-converting them with ImageMagick and examine the results, but I am not optimistic that they will be acceptable.

At any rate, the next step will be to assemble the individual pages into PDFs or DjVus, one issue per file. I shall have to look around to see what software is available for this. The only one I'm aware of is the pdfpages package for pdfTeX, though I'm sure there are others more suitable for my task.

1 comment:

DingoSunday, March 21, 2010 at 9:03:00 p.m. GMT+1
for djvu encoding:

for black and white scans

*minidjvu*
http://minidjvu.sourceforge.net/

may be the best choice

for color/grayscale,

is also available
*Djvusolo* (freeware)
http://www.djvu.org/resources/

========
for pdf assembling I usually use:

*sam2p*
http://pts.szit.bme.hu/sam2p/

with this script (I written to convert .png to .pdf)

#!/bin/bash

directory=`pwd`

for file in $directory/*.png
do
filename=${file%.png}
sam2p $filename.png $filename.pdf
done

finally, I join all single pdfs in one pdf with

*pdftk*
http://www.accesspdf.com/pdftk/

pdftk *.pdf cat output joined.pdf
pdftk joined.pdf output fixed.pdf
mv fixed.pdf joined.pdf

these passages

pdftk joined.pdf output fixed.pdf
mv fixed.pdf joined.pdf

are needed since xref table of pdf is rightly rebuilt with

pdftk joined.pdf output fixed.pdf
ReplyDelete
Replies

Add comment

21 March 2010

Manual cropping

1 comment:

Blog Archive

Labels

Related links

21 March 2010

Manual cropping

1 comment:

Subscribe

Blog Archive

Labels

Related links