15 March 2010

LSE PDF cropping

Using the pdfimages tool from Poppler, I've confirmed that all the LSE PDFs consist of full-page RGB DCT images. There are a couple problems with this:

  1. This format is lossy and thus the quality of the images could suffer if I transform them.
  2. The files are needlessly large, since they are stored in full colour even though the scans are only in shades of grey.

Now, it is possible to extract the images as JPEGs and crop them losslessly using jpegtran, but only when the upper left corner of the cropped region falls on an iMCU boundary. Fortunately, the images are scanned at a high enough resolution that increasing the dimensions of the region by a few extra pixels to ensure boundary alignment shouldn't be a problem. It's also possible to losslessly convert RGB JPEGs to greyscale, though the reduction in file size is negligible (for these images about 3.4%).

So the next order of business is to extract the images from all the LSE PDFs and crop them. It's possible that the exact coordinates for the cropped regions may vary across the images, depending on the paper size (which changed throughout the Standard's run) and the positioning of the paper when the issues were originally photographed for microfiche. I am hoping, though, that large runs of issues will use the same cropping coordinates, allowing me to do most of the work automatically rather than manually.

I used the following bash script extract all the DCT images from the LSE PDFs as JPEGs:

for year in {1904..1972}
  pdfimages -f 2 -j LSE_SocialistStandard_$year.pdf $year

I'll then have to examine the resulting JPEG files manually to determine the cropping region for the left- and right-hand pages. Once I have these, I can do batch greyscaling and cropping using the following script, where W and H are the pixel width and height of the region, respectively, and X+Y is the offset from the upper left corner of the original image:

for f in *
  jpegtran -grayscale -crop W1xH1+X1+Y1 $f >left-$f
  jpegtran -grayscale -crop W2xH2+X2+Y2 $f >right-$f

Given the amount of data I have, the above scripts can take about an hour to complete. They also create a large amount of data, and I've found I'm running out of disk space. I've therefore ordered a Samsung Story Station 1.5 TB USB 2.0 external hard drive.

No comments:

Post a Comment