20 March 2010

Autocrop

I have solved the problem of cropping the LSE images.

First, a quick recap: The microfiche scans of the Socialist Standard from the London School of Economics Library were provided as 6510 DCT images embedded into 69 PDF files. The images are unsuitable for use as-is for several reasons. First, each image depicts a spread of two physical pages—unless one has a particularly enormous, high-resolution monitor, it's not possible to read the text without doing a lot of tiresome scrolling. Second, the images are uncropped photographs of bound volumes of the Standard; they include a very thick and uneven black margin all around the page spread, which besides being ugly also reduces the resolution of the text when the images are displayed in a viewer at full width or height. Third, LSE has unhelpfully tacked a rather garish ex libris banner at the bottom of each page. You can see a scaled-down copy of one of these DCT images below.

My task, then, is to crop the DCT images in such a way as to remove the black border and banner, and then to cut the image down the middle to isolate the two physical pages. I was afraid that, since the width of the DCT image and the position of the page spread therein varies from image to image, I would have to do the cropping manually. Assuming it takes two minutes to crop an image manually, it would have taken about 217 hours to do the entire microfiche collection.

Fortunately, I was able to devise an image processing algorithm, realized in the libjpeg-based C program below, which suggests the cropping region automatically. It examines successive rows from the top of the image and calculates their average brightness; once it discovers a row with a brightness above a certain threshold, it has found the upper crop line. It finds the bottom crop line similarly, but this time working upwards from just before the LSE banner. The left and right crop lines are handled similary, except that the algorithm examines columns instead of rows, working inwards from the left and right edges. The cropping region is then passed to jpegtran for lossless cropping, as shown in the shell script which follows.

#include <stdio.h>
#include <stdlib.h>
#include <jpeglib.h>

#define THRESHOLD 15
#define MIN_X 65
#define MIN_Y 5
#define MAX_Y 2850

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  JSAMPARRAY buffer;
  int arg = 0;
  size_t row_stride;
  long x, y, top = 0 , bottom = 0, left = 0, right = 0;
  unsigned long v;

  /* Print usage information */
  if (argc <= 1) {
    fputs("Usage: autocrop file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg < argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "can't open %s\n", argv[arg]);
      return EXIT_FAILURE;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&jerr);
    jpeg_create_decompress(&cinfo);
    jpeg_stdio_src(&cinfo, infile);
    (void) jpeg_read_header(&cinfo, TRUE);
    (void) jpeg_start_decompress(&cinfo);
    row_stride = cinfo.output_width * cinfo.output_components;

    /* Slurp JPEG into memory */
    buffer = (*cinfo.mem->alloc_sarray)
      ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, cinfo.output_height); 
    if (buffer == NULL) {
      fprintf(stderr, "autocrop: out of memory\n");
      return EXIT_FAILURE;
    }
    while (cinfo.output_scanline < cinfo.output_height)
      jpeg_read_scanlines(&cinfo, &buffer[cinfo.output_scanline], 
                          cinfo.output_height);

    /* Find top crop */
    for (y = MIN_Y; y <= MAX_Y; y++) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x < row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        top = y;
        break;
      }
    }

    /* Find bottom crop */
    for (y = MAX_Y; y >= MIN_Y; y--) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x < row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        bottom = y;
        break;
      }
    }

    /* Find left crop */
    for (x = MIN_X * cinfo.output_components; x < row_stride; x++) {
      v = 0;
      for (y = MIN_Y; y <= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        left = x / cinfo.output_components;
        break;
      }
    }

    /* Find right crop */
    for (x = row_stride - 1; x >= MIN_X * cinfo.output_components; x--) {
      v = 0;
      for (y = MIN_Y; y <= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        right = x / cinfo.output_components;
        break;
      }
    }

    /* Print the crop width, height, and upper left coordinates */
    printf("%s\t%ld\t%ld\t%ld\t%ld\n", argv[arg], 
           right - left, bottom - top, left, top);

    /* Clean up */
    (void) jpeg_finish_decompress(&cinfo);
    jpeg_destroy_decompress(&cinfo);
    fclose(infile);
  }

  return EXIT_SUCCESS;
}
for p in */*.jpg; do
    w=-1
    pbase=$(basename $p .jpg)
    pdir=$(dirname $p)
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg ]; then
        read filename w h x y < <(echo $(../bin/autocrop $p))
        echo jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p \
            > ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg
    fi
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg ]; then
        if [ $w -eq -1 ];then
            read filename w h x y < <(echo $(../bin/autocrop $p))
        fi
        echo jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p \
            > ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg
    fi
done

This process is remarkably effective for these images. Below you can see how it properly cropped the page spread shown above into two separate pages.

Out of the 11 168 Socialist Standard pages, the autocrop algorithm cropped only 18 of them incorrectly, giving an error rate of just 0.161%. Of these 18 failures, 7 were due to the page being overly skewed, 10 were due to a particularly dark cover image, and 1 was due to noise in the bottom margin. Below are a couple examples of improperly cropped images. As there are only 18 of them, I don't mind redoing these manually.

3 comments:

  1. Another useful C sourcecode from you, I'll compile for puppy linux community

    do you know *jpegcrops* anyway? it's a GUI program for visual (but batch) cropping several images with one touch

    http://ekot.dk/programmer/JPEGCrops/

    ReplyDelete
  2. HU HO!

    Too soon spoken

    I compiled source code

    gcc jpegcrops.c -ljpeg -o jpegcrops

    but when I type:

    jpegcrops autocrop *.jpg

    it says to me:

    can't open autocrop

    ReplyDelete
  3. Dingo, I wouldn't advise distributing that program as-is, since the margins and threshold are hard-coded. It would be best to add some command-line option processing which allows the user to set the top, bottom, left, and right margins, plus the brightness threshold. An even better change would be to specify the colour of the border to detect; right now it defaults to black. I might make these changes myself eventually, as they're not particularly difficult.

    By the way, the command is failing for you because the only arguments it takes are filenames. If there isn't a JPEG file named "autocrop" in the current directory, then "jpegcrops autocrop *.jpg" is going to fail.

    ReplyDelete