31 March 2010

Final results with unpaper

I've finished processing the LSE JPEGs with unpaper, at least for the time being. The vast majority of the images were successfully processed using the following command lines.

September 1904 to August 1918
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm
September 1918 to August 1932
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3700,2600 in.pgm out%d.pgm
September 1932 to December 1950
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 4000,2660 in.pgm out%d.pgm
January 1951 to December 1969
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3450,2700 in.pgm out%d.pgm

As you can see, the only difference in the command lines was the sheet size, which varied over the course of the Standard's print run. The --layout double and --output-pages 2 options simply specify that we are working with 2-up sheets that need to be split into two separate files, one for each page. The --pre-wipe option deletes the ex libris banner LSE inserted at the bottom of the image.

Overall, 5445 of the 6078 sheets (90%) were processed more or less correctly with these default settings. In some of these cases unpaper failed to properly deskew the page, and sometimes a small portion of the page header was cut off, but no significant amount of article text or graphics is missing. A further 491 images (8%) required some additional options to correct the following processing errors:

folded columns
Unpaper's grey filter seemed to get confused with certain multicolumn layouts, especially if the column layout wasn't identical on both pages of a sheet. It would end up deleting a column down the side or middle and then squeezing the two edges together, as if it had folded the page onto itself. About 5% of sheets were so affected. The problem was solved by using the --no-grayfilter option.

folded column

fixed with --no-grayfilter
missing text
About 4% of the sheets, mostly from 1921 and the 1960s, had blocks of text missing due to unpaper's grey filter misidentifying an area of a particularly dark scan. The issue was fixed by using --black-threshold to adjust the luminance value under which unpaper considers a pixel to be black.

missing text

fixed with --black-threshold
In a further five cases, the missing text was due to a hair or other anomalous dark line leading to the black border area; unpaper then considered the text block to be part of the border and deleted it. These cases were solved by adjusting --blackfilter-intensity.

missing text

fixed with --blackfilter-intensity
misaligned pages
Sometimes unpaper would split a sheet off centre, so that the rightmost edge of the left-hand page spilled over into the leftmost edge of the right-hand page. Using --pre-shift -100,0 solved this problem, which affected nearly 3% of the images.

misaligned page

fixed with --pre-shift

There were 142 images (just over 2%) which could not be easily fixed at all. Nearly all of these failures were due to unpaper's black filter erasing images with large dark patches, or images or headlines which lay too close to the edge of the page. I was unable to find any combination of options which would preserve the desired text and images while still erasing the black border around the page. I may write to the author of unpaper to see if he has any suggestions; if this proves fruitless then I will have to take a different approach to these images. Possibly I could use my own autocrop tool on them to eliminate the black border, and then pass the result to unpaper for deskewing, grey filtering, and noise filtering.

The following stacked bar chart shows the number of successfully and unsuccessfully processed JPEGs for each volume of issues from 1904 to 1969.

[a stacked bar chart showing the proportion of images successfully and unsuccessfully processed by unpaper

As can be seen, the number of failures increases significantly in the 1960s—this is due to the increased use of photographs, particularly on the cover pages. The 1970s issues used so many photographs that there were more failures than I cared to correct. Since I scanned those images from paper myself, I will use my own much better scans instead of trying to unpaper the microfiche photos.

28 March 2010

First results with unpaper

The last few days have been spent figuring out how to get unpaper to work. Unlike my autocrop tool, the sheet size needs to be specified, which makes it a bit trickier to use with the LSE scans (see my earlier post "Page and image size analysis"). It also handles only uncompressed PNM files, which for some strange reason the author thinks of as a feature rather than a shortcoming. So now my corpus has ballooned by another 74 GB. Good thing I bought that 1.5 TB drive.

Anyway, I've run unpaper on the September 1904 through August 1918 issues (whose pages are all 242 mm × 460 mm). Of the 836 uncropped JPEGs for these issues, unpaper seems to have processed 779 of them (93%) correctly with the following command line:

unpaper --overwrite --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm

Of the remaining 57 JPEGs, all but 6 were correctly processed with some extra options to modify the behaviour of the black or grey filters. The remaining 6 images I will have to deskew and crop by hand, or just use my autocrop tool.

Finding out the correct unpaper options for the 57 anomalous JPEGs was somewhat tedious. I would run unpaper with various command-line options on the files, wait several seconds for it to process, launch an image viewer on the output files, and then if the output was not acceptable, I would have to quit the image viewer and start again with a different set of command-line options. It would have been much easier if I could have just kept the image viewer open and set it to automatically refresh the images whenever they changed on disk. Unfortunately, neither of the viewers I tried (Gwenview and Kuickshow) have a "watch file" option. Gwenview does have a manual "refresh" command, but it does not refresh thumbnails. I've therefore created and/or voted for these "watch file" and "refresh" feature requests on the KDE bug tracker:

In the meantime, does anyone know of a fast, lightweight image viewer for X11 which has a "watch file" feature? It should be able to view PNM and PNG files.

23 March 2010

PDFs: JPEG vs PNG vs JBIG2

The goal of today's exercise is to see how to make the smallest possible PDF from the scans I have. To this end, I experimented with the September 1954 issue, which is a good candidate because it is fairly long and contains both text and images. The following table and graph summarizes four approaches and the results.

processPDF creation commandPDF size (KB)
join the JPEGs into a PDF with ImageMagickconvert *.jpg JPEG.pdf43 777
convert the JPEGs to bilevel PNGs, then join them into a PDF with ImageMagickconvert *.png PNG.pdf6 907
convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2encjbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf947
upscale and convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2encjbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf1 451

So the clear winner here is JBIG2. The 2× upscaled version is actually much easier to read than the unscaled JBIG2 or PNG images, which are sometimes too faint. If I were to use the 2× upscaled JBIG2 method to produce PDFs for all the 1904–1972 issues, the total would be about 450 MB in size, which would easily fit on a single CD-ROM.

However, I know that much better compression ratios can be achieved using DjVu—pages can typically be reduced to just a few kilobytes each. The problem is that creating DjVu documents is a bit more involved. I tried using pdf2djvu, but the DjVu files it created were even larger than the PDFs; clearly what I really need to do is to use the individual DjVuLibre tools to properly segment and compress the original cropped JPEGs. Fortunately there appear to be some guidance and scripts on Wikisource. The Wikisource guide also pointed me towards unpaper, which apparently does a better job of autocropping scans than my own tool, and also deskews the pages. So the next few days will probably be spent investigating these resources.

jbig2enc

Today, on the recommendation of one of the readers of this blog, I decided to install jbig2enc to see how it might be useful for my digitization project. Unfortunately, it didn't seem to compile out of the box:

[psy@sable:~/src/agl-jbig2enc-edebc5a]$ make
g++ -c jbig2enc.cc -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
g++ -c jbig2arith.cc -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
g++ -c jbig2sym.cc -DUSE_EXT -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
ar -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o
a - jbig2enc.o
a - jbig2arith.o
a - jbig2sym.o
g++ -o jbig2 jbig2.cc -L. -ljbig2enc ../leptonlib-1.64/src/liblept.a -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3  -lpng -ljpeg -ltiff -lm
../leptonlib-1.64/src/liblept.a(gifio.o): In function `pixWriteStreamGif':
gifio.c:(.text+0x15c): undefined reference to `MakeMapObject'
gifio.c:(.text+0x21d): undefined reference to `EGifOpenFileHandle'
gifio.c:(.text+0x258): undefined reference to `EGifPutScreenDesc'
gifio.c:(.text+0x26f): undefined reference to `FreeMapObject'
gifio.c:(.text+0x277): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x2b0): undefined reference to `FreeMapObject'
gifio.c:(.text+0x2df): undefined reference to `FreeMapObject'
gifio.c:(.text+0x2ff): undefined reference to `EGifPutImageDesc'
gifio.c:(.text+0x31a): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x502): undefined reference to `EGifPutLine'
gifio.c:(.text+0x537): undefined reference to `EGifPutComment'
gifio.c:(.text+0x569): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x582): undefined reference to `FreeMapObject'
gifio.c:(.text+0x5e0): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x653): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x682): undefined reference to `EGifCloseFile'
../leptonlib-1.64/src/liblept.a(gifio.o): In function `pixReadStreamGif':
gifio.c:(.text+0x6ce): undefined reference to `DGifOpenFileHandle'
gifio.c:(.text+0x6e6): undefined reference to `DGifSlurp'
gifio.c:(.text+0x878): undefined reference to `DGifCloseFile'
gifio.c:(.text+0x884): undefined reference to `DGifCloseFile'
gifio.c:(.text+0x9bc): undefined reference to `DGifCloseFile'
gifio.c:(.text+0xa2c): undefined reference to `DGifCloseFile'
gifio.c:(.text+0xa55): undefined reference to `DGifCloseFile'
../leptonlib-1.64/src/liblept.a(gifio.o):gifio.c:(.text+0xa86): more undefined references to `DGifCloseFile' follow
collect2: ld returned 1 exit status
make: *** [jbig2] Error 1
[psy@sable:~/src/agl-jbig2enc-edebc5a]$ 

It seems there were a few things wrong with the Makefile. The showstopper was that Leptonica, the library upon which jbig2enc depends, is expecting to link to giflib, but the Makefile doesn't specify this library. This was solved by adding -lgif to the command which compiles jbig2. The other problems were not fatal but somewhat irritating:

  • it is assumed that Leptonica isn't installed in a standard location;
  • there are no install and uninstall targets for (un)installing the package;
  • the program is written in C++, but the compiler is invoked with a redefined $(CC) rather than the standard $(CXX);
  • ar is invoked directly rather than through the standard $(AR); and
  • the clean target uses wildcards somewhat dangerously.

So here's an updated Makefile for jbig2enc 0.27. It should work with little or no modification on most *nix systems. On 64-bit systems which use lib64 directories, the libdir variable should be changed appropriately, or else it should be overriden on the command line.

# Improved Makefile for jbig2enc by Tristan Miller, 2010-03-22

prefix=/usr/local
exec_prefix=$(prefix)
bindir=$(exec_prefix)/bin
libdir=$(exec_prefix)/lib
CFLAGS=-I/usr/local/include/liblept -I/usr/include/liblept -Wall -O3 ${EXTRA}

jbig2: libjbig2enc.a jbig2.cc
 $(CXX) -o jbig2 jbig2.cc -L. -ljbig2enc $(CFLAGS) -lpng -ljpeg -ltiff -lm -llept -lgif

libjbig2enc.a: jbig2enc.o jbig2arith.o jbig2sym.o
 $(AR) -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o

jbig2enc.o: jbig2enc.cc jbig2arith.h jbig2sym.h jbig2structs.h jbig2segments.h
 $(CXX) -c jbig2enc.cc $(CFLAGS)

jbig2arith.o: jbig2arith.cc jbig2arith.h
 $(CXX) -c jbig2arith.cc $(CFLAGS)

jbig2sym.o: jbig2sym.cc jbig2arith.h
 $(CXX) -c jbig2sym.cc -DUSE_EXT $(CFLAGS)

clean:
 rm -f jbig2enc.o jbig2arith.o jbig2sym.o jbig2 libjbig2enc.a

install:
 install -s jbig2 $(bindir)
 install pdf.py $(bindir)
 install -s libjbig2enc.a $(libdir)

uninstall:
 rm $(bindir)/jbig2
 rm $(bindir)/pdf.py
 rm $(libdir)/libjbig2enc.a

21 March 2010

Manual cropping

This afternoon I used GIMP to find cropping coordinates for the 18 pages my autocrop program didn't successfully process. Having passed these to jpegtran, I'm now in possession of 13 020 properly cropped JPEG images, of which 11 164 are unique pages of the Socialist Standard (and the 1967 supplement) and the remaining 1856 are blank pages, microfiche title slides, indices, or duplicates.

Having cropped the images and discarded the irrelevant pages has brought the size of the corpus down from 17.58 GB to 15.13 GB, a savings of 13.94%. Of course, if LSE had properly scanned them as high-resolution bilevel images rather than JPEGs in the first place, the size would have been about a third of this. I am wondering if there is some way to convert the JPEGs to bilevel images, but given the relatively poor quality of the photographs and low resolution of the scans, this may not be possible. I'll have a go at batch-converting them with ImageMagick and examine the results, but I am not optimistic that they will be acceptable.

At any rate, the next step will be to assemble the individual pages into PDFs or DjVus, one issue per file. I shall have to look around to see what software is available for this. The only one I'm aware of is the pdfpages package for pdfTeX, though I'm sure there are others more suitable for my task.

20 March 2010

Autocrop

I have solved the problem of cropping the LSE images.

First, a quick recap: The microfiche scans of the Socialist Standard from the London School of Economics Library were provided as 6510 DCT images embedded into 69 PDF files. The images are unsuitable for use as-is for several reasons. First, each image depicts a spread of two physical pages—unless one has a particularly enormous, high-resolution monitor, it's not possible to read the text without doing a lot of tiresome scrolling. Second, the images are uncropped photographs of bound volumes of the Standard; they include a very thick and uneven black margin all around the page spread, which besides being ugly also reduces the resolution of the text when the images are displayed in a viewer at full width or height. Third, LSE has unhelpfully tacked a rather garish ex libris banner at the bottom of each page. You can see a scaled-down copy of one of these DCT images below.

My task, then, is to crop the DCT images in such a way as to remove the black border and banner, and then to cut the image down the middle to isolate the two physical pages. I was afraid that, since the width of the DCT image and the position of the page spread therein varies from image to image, I would have to do the cropping manually. Assuming it takes two minutes to crop an image manually, it would have taken about 217 hours to do the entire microfiche collection.

Fortunately, I was able to devise an image processing algorithm, realized in the libjpeg-based C program below, which suggests the cropping region automatically. It examines successive rows from the top of the image and calculates their average brightness; once it discovers a row with a brightness above a certain threshold, it has found the upper crop line. It finds the bottom crop line similarly, but this time working upwards from just before the LSE banner. The left and right crop lines are handled similary, except that the algorithm examines columns instead of rows, working inwards from the left and right edges. The cropping region is then passed to jpegtran for lossless cropping, as shown in the shell script which follows.

#include <stdio.h>
#include <stdlib.h>
#include <jpeglib.h>

#define THRESHOLD 15
#define MIN_X 65
#define MIN_Y 5
#define MAX_Y 2850

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  JSAMPARRAY buffer;
  int arg = 0;
  size_t row_stride;
  long x, y, top = 0 , bottom = 0, left = 0, right = 0;
  unsigned long v;

  /* Print usage information */
  if (argc <= 1) {
    fputs("Usage: autocrop file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg < argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "can't open %s\n", argv[arg]);
      return EXIT_FAILURE;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&jerr);
    jpeg_create_decompress(&cinfo);
    jpeg_stdio_src(&cinfo, infile);
    (void) jpeg_read_header(&cinfo, TRUE);
    (void) jpeg_start_decompress(&cinfo);
    row_stride = cinfo.output_width * cinfo.output_components;

    /* Slurp JPEG into memory */
    buffer = (*cinfo.mem->alloc_sarray)
      ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, cinfo.output_height); 
    if (buffer == NULL) {
      fprintf(stderr, "autocrop: out of memory\n");
      return EXIT_FAILURE;
    }
    while (cinfo.output_scanline < cinfo.output_height)
      jpeg_read_scanlines(&cinfo, &buffer[cinfo.output_scanline], 
                          cinfo.output_height);

    /* Find top crop */
    for (y = MIN_Y; y <= MAX_Y; y++) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x < row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        top = y;
        break;
      }
    }

    /* Find bottom crop */
    for (y = MAX_Y; y >= MIN_Y; y--) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x < row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        bottom = y;
        break;
      }
    }

    /* Find left crop */
    for (x = MIN_X * cinfo.output_components; x < row_stride; x++) {
      v = 0;
      for (y = MIN_Y; y <= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        left = x / cinfo.output_components;
        break;
      }
    }

    /* Find right crop */
    for (x = row_stride - 1; x >= MIN_X * cinfo.output_components; x--) {
      v = 0;
      for (y = MIN_Y; y <= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        right = x / cinfo.output_components;
        break;
      }
    }

    /* Print the crop width, height, and upper left coordinates */
    printf("%s\t%ld\t%ld\t%ld\t%ld\n", argv[arg], 
           right - left, bottom - top, left, top);

    /* Clean up */
    (void) jpeg_finish_decompress(&cinfo);
    jpeg_destroy_decompress(&cinfo);
    fclose(infile);
  }

  return EXIT_SUCCESS;
}
for p in */*.jpg; do
    w=-1
    pbase=$(basename $p .jpg)
    pdir=$(dirname $p)
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg ]; then
        read filename w h x y < <(echo $(../bin/autocrop $p))
        echo jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p \
            > ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg
    fi
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg ]; then
        if [ $w -eq -1 ];then
            read filename w h x y < <(echo $(../bin/autocrop $p))
        fi
        echo jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p \
            > ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg
    fi
done

This process is remarkably effective for these images. Below you can see how it properly cropped the page spread shown above into two separate pages.

Out of the 11 168 Socialist Standard pages, the autocrop algorithm cropped only 18 of them incorrectly, giving an error rate of just 0.161%. Of these 18 failures, 7 were due to the page being overly skewed, 10 were due to a particularly dark cover image, and 1 was due to noise in the bottom margin. Below are a couple examples of improperly cropped images. As there are only 18 of them, I don't mind redoing these manually.

19 March 2010

Page inventory

I've completed an inventory of the pages in the 69 LSE PDFs. That is, for each page in the PDF (or more specifically, for each of the two logical pages on every physical page), I noted whether it was a microfiche title slide, a regular Socialist Standard page, a blank page, or something else. Creating such an inventory was necessary so that I can later split up the pages by issue; I need to know where each issue starts and ends within the PDF.

Below is a chart showing the 13 020 logical pages as they appear in sequence in each PDF. The pages have been colour-coded as follows: white, a blank page; grey, a microfiche title slide; green, a regular Socialist Standard page; red, a "special supplement" that appears to have been distributed with the August or September 1967 issue; and blue, the official Socialist Standard index which was included in some bound volumes.

As can be seen, the length of the Standard varies a great deal, sometimes even within a single month (often due to special anniversary issues, but sometimes for no apparent reason). Indices were included only from 1940–1952 and 1969–1971. Every PDF except for 1915 starts with a title page, and further title pages appear in the middle of the 1922, 1927, and 1933 volumes. The January issues for 1968, 1969, and 1971 are missing their covers; for the first two I will have to contact the Party to get a physical copy to scan myself. Finally, the December 1966 issue appears twice.

17 March 2010

Page and image size analysis

I have just discovered two anomalies regarding the page and image sizes; it remains to be seen how they will affect the cropping task.

The Socialist Standard has used at least five different page sizes throughout its print run. However, as shown in the table below, the page dimension ratios don't seem to correspond with those of the scanned versions in the LSE PDFs.

periodphysical pagescanned pageapparent DPI
dimensions (mm)ratiodimensions (px)ratiohorizontalvertical
1904-09 – 1918-08242 × 4600.5261715 × 26700.642180147
1918-09 – 1932-08180 × 2390.7531815 × 25550.710256272
1932-09 – 1950-12208 × 2760.7541965 × 26250.749240242
1951-01 – 1969-12212 × 2700.7851715 × 26250.653205247
1970-01 – 1972-12210 × 2970.7071725 × 28250.611209242

I'm at a loss as to what might account for the discrepancy, especially since the scanned aspect ratio is sometimes greater and sometimes lesser than the original. Keeping in mind that the microfiche was produced from photographs of bound collections of the Socialist Standard, here are some possible causes:

  • The Standard's sheets may have been cropped to a different size for binding.
  • The Standard was reprinted on paper of a different size for binding.
  • Portions of the physical sheets were obscured during photography (for instance, to hold the book open and in place for the camera), resulting in a cropped photo.
  • The paper is much wider than it appears in the two-dimensional photographs due to the binding gutter.
  • The horizontal and vertical DPI settings used for scanning the microfiche were not equal.

Not only are the page ratios different, but the dimensions of the entire scans (including the margins around the book and the LSE banner at the bottom) vary inexplicably. The height is always 3102 pixels but, as can be seen in the graph below, the width varies from 3405 to 4263 pixels. There is no obvious reason for this.

To automatically extract the image dimensions, I wrote the following C program using libjpeg:

#include <stdio.h>
#include <stdlib.h>
#include <jpeglib.h>

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  int arg = 0, status = EXIT_SUCCESS;

  /* Print usage information */
  if (argc <= 1) {
    fputs("Usage: jpegdims file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg < argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "jpegdims: can't open %s\n", argv[arg]);
      status = EXIT_FAILURE;
      continue;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&jerr);
    jpeg_create_decompress(&cinfo);
    jpeg_stdio_src(&cinfo, infile);
    (void) jpeg_read_header(&cinfo, TRUE);
    (void) jpeg_start_decompress(&cinfo);

    printf("%7lu\t%7u\t%s\n", cinfo.output_width, 
                              cinfo.output_height, argv[arg]);

    /* Clean up */
    jpeg_destroy_decompress(&cinfo);
    fclose(infile);
  }

  return status;
}

New hard drive

My new 1.5 TB external hard drive arrived today! Considering I placed the order with standard 3–5 day shipping, I'm rather impressed that it was dispatched and delivered in less than two days. I'm currently in the process of moving my files over; it will be nice to be able to work now without constantly scrounging for free space.

15 March 2010

LSE PDF cropping

Using the pdfimages tool from Poppler, I've confirmed that all the LSE PDFs consist of full-page RGB DCT images. There are a couple problems with this:

  1. This format is lossy and thus the quality of the images could suffer if I transform them.
  2. The files are needlessly large, since they are stored in full colour even though the scans are only in shades of grey.

Now, it is possible to extract the images as JPEGs and crop them losslessly using jpegtran, but only when the upper left corner of the cropped region falls on an iMCU boundary. Fortunately, the images are scanned at a high enough resolution that increasing the dimensions of the region by a few extra pixels to ensure boundary alignment shouldn't be a problem. It's also possible to losslessly convert RGB JPEGs to greyscale, though the reduction in file size is negligible (for these images about 3.4%).

So the next order of business is to extract the images from all the LSE PDFs and crop them. It's possible that the exact coordinates for the cropped regions may vary across the images, depending on the paper size (which changed throughout the Standard's run) and the positioning of the paper when the issues were originally photographed for microfiche. I am hoping, though, that large runs of issues will use the same cropping coordinates, allowing me to do most of the work automatically rather than manually.

I used the following bash script extract all the DCT images from the LSE PDFs as JPEGs:

for year in {1904..1972}
do
  pdfimages -f 2 -j LSE_SocialistStandard_$year.pdf $year
done

I'll then have to examine the resulting JPEG files manually to determine the cropping region for the left- and right-hand pages. Once I have these, I can do batch greyscaling and cropping using the following script, where W and H are the pixel width and height of the region, respectively, and X+Y is the offset from the upper left corner of the original image:

for f in *
do
  jpegtran -grayscale -crop W1xH1+X1+Y1 $f >left-$f
  jpegtran -grayscale -crop W2xH2+X2+Y2 $f >right-$f
done

Given the amount of data I have, the above scripts can take about an hour to complete. They also create a large amount of data, and I've found I'm running out of disk space. I've therefore ordered a Samsung Story Station 1.5 TB USB 2.0 external hard drive.

PDF software inventory

Here's an inventory of the PDF manipulation software I have at my disposal for the project. All of these packages are Free Software in the sense that the user has the freedom to run, study, modify, and redistribute the program.
Okular
PDF viewer (GUI)
GNU gv
PDF viewer (GUI)
pdftk
merges and splits PDFs, rotates pages, edits metadata (command-line; GUI via pdftk-qgui)
PDFjam
merges, rotates, and n-ups PDFs (command-line)
PdfMod
adds, reorders, rotates, and removes pages; exports images; edits metadata (GUI)
poppler-tools
lists fonts; extracts images; prints metadata; converts PDF to HTML, PPM, PostScript, or text (command-line)
QPDF
linearizes (web-optimizes) PDFs, various other low-level transformations (command-line)
pspdftool
rearranges, deletes, scales, flips, numbers, crops, rotates, n-ups pages (command-line)
pstoedit
scales, shifts, splits, rotates, and resizes pages; converts PDF to other vector formats (command-line)
Ghostscript
PDF language interpreter; can do pretty much everything, though not necessarily easily (command-line)

PDF viewing woes

I've been running into problems with the programs I use to view the LSE PDFs. For one thing, these PDFs are an average of 261 MB in size, whereas my file manager, Dolphin, won't generate previews for files larger than 100 MB. This is a rather annoying and arbitrary upper limit, especially considering that it takes only a few milliseconds to generate thumbnails for 100 MB PDFs. I've accordingly filed a bug report asking that the limit be removed.

The other problem is that my usual PDF viewer, Okular, is phenomenally slow at rendering the pages of the LSE PDFs—it takes between 10 and 25 seconds per page. (By comparison, the proprietary Adobe Reader renders them almost instantly.) Okular, like many other Free Software document viewers, renders PDFs using the FreeDesktop project's Poppler library, and it is there that the problem lies. Most likely this is due to a known issue, Bug 13518. For now, then, I will be using GNU gv, which isn't based on Poppler and is able to render the LSE PDFs quickly.

The state of things

My project to digitize the Standard began in 2005. On 25 and 26 December of that year, Norbert Sanden and I, who possessed all issues dating back to February 1970, manually scanned them on a pair of high-end Ricoh office MFPs in Kaiserslautern. To save scanning time, and to avoid destroying the originals, we scanned two-page spreads into multi-page black-and-white TIFFs, usually one issue per file. This means that in these scans, all the pages are in order, except for the front and back pages. Additionally, from about April 1986 onwards, the cover sheet (including the front and back cover) was printed in spot colour (black plus one or two coloured inks), so the covers for these issues were scanned separately as colour JPEGs. In total I have 2.3 GB in 1-bit 600 dpi TIFF files, and a further 2.6 GB in 300 dpi JPEGs.

I moved to London in January 2006 and, with access to the Party's archives, intended to continue manually scanning issues back to 1904. The Party informed me that the 1904 to 1972 issues of the Standard had been microfiched by a third party, and that it might be easier to arrange for scanning directly from the microfiche. I set about calling various libraries to see if I could obtain a copy of the microfiche, and various document archiving companies to inquire about microfiche scanning costs. The first library I called, that of the London School of Economics, informed me that they were actually in the process of scanning much of their microfilm and microfiche collections anyway, and offered to include the Standard in this project and provide the Party with a copy for a small fee (about £20). I agreed to this as this was far cheaper than arranging for the scanning through a company, and far less effort than manually scanning the printed issues. However, the scanning project at LSE proceeded very slowly, and the scans weren't made available to me until several years later—namely, a few weeks ago.

I just got around to looking at the LSE scans yesterday. They're on 6 DVD-ROMs, and comprise 18 GB in 69 PDF files, one for each year from 1904 to 1972. The PDFs consist of two-page spreads scanned at 200 dpi greyscale; unlike the images I scanned, the front and back covers are in the correct order. The scans have not been OCR'd, and I haven't yet determined how the PDFs encode the image data; possibly it is JPEG. Each file includes a title page identifying the year of the archive, and some also include the index published in bound volumes of the Standard. All pages also include a banner at the bottom with the text, "London School of Economics & Political Science 2007 / Socialist Standard xxxx", where xxxx is the year.

Issues from 1998 onwards have been typeset digitally and are available as PDFs. I should be able to obtain these directly from the Party's Socialist Standard production team. The Party also has a basic electronic index for the Standard (probably including only title and author data, but possibly also subjects) which I hope to obtain later. The index wasn't professionally produced, so I doubt it will be of much use when it comes to looking for specific subjects. Since making a proper subject index would be a tremendous undertaking, I hope that OCR plus full text search will serve as a reasonable substitute for the time being.

The next step will be to determine how best to crop the LSE scans such that there is a single physical page per image and no LSE footer. This will be the subject of an upcoming post.

14 March 2010

Goals

The goal of this project is to produce a digital archive of the Socialist Standard which includes all issues from 1904 to the present, and which will be made available on DVD-ROM and online. The archive will include a user-friendly interface for finding and viewing the issues, and should be accessible with any modern computer.

In practice, this means that the issues will be provided in an accessible document archive format, such as PDF or DjVu. Older issues will need to be scanned and OCR'd. The interface will most likely take the form of a set of HTML pages viewed with a web browser, as this is the easiest way to ensure cross-platform compatibility. The index will list the issues by date and cover image. Ideally there will also be an index of article titles, authors, and subjects, and also a searchable full-text index.

Some of the goals, such as full-text searching, may not be achievable in the short term, so I plan to create various editions of the archive. The first edition may include just the non-OCR'd issues and a simple index; later editions can add other features as work on them is completed.

An introduction

The Socialist Party of Great Britain has been publishing its journal, the Socialist Standard, without interruption since 1904. This blog documents my project to produce a complete digital archive of the Standard. By doing so I hope to help myself and others track my progress, and to provide insight for fellow would-be amateur archivists into the process and challenges of digitizing a large newspaper archive.