31 March 2010

Final results with unpaper

I've finished processing the LSE JPEGs with unpaper, at least for the time being. The vast majority of the images were successfully processed using the following command lines.

September 1904 to August 1918
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm
September 1918 to August 1932
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3700,2600 in.pgm out%d.pgm
September 1932 to December 1950
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 4000,2660 in.pgm out%d.pgm
January 1951 to December 1969
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3450,2700 in.pgm out%d.pgm

As you can see, the only difference in the command lines was the sheet size, which varied over the course of the Standard's print run. The --layout double and --output-pages 2 options simply specify that we are working with 2-up sheets that need to be split into two separate files, one for each page. The --pre-wipe option deletes the ex libris banner LSE inserted at the bottom of the image.

Overall, 5445 of the 6078 sheets (90%) were processed more or less correctly with these default settings. In some of these cases unpaper failed to properly deskew the page, and sometimes a small portion of the page header was cut off, but no significant amount of article text or graphics is missing. A further 491 images (8%) required some additional options to correct the following processing errors:

folded columns
Unpaper's grey filter seemed to get confused with certain multicolumn layouts, especially if the column layout wasn't identical on both pages of a sheet. It would end up deleting a column down the side or middle and then squeezing the two edges together, as if it had folded the page onto itself. About 5% of sheets were so affected. The problem was solved by using the --no-grayfilter option.

folded column

fixed with --no-grayfilter
missing text
About 4% of the sheets, mostly from 1921 and the 1960s, had blocks of text missing due to unpaper's grey filter misidentifying an area of a particularly dark scan. The issue was fixed by using --black-threshold to adjust the luminance value under which unpaper considers a pixel to be black.

missing text

fixed with --black-threshold
In a further five cases, the missing text was due to a hair or other anomalous dark line leading to the black border area; unpaper then considered the text block to be part of the border and deleted it. These cases were solved by adjusting --blackfilter-intensity.

missing text

fixed with --blackfilter-intensity
misaligned pages
Sometimes unpaper would split a sheet off centre, so that the rightmost edge of the left-hand page spilled over into the leftmost edge of the right-hand page. Using --pre-shift -100,0 solved this problem, which affected nearly 3% of the images.

misaligned page

fixed with --pre-shift

There were 142 images (just over 2%) which could not be easily fixed at all. Nearly all of these failures were due to unpaper's black filter erasing images with large dark patches, or images or headlines which lay too close to the edge of the page. I was unable to find any combination of options which would preserve the desired text and images while still erasing the black border around the page. I may write to the author of unpaper to see if he has any suggestions; if this proves fruitless then I will have to take a different approach to these images. Possibly I could use my own autocrop tool on them to eliminate the black border, and then pass the result to unpaper for deskewing, grey filtering, and noise filtering.

The following stacked bar chart shows the number of successfully and unsuccessfully processed JPEGs for each volume of issues from 1904 to 1969.

[a stacked bar chart showing the proportion of images successfully and unsuccessfully processed by unpaper

As can be seen, the number of failures increases significantly in the 1960s—this is due to the increased use of photographs, particularly on the cover pages. The 1970s issues used so many photographs that there were more failures than I cared to correct. Since I scanned those images from paper myself, I will use my own much better scans instead of trying to unpaper the microfiche photos.


  1. This is all very impressive for the computer-illiterate like myself. Best of luck with the rest of the project!

  2. whats the time scale now for it to be online?