I've finished processing the LSE JPEGs with unpaper, at least for the time being. The vast majority of the images were successfully processed using the following command lines.
- September 1904 to August 1918
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm
- September 1918 to August 1932
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3700,2600 in.pgm out%d.pgm
- September 1932 to December 1950
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 4000,2660 in.pgm out%d.pgm
- January 1951 to December 1969
unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3450,2700 in.pgm out%d.pgm
As you can see, the only difference in the command lines was the sheet size, which varied over the course of the Standard's print run. The --layout double
and --output-pages 2
options simply specify that we are working with 2-up sheets that need to be split into two separate files, one for each page. The --pre-wipe
option deletes the ex libris banner LSE inserted at the bottom of the image.
Overall, 5445 of the 6078 sheets (90%) were processed more or less correctly with these default settings. In some of these cases unpaper failed to properly deskew the page, and sometimes a small portion of the page header was cut off, but no significant amount of article text or graphics is missing. A further 491 images (8%) required some additional options to correct the following processing errors:
- folded columns
- Unpaper's grey filter seemed to get confused with certain multicolumn layouts, especially if the column layout wasn't identical on both pages of a sheet. It would end up deleting a column down the side or middle and then squeezing the two edges together, as if it had folded the page onto itself. About 5% of sheets were so affected. The problem was solved by using the
--no-grayfilter
option.
folded column
fixed with--no-grayfilter
- missing text
- About 4% of the sheets, mostly from 1921 and the 1960s, had blocks of text missing due to unpaper's grey filter misidentifying an area of a particularly dark scan. The issue was fixed by using
--black-threshold
to adjust the luminance value under which unpaper considers a pixel to be black.
missing text
fixed with--black-threshold
- In a further five cases, the missing text was due to a hair or other anomalous dark line leading to the black border area; unpaper then considered the text block to be part of the border and deleted it. These cases were solved by adjusting
--blackfilter-intensity
.
missing text
fixed with--blackfilter-intensity
- misaligned pages
- Sometimes unpaper would split a sheet off centre, so that the rightmost edge of the left-hand page spilled over into the leftmost edge of the right-hand page. Using
--pre-shift -100,0
solved this problem, which affected nearly 3% of the images.
misaligned page
fixed with--pre-shift
There were 142 images (just over 2%) which could not be easily fixed at all. Nearly all of these failures were due to unpaper's black filter erasing images with large dark patches, or images or headlines which lay too close to the edge of the page. I was unable to find any combination of options which would preserve the desired text and images while still erasing the black border around the page. I may write to the author of unpaper to see if he has any suggestions; if this proves fruitless then I will have to take a different approach to these images. Possibly I could use my own autocrop tool on them to eliminate the black border, and then pass the result to unpaper for deskewing, grey filtering, and noise filtering.
The following stacked bar chart shows the number of successfully and unsuccessfully processed JPEGs for each volume of issues from 1904 to 1969.
As can be seen, the number of failures increases significantly in the 1960s—this is due to the increased use of photographs, particularly on the cover pages. The 1970s issues used so many photographs that there were more failures than I cared to correct. Since I scanned those images from paper myself, I will use my own much better scans instead of trying to unpaper the microfiche photos.
This is all very impressive for the computer-illiterate like myself. Best of luck with the rest of the project!
ReplyDeleteMike
whats the time scale now for it to be online?
ReplyDelete