Socialist Standard digitization blog

UI mockup #1

noreply@blogger.com (Tristan Miller) — Tue, 05 Feb 2013 21:46:00 +0000

I've produced a quick mock-up of what a web browser–based interface to the Socialist Standard archive might look like. Follow the link below to see a working copy.

I should stress that this is a very rough sketch; the colours and exact positioning of elements aren't finalized, the JavaScript is a bit buggy, and the page header is just something I threw together in the Gimp in five minutes. But I think that the overall issue navigation is OK. Opinions?

PDFBeads doesn't like consecutive whitespace in hOCR

noreply@blogger.com (Tristan Miller) — Thu, 20 Dec 2012 08:46:00 +0000

Lazy Kent has now published openSUSE RPMs for Tesseract 3.02, so I installed it and ran it on the files Tesseract 3.01 was failing on. This time it was able to produce hOCR files for them. However, PDFBeads did not like some of these hOCR files:

$ pdfbeads -f -t 216 -m JBIG2 -b JP2 -B 72 1910-036b.tiff  >/dev/null
Prepared data for processing 1910-036b.tiff
JBIG2 compression complete. pages:1 symbols:5780 log2:13
/usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `to_i': NaN (FloatDomainError)
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:589:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `each_index'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:584:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:564:in `getPDFText'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:267:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `each'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/lib/pdfbeads/pdfbuilder.rb:202:in `process'
        from /usr/lib64/ruby/gems/1.8/gems/pdfbeads-1.0.9/bin/pdfbeads:203
        from /usr/bin/pdfbeads:19:in `load'
        from /usr/bin/pdfbeads:19

Through a tedious process of binary searching through the files, I narrowed down the problem to cases where the hOCR file contains two or more spans of class ocrx_word which contain no CDATA except for whitespace. Affected files can be found by grepping for the following extended regular expression:

(<span[^>]*>(<strong>)? +(</strong>)?</span> *){2}

I don't know the hOCR format in detail, but I suspect that having two whitespace-containing ocrx_word spans in a row isn't prohibited. It's therefore probably PDFBeads which is at fault here.

The next steps are therefore to process the Tesseract output with sed to find and remove the duplicate whitespace spans. Following that it looks like my OCR job is complete, and the only really necessary remaining work is to build some HTML-based UI to access the PDFs.

More bugs

noreply@blogger.com (Tristan Miller) — Sun, 04 Nov 2012 17:10:00 +0000

So I wrote to Alexey Kryukov, author of PDFBeads, alerting him about his program's inability to handle input files where the horizontal and vertical DPI differ. I never heard back. I also did some further research on manually changing the output PDF resolution, but from what I can tell this isn't possible. So it looks like I'll just have to override the DPI settings in the original TIFFs and live with slightly stretched or compressed page sizes.

After running PDFBeads on my entire collection of images, I noticed that it failed to produce some issues due to missing hOCR files. Looking back, I see that Tesseract 3.01 has failed on some of the images, producing the following error message:

ELIST_ITERATOR::add_after_then_move:Error:Attemting to add an element with non NULL links, to a list

It looks like this problem has been reported at least a couple times before on the Tesseract issue tracker (Issue 541, Issue 788). Comments on the second report suggest the problem may have been solved in Tesseract 3.02, which was released a few days ago. This version hasn't yet been packaged in Lazy Kent's repository, so I can either wait to see if he updates the RPM, or try producing one myself using his spec file and patches.

Further experiences with PDFBeads

noreply@blogger.com (Tristan Miller) — Wed, 05 Sep 2012 21:53:00 +0000

I had a chance to visually examine the output of PDFBeads, and so far it looks OK. I think I will keep the unpapered backgrounds.

One problem that has arisen, however, is properly specifying the physical dimensions of the page. Back when I started this blog I reported that for most of my scans, the horizontal DPI is not the same as the vertical DPI. However, it seems that PDFBeads can't handle TIFFs where the horizontal and vertical DPI differ; when it tries to combine such images with hOCR data, the text in the resulting PDFs is a complete mess. I suppose there are three possible solutions to this problem:

Examine the source code of PDFBeads to track down and fix the bug. This is likely to be difficult, at least for me, because the tool is written in Ruby, a language I have no knowledge of. (Or perhaps the author could be persuaded to fix it; there's no bug tracker but he does give his e-mail address in the documentation.)
Postprocess the output PDF to override the DPI or paper size settings. I'm not sure if there's any easy way of doing this.
Use ImageMagick's convert --density to override the input TIFF DPI so that the vertical and horizontal DPI values are the same. This will result in distorted images, however.

Experiences with PDFBeads

noreply@blogger.com (Tristan Miller) — Wed, 05 Sep 2012 14:46:00 +0000

Further to my post of yesterday, I downloaded and installed PDFBeads, a Ruby tool for assembling images and hOCR data into PDFs. Unlike hocr2pdf, PDFBeads supports JBIG2, and is also able to up- or downscale the page image to a specified DPI. In theory this means it's no longer necessary to call jbig2enc separately. Also, unlike jbig2enc's pdf.py, PDFBeads inserts a JPEG or JPEG2000 foreground image of the original page, though whether or not this is desirable depends on the source material. For my microfiche scans, it's not particularly helpful: some of the scans are quite dark, so I would rather have bitonal, text-only images which print with higher contrast and less ink. Also, large areas of these images have been blanked by unpaper, so it's possible some of them may look a bit ugly. I'll have to examine the results to see whether they're acceptable.

First though, I had to figure out how to use the tool. The full manual is available in Russian only, though running it with --help does produce a useful, if incomplete, English summary of the command-line options. But the usage instructions leave something to be desired: "pdfbeads [options] [files to process] > out.pdf" conspicuously omits such important details as what file types are supported and how to associate a given image file with a given hOCR file. Some experimentation revealed some usage quirks and bugs, which I document here for future reference and for the benefit of anyone else using this tool:

One need give only the image files on the command line; it tries to find matching hOCR files automatically based on the image filenames. For example, if you call pdfbeads foo.tiff then it will look for hOCR data in the file foo.html. Frustratingly, however, it looks for this file in the current directory, and not in foo.tiff's directory, so calling pdfbeads /path/to/foo.tiff won't work if the hOCR data is in /path/to/foo.html.
The tool leaves a lot of temporary files lying around. To be fair, this is a good thing, since they are expensive to produce and you wouldn't want to recreate them on each run unless necessary; there's also a command-line option to delete them. The problem is that where these files are produced is neither documented nor specifiable. This issue, plus the one mentioned in the previous point, make it a bit more difficult to cleanly use the tool in a batch environment such as a shell script or makefile.
The program doesn't always throw an error when things go wrong—for example, if you try to invoke it on a PNG image, it will happily produce a blank PDF instead of informing you that it can't handle PNG files. It took some trial and error to find an image file format that it liked (TIFF).
Even when called with the correct arguments, the program sometimes ends up producing a 0-byte PDF file. I let it run overnight to produce PDFs for 820 issues of the Standard, and in about a dozen cases it produced a 0-byte file. However, when I tried rerunning the tool on these cases, in all but one it successfully produced the PDF. So evidently it's a bit flaky.
The tool still failed on one of my newspaper issues, throwing the error /usr/lib64/ruby/gems/1.8/gems/hpricot-0.8.5/lib/hpricot/parse.rb:33: [BUG] Segmentation fault. The problem is evidently an insidious and often-reported bug with hpricot, the HTML parser PDFBeads uses to process the hOCR files. There was nothing obviously wrong with the particular hOCR file that hpricot was choking on; and I found that making almost any trivial modification to it (such as adding another newline to the end of the file) allowed hpricot to process it without error.

Now that I've used the tool to produce a set of PDFs, I'm doing some spot checks on them to make sure they all look OK and have the hOCR data properly integrated. Also, because my scans vary in size (both in terms of pixels and physical paper dimensions) I may need to rerun the tool using different DPI settings for different issue ranges. Once that is done I can look at adding proper metadata to the PDFs. (Then there's the whole issue of using DjVu as an alternative, which so far I haven't investigated yet!)

DIY Book Scanner

noreply@blogger.com (Tristan Miller) — Wed, 05 Sep 2012 12:02:00 +0000

In one of my recent posts an anonymous commenter alerted me to the existence of the DIY Book Scanner website, and more specifically its forum. The forum looks to be an excellent resource for anyone doing their own book (or newspaper) scanning project, and contains areas for discussing both hardware and software workflows. It's there that I first learned about PDFBeads (more on which in an upcoming post).

Combining hOCR and image data into a PDF

noreply@blogger.com (Tristan Miller) — Mon, 03 Sep 2012 22:48:00 +0000

You will recall that the page images originally supplied to me were DCT images embedded in PDFs, DCT being a lossy compression scheme based on the JPEG standard. I needed to crop, deskew, and OCR these images, for which I had to decompressed them to bitmaps. The finished PDFs I ultimately produce will use lossless JBIG2 compression on the scans—or so the plan is.

At the moment I have the cropped and deskewed page images as lossless PNG bitmaps, along with the OCR'd text in hOCR format. Using jbig2enc it's easy to create JBIG2 bitmaps and symbol tables from the PNG files. However, I don't (yet) have any tool which will directly combine the JBIG2 data and the hOCR data for a page into a single PDF. Jbig2enc's pdf.py can assemble JBIG2 files into a PDF, but it doesn't add the hOCR text. I did some investigation and I think I have two options available to me:

I could use ExactImage's hocr2pdf to combine the PNG bitmaps and hOCR text into a PDF, and then use pdfsizeopt to JBIG2-compress the PDFs. There are two possibly surmountable problems with this:
- Hocr2pdf always converts the images you give it to the lossy DCT format when outputting them to the PDF. In our case this is a bad thing, because our images are already from a DCT-compressed source, and are pretty low resolution to begin with. From reading the comments in the hocr2pdf source code (the only source for which I found was a user-contributed openSUSE RPM) I see that support for other image compression schemes is on the to-do list:
```
// TODO: more image compressions, jbig2, Fax
```
  Fortunately, I think it should be easy to hack lossless image output into the code. The code for writing PDFs in codecs/pdf.cc starts off as folows:
```
  virtual void writeStreamTagsImpl(std::ostream& s)
  {
    // default based on image type
    if (image.bps < 8) encoding = "/FlateDecode";
    else encoding = "/DCTDecode";
```
  So apparently the code already supports not only DCT but also the lossless Flate scheme, and chooses between them based on the bit depth of the image. If I change the above code to
```
virtual void writeStreamTagsImpl(std::ostream& s)
  {
    encoding = "/FlateDecode";
```
  and recompile, maybe hocr2pdf will no longer lossily compress the PNGs I feed it.
- You will recall that I got the best-looking results from jbig2enc when I set it to upscale the images by a factor of two. However, pdfsizeopt doesn't appear to let you change the scaling factor. Since pdfsizeopt is just a Python script which calls jbig2, I should be able to just add -2 to the system call at that point.
I could instead use something called PDFBeads. According to a thread on DIY Book Scanner, PDFBeads is a Ruby application which can add hOCR to PDF. However, the reader is warned that the "manual [is] in Russian only"! This could be fun.

So tomorrow will be spent patching hocr2pdf and pdfsizeopt, and/or learning Russian. :)

GNU Parallel, where have you been all my life?

noreply@blogger.com (Tristan Miller) — Mon, 03 Sep 2012 21:19:00 +0000

Digitizing the Socialist Standard archive involves running CPU-bound image processing tools on a large number of files. Since I've got a multicore CPU, it makes sense to run such operations in parallel rather than one after another. (A good rule of thumb I've heard is to always have twice as many processes running as you have cores or CPUs.) Up until now, I've been coding each batch of tasks in a makefile, and then invoking make with the -j argument for parallel execution. Needless to say, this is a bit inconvenient when I just have a one-off batch job to run, and it also prevents me from developing and testing bash-scripted tasks from the command line. For years I've wished that bash's looping statements could be parameterized by the number of loop bodies to run simultaneously. For example, instead of writing for x in a b c d e f;do somehugecommand $x;done and waiting for somehugecommand to run six times, one after the other, I want to be able to write something like for x in a b c d e f;do -j3 somehugecommand $x;done and have three instances of somehugecommand launch and run simultaneously.

Well, apparently such a tool has existed for many years now, but no one told me about it. It's called GNU Parallel, and it works much like the old familiar xargs from GNU Findutils. You pass it a list of values on stdin, and pass as command-line arguments a command line to execute. As with xargs, the character sequence {} gets replaced with the values from stdin. And of course, you also tell it how many simultaneous jobs to run with the -j option, just like with GNU Make. For example, whereas before I was calling the Tesseract OCR software on one file at a time with for f in $list_of_images;do tesseract $f.png $f -l eng hocr;done, I'm now executing them in parallel with echo $list_of_images | parallel -j4 tesseract {}.png {} -l eng hocr. What a fantastically useful utility!

As might be surmised from its name, GNU Parallel is an official GNU project, so it's surprising that it's not better known and more widely available. (For example, it's not packaged by openSUSE or other major distributions.) GNU Parallel's web page has some background which explains why:

In the years after 2005… I tried getting parallel accepted into GNU findutils. It was not accepted as it was written in Perl and the team did not want GNU findutils to depend on Perl…

In February 2009 I tried getting parallel added to the package moreutils. The author never replied to the email or the two reminders…

In 2010 parallel was adopted as an official GNU tool and the name was changed to GNU parallel. As GNU already had a tool for running jobs on remote computers (called pexec) it was a hard decision to include GNU parallel as well. I believe the decision was mostly based on GNU parallel having a more familiar user interface - behaving very much like xargs. Shortly after the release as GNU tool remote execution was added and all missing options from xargs were added to make it possible to use GNU parallel as a drop in replacement for xargs.

So to Ole Tange, the author of GNU Parallel, I just want to say thank you for this wonderful utility, and I'm sorry that you had so much trouble getting it adopted into a GNU package.

hOCR-capable OCR programs

noreply@blogger.com (Tristan Miller) — Mon, 03 Sep 2012 12:36:00 +0000

As indicated in my last posting, I tested various OCR programs which output either to hOCR or directly to PDF. For the ones which output hOCR, I tried producing a PDF with the text layer hidden underneath the image using hocr2pdf, a Free Software tool which the creators do a very good job of preventing you from finding. There seems to be absolutely nowhere on their website to download it, either in source or binary form. Fortunately, the source seems to be available on a few third-party download sites, and users at the openSUSE Build Service have posted RPMs.

Anyway, I tested each OCR package on a page from a 1904 and 1961 issue. My findings are summarized as follows:

Cuneiform: Cuneiform seemed to do a decent job at OCR, at least as far as character matching went, but the hOCR it produced didn't work well with hocr2pdf—this despite using an earlier version of Cuneiform, as instructed by a post on the DIY Book Scanner forms which an anonymous commenter referred me to.
Tesseract: Tesseract's output was almost as good as Cuneiform's, and moreover the hOCR was digestible by hocr2pdf.
Adobe Acrobat Professional: I have access to this through my workplace. Accuracy was similar to the above two Free Software packages, but the user interface doesn't support batch processing. I've got hundreds of issues to process, so OCRing them one at a time in a GUI isn't an option.
ABBYY OCR: I activated a trial version of ABBYY's command-line OCR package. The accuracy was by far the highest of any of the suites I tested. However, it's proprietary and also very expensive software; in order to process the complete Standard archive I'd need to buy a €999 licence.

None of the above OCR programs seemed to recognize the column layout of the newspaper. It's therefore not possible to use the text selection tool in the resulting PDF to copy and paste more than one line of a column at a time. However, at least the PDF will be searchable (modulo the character recognition errors).

I've therefore settled on Tesseract. I set up a batch processing job and estimate it will take about 20 to 30 hours to do the whole archive.

One difficulty I foresee is that I don't think hocr2pdf works on the output of jbig2enc. I may need to use hocr2pdf to create an uncompressed PDF with hidden text, and then reprocess it using pdfsizeopt, which integrates jbig2.

OCR on GNU/Linux: A survey

noreply@blogger.com (Tristan Miller) — Sun, 16 Oct 2011 20:55:00 +0000

Today was spent checking out options for optical character recognition (OCR) on GNU/Linux. There are apparently the following basic engines for OCR:

Engine	Version	Licence
OpenOCR (Cuneiform)	1.0.0	BSD-style
GOCR	0.49	GPLv2
Ocrad	0.20	GPLv3
Tesseract	2.04	Apache 2.0
ABBYY OCR	9.0	proprietary
OCR Shop XTR	5.6	proprietary

In June of last year Andreas Gohr did a short experiment where he compared the first five above-listed GNU/Linux OCR engines and found that ABBYY OCR had the highest accuracy, with 100% for proportionally spaced serif and sans-serif text; Tesseract was the best-performing Free Software package, with accuracy in the 92–98% range. GOCR and Ocrad were significantly worse, with accuracy as low as 76% and 82%, respectively.

The non-proprietary engines usually just have bare-bones command-line interfaces with a very limited feature set. There are a number of higher-level tools, often with graphical interfaces and allowing more sophisticated pre- and post-processing of the data. These tools include the following:

Front end	Version	Licence	Back ends	Notes
easy-ocr	3.4	BSD-style	Cuneiform, GOCR, Ocrad, OCRopus, Tesseract	apparently available only as a Debian binary package
gImageReader	0.9	GPLv3	Tesseract	—
OCRFeeder	0.6.6	GPLv3	GOCR, Ocrad, Tesseract	—
ocrodjvu	0.7.5	GPLv2	Cuneiform, GOCR, Ocrad, OCRopus, Tesseract	—
OCRopus	0.4	Apache 2.0	Tesseract	—
pdfocr	0.1.2	BSD-style	Cuneiform	—
WatchOCR	0.8	GPL	Cuneiform	available only as a Debian binary package or Knoppix LiveCD

Since it's my intention to use Free Software wherever possible for this project, I installed Tesseract and a GPLv3-licensed graphical front end, gImageReader. (Rather than compiling from source, I used Malcolm Lewis's openSUSE RPMs. These RPMs fail to specify all the dependencies; they require the presence of the python-imaging and python-enchant packages.) I then tried processing a couple pages from the Standard as test runs: the front page of the September 1904 issue, and the third page of the July 1961 issue. The former is a relatively poor-quality scan, and the latter is quite clean and has a simple layout. You can see the results in the screenshots below: the text OCR'd from the 1904 issue is almost complete gibberish, whereas the text for the 1961 issue is mostly correct (though still with lots of mistakes).

gImageReader and the September 1904 Standard

gImageReader and the July 1961 Standard

Unfortunately, the gImageReader interface produces only plain text as its output, which is useless for my purposes. What I need is for there to be a mapping of selectable, searchable text to the position it appears at in the original scan. Apparently, there is an open standard, hOCR, for representing text layout, recognition confidence, style, and other OCR information. Tesseract, Cuneiform, and other OCR packages can output to hOCR. The problem is that the hOCR file doesn't itself contain the original scanned image; for this you need some extra software to produce (say) a PDF which combines the text information and the original image. Only then will you have a searchable PDF.

It turns out that only some of the engines and front ends support hOCR, and of the Free Software front ends, only two of them add text layers to PDFs: WatchOCR and pdfocr. The ocrodjvu wrapper produces DjVu files instead of PDFs. My next task will therefore be to install and test WatchOCR, pdfocr, and ocrodjvu. I may also try out some proprietary packages for purposes of comparison.

January 1968 and 1969

noreply@blogger.com (Tristan Miller) — Thu, 29 Sep 2011 19:06:00 +0000

I first reported in March 2010 that the microfiche scans provided by LSE are missing the cover pages for the January 1968 and 1969 issues. Before I left London I visited the Socialist Party of Great Britain headquarters and picked up a copy of the January 1969 issue. The archive had no more loose January 1968 issues, so I had to photocopy it from a bound volume. Unfortunately, this meant that the page was cropped along the binding edge, but it's better than no page at all.

I had also mentioned in March 2010 that the aspect ratio of the actual printed issues doesn't seem to correspond to that of the LSE scans. This is something I had forgotten about when I finally scanned these issues in yesterday, and it caused me no end of confusion when I saw the discrepancy between what I had scanned myself and what all the other scans looked like. Below are two images for comparison: the left shows my scan of the January 1969 issue cover, and the right is the same image stretched to fit the aspect ratio of the LSE scans.

Since all the other scans are distorted in this manner, I'm faced with the choice of similarly distorting my scans to match them, or else disproportionately scaling all the LSE scans to match the physical page size. Much as I would like my archive to be as faithful as possible a copy of the printed issues, I am going to have to go with the first option, at least for now. The LSE scans aren't at a high enough resolution to withstand too much image manipulation, and besides that altering them would necessitate another careful inspection of each page to ensure that there are no further unpaper anomalies.

jbig2enc seeks a new maintainer

noreply@blogger.com (Tristan Miller) — Wed, 28 Sep 2011 20:21:00 +0000

I just received the following message from Adam Langley, maintainer of jbig2enc:

I wrote jbig2enc many years ago and I'm aware that it's been very useful to some people, for which I'm glad. But I'm afraid that I simply don't have the time to maintain it any more. Please feel free to fork it in the usual open source style. If you think you have a sufficiently well maintained fork, let me know and I'll start directing people to it.

As someone who has found jbig2enc quite valuable, I just want to say thanks to Adam, and hope that someone reading this blog might decide to take over the good work he's been doing!

Upsampling results

noreply@blogger.com (Tristan Miller) — Sun, 25 Sep 2011 18:33:00 +0000

In June I thought I had completed cropping, but upon reviewing the covers I spotted a further three with problems. It's possible that some of the inner pages were likewise overlooked, though I'll leave that rather tedious inspection for later (or perhaps to eagle-eyed volunteers, if I can get any!)…

Also in June I indicated that I had used jbig2enc to produce three PDFs for each of the issues I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. The size of the collections are as follows:

upsampling	size (MB)
none	427
2×	733
4×	1257

So even at the greatest upsampling the collection is still easily small enough to fit on a DVD-ROM. However, from browsing the PDF covers, it seems a lot of the 4× upsampled issues are missing text (similar to how unpaper sometimes omits text on pages with low contrast or a close black border). I figure this could be fixed by playing with jbig2enc's threshold settings, though that could take several hours or days of experimentation and careful checking of the output.

The PDFs with 2× upsampling seem to be of good quality—much better than the non-upsampled ones, and not obviously worse than the 4× upsampled ones. And none of the covers had any obviously missing text; I hope this holds for the inside pages as well. I think, then, that it will be best to go with 2× upsampling. You can see some comparisons below.

No upsampling

2× upsampling

4× upsampling

My next steps will therefore be as follows:

Locate and scan the missing covers of the January 1968 and January 1969 issues. I got photocopies of these pages before I left London; I hope they didn't get lost in the move to Darmstadt!
Start investigating OCR software so that the final collection can have full-text search.
Start investigating DjVu to compare it with the PDFs.

A series of unfortunate events

noreply@blogger.com (Tristan Miller) — Sun, 25 Sep 2011 10:03:00 +0000

This spring I accepted a new job in Darmstadt, with a starting date in July, so in late May I quit my old job in London, hoping to use the time (except for that spent moving) to finish the digitization of at least the 1904–1970 issues. Unfortunately, this plan was thwarted by a series of unfortunate events. June saw one legal, one medical, and one veterinary emergency, which together consumed all my available time. When I finally arrived in Darmstadt, it took nearly two months after signing up for Internet access before they came to install the cables. And to top it all off, shortly after we got wired, my laptop became irreparably damaged, so I had to procure a new computer on short notice and transfer all my data onto it.

The new computer is all set up, now, and the Internet is the fastest I've ever had. Here's a comparison of the specs of my old machine and the new one I'll be working with:

	sable (old machine)	ferret (new machine)
CPU	Intel Core2 Duo T8300 @ 2.4 GHz	AMD Athlon II X2 260 @ 3.2 GHz
RAM	4 GB DDR2	4 GB DDR3
Hard disk	250 GB SATA	500 GB SATA II
Display	39 cm TFT @ 1440×900 (WSXGA)	61 cm TFT @ 1920×1080 (1080p)
Graphics	Intel 965 GM	AMD Radeon HD3000
OS	openSUSE 11.3	openSUSE 11.4

So as you can see, the new machine is a modest improvement on the old one in every respect: it's got a faster CPU and faster memory, a larger and faster hard drive, a larger and higher-resolution display, a better graphics card (not that that matters much for the 2D imaging work this project is concerned with), and a newer operating system.

Because the machine's architecture is slightly different, and because it's a completely new install of the operating system, I'm going to have to recompile jbig2enc, Leptonica, and some various support utilities I've written myself. After that I can get back to work, so watch this space for updates in the hopefully very near future…

Cropping complete

noreply@blogger.com (Tristan Miller) — Sun, 05 Jun 2011 21:04:00 +0000

As reported in March 2010, the microfiche images omit the cover pages for the January 1968 and January 1969 issues. I attended the Socialist Party of Great Britain's head office yesterday and picked up a spare copy of the January 1969 issue. There were no extra January 1968 issues remaining, but the issue did appear in the bound volumes in the archive, so I took a photocopy. The binding obscures about a centimetre of the left edge of the page, but my copy is better than nothing, I guess. Now my only problem is getting these two pages digitized, as I no longer have access to a scanner. I'll either have to find someone with a scanner, or see if I can photograph the covers myself.

With most of the pages now in place, it's time to start thinking again about how to "bind" them into PDF or DjVu documents. Since it's been a year since I last experimented with this, I downloaded the latest version of jbig2enc and its dependency, Leptonica. I discovered that jbig2enc doesn't compile with Leptonica 1.68, but only because the parameters to the findFileFormat() function have changed. This function is referenced once, in jbig2.cc, where it's used to check something involving multi-page TIFFs. I don't use jbig2enc to process TIFFs so I just commented out these lines, and then jbig2enc compiled fine.

My computer is now whirring away, generating three PDFs for each of the issues that I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. It will probably be busy doing this all night. Once it's done, I'll examine the results to see what looks the best and what the file sizes are like. Watch this space for further analysis of the results…

unpaper revisited

noreply@blogger.com (Tristan Miller) — Thu, 02 Jun 2011 13:20:00 +0000

The last few days have been spent reviewing and revising the results I obtained with unpaper in March of last year. After double-checking the image output, I found I had missed some cases where unpaper had failed to properly process the images. After applying the appropriate command-line options discussed previously, I was able to get unpaper to correctly process most of these; the rest I added to the list of images which unpaper cannot process. I also double-checked this list, which originally had 142 images; I found that many of them were able to be processed successfully with a little more command-line option experimentation.

In the end, there was a net increase of seven images to the list, so it now contains 149 images. These cases are almost exclusively pages with illustrations (usually cover pages). I will now have to do some preliminary tests to determine whether it would be more efficient to crop these images manually or use my autocrop tool.

PDF viewing woes: update

noreply@blogger.com (Tristan Miller) — Tue, 31 May 2011 10:47:00 +0000

In March 2010 I reported on a couple problems viewing PDFs. The first problem was that my file manager, Dolphin, was unable to generate previews of some PDFs due to an arbitrary limit on file sizes. The second problem was slow rendering of the PDFs in my viewer, Okular. I'm pleased to report that the first of these issues has been fixed, and the second is due to be fixed in the next stable release series of Poppler (the PDF rendering library used by Okular).

Final results with unpaper

noreply@blogger.com (Tristan Miller) — Wed, 31 Mar 2010 16:53:00 +0000

I've finished processing the LSE JPEGs with unpaper, at least for the time being. The vast majority of the images were successfully processed using the following command lines.

September 1904 to August 1918: unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm
September 1918 to August 1932: unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3700,2600 in.pgm out%d.pgm
September 1932 to December 1950: unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 4000,2660 in.pgm out%d.pgm
January 1951 to December 1969: unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3450,2700 in.pgm out%d.pgm

As you can see, the only difference in the command lines was the sheet size, which varied over the course of the Standard's print run. The --layout double and --output-pages 2 options simply specify that we are working with 2-up sheets that need to be split into two separate files, one for each page. The --pre-wipe option deletes the ex libris banner LSE inserted at the bottom of the image.

Overall, 5445 of the 6078 sheets (90%) were processed more or less correctly with these default settings. In some of these cases unpaper failed to properly deskew the page, and sometimes a small portion of the page header was cut off, but no significant amount of article text or graphics is missing. A further 491 images (8%) required some additional options to correct the following processing errors:

folded columns

Unpaper's grey filter seemed to get confused with certain multicolumn layouts, especially if the column layout wasn't identical on both pages of a sheet. It would end up deleting a column down the side or middle and then squeezing the two edges together, as if it had folded the page onto itself. About 5% of sheets were so affected. The problem was solved by using the --no-grayfilter option.

folded column

fixed with --no-grayfilter

missing text

About 4% of the sheets, mostly from 1921 and the 1960s, had blocks of text missing due to unpaper's grey filter misidentifying an area of a particularly dark scan. The issue was fixed by using --black-threshold to adjust the luminance value under which unpaper considers a pixel to be black.

missing text

fixed with --black-threshold

In a further five cases, the missing text was due to a hair or other anomalous dark line leading to the black border area; unpaper then considered the text block to be part of the border and deleted it. These cases were solved by adjusting --blackfilter-intensity.

missing text

fixed with --blackfilter-intensity

misaligned pages

Sometimes unpaper would split a sheet off centre, so that the rightmost edge of the left-hand page spilled over into the leftmost edge of the right-hand page. Using --pre-shift -100,0 solved this problem, which affected nearly 3% of the images.

misaligned page

fixed with --pre-shift

There were 142 images (just over 2%) which could not be easily fixed at all. Nearly all of these failures were due to unpaper's black filter erasing images with large dark patches, or images or headlines which lay too close to the edge of the page. I was unable to find any combination of options which would preserve the desired text and images while still erasing the black border around the page. I may write to the author of unpaper to see if he has any suggestions; if this proves fruitless then I will have to take a different approach to these images. Possibly I could use my own autocrop tool on them to eliminate the black border, and then pass the result to unpaper for deskewing, grey filtering, and noise filtering.

The following stacked bar chart shows the number of successfully and unsuccessfully processed JPEGs for each volume of issues from 1904 to 1969.

As can be seen, the number of failures increases significantly in the 1960s—this is due to the increased use of photographs, particularly on the cover pages. The 1970s issues used so many photographs that there were more failures than I cared to correct. Since I scanned those images from paper myself, I will use my own much better scans instead of trying to unpaper the microfiche photos.

First results with unpaper

noreply@blogger.com (Tristan Miller) — Sun, 28 Mar 2010 15:36:00 +0000

The last few days have been spent figuring out how to get unpaper to work. Unlike my autocrop tool, the sheet size needs to be specified, which makes it a bit trickier to use with the LSE scans (see my earlier post "Page and image size analysis"). It also handles only uncompressed PNM files, which for some strange reason the author thinks of as a feature rather than a shortcoming. So now my corpus has ballooned by another 74 GB. Good thing I bought that 1.5 TB drive.

Anyway, I've run unpaper on the September 1904 through August 1918 issues (whose pages are all 242 mm × 460 mm). Of the 836 uncropped JPEGs for these issues, unpaper seems to have processed 779 of them (93%) correctly with the following command line:

unpaper --overwrite --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm

Of the remaining 57 JPEGs, all but 6 were correctly processed with some extra options to modify the behaviour of the black or grey filters. The remaining 6 images I will have to deskew and crop by hand, or just use my autocrop tool.

Finding out the correct unpaper options for the 57 anomalous JPEGs was somewhat tedious. I would run unpaper with various command-line options on the files, wait several seconds for it to process, launch an image viewer on the output files, and then if the output was not acceptable, I would have to quit the image viewer and start again with a different set of command-line options. It would have been much easier if I could have just kept the image viewer open and set it to automatically refresh the images whenever they changed on disk. Unfortunately, neither of the viewers I tried (Gwenview and Kuickshow) have a "watch file" option. Gwenview does have a manual "refresh" command, but it does not refresh thumbnails. I've therefore created and/or voted for these "watch file" and "refresh" feature requests on the KDE bug tracker:

In the meantime, does anyone know of a fast, lightweight image viewer for X11 which has a "watch file" feature? It should be able to view PNM and PNG files.

PDFs: JPEG vs PNG vs JBIG2

noreply@blogger.com (Tristan Miller) — Tue, 23 Mar 2010 13:45:00 +0000

The goal of today's exercise is to see how to make the smallest possible PDF from the scans I have. To this end, I experimented with the September 1954 issue, which is a good candidate because it is fairly long and contains both text and images. The following table and graph summarizes four approaches and the results.

process	PDF creation command	PDF size (KB)
join the JPEGs into a PDF with ImageMagick	`convert *.jpg JPEG.pdf`	43 777
convert the JPEGs to bilevel PNGs, then join them into a PDF with ImageMagick	`convert *.png PNG.pdf`	6 907
convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc	`jbig2 -b J -d -p -s *.jpg; pdf.py J > JBIG2.pdf`	947
upscale and convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc	`jbig2 -b J -d -p -s -2 *.jpg; pdf.py J > 2xJBIG2.pdf`	1 451

So the clear winner here is JBIG2. The 2× upscaled version is actually much easier to read than the unscaled JBIG2 or PNG images, which are sometimes too faint. If I were to use the 2× upscaled JBIG2 method to produce PDFs for all the 1904–1972 issues, the total would be about 450 MB in size, which would easily fit on a single CD-ROM.

However, I know that much better compression ratios can be achieved using DjVu—pages can typically be reduced to just a few kilobytes each. The problem is that creating DjVu documents is a bit more involved. I tried using pdf2djvu, but the DjVu files it created were even larger than the PDFs; clearly what I really need to do is to use the individual DjVuLibre tools to properly segment and compress the original cropped JPEGs. Fortunately there appear to be some guidance and scripts on Wikisource. The Wikisource guide also pointed me towards unpaper, which apparently does a better job of autocropping scans than my own tool, and also deskews the pages. So the next few days will probably be spent investigating these resources.

jbig2enc

noreply@blogger.com (Tristan Miller) — Mon, 22 Mar 2010 23:27:00 +0000

Today, on the recommendation of one of the readers of this blog, I decided to install jbig2enc to see how it might be useful for my digitization project. Unfortunately, it didn't seem to compile out of the box:

[psy@sable:~/src/agl-jbig2enc-edebc5a]$ make
g++ -c jbig2enc.cc -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
g++ -c jbig2arith.cc -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
g++ -c jbig2sym.cc -DUSE_EXT -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
ar -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o
a - jbig2enc.o
a - jbig2arith.o
a - jbig2sym.o
g++ -o jbig2 jbig2.cc -L. -ljbig2enc ../leptonlib-1.64/src/liblept.a -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3  -lpng -ljpeg -ltiff -lm
../leptonlib-1.64/src/liblept.a(gifio.o): In function `pixWriteStreamGif':
gifio.c:(.text+0x15c): undefined reference to `MakeMapObject'
gifio.c:(.text+0x21d): undefined reference to `EGifOpenFileHandle'
gifio.c:(.text+0x258): undefined reference to `EGifPutScreenDesc'
gifio.c:(.text+0x26f): undefined reference to `FreeMapObject'
gifio.c:(.text+0x277): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x2b0): undefined reference to `FreeMapObject'
gifio.c:(.text+0x2df): undefined reference to `FreeMapObject'
gifio.c:(.text+0x2ff): undefined reference to `EGifPutImageDesc'
gifio.c:(.text+0x31a): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x502): undefined reference to `EGifPutLine'
gifio.c:(.text+0x537): undefined reference to `EGifPutComment'
gifio.c:(.text+0x569): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x582): undefined reference to `FreeMapObject'
gifio.c:(.text+0x5e0): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x653): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x682): undefined reference to `EGifCloseFile'
../leptonlib-1.64/src/liblept.a(gifio.o): In function `pixReadStreamGif':
gifio.c:(.text+0x6ce): undefined reference to `DGifOpenFileHandle'
gifio.c:(.text+0x6e6): undefined reference to `DGifSlurp'
gifio.c:(.text+0x878): undefined reference to `DGifCloseFile'
gifio.c:(.text+0x884): undefined reference to `DGifCloseFile'
gifio.c:(.text+0x9bc): undefined reference to `DGifCloseFile'
gifio.c:(.text+0xa2c): undefined reference to `DGifCloseFile'
gifio.c:(.text+0xa55): undefined reference to `DGifCloseFile'
../leptonlib-1.64/src/liblept.a(gifio.o):gifio.c:(.text+0xa86): more undefined references to `DGifCloseFile' follow
collect2: ld returned 1 exit status
make: *** [jbig2] Error 1
[psy@sable:~/src/agl-jbig2enc-edebc5a]$

It seems there were a few things wrong with the Makefile. The showstopper was that Leptonica, the library upon which jbig2enc depends, is expecting to link to giflib, but the Makefile doesn't specify this library. This was solved by adding -lgif to the command which compiles jbig2. The other problems were not fatal but somewhat irritating:

it is assumed that Leptonica isn't installed in a standard location;
there are no install and uninstall targets for (un)installing the package;
the program is written in C++, but the compiler is invoked with a redefined $(CC) rather than the standard $(CXX);
ar is invoked directly rather than through the standard $(AR); and
the clean target uses wildcards somewhat dangerously.

So here's an updated Makefile for jbig2enc 0.27. It should work with little or no modification on most *nix systems. On 64-bit systems which use lib64 directories, the libdir variable should be changed appropriately, or else it should be overriden on the command line.

# Improved Makefile for jbig2enc by Tristan Miller, 2010-03-22

prefix=/usr/local
exec_prefix=$(prefix)
bindir=$(exec_prefix)/bin
libdir=$(exec_prefix)/lib
CFLAGS=-I/usr/local/include/liblept -I/usr/include/liblept -Wall -O3 ${EXTRA}

jbig2: libjbig2enc.a jbig2.cc
 $(CXX) -o jbig2 jbig2.cc -L. -ljbig2enc $(CFLAGS) -lpng -ljpeg -ltiff -lm -llept -lgif

libjbig2enc.a: jbig2enc.o jbig2arith.o jbig2sym.o
 $(AR) -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o

jbig2enc.o: jbig2enc.cc jbig2arith.h jbig2sym.h jbig2structs.h jbig2segments.h
 $(CXX) -c jbig2enc.cc $(CFLAGS)

jbig2arith.o: jbig2arith.cc jbig2arith.h
 $(CXX) -c jbig2arith.cc $(CFLAGS)

jbig2sym.o: jbig2sym.cc jbig2arith.h
 $(CXX) -c jbig2sym.cc -DUSE_EXT $(CFLAGS)

clean:
 rm -f jbig2enc.o jbig2arith.o jbig2sym.o jbig2 libjbig2enc.a

install:
 install -s jbig2 $(bindir)
 install pdf.py $(bindir)
 install -s libjbig2enc.a $(libdir)

uninstall:
 rm $(bindir)/jbig2
 rm $(bindir)/pdf.py
 rm $(libdir)/libjbig2enc.a

Manual cropping

noreply@blogger.com (Tristan Miller) — Sun, 21 Mar 2010 18:10:00 +0000

This afternoon I used GIMP to find cropping coordinates for the 18 pages my autocrop program didn't successfully process. Having passed these to jpegtran, I'm now in possession of 13 020 properly cropped JPEG images, of which 11 164 are unique pages of the Socialist Standard (and the 1967 supplement) and the remaining 1856 are blank pages, microfiche title slides, indices, or duplicates.

Having cropped the images and discarded the irrelevant pages has brought the size of the corpus down from 17.58 GB to 15.13 GB, a savings of 13.94%. Of course, if LSE had properly scanned them as high-resolution bilevel images rather than JPEGs in the first place, the size would have been about a third of this. I am wondering if there is some way to convert the JPEGs to bilevel images, but given the relatively poor quality of the photographs and low resolution of the scans, this may not be possible. I'll have a go at batch-converting them with ImageMagick and examine the results, but I am not optimistic that they will be acceptable.

At any rate, the next step will be to assemble the individual pages into PDFs or DjVus, one issue per file. I shall have to look around to see what software is available for this. The only one I'm aware of is the pdfpages package for pdfTeX, though I'm sure there are others more suitable for my task.

Autocrop

noreply@blogger.com (Tristan Miller) — Sat, 20 Mar 2010 17:26:00 +0000

I have solved the problem of cropping the LSE images.

First, a quick recap: The microfiche scans of the Socialist Standard from the London School of Economics Library were provided as 6510 DCT images embedded into 69 PDF files. The images are unsuitable for use as-is for several reasons. First, each image depicts a spread of two physical pages—unless one has a particularly enormous, high-resolution monitor, it's not possible to read the text without doing a lot of tiresome scrolling. Second, the images are uncropped photographs of bound volumes of the Standard; they include a very thick and uneven black margin all around the page spread, which besides being ugly also reduces the resolution of the text when the images are displayed in a viewer at full width or height. Third, LSE has unhelpfully tacked a rather garish ex libris banner at the bottom of each page. You can see a scaled-down copy of one of these DCT images below.

My task, then, is to crop the DCT images in such a way as to remove the black border and banner, and then to cut the image down the middle to isolate the two physical pages. I was afraid that, since the width of the DCT image and the position of the page spread therein varies from image to image, I would have to do the cropping manually. Assuming it takes two minutes to crop an image manually, it would have taken about 217 hours to do the entire microfiche collection.

Fortunately, I was able to devise an image processing algorithm, realized in the libjpeg-based C program below, which suggests the cropping region automatically. It examines successive rows from the top of the image and calculates their average brightness; once it discovers a row with a brightness above a certain threshold, it has found the upper crop line. It finds the bottom crop line similarly, but this time working upwards from just before the LSE banner. The left and right crop lines are handled similary, except that the algorithm examines columns instead of rows, working inwards from the left and right edges. The cropping region is then passed to jpegtran for lossless cropping, as shown in the shell script which follows.

#include <stdio.h>
#include <stdlib.h>
#include <jpeglib.h>

#define THRESHOLD 15
#define MIN_X 65
#define MIN_Y 5
#define MAX_Y 2850

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  JSAMPARRAY buffer;
  int arg = 0;
  size_t row_stride;
  long x, y, top = 0 , bottom = 0, left = 0, right = 0;
  unsigned long v;

  /* Print usage information */
  if (argc <= 1) {
    fputs("Usage: autocrop file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg < argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "can't open %s\n", argv[arg]);
      return EXIT_FAILURE;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&jerr);
    jpeg_create_decompress(&cinfo);
    jpeg_stdio_src(&cinfo, infile);
    (void) jpeg_read_header(&cinfo, TRUE);
    (void) jpeg_start_decompress(&cinfo);
    row_stride = cinfo.output_width * cinfo.output_components;

    /* Slurp JPEG into memory */
    buffer = (*cinfo.mem->alloc_sarray)
      ((j_common_ptr) &cinfo, JPOOL_IMAGE, row_stride, cinfo.output_height); 
    if (buffer == NULL) {
      fprintf(stderr, "autocrop: out of memory\n");
      return EXIT_FAILURE;
    }
    while (cinfo.output_scanline < cinfo.output_height)
      jpeg_read_scanlines(&cinfo, &buffer[cinfo.output_scanline], 
                          cinfo.output_height);

    /* Find top crop */
    for (y = MIN_Y; y <= MAX_Y; y++) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x < row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        top = y;
        break;
      }
    }

    /* Find bottom crop */
    for (y = MAX_Y; y >= MIN_Y; y--) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x < row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        bottom = y;
        break;
      }
    }

    /* Find left crop */
    for (x = MIN_X * cinfo.output_components; x < row_stride; x++) {
      v = 0;
      for (y = MIN_Y; y <= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        left = x / cinfo.output_components;
        break;
      }
    }

    /* Find right crop */
    for (x = row_stride - 1; x >= MIN_X * cinfo.output_components; x--) {
      v = 0;
      for (y = MIN_Y; y <= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride > THRESHOLD) {
        right = x / cinfo.output_components;
        break;
      }
    }

    /* Print the crop width, height, and upper left coordinates */
    printf("%s\t%ld\t%ld\t%ld\t%ld\n", argv[arg], 
           right - left, bottom - top, left, top);

    /* Clean up */
    (void) jpeg_finish_decompress(&cinfo);
    jpeg_destroy_decompress(&cinfo);
    fclose(infile);
  }

  return EXIT_SUCCESS;
}

for p in */*.jpg; do
    w=-1
    pbase=$(basename $p .jpg)
    pdir=$(dirname $p)
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg ]; then
        read filename w h x y < <(echo $(../bin/autocrop $p))
        echo jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p \
            > ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg
    fi
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg ]; then
        if [ $w -eq -1 ];then
            read filename w h x y < <(echo $(../bin/autocrop $p))
        fi
        echo jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p \
            > ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg
    fi
done

This process is remarkably effective for these images. Below you can see how it properly cropped the page spread shown above into two separate pages.

Out of the 11 168 Socialist Standard pages, the autocrop algorithm cropped only 18 of them incorrectly, giving an error rate of just 0.161%. Of these 18 failures, 7 were due to the page being overly skewed, 10 were due to a particularly dark cover image, and 1 was due to noise in the bottom margin. Below are a couple examples of improperly cropped images. As there are only 18 of them, I don't mind redoing these manually.

Page inventory

noreply@blogger.com (Tristan Miller) — Fri, 19 Mar 2010 20:17:00 +0000

I've completed an inventory of the pages in the 69 LSE PDFs. That is, for each page in the PDF (or more specifically, for each of the two logical pages on every physical page), I noted whether it was a microfiche title slide, a regular Socialist Standard page, a blank page, or something else. Creating such an inventory was necessary so that I can later split up the pages by issue; I need to know where each issue starts and ends within the PDF.

Below is a chart showing the 13 020 logical pages as they appear in sequence in each PDF. The pages have been colour-coded as follows: white, a blank page; grey, a microfiche title slide; green, a regular Socialist Standard page; red, a "special supplement" that appears to have been distributed with the August or September 1967 issue; and blue, the official Socialist Standard index which was included in some bound volumes.

As can be seen, the length of the Standard varies a great deal, sometimes even within a single month (often due to special anniversary issues, but sometimes for no apparent reason). Indices were included only from 1940–1952 and 1969–1971. Every PDF except for 1915 starts with a title page, and further title pages appear in the middle of the 1922, 1927, and 1933 volumes. The January issues for 1968, 1969, and 1971 are missing their covers; for the first two I will have to contact the Party to get a physical copy to scan myself. Finally, the December 1966 issue appears twice.

Page and image size analysis

noreply@blogger.com (Tristan Miller) — Wed, 17 Mar 2010 15:28:00 +0000

I have just discovered two anomalies regarding the page and image sizes; it remains to be seen how they will affect the cropping task.

The Socialist Standard has used at least five different page sizes throughout its print run. However, as shown in the table below, the page dimension ratios don't seem to correspond with those of the scanned versions in the LSE PDFs.

period	physical page		scanned page		apparent DPI
period	dimensions (mm)	ratio	dimensions (px)	ratio	horizontal	vertical
1904-09 – 1918-08	242 × 460	0.526	1715 × 2670	0.642	180	147
1918-09 – 1932-08	180 × 239	0.753	1815 × 2555	0.710	256	272
1932-09 – 1950-12	208 × 276	0.754	1965 × 2625	0.749	240	242
1951-01 – 1969-12	212 × 270	0.785	1715 × 2625	0.653	205	247
1970-01 – 1972-12	210 × 297	0.707	1725 × 2825	0.611	209	242

I'm at a loss as to what might account for the discrepancy, especially since the scanned aspect ratio is sometimes greater and sometimes lesser than the original. Keeping in mind that the microfiche was produced from photographs of bound collections of the Socialist Standard, here are some possible causes:

The Standard's sheets may have been cropped to a different size for binding.
The Standard was reprinted on paper of a different size for binding.
Portions of the physical sheets were obscured during photography (for instance, to hold the book open and in place for the camera), resulting in a cropped photo.
The paper is much wider than it appears in the two-dimensional photographs due to the binding gutter.
The horizontal and vertical DPI settings used for scanning the microfiche were not equal.

Not only are the page ratios different, but the dimensions of the entire scans (including the margins around the book and the LSE banner at the bottom) vary inexplicably. The height is always 3102 pixels but, as can be seen in the graph below, the width varies from 3405 to 4263 pixels. There is no obvious reason for this.

To automatically extract the image dimensions, I wrote the following C program using libjpeg:

#include <stdio.h>
#include <stdlib.h>
#include <jpeglib.h>

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  int arg = 0, status = EXIT_SUCCESS;

  /* Print usage information */
  if (argc <= 1) {
    fputs("Usage: jpegdims file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg < argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "jpegdims: can't open %s\n", argv[arg]);
      status = EXIT_FAILURE;
      continue;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&jerr);
    jpeg_create_decompress(&cinfo);
    jpeg_stdio_src(&cinfo, infile);
    (void) jpeg_read_header(&cinfo, TRUE);
    (void) jpeg_start_decompress(&cinfo);

    printf("%7lu\t%7u\t%s\n", cinfo.output_width, 
                              cinfo.output_height, argv[arg]);

    /* Clean up */
    jpeg_destroy_decompress(&cinfo);
    fclose(infile);
  }

  return status;
}