29 September 2011

January 1968 and 1969

I first reported in March 2010 that the microfiche scans provided by LSE are missing the cover pages for the January 1968 and 1969 issues. Before I left London I visited the Socialist Party of Great Britain headquarters and picked up a copy of the January 1969 issue. The archive had no more loose January 1968 issues, so I had to photocopy it from a bound volume. Unfortunately, this meant that the page was cropped along the binding edge, but it's better than no page at all.

I had also mentioned in March 2010 that the aspect ratio of the actual printed issues doesn't seem to correspond to that of the LSE scans. This is something I had forgotten about when I finally scanned these issues in yesterday, and it caused me no end of confusion when I saw the discrepancy between what I had scanned myself and what all the other scans looked like. Below are two images for comparison: the left shows my scan of the January 1969 issue cover, and the right is the same image stretched to fit the aspect ratio of the LSE scans.

Since all the other scans are distorted in this manner, I'm faced with the choice of similarly distorting my scans to match them, or else disproportionately scaling all the LSE scans to match the physical page size. Much as I would like my archive to be as faithful as possible a copy of the printed issues, I am going to have to go with the first option, at least for now. The LSE scans aren't at a high enough resolution to withstand too much image manipulation, and besides that altering them would necessitate another careful inspection of each page to ensure that there are no further unpaper anomalies.

28 September 2011

jbig2enc seeks a new maintainer

I just received the following message from Adam Langley, maintainer of jbig2enc:
I wrote jbig2enc many years ago and I'm aware that it's been very useful to some people, for which I'm glad. But I'm afraid that I simply don't have the time to maintain it any more. Please feel free to fork it in the usual open source style. If you think you have a sufficiently well maintained fork, let me know and I'll start directing people to it.
As someone who has found jbig2enc quite valuable, I just want to say thanks to Adam, and hope that someone reading this blog might decide to take over the good work he's been doing!

25 September 2011

Upsampling results

In June I thought I had completed cropping, but upon reviewing the covers I spotted a further three with problems. It's possible that some of the inner pages were likewise overlooked, though I'll leave that rather tedious inspection for later (or perhaps to eagle-eyed volunteers, if I can get any!)…

Also in June I indicated that I had used jbig2enc to produce three PDFs for each of the issues I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling. The size of the collections are as follows:

upsamplingsize (MB)
none427
733
1257

So even at the greatest upsampling the collection is still easily small enough to fit on a DVD-ROM. However, from browsing the PDF covers, it seems a lot of the 4× upsampled issues are missing text (similar to how unpaper sometimes omits text on pages with low contrast or a close black border). I figure this could be fixed by playing with jbig2enc's threshold settings, though that could take several hours or days of experimentation and careful checking of the output.

The PDFs with 2× upsampling seem to be of good quality—much better than the non-upsampled ones, and not obviously worse than the 4× upsampled ones. And none of the covers had any obviously missing text; I hope this holds for the inside pages as well. I think, then, that it will be best to go with 2× upsampling. You can see some comparisons below.

No upsampling

2× upsampling

4× upsampling


My next steps will therefore be as follows:

  • Locate and scan the missing covers of the January 1968 and January 1969 issues. I got photocopies of these pages before I left London; I hope they didn't get lost in the move to Darmstadt!
  • Start investigating OCR software so that the final collection can have full-text search.
  • Start investigating DjVu to compare it with the PDFs.

A series of unfortunate events

This spring I accepted a new job in Darmstadt, with a starting date in July, so in late May I quit my old job in London, hoping to use the time (except for that spent moving) to finish the digitization of at least the 1904–1970 issues. Unfortunately, this plan was thwarted by a series of unfortunate events. June saw one legal, one medical, and one veterinary emergency, which together consumed all my available time. When I finally arrived in Darmstadt, it took nearly two months after signing up for Internet access before they came to install the cables. And to top it all off, shortly after we got wired, my laptop became irreparably damaged, so I had to procure a new computer on short notice and transfer all my data onto it.

The new computer is all set up, now, and the Internet is the fastest I've ever had. Here's a comparison of the specs of my old machine and the new one I'll be working with:

sable (old machine)ferret (new machine)
CPUIntel Core2 Duo T8300 @ 2.4 GHzAMD Athlon II X2 260 @ 3.2 GHz
RAM4 GB DDR24 GB DDR3
Hard disk250 GB SATA500 GB SATA II
Display39 cm TFT @ 1440×900 (WSXGA)61 cm TFT @ 1920×1080 (1080p)
GraphicsIntel 965 GMAMD Radeon HD3000
OSopenSUSE 11.3openSUSE 11.4

So as you can see, the new machine is a modest improvement on the old one in every respect: it's got a faster CPU and faster memory, a larger and faster hard drive, a larger and higher-resolution display, a better graphics card (not that that matters much for the 2D imaging work this project is concerned with), and a newer operating system.

Because the machine's architecture is slightly different, and because it's a completely new install of the operating system, I'm going to have to recompile jbig2enc, Leptonica, and some various support utilities I've written myself. After that I can get back to work, so watch this space for updates in the hopefully very near future…