15 March 2010

The state of things

My project to digitize the Standard began in 2005. On 25 and 26 December of that year, Norbert Sanden and I, who possessed all issues dating back to February 1970, manually scanned them on a pair of high-end Ricoh office MFPs in Kaiserslautern. To save scanning time, and to avoid destroying the originals, we scanned two-page spreads into multi-page black-and-white TIFFs, usually one issue per file. This means that in these scans, all the pages are in order, except for the front and back pages. Additionally, from about April 1986 onwards, the cover sheet (including the front and back cover) was printed in spot colour (black plus one or two coloured inks), so the covers for these issues were scanned separately as colour JPEGs. In total I have 2.3 GB in 1-bit 600 dpi TIFF files, and a further 2.6 GB in 300 dpi JPEGs.

I moved to London in January 2006 and, with access to the Party's archives, intended to continue manually scanning issues back to 1904. The Party informed me that the 1904 to 1972 issues of the Standard had been microfiched by a third party, and that it might be easier to arrange for scanning directly from the microfiche. I set about calling various libraries to see if I could obtain a copy of the microfiche, and various document archiving companies to inquire about microfiche scanning costs. The first library I called, that of the London School of Economics, informed me that they were actually in the process of scanning much of their microfilm and microfiche collections anyway, and offered to include the Standard in this project and provide the Party with a copy for a small fee (about £20). I agreed to this as this was far cheaper than arranging for the scanning through a company, and far less effort than manually scanning the printed issues. However, the scanning project at LSE proceeded very slowly, and the scans weren't made available to me until several years later—namely, a few weeks ago.

I just got around to looking at the LSE scans yesterday. They're on 6 DVD-ROMs, and comprise 18 GB in 69 PDF files, one for each year from 1904 to 1972. The PDFs consist of two-page spreads scanned at 200 dpi greyscale; unlike the images I scanned, the front and back covers are in the correct order. The scans have not been OCR'd, and I haven't yet determined how the PDFs encode the image data; possibly it is JPEG. Each file includes a title page identifying the year of the archive, and some also include the index published in bound volumes of the Standard. All pages also include a banner at the bottom with the text, "London School of Economics & Political Science 2007 / Socialist Standard xxxx", where xxxx is the year.

Issues from 1998 onwards have been typeset digitally and are available as PDFs. I should be able to obtain these directly from the Party's Socialist Standard production team. The Party also has a basic electronic index for the Standard (probably including only title and author data, but possibly also subjects) which I hope to obtain later. The index wasn't professionally produced, so I doubt it will be of much use when it comes to looking for specific subjects. Since making a proper subject index would be a tremendous undertaking, I hope that OCR plus full text search will serve as a reasonable substitute for the time being.

The next step will be to determine how best to crop the LSE scans such that there is a single physical page per image and no LSE footer. This will be the subject of an upcoming post.

7 comments:

  1. A few of questions, Tristan .
    What about 1972 to 1998 issues? [ you comment that "all issues dating back to February 1970, manually scanned" ]

    Will it be easy to cut and paste and not an image that cannot be done ?

    Will it be accessible to a search engine ? I think you say it will ["I hope that OCR plus full text search will serve as a reasonable substitute for the time being"]

    Good luck

    ReplyDelete
  2. Thanks for your comments, Alan. The first order of business is to clean up the 1904–1972 issues scanned from microfiche, since they're already in the correct order and have fewer special cases (e.g., there are no colour pages). The 1970 through 1997 issues that Norbert and I scanned manually will be done next, and will take a bit more work to put the pages in order. For both collections, I hope to be able to crop and cut all the images automatically or programmatically rather than manually. (That is, I'll write some sort of script to trim the pages to the correct size, and cut the two-page spreads down the centre binding.)

    As for search engine accessibility, the online version should eventually be indexable by Google and other search engines once the documents are OCR'd. However, that is probably a large undertaking I'll reserve for a future edition. The initial release will just be bitmap images of the pages, plus a basic title index.

    ReplyDelete
  3. This is a tremendous project! Well done to you and Norbert. Here's hoping it proceeds without too many frustrations for you.
    Cheers,
    Robert Whitfield

    ReplyDelete
  4. Great stuff, Trist and Norbert.

    Will you be making details of this blog known on various other sites/forums?

    ReplyDelete
  5. Thanks, Whichfinder. I don't really know which other sites and forums would be interested in this blog. (Keep in mind it's more of a technical than a political blog, though of course anyone interested in tracking the progress of the Standard archive may also be interested) If you know of any, feel free to publicize this blog there.

    ReplyDelete
  6. ISTR that the access database of SS articles does have basic subject indexing, and of course there is the dead tree index at HO...

    ReplyDelete
  7. This is really great news. Many thanks for your hard work on the project. Darren and I are volunteers at the Marxist Internet Archive so if there is some way to also upload these back issues to that site or link from it to our website it might be a good way to let more people know about it.
    Anyway, best of luck with what seems a tricky project.

    YfS,
    Mike Schauerte

    ReplyDelete