<?xml version='1.0' encoding='UTF-8'?><rss xmlns:atom='http://www.w3.org/2005/Atom' xmlns:openSearch='http://a9.com/-/spec/opensearchrss/1.0/' xmlns:georss='http://www.georss.org/georss' xmlns:gd='http://schemas.google.com/g/2005' xmlns:thr='http://purl.org/syndication/thread/1.0' version='2.0'><channel><atom:id>tag:blogger.com,1999:blog-8183298601630944795</atom:id><lastBuildDate>Sat, 07 Apr 2012 03:56:54 +0000</lastBuildDate><category>ocrodjvu</category><category>ABBYY OCR</category><category>Tesseract</category><category>gwenview</category><category>leptonica</category><category>pdfjam</category><category>bugs</category><category>poppler</category><category>pdfocr</category><category>unpaper</category><category>pdftk-qgui</category><category>OCRFeeder</category><category>pdftex</category><category>pdfmod</category><category>ghostscript</category><category>OCR Shop XTR</category><category>jbig2enc</category><category>pdf</category><category>PNM</category><category>pdfpages</category><category>WatchOCR</category><category>gv</category><category>cropping</category><category>make</category><category>libjpeg</category><category>pspdftool</category><category>Ocrad</category><category>easy-ocr</category><category>Cuneiform</category><category>OpenOCR</category><category>kuickshow</category><category>autocrop</category><category>pdf2djvu</category><category>gimp</category><category>GOCR</category><category>qpdf</category><category>pstoedit</category><category>imagemagick</category><category>jpegtran</category><category>hOCR</category><category>poppler-tools</category><category>dolphin</category><category>okular</category><category>OCRopus</category><category>OCR</category><category>pdftk</category><category>djvulibre</category><category>hardware</category><title>Socialist Standard digitization blog</title><description>Documenting the project to create a complete digital archive of the &lt;em&gt;Socialist Standard&lt;/em&gt; (1904–).</description><link>http://ssdigit.nothingisreal.com/</link><managingEditor>noreply@blogger.com (Tristan Miller)</managingEditor><generator>Blogger</generator><openSearch:totalResults>23</openSearch:totalResults><openSearch:startIndex>1</openSearch:startIndex><openSearch:itemsPerPage>25</openSearch:itemsPerPage><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-1236702271863940409</guid><pubDate>Sun, 16 Oct 2011 20:55:00 +0000</pubDate><atom:updated>2011-10-16T22:55:16.765+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>pdfocr</category><category domain='http://www.blogger.com/atom/ns#'>OCRFeeder</category><category domain='http://www.blogger.com/atom/ns#'>OCR Shop XTR</category><category domain='http://www.blogger.com/atom/ns#'>hOCR</category><category domain='http://www.blogger.com/atom/ns#'>Ocrad</category><category domain='http://www.blogger.com/atom/ns#'>GOCR</category><category domain='http://www.blogger.com/atom/ns#'>ocrodjvu</category><category domain='http://www.blogger.com/atom/ns#'>Tesseract</category><category domain='http://www.blogger.com/atom/ns#'>easy-ocr</category><category domain='http://www.blogger.com/atom/ns#'>Cuneiform</category><category domain='http://www.blogger.com/atom/ns#'>OCR</category><category domain='http://www.blogger.com/atom/ns#'>WatchOCR</category><category domain='http://www.blogger.com/atom/ns#'>ABBYY OCR</category><category domain='http://www.blogger.com/atom/ns#'>OCRopus</category><category domain='http://www.blogger.com/atom/ns#'>OpenOCR</category><title>OCR on GNU/Linux: A survey</title><description>&lt;p&gt;Today was spent checking out options for optical character recognition (OCR) on GNU/Linux.  There are apparently the following basic engines for OCR:&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;th&gt;Engine&lt;/th&gt;&lt;th&gt;Version&lt;/th&gt;&lt;th&gt;Licence&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://en.openocr.org/"&gt;OpenOCR (Cuneiform)&lt;/a&gt;&lt;/td&gt;&lt;td&gt;1.0.0&lt;/td&gt;&lt;td&gt;BSD-style&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://jocr.sourceforge.net/"&gt;GOCR&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.49&lt;/td&gt;&lt;td&gt;GPLv2&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://www.gnu.org/software/ocrad/"&gt;Ocrad&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.20&lt;/td&gt;&lt;td&gt;GPLv3&lt;td&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://code.google.com/p/tesseract-ocr/"&gt;Tesseract&lt;/a&gt;&lt;/td&gt;&lt;td&gt;2.04&lt;/td&gt;&lt;td&gt;Apache 2.0&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://www.ocr4linux.com/"&gt;ABBYY OCR&lt;/a&gt;&lt;/td&gt;&lt;td&gt;9.0&lt;/td&gt;&lt;td&gt;proprietary&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://www.vividata.com/be_xtr_specs.html"&gt;OCR Shop XTR&lt;/a&gt;&lt;/td&gt;&lt;td&gt;5.6&lt;/td&gt;&lt;td&gt;proprietary&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;In June of last year &lt;a href="http://www.splitbrain.org/personal"&gt;Andreas Gohr&lt;/a&gt; did a short experiment where he &lt;a href="http://www.splitbrain.org/blog/2010-06/15-linux_ocr_software_comparison" title="Linux OCR Software Comparison"&gt;compared the first five above-listed GNU/Linux OCR engines&lt;/a&gt; and found that ABBYY OCR had the highest accuracy, with 100% for proportionally spaced serif and sans-serif text; Tesseract was the best-performing &lt;a href="http://www.gnu.org/philosophy/free-sw.html"&gt;Free Software&lt;/a&gt; package, with accuracy in the 92–98% range.  GOCR and Ocrad were significantly worse, with accuracy as low as 76% and 82%, respectively.&lt;/p&gt;

&lt;p&gt;The non-proprietary engines usually just have bare-bones command-line interfaces with a very limited feature set.  There are a number of higher-level tools, often with graphical interfaces and allowing more sophisticated pre- and post-processing of the data.  These tools include the following:&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;th&gt;Front end&lt;/th&gt;&lt;th&gt;Version&lt;/th&gt;&lt;th&gt;Licence&lt;/th&gt;&lt;th&gt;Back ends&lt;/th&gt;&lt;th&gt;Notes&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://code.google.com/p/easy-ocr/"&gt;easy-ocr&lt;/a&gt;&lt;/td&gt;&lt;td&gt;3.4&lt;/td&gt;&lt;td&gt;BSD-style&lt;/td&gt;&lt;td&gt;Cuneiform, GOCR, Ocrad, OCRopus, Tesseract&lt;/td&gt;&lt;td&gt;apparently available only as a Debian binary package&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://sourceforge.net/projects/gimagereader/"&gt;gImageReader&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.9&lt;/td&gt;&lt;td&gt;GPLv3&lt;/td&gt;&lt;td&gt;Tesseract&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://live.gnome.org/OCRFeeder"&gt;OCRFeeder&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.6.6&lt;/td&gt;&lt;td&gt;GPLv3&lt;/td&gt;&lt;td&gt;GOCR, Ocrad, Tesseract&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://jwilk.net/software/ocrodjvu"&gt;ocrodjvu&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.7.5&lt;/td&gt;&lt;td&gt;GPLv2&lt;/td&gt;&lt;td&gt;Cuneiform, GOCR, Ocrad, OCRopus, Tesseract&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://code.google.com/p/ocropus/"&gt;OCRopus&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.4&lt;/td&gt;&lt;td&gt;Apache 2.0&lt;/td&gt;&lt;td&gt;Tesseract&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="https://github.com/gkovacs/pdfocr"&gt;pdfocr&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.1.2&lt;/td&gt;&lt;td&gt;BSD-style&lt;/td&gt;&lt;td&gt;Cuneiform&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;&lt;a href="http://www.watchocr.com/"&gt;WatchOCR&lt;/a&gt;&lt;/td&gt;&lt;td&gt;0.8&lt;/td&gt;&lt;td&gt;GPL&lt;/td&gt;&lt;td&gt;Cuneiform&lt;/td&gt;&lt;td&gt;available only as a Debian binary package or Knoppix LiveCD&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;Since it's my intention to use Free Software wherever possible for this project, I installed Tesseract and a GPLv3-licensed graphical front end, &lt;a href="http://sourceforge.net/projects/gimagereader/"&gt;gImageReader&lt;/a&gt;.  (Rather than compiling from source, I used &lt;a href="http://en.opensuse.org/User:Malcolmlewis" title="Malcolm Lewis"&gt;Malcolm Lewis&lt;/a&gt;'s &lt;a href="http://download.opensuse.org/repositories/home:/malcolmlewis:/Python/" title="Malcolm Lewis's openSUSE repository"&gt;openSUSE RPMs&lt;/a&gt;.  These RPMs fail to specify all the dependencies; they require the presence of the python-imaging and python-enchant packages.)  I then tried processing a couple pages from the &lt;em&gt;Standard&lt;/em&gt; as test runs:  the front page of the September 1904 issue, and the third page of the July 1961 issue.  The former is a relatively poor-quality scan, and the latter is quite clean and has a simple layout.  You can see the results in the screenshots below:  the text OCR'd from the 1904 issue is almost complete gibberish, whereas the text for the 1961 issue is mostly correct (though still with lots of mistakes).&lt;/p&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://1.bp.blogspot.com/-zUKKaVwoy_0/TptDsHunepI/AAAAAAAAAIk/ibPvt3FDDRI/s1600/gImageReader_1904-09.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="261" width="400" src="http://1.bp.blogspot.com/-zUKKaVwoy_0/TptDsHunepI/AAAAAAAAAIk/ibPvt3FDDRI/s400/gImageReader_1904-09.png" /&gt;&lt;br /&gt;gImageReader and the September 1904 &lt;em&gt;Standard&lt;/em&gt;&lt;/a&gt;&lt;/div&gt;

&lt;div class="separator" style="clear: both; text-align: center;"&gt;
&lt;a href="http://4.bp.blogspot.com/-zhL0Lmh7PQ8/TptDsUrZZ2I/AAAAAAAAAI0/mQJlxVaEaRo/s1600/gImageReader_1961-07.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="261" width="400" src="http://4.bp.blogspot.com/-zhL0Lmh7PQ8/TptDsUrZZ2I/AAAAAAAAAI0/mQJlxVaEaRo/s400/gImageReader_1961-07.png" /&gt;&lt;br /&gt;gImageReader and the July 1961 &lt;em&gt;Standard&lt;/em&gt;&lt;/a&gt;&lt;/div&gt;



&lt;p&gt;Unfortunately, the gImageReader interface produces only plain text as its output, which is useless for my purposes.  What I need is for there to be a mapping of selectable, searchable text to the position it appears at in the original scan.  Apparently, there is an open standard, &lt;a href="http://en.wikipedia.org/wiki/HOCR"&gt;hOCR&lt;/a&gt;, for representing text layout, recognition confidence, style, and other OCR information.  Tesseract, Cuneiform, and other OCR packages can output to hOCR.  The problem is that the hOCR file doesn't itself contain the original scanned image; for this you need some extra software to produce (say) a PDF which combines the text information and the original image.  Only then will you have a searchable PDF.&lt;/p&gt;

&lt;p&gt;It turns out that only some of the engines and front ends support hOCR, and of the Free Software front ends, only two of them add text layers to PDFs: WatchOCR and pdfocr.  The ocrodjvu wrapper produces DjVu files instead of PDFs.  My next task will therefore be to install and test WatchOCR, pdfocr, and ocrodjvu.  I may also try out some proprietary packages for purposes of comparison.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-1236702271863940409?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/10/ocr-on-gnulinux-survey.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/-zUKKaVwoy_0/TptDsHunepI/AAAAAAAAAIk/ibPvt3FDDRI/s72-c/gImageReader_1904-09.png' height='72' width='72'/><thr:total>2</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-5336752128687823536</guid><pubDate>Thu, 29 Sep 2011 19:06:00 +0000</pubDate><atom:updated>2011-09-29T21:06:00.082+02:00</atom:updated><title>January 1968 and 1969</title><description>&lt;p&gt;&lt;a href="/2010/03/page-inventory.html"&gt;I first reported in March 2010 that the microfiche scans provided by LSE are missing the cover pages for the January 1968 and 1969 issues.&lt;/a&gt; Before I left London I visited the Socialist Party of Great Britain headquarters and picked up a copy of the January 1969 issue. The archive had no more loose January 1968 issues, so I had to photocopy it from a bound volume. Unfortunately, this meant that the page was cropped along the binding edge, but it's better than no page at all.&lt;/p&gt;&lt;p&gt;&lt;a href="/2010/03/page-and-image-size-analysis.html"&gt;I had also mentioned in March 2010 that the aspect ratio of the actual printed issues doesn't seem to correspond to that of the LSE scans.&lt;/a&gt;  This is something I had forgotten about when I finally scanned these issues in yesterday, and it caused me no end of confusion when I saw the discrepancy between what I had scanned myself and what all the other scans looked like.  Below are two images for comparison:  the left shows my scan of the January 1969 issue cover, and the right is the same image stretched to fit the aspect ratio of the LSE scans.&lt;/p&gt;&lt;div class="separator" style="clear: both; text-align: center;"&gt;&lt;a href="http://2.bp.blogspot.com/-MHqUzP5nBZ0/ToTAk61sQxI/AAAAAAAAAIU/sBVOnemia80/s1600/1969-01_originalcover.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="300" width="228" src="http://2.bp.blogspot.com/-MHqUzP5nBZ0/ToTAk61sQxI/AAAAAAAAAIU/sBVOnemia80/s400/1969-01_originalcover.png" /&gt;&lt;/a&gt; &lt;a href="http://2.bp.blogspot.com/-m93hysrHZ5M/ToTAk8UEVsI/AAAAAAAAAIc/O07djHpR-KY/s1600/1969-01_stretchedcover.png" imageanchor="1" style="margin-left:1em; margin-right:1em"&gt;&lt;img border="0" height="300" width="192" src="http://2.bp.blogspot.com/-m93hysrHZ5M/ToTAk8UEVsI/AAAAAAAAAIc/O07djHpR-KY/s400/1969-01_stretchedcover.png" /&gt;&lt;/a&gt;&lt;/div&gt;&lt;p&gt;Since all the other scans are distorted in this manner, I'm faced with the choice of similarly distorting my scans to match them, or else disproportionately scaling all the LSE scans to match the physical page size.  Much as I would like my archive to be as faithful as possible a copy of the printed issues, I am going to have to go with the first option, at least for now.  The LSE scans aren't at a high enough resolution to withstand too much image manipulation, and besides that altering them would necessitate another careful inspection of each page to ensure that there are no further unpaper anomalies.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-5336752128687823536?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/09/january-1968-and-1969.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/-MHqUzP5nBZ0/ToTAk61sQxI/AAAAAAAAAIU/sBVOnemia80/s72-c/1969-01_originalcover.png' height='72' width='72'/><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-1703834609627291702</guid><pubDate>Wed, 28 Sep 2011 20:21:00 +0000</pubDate><atom:updated>2011-09-28T22:21:33.230+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>jbig2enc</category><title>jbig2enc seeks a new maintainer</title><description>I just received the following message from Adam Langley, maintainer of &lt;a href="https://github.com/agl/jbig2enc"&gt;jbig2enc&lt;/a&gt;:&lt;br /&gt;
&lt;blockquote&gt;
I wrote jbig2enc many years ago and I'm aware that it's been very useful to some people, for which I'm glad. But I'm afraid that I simply don't have the time to maintain it any more. Please feel free to fork it in the usual open source style. If you think you have a sufficiently well maintained fork, let me know and I'll start directing people to it.&lt;/blockquote&gt;
As someone who has found jbig2enc quite valuable, I just want to say thanks to Adam, and hope that someone reading this blog might decide to take over the good work he's been doing!&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-1703834609627291702?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/09/jbig2enc-seeks-new-maintainer.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-745316345506498839</guid><pubDate>Sun, 25 Sep 2011 18:33:00 +0000</pubDate><atom:updated>2011-09-25T20:33:54.826+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>jbig2enc</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><category domain='http://www.blogger.com/atom/ns#'>cropping</category><title>Upsampling results</title><description>&lt;a href="http://www.blogger.com/2011/06/cropping-complete.html"&gt;In June I thought I had completed cropping&lt;/a&gt;, but upon reviewing the covers I spotted a further three with problems.  It's possible that some of the inner pages were likewise overlooked, though I'll leave that rather tedious inspection for later (or perhaps to eagle-eyed volunteers, if I can get any!)…&lt;br /&gt;
&lt;br /&gt;
Also in June I indicated that I had used &lt;a href="http://github.com/agl/jbig2enc"&gt;jbig2enc&lt;/a&gt; to produce three PDFs for each of the issues I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling.  The size of the collections are as follows:&lt;br /&gt;
&lt;br /&gt;
&lt;table&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;th&gt;upsampling&lt;/th&gt;&lt;th&gt;size (MB)&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;none&lt;/td&gt;&lt;td style="text-align: right;"&gt;427&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;2×&lt;/td&gt;&lt;td style="text-align: right;"&gt;733&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;4×&lt;/td&gt;&lt;td style="text-align: right;"&gt;1257&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
So even at the greatest upsampling the collection is still easily small enough to fit on a DVD-ROM.  However, from browsing the PDF covers, it seems a lot of the 4× upsampled issues are missing text (similar to how &lt;a href="http://www.blogger.com/2010/03/final-results-with-unpaper.html"&gt;unpaper sometimes omits text on pages with low contrast or a close black border&lt;/a&gt;).  I figure this could be fixed by playing with jbig2enc's threshold settings, though that could take several hours or days of experimentation and careful checking of the output.&lt;br /&gt;
&lt;br /&gt;
The PDFs with 2× upsampling seem to be of good quality—much better than the non-upsampled ones, and not obviously worse than the 4× upsampled ones.  And none of the covers had any obviously missing text; I hope this holds for the inside pages as well.  I think, then, that it will be best to go with 2× upsampling.  You can see some comparisons below.&lt;br /&gt;
&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-nVfgapRszmQ/Tn9zEb5KjTI/AAAAAAAAAIE/qj1DgKbD3zM/s1600/1905-07_1x.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-nVfgapRszmQ/Tn9zEb5KjTI/AAAAAAAAAIE/qj1DgKbD3zM/s1600/1905-07_1x.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;No upsampling&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://1.bp.blogspot.com/-K_M2Bh0yQak/Tn9zEzhuQmI/AAAAAAAAAII/_I6PBcuZvaw/s1600/1905-07_2x.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://1.bp.blogspot.com/-K_M2Bh0yQak/Tn9zEzhuQmI/AAAAAAAAAII/_I6PBcuZvaw/s1600/1905-07_2x.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;2× upsampling&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
&lt;table align="center" cellpadding="0" cellspacing="0" class="tr-caption-container" style="margin-left: auto; margin-right: auto; text-align: center;"&gt;&lt;tbody&gt;
&lt;tr&gt;&lt;td style="text-align: center;"&gt;&lt;a href="http://3.bp.blogspot.com/-6VSvbwqw-MI/Tn9zFgW7YNI/AAAAAAAAAIM/Zn-4qPV8Jkc/s1600/1905-07_4x.png" imageanchor="1" style="margin-left: auto; margin-right: auto;"&gt;&lt;img border="0" src="http://3.bp.blogspot.com/-6VSvbwqw-MI/Tn9zFgW7YNI/AAAAAAAAAIM/Zn-4qPV8Jkc/s1600/1905-07_4x.png" /&gt;&lt;/a&gt;&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td class="tr-caption" style="text-align: center;"&gt;4× upsampling&lt;/td&gt;&lt;/tr&gt;
&lt;/tbody&gt;&lt;/table&gt;
&lt;br /&gt;
&lt;br /&gt;
My next steps will therefore be as follows:&lt;br /&gt;
&lt;br /&gt;
&lt;ul&gt;
&lt;li&gt;Locate and scan the missing covers of the January 1968 and January 1969 issues.  I got photocopies of these pages before I left London; I hope they didn't get lost in the move to Darmstadt!&lt;/li&gt;
&lt;li&gt;Start investigating OCR software so that the final collection can have full-text search.&lt;/li&gt;
&lt;li&gt;Start investigating DjVu to compare it with the PDFs.&lt;/li&gt;
&lt;/ul&gt;
&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-745316345506498839?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/09/upsampling-results.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/-nVfgapRszmQ/Tn9zEb5KjTI/AAAAAAAAAIE/qj1DgKbD3zM/s72-c/1905-07_1x.png' height='72' width='72'/><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-1065634896946815358</guid><pubDate>Sun, 25 Sep 2011 10:03:00 +0000</pubDate><atom:updated>2011-09-25T12:03:00.959+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>hardware</category><title>A series of unfortunate events</title><description>&lt;p&gt;This spring I accepted a new job in Darmstadt, with a starting date in July, so in late May I quit my old job in London, hoping to use the time (except for that spent moving) to finish the digitization of at least the 1904–1970 issues.  Unfortunately, this plan was thwarted by a series of unfortunate events.  June saw one legal, one medical, and one veterinary emergency, which together consumed all my available time.  When I finally arrived in Darmstadt, it took nearly two months after signing up for Internet access before they came to install the cables.  And to top it all off, shortly after we got wired, my laptop became irreparably damaged, so I had to procure a new computer on short notice and transfer all my data onto it.&lt;/p&gt;

&lt;p&gt;The new computer is all set up, now, and the Internet is the fastest I've ever had.  Here's a comparison of the specs of my old machine and the new one I'll be working with:&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;th&gt;&lt;/th&gt;&lt;th&gt;sable (old machine)&lt;/th&gt;&lt;th&gt;ferret (new machine)&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;CPU&lt;/th&gt;&lt;td&gt;Intel Core2 Duo T8300 @ 2.4&amp;thinsp;GHz&lt;/td&gt;&lt;td&gt;AMD Athlon II X2 260 @ 3.2&amp;thinsp;GHz&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;RAM&lt;/th&gt;&lt;td&gt;4 GB DDR2&lt;/td&gt;&lt;td&gt;4 GB DDR3&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;Hard disk&lt;/th&gt;&lt;td&gt;250&amp;thinsp;GB SATA&lt;/td&gt;&lt;td&gt;500&amp;thinsp;GB SATA II&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;Display&lt;/th&gt;&lt;td&gt;39&amp;thinsp;cm TFT @ 1440×900 (WSXGA)&lt;/td&gt;&lt;td&gt;61&amp;thinsp;cm TFT @ 1920×1080 (1080p)&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;Graphics&lt;/th&gt;&lt;td&gt;Intel 965 GM&lt;/td&gt;&lt;td&gt;AMD Radeon HD3000&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;OS&lt;/th&gt;&lt;td&gt;openSUSE 11.3&lt;/td&gt;&lt;td&gt;openSUSE 11.4&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;So as you can see, the new machine is a modest improvement on the old one in every respect:  it's got a faster CPU and faster memory, a larger and faster hard drive, a larger and higher-resolution display, a better graphics card (not that that matters much for the 2D imaging work this project is concerned with), and a newer operating system.&lt;/p&gt;

&lt;p&gt;Because the machine's architecture is slightly different, and because it's a completely new install of the operating system, I'm going to have to recompile &lt;a href="http://github.com/agl/jbig2enc"&gt;jbig2enc&lt;/a&gt;, &lt;a href="http://www.leptonica.com/"&gt;Leptonica&lt;/a&gt;, and some various support utilities I've written myself.  After that I can get back to work, so watch this space for updates in the hopefully very near future…&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-1065634896946815358?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/09/series-of-unfortunate-events.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-3947358968773537072</guid><pubDate>Sun, 05 Jun 2011 21:04:00 +0000</pubDate><atom:updated>2011-06-05T23:15:47.274+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>unpaper</category><category domain='http://www.blogger.com/atom/ns#'>jbig2enc</category><category domain='http://www.blogger.com/atom/ns#'>autocrop</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><category domain='http://www.blogger.com/atom/ns#'>leptonica</category><title>Cropping complete</title><description>&lt;p&gt;&lt;a href="/2010/03/page-inventory.html"&gt;As reported in March 2010&lt;/a&gt;, the microfiche images omit the cover pages for the January 1968 and January 1969 issues.  I attended the Socialist Party of Great Britain's head office yesterday and picked up a spare copy of the January 1969 issue.  There were no extra January 1968 issues remaining, but the issue did appear in the bound volumes in the archive, so I took a photocopy.  The binding obscures about a centimetre of the left edge of the page, but my copy is better than nothing, I guess.  Now my only problem is getting these two pages digitized, as I no longer have access to a scanner.  I'll either have to find someone with a scanner, or see if I can photograph the covers myself.&lt;/p&gt;

&lt;p&gt;With most of the pages now in place, it's time to start thinking again about how to "bind" them into PDF or DjVu documents.  Since it's been a year since I last experimented with this, I downloaded the latest version of &lt;a href="http://github.com/agl/jbig2enc"&gt;jbig2enc&lt;/a&gt; and its dependency, &lt;a href="http://www.leptonica.com/"&gt;Leptonica&lt;/a&gt;.  I discovered that &lt;a href="https://github.com/agl/jbig2enc/issues/12"&gt;jbig2enc doesn't compile with Leptonica 1.68&lt;/a&gt;, but only because the parameters to the &lt;tt&gt;findFileFormat()&lt;/tt&gt; function have changed. This function is referenced once, in &lt;tt&gt;jbig2.cc&lt;/tt&gt;, where it's used to check something involving multi-page TIFFs. I don't use jbig2enc to process TIFFs so I just commented out these lines, and then jbig2enc compiled fine.&lt;/p&gt;

&lt;p&gt;My computer is now whirring away, generating three PDFs for each of the issues that I have up to the end of 1969: one with no upsampling, one with 2× upsampling, and one with 4× upsampling.  It will probably be busy doing this all night.  Once it's done, I'll examine the results to see what looks the best and what the file sizes are like.  Watch this space for further analysis of the results…&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-3947358968773537072?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/06/cropping-complete.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>3</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-4535728579625582619</guid><pubDate>Thu, 02 Jun 2011 13:20:00 +0000</pubDate><atom:updated>2011-06-02T15:22:09.801+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>unpaper</category><title>unpaper revisited</title><description>&lt;p&gt;The last few days have been spent reviewing and revising &lt;a href="/2010/03/final-results-with-unpaper.html"&gt;the results I obtained with unpaper in March of last year&lt;/a&gt;.  After double-checking the image output, I found I had missed some cases where unpaper had failed to properly process the images.  After applying the appropriate command-line options discussed previously, I was able to get unpaper to correctly process most of these; the rest I added to the list of images which unpaper cannot process.  I also double-checked this list, which originally had 142 images; I found that many of them were able to be processed successfully with a little more command-line option experimentation.&lt;/p&gt;
&lt;p&gt;In the end, there was a net increase of seven images to the list, so it now contains 149 images.  These cases are almost exclusively pages with illustrations (usually cover pages).  I will now have to do some preliminary tests to determine whether it would be more efficient to crop these images manually or use my autocrop tool.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-4535728579625582619?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/06/unpaper-revisited.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-3194849684768790381</guid><pubDate>Tue, 31 May 2011 10:47:00 +0000</pubDate><atom:updated>2011-05-31T12:47:53.485+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>bugs</category><category domain='http://www.blogger.com/atom/ns#'>dolphin</category><category domain='http://www.blogger.com/atom/ns#'>okular</category><category domain='http://www.blogger.com/atom/ns#'>poppler</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><title>PDF viewing woes: update</title><description>&lt;p&gt;In March 2010 &lt;a href="2010/03/pdf-viewing-woes.html"&gt;I reported on a couple problems viewing PDFs&lt;/a&gt;. The first problem was that my file manager, Dolphin, was unable to generate previews of some PDFs due to an arbitrary limit on file sizes.  The second problem was slow rendering of the PDFs in my viewer, Okular.  I'm pleased to report that &lt;a href="https://bugs.kde.org/show_bug.cgi?id=230820#c2"&gt;the first of these issues has been fixed&lt;/a&gt;, and &lt;a href="https://bugs.freedesktop.org/show_bug.cgi?id=13518#c28"&gt;the second is due to be fixed in the next stable release series of Poppler&lt;/a&gt; (the PDF rendering library used by Okular).&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-3194849684768790381?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2011/05/pdf-viewing-woes-update.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-3747798299673684610</guid><pubDate>Wed, 31 Mar 2010 16:53:00 +0000</pubDate><atom:updated>2010-04-01T15:33:39.790+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>unpaper</category><title>Final results with unpaper</title><description>&lt;p&gt;I've finished processing the LSE JPEGs with &lt;a href="http://unpaper.berlios.de/"&gt;unpaper&lt;/a&gt;, at least for the time being.  The vast majority of the images were successfully processed using the following command lines.&lt;/p&gt;

&lt;dl&gt;
&lt;dt&gt;September 1904 to August 1918&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm&lt;/code&gt;&lt;/dd&gt;
&lt;dt&gt;September 1918 to August 1932&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3700,2600 in.pgm out%d.pgm&lt;/code&gt;&lt;/dd&gt;
&lt;dt&gt;September 1932 to December 1950&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 4000,2660 in.pgm out%d.pgm&lt;/code&gt;&lt;/dd&gt;
&lt;dt&gt;January 1951 to December 1969&lt;/dt&gt;
&lt;dd&gt;&lt;code&gt;unpaper --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3450,2700 in.pgm out%d.pgm&lt;/code&gt;&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;As you can see, the only difference in the command lines was the sheet size, which &lt;a href="/2010/03/page-and-image-size-analysis.html"&gt;varied over the course of the &lt;em&gt;Standard&lt;/em&gt;'s print run&lt;/a&gt;.  The &lt;code&gt;--layout double&lt;/code&gt; and &lt;code&gt;--output-pages 2&lt;/code&gt; options simply specify that we are working with 2-up sheets that need to be split into two separate files, one for each page.  The &lt;code&gt;--pre-wipe&lt;/code&gt; option deletes the &lt;em&gt;ex libris&lt;/em&gt; banner LSE inserted at the bottom of the image.&lt;/p&gt;

&lt;p&gt;Overall, 5445 of the 6078 sheets (90%) were processed more or less correctly with these default settings.  In some of these cases unpaper failed to properly deskew the page, and sometimes a small portion of the page header was cut off, but no significant amount of article text or graphics is missing.  A further 491 images (8%) required some additional options to correct the following processing errors:&lt;/p&gt;



&lt;dl&gt;

&lt;dt&gt;folded columns&lt;/dt&gt;
&lt;dd&gt;Unpaper's grey filter seemed to get confused with certain multicolumn layouts, especially if the column layout wasn't identical on both pages of a sheet.  It would end up deleting a column down the side or middle and then squeezing the two edges together, as if it had folded the page onto itself.  About 5% of sheets were so affected.  The problem was solved by using the &lt;code&gt;--no-grayfilter&lt;/code&gt; option.

&lt;table&gt;
  &lt;tr&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-qAvPD4I/AAAAAAAAAFo/2NIrpi3_RSM/s1600/1965-033a_foldedcolumn.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 255px; height: 400px;" src="http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-qAvPD4I/AAAAAAAAAFo/2NIrpi3_RSM/s400/1965-033a_foldedcolumn.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842833893527426" /&gt;&lt;br /&gt;folded column&lt;/a&gt;&lt;/td&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Ld2fW7xry2I/S7N-qMw6XVI/AAAAAAAAAFw/omCF95ArvE4/s1600/1965-033a_fixed.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 255px; height: 400px;" src="http://4.bp.blogspot.com/_Ld2fW7xry2I/S7N-qMw6XVI/AAAAAAAAAFw/omCF95ArvE4/s400/1965-033a_fixed.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842837121785170" /&gt;&lt;br /&gt;fixed with &lt;code&gt;--no-grayfilter&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/dd&gt;

&lt;dt&gt;missing text&lt;/dt&gt;
&lt;dd&gt;About 4% of the sheets, mostly from 1921 and the 1960s, had blocks of text missing due to unpaper's grey filter misidentifying an area of a particularly dark scan. The issue was fixed by using &lt;code&gt;--black-threshold&lt;/code&gt; to adjust the luminance value under which unpaper considers a pixel to be black.

&lt;table&gt;
  &lt;tr&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-aLevytI/AAAAAAAAAFI/ZAzXDPguo0o/s1600/1921-003b_blackthreshold.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 284px; height: 400px;" src="http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-aLevytI/AAAAAAAAAFI/ZAzXDPguo0o/s400/1921-003b_blackthreshold.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842561899252434" /&gt;&lt;br /&gt;missing text&lt;/a&gt;&lt;/td&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-akJpYNI/AAAAAAAAAFQ/awiv5OAPmKQ/s1600/1921-003b_fixed.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 284px; height: 400px;" src="http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-akJpYNI/AAAAAAAAAFQ/awiv5OAPmKQ/s400/1921-003b_fixed.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842568521638098" /&gt;&lt;br /&gt;fixed with &lt;code&gt;--black-threshold&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/dd&gt;

&lt;dd&gt;In a further five cases, the missing text was due to a hair or other anomalous dark line leading to the black border area; unpaper then considered the text block to be part of the border and deleted it.  These cases were solved by adjusting &lt;code&gt;--blackfilter-intensity&lt;/code&gt;.

&lt;table&gt;
  &lt;tr&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Ld2fW7xry2I/S7N-a6Rvj_I/AAAAAAAAAFY/OwWNuuNlSio/s1600/1944-007a_hair.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 301px; height: 400px;" src="http://4.bp.blogspot.com/_Ld2fW7xry2I/S7N-a6Rvj_I/AAAAAAAAAFY/OwWNuuNlSio/s400/1944-007a_hair.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842574461177842" /&gt;&lt;br /&gt;missing text&lt;/a&gt;&lt;/td&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-bIahDoI/AAAAAAAAAFg/4paJ4aT-WiE/s1600/1944-007a_fixed.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 301px; height: 400px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-bIahDoI/AAAAAAAAAFg/4paJ4aT-WiE/s400/1944-007a_fixed.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842578256072322" /&gt;&lt;br /&gt;fixed with &lt;code&gt;--blackfilter-intensity&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/dd&gt;

&lt;dt&gt;misaligned pages&lt;/dt&gt;
&lt;dd&gt;Sometimes unpaper would split a sheet off centre, so that the rightmost edge of the left-hand page spilled over into the leftmost edge of the right-hand page.  Using &lt;code&gt;--pre-shift -100,0&lt;/code&gt; solved this problem, which affected nearly 3% of the images.

&lt;table&gt;
  &lt;tr&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-qlJjzsI/AAAAAAAAAF4/Y3CyELP-DfM/s1600/1966-080b_offset.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 255px; height: 400px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-qlJjzsI/AAAAAAAAAF4/Y3CyELP-DfM/s400/1966-080b_offset.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842843667615426" /&gt;&lt;br /&gt;misaligned page&lt;/a&gt;&lt;/td&gt;
    &lt;td style="text-align:center; font-size: 80%;"&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-qz1izNI/AAAAAAAAAGA/S7ISzbrmR5M/s1600/1966-080b_fixed.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 255px; height: 400px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-qz1izNI/AAAAAAAAAGA/S7ISzbrmR5M/s400/1966-080b_fixed.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5454842847610195154" /&gt;&lt;br /&gt;fixed with &lt;code&gt;--pre-shift&lt;/code&gt;&lt;/a&gt;&lt;/td&gt;
  &lt;/tr&gt;
&lt;/table&gt;
&lt;/dd&gt;
&lt;/dl&gt;

&lt;p&gt;There were 142 images (just over 2%) which could not be easily fixed at all.  Nearly all of these failures were due to unpaper's black filter erasing images with large dark patches, or images or headlines which lay too close to the edge of the page.  I was unable to find any combination of options which would preserve the desired text and images while still erasing the black border around the page.  I may write to the author of unpaper to see if he has any suggestions; if this proves fruitless then I will have to take a different approach to these images.  Possibly I could use my own autocrop tool on them to eliminate the black border, and then pass the result to unpaper for deskewing, grey filtering, and noise filtering.&lt;/p&gt;

&lt;p&gt;The following stacked bar chart shows the number of successfully and unsuccessfully processed JPEGs for each volume of issues from 1904 to 1969.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-rV-YJ0I/AAAAAAAAAGI/io7_DLypiVI/s1600/unpaper.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 280px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S7N-rV-YJ0I/AAAAAAAAAGI/io7_DLypiVI/s400/unpaper.png" border="0" alt="[a stacked bar chart showing the proportion of images successfully and unsuccessfully processed by unpaper" id="BLOGGER_PHOTO_ID_5454842856774051650" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As can be seen, the number of failures increases significantly in the 1960s—this is due to the increased use of photographs, particularly on the cover pages.  The 1970s issues used so many photographs that there were more failures than I cared to correct.  Since I scanned those images from paper myself, I will use my own much better scans instead of trying to unpaper the microfiche photos.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-3747798299673684610?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/final-results-with-unpaper.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_Ld2fW7xry2I/S7N-qAvPD4I/AAAAAAAAAFo/2NIrpi3_RSM/s72-c/1965-033a_foldedcolumn.png' height='72' width='72'/><thr:total>2</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-3892710098798838830</guid><pubDate>Sun, 28 Mar 2010 15:36:00 +0000</pubDate><atom:updated>2010-03-28T17:38:24.452+02:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>kuickshow</category><category domain='http://www.blogger.com/atom/ns#'>unpaper</category><category domain='http://www.blogger.com/atom/ns#'>gwenview</category><category domain='http://www.blogger.com/atom/ns#'>PNM</category><title>First results with unpaper</title><description>&lt;p&gt;The last few days have been spent figuring out how to get &lt;a href="http://unpaper.berlios.de/"&gt;unpaper&lt;/a&gt; to work.  Unlike my autocrop tool, the sheet size needs to be specified, which makes it a bit trickier to use with the LSE scans (see my earlier post &lt;a href="/2010/03/page-and-image-size-analysis.html"&gt;"Page and image size analysis"&lt;/a&gt;).  It also handles only uncompressed &lt;a href="http://en.wikipedia.org/wiki/Netpbm_format"&gt;PNM&lt;/a&gt; files, which for some strange reason the author thinks of as a feature rather than a shortcoming.  So now my corpus has ballooned by another 74 GB.  Good thing I bought that 1.5 TB drive.&lt;/p&gt;

&lt;p&gt;Anyway, I've run unpaper on the September 1904 through August 1918 issues (whose pages are all 242&amp;thinsp;mm&amp;nbsp;× 460&amp;thinsp;mm).  Of the 836 uncropped JPEGs for these issues, unpaper seems to have processed 779 of them (93%) correctly with the following command line:&lt;/p&gt;

&lt;blockquote&gt;&lt;code&gt;unpaper --overwrite --layout double --output-pages 2 --pre-wipe 0,2898,4263,3102 --sheet-size 3500,2700 in.pgm out%d.pgm&lt;/code&gt;&lt;/blockquote&gt;

&lt;p&gt;Of the remaining 57 JPEGs, all but 6 were correctly processed with some extra options to modify the behaviour of the black or grey filters.  The remaining 6 images I will have to deskew and crop by hand, or just use my autocrop tool.&lt;/p&gt;

&lt;p&gt;Finding out the correct unpaper options for the 57 anomalous JPEGs was somewhat tedious.  I would run unpaper with various command-line options on the files, wait several seconds for it to process, launch an image viewer on the output files, and then if the output was not acceptable, I would have to quit the image viewer and start again with a different set of command-line options.  It would have been much easier if I could have just kept the image viewer open and set it to automatically refresh the images whenever they changed on disk.  Unfortunately, neither of the viewers I tried (&lt;a href="http://gwenview.sourceforge.net/"&gt;Gwenview&lt;/a&gt; and &lt;a href="http://userbase.kde.org/KuickShow"&gt;Kuickshow&lt;/a&gt;) have a "watch file" option. Gwenview does have a manual "refresh" command, but it does not refresh thumbnails.  I've therefore created and/or voted for these "watch file" and "refresh" feature requests on the KDE bug tracker:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://bugs.kde.org/show_bug.cgi?id=230972"&gt;Bug 230972 (Gwenview) -  GwenView does not refresh thumbnails on F5&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bugs.kde.org/show_bug.cgi?id=131068"&gt;Bug 131068 (Gwenview) -  Feature request: Watch file&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bugs.kde.org/show_bug.cgi?id=232407"&gt;Bug 232407 (Kuickshow) -  Add a "Watch file" option&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://bugs.kde.org/show_bug.cgi?id=232409"&gt;Bug 232409 (Kuickshow) -  Add "Reload" command to kuickshow&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the meantime, does anyone know of a fast, lightweight image viewer for X11 which has a "watch file" feature?  It should be able to view PNM and PNG files.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-3892710098798838830?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/first-results-with-unpaper.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>4</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-8677326518346613455</guid><pubDate>Tue, 23 Mar 2010 13:45:00 +0000</pubDate><atom:updated>2010-03-23T14:50:16.348+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>unpaper</category><category domain='http://www.blogger.com/atom/ns#'>jbig2enc</category><category domain='http://www.blogger.com/atom/ns#'>pdf2djvu</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><category domain='http://www.blogger.com/atom/ns#'>djvulibre</category><category domain='http://www.blogger.com/atom/ns#'>imagemagick</category><title>PDFs: JPEG vs PNG vs JBIG2</title><description>&lt;p&gt;The goal of today's exercise is to see how to make the smallest possible PDF from the scans I have.  To this end, I experimented with the September 1954 issue, which is a good candidate because it is fairly long and contains both text and images.  The following table and graph summarizes four approaches and the results.&lt;/p&gt;

&lt;table&gt;
&lt;tr&gt;&lt;th&gt;process&lt;/th&gt;&lt;th&gt;PDF creation command&lt;/th&gt;&lt;th&gt;PDF size (KB)&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;join the JPEGs into a PDF with &lt;a href="http://www.imagemagick.org/"&gt;ImageMagick&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tt&gt;convert *.jpg JPEG.pdf&lt;/tt&gt;&lt;/td&gt;&lt;td style="text-align: right"&gt;43&amp;thinsp;777&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;convert the JPEGs to bilevel PNGs, then join them into a PDF with ImageMagick&lt;/td&gt;&lt;td&gt;&lt;tt&gt;convert *.png PNG.pdf&lt;/tt&gt;&lt;/td&gt;&lt;td style="text-align: right"&gt;6&amp;thinsp;907&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;convert the JPEGs to bilevel JBIG2s, then join them into a PDF with &lt;a href="http://github.com/agl/jbig2enc"&gt;jbig2enc&lt;/a&gt;&lt;/td&gt;&lt;td&gt;&lt;tt&gt;jbig2 -b J -d -p -s *.jpg; pdf.py J &gt; JBIG2.pdf&lt;/tt&gt;&lt;/td&gt;&lt;td style="text-align: right"&gt;947&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;upscale and convert the JPEGs to bilevel JBIG2s, then join them into a PDF with jbig2enc&lt;/td&gt;&lt;td&gt;&lt;tt&gt;jbig2 -b J -d -p -s -2 *.jpg; pdf.py J &gt; 2xJBIG2.pdf&lt;/tt&gt;&lt;/td&gt;&lt;td style="text-align: right"&gt;1&amp;thinsp;451&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6jGO07xR1I/AAAAAAAAAFA/78r0253tJEs/s1600-h/PDF_sizes.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 303px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6jGO07xR1I/AAAAAAAAAFA/78r0253tJEs/s400/PDF_sizes.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5451825306961790802" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So the clear winner here is JBIG2.  The 2× upscaled version is actually much easier to read than the unscaled JBIG2 or PNG images, which are sometimes too faint.  If I were to use the 2× upscaled JBIG2 method to produce PDFs for all the 1904–1972 issues, the total would be about 450 MB in size, which would easily fit on a single CD-ROM.&lt;/p&gt;

&lt;p&gt;However, I know that much better compression ratios can be achieved using DjVu—pages can typically be reduced to just a few kilobytes each.  The problem is that creating DjVu documents is a bit more involved.  I tried using &lt;a href="http://code.google.com/p/pdf2djvu/"&gt;pdf2djvu&lt;/a&gt;, but the DjVu files it created were even larger than the PDFs; clearly what I really need to do is to use the individual &lt;a href="http://djvu.sourceforge.net/"&gt;DjVuLibre&lt;/a&gt; tools to properly segment and compress the original cropped JPEGs.  Fortunately there appear to be some guidance and scripts on &lt;a href="http://en.wikisource.org/wiki/Help:DjVu_files"&gt;Wikisource&lt;/a&gt;.  The Wikisource guide also pointed me towards &lt;a href="http://unpaper.berlios.de/"&gt;unpaper&lt;/a&gt;, which apparently does a better job of autocropping scans than my own tool, and also deskews the pages.  So the next few days will probably be spent investigating these resources.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-8677326518346613455?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/pdfs-jpeg-vs-png-vs-jbig.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_Ld2fW7xry2I/S6jGO07xR1I/AAAAAAAAAFA/78r0253tJEs/s72-c/PDF_sizes.png' height='72' width='72'/><thr:total>1</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-8617120057524077869</guid><pubDate>Mon, 22 Mar 2010 23:27:00 +0000</pubDate><atom:updated>2010-03-23T00:29:12.556+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>jbig2enc</category><category domain='http://www.blogger.com/atom/ns#'>make</category><title>jbig2enc</title><description>&lt;p&gt;Today, on the recommendation of one of the readers of this blog, I decided to install &lt;a href="http://github.com/agl/jbig2enc"&gt;jbig2enc&lt;/a&gt; to see how it might be useful for my digitization project.  Unfortunately, it didn't seem to compile out of the box:&lt;/p&gt;

&lt;pre&gt;[psy@sable:~/src/agl-jbig2enc-edebc5a]$ make
g++ -c jbig2enc.cc -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
g++ -c jbig2arith.cc -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
g++ -c jbig2sym.cc -DUSE_EXT -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3 
ar -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o
a - jbig2enc.o
a - jbig2arith.o
a - jbig2sym.o
g++ -o jbig2 jbig2.cc -L. -ljbig2enc ../leptonlib-1.64/src/liblept.a -I../leptonlib-1.64/src -Wall -I/usr/include -L/usr/lib -O3  -lpng -ljpeg -ltiff -lm
../leptonlib-1.64/src/liblept.a(gifio.o): In function `pixWriteStreamGif':
gifio.c:(.text+0x15c): undefined reference to `MakeMapObject'
gifio.c:(.text+0x21d): undefined reference to `EGifOpenFileHandle'
gifio.c:(.text+0x258): undefined reference to `EGifPutScreenDesc'
gifio.c:(.text+0x26f): undefined reference to `FreeMapObject'
gifio.c:(.text+0x277): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x2b0): undefined reference to `FreeMapObject'
gifio.c:(.text+0x2df): undefined reference to `FreeMapObject'
gifio.c:(.text+0x2ff): undefined reference to `EGifPutImageDesc'
gifio.c:(.text+0x31a): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x502): undefined reference to `EGifPutLine'
gifio.c:(.text+0x537): undefined reference to `EGifPutComment'
gifio.c:(.text+0x569): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x582): undefined reference to `FreeMapObject'
gifio.c:(.text+0x5e0): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x653): undefined reference to `EGifCloseFile'
gifio.c:(.text+0x682): undefined reference to `EGifCloseFile'
../leptonlib-1.64/src/liblept.a(gifio.o): In function `pixReadStreamGif':
gifio.c:(.text+0x6ce): undefined reference to `DGifOpenFileHandle'
gifio.c:(.text+0x6e6): undefined reference to `DGifSlurp'
gifio.c:(.text+0x878): undefined reference to `DGifCloseFile'
gifio.c:(.text+0x884): undefined reference to `DGifCloseFile'
gifio.c:(.text+0x9bc): undefined reference to `DGifCloseFile'
gifio.c:(.text+0xa2c): undefined reference to `DGifCloseFile'
gifio.c:(.text+0xa55): undefined reference to `DGifCloseFile'
../leptonlib-1.64/src/liblept.a(gifio.o):gifio.c:(.text+0xa86): more undefined references to `DGifCloseFile' follow
collect2: ld returned 1 exit status
make: *** [jbig2] Error 1
[psy@sable:~/src/agl-jbig2enc-edebc5a]$ 
&lt;/pre&gt;

&lt;p&gt;It seems there were a few things wrong with the Makefile.  The showstopper was that &lt;a href="http://www.leptonica.com/"&gt;Leptonica&lt;/a&gt;, the library upon which jbig2enc depends, is expecting to link to &lt;a href="http://sourceforge.net/projects/giflib/"&gt;giflib&lt;/a&gt;, but the Makefile doesn't specify this library.  This was solved by adding &lt;tt&gt;-lgif&lt;/tt&gt; to the command which compiles &lt;tt&gt;jbig2&lt;/tt&gt;.  The other problems were not fatal but somewhat irritating:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;it is assumed that Leptonica isn't installed in a standard location;&lt;/li&gt;
  &lt;li&gt;there are no &lt;tt&gt;install&lt;/tt&gt; and &lt;tt&gt;uninstall&lt;/tt&gt; targets for (un)installing the package;&lt;/li&gt;
  &lt;li&gt;the program is written in C++, but the compiler is invoked with a redefined &lt;tt&gt;$(CC)&lt;/tt&gt; rather than the standard &lt;tt&gt;$(CXX)&lt;/tt&gt;;&lt;/li&gt;
  &lt;li&gt;&lt;tt&gt;ar&lt;/tt&gt; is invoked directly rather than through the standard &lt;tt&gt;$(AR)&lt;/tt&gt;; and&lt;/li&gt;
  &lt;li&gt;the &lt;tt&gt;clean&lt;/tt&gt; target uses wildcards somewhat dangerously.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So here's an updated Makefile for jbig2enc 0.27.  It should work with little or no modification on most *nix systems.  On 64-bit systems which use &lt;tt&gt;lib64&lt;/tt&gt; directories, the &lt;tt&gt;libdir&lt;/tt&gt; variable should be changed appropriately, or else it should be overriden on the command line.&lt;/p&gt;

&lt;pre&gt;# Improved Makefile for jbig2enc by Tristan Miller, 2010-03-22

prefix=/usr/local
exec_prefix=$(prefix)
bindir=$(exec_prefix)/bin
libdir=$(exec_prefix)/lib
CFLAGS=-I/usr/local/include/liblept -I/usr/include/liblept -Wall -O3 ${EXTRA}

jbig2: libjbig2enc.a jbig2.cc
 $(CXX) -o jbig2 jbig2.cc -L. -ljbig2enc $(CFLAGS) -lpng -ljpeg -ltiff -lm -llept -lgif

libjbig2enc.a: jbig2enc.o jbig2arith.o jbig2sym.o
 $(AR) -rcv libjbig2enc.a jbig2enc.o jbig2arith.o jbig2sym.o

jbig2enc.o: jbig2enc.cc jbig2arith.h jbig2sym.h jbig2structs.h jbig2segments.h
 $(CXX) -c jbig2enc.cc $(CFLAGS)

jbig2arith.o: jbig2arith.cc jbig2arith.h
 $(CXX) -c jbig2arith.cc $(CFLAGS)

jbig2sym.o: jbig2sym.cc jbig2arith.h
 $(CXX) -c jbig2sym.cc -DUSE_EXT $(CFLAGS)

clean:
 rm -f jbig2enc.o jbig2arith.o jbig2sym.o jbig2 libjbig2enc.a

install:
 install -s jbig2 $(bindir)
 install pdf.py $(bindir)
 install -s libjbig2enc.a $(libdir)

uninstall:
 rm $(bindir)/jbig2
 rm $(bindir)/pdf.py
 rm $(libdir)/libjbig2enc.a
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-8617120057524077869?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/jbig2enc.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>4</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-8665923421390757022</guid><pubDate>Sun, 21 Mar 2010 18:10:00 +0000</pubDate><atom:updated>2010-03-21T19:11:39.436+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>gimp</category><category domain='http://www.blogger.com/atom/ns#'>pdfpages</category><category domain='http://www.blogger.com/atom/ns#'>pdftex</category><category domain='http://www.blogger.com/atom/ns#'>cropping</category><category domain='http://www.blogger.com/atom/ns#'>imagemagick</category><category domain='http://www.blogger.com/atom/ns#'>jpegtran</category><title>Manual cropping</title><description>&lt;p&gt;This afternoon I used &lt;a href="http://en.wikipedia.org/wiki/GIMP"&gt;GIMP&lt;/a&gt; to find cropping coordinates for the 18 pages my autocrop program didn't successfully process.  Having passed these to &lt;a href="http://www.ijg.org/"&gt;jpegtran&lt;/a&gt;, I'm now in possession of 13&amp;thinsp;020 properly cropped JPEG images, of which 11&amp;thinsp;164 are unique pages of the &lt;em&gt;Socialist Standard&lt;/em&gt; (and the 1967 supplement) and the remaining 1856 are blank pages, microfiche title slides, indices, or duplicates.&lt;/p&gt;

&lt;p&gt;Having cropped the images and discarded the irrelevant pages has brought the size of the corpus down from 17.58 GB to 15.13 GB, a savings of 13.94%.  Of course, if LSE had properly scanned them as high-resolution bilevel images rather than JPEGs in the first place, the size would have been about a third of this.  I am wondering if there is some way to convert the JPEGs to bilevel images, but given the relatively poor quality of the photographs and low resolution of the scans, this may not be possible.  I'll have a go at batch-converting them with &lt;a href="http://www.imagemagick.org/"&gt;ImageMagick&lt;/a&gt; and examine the results, but I am not optimistic that they will be acceptable.&lt;/p&gt;

&lt;p&gt;At any rate, the next step will be to assemble the individual pages into PDFs or DjVus, one issue per file.  I shall have to look around to see what software is available for this.  The only one I'm aware of is the &lt;a href="http://www.ctan.org/tex-archive/macros/latex/contrib/pdfpages/"&gt;pdfpages&lt;/a&gt; package for &lt;a href="http://www.tug.org/applications/pdftex/"&gt;pdfTeX&lt;/a&gt;, though I'm sure there are others more suitable for my task.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-8665923421390757022?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/manual-cropping.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>1</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-6877427974798242133</guid><pubDate>Sat, 20 Mar 2010 17:26:00 +0000</pubDate><atom:updated>2010-03-20T18:38:07.799+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>autocrop</category><category domain='http://www.blogger.com/atom/ns#'>libjpeg</category><category domain='http://www.blogger.com/atom/ns#'>cropping</category><category domain='http://www.blogger.com/atom/ns#'>jpegtran</category><title>Autocrop</title><description>&lt;p&gt;I have solved the problem of cropping the LSE images.&lt;/p&gt;

&lt;p&gt;First, a quick recap:  The microfiche scans of the &lt;em&gt;Socialist Standard&lt;/em&gt; from the London School of Economics Library were provided as 6510 &lt;a href="http://en.wikipedia.org/wiki/Discrete_cosine_transform"&gt;DCT&lt;/a&gt; images embedded into 69 PDF files.  The images are unsuitable for use as-is for several reasons.  First, each image depicts a spread of two physical pages—unless one has a particularly enormous, high-resolution monitor, it's not possible to read the text without doing a lot of tiresome scrolling.  Second, the images are uncropped photographs of bound volumes of the &lt;em&gt;Standard&lt;/em&gt;; they include a very thick and uneven black margin all around the page spread, which besides being ugly also reduces the resolution of the text when the images are displayed in a viewer at full width or height.  Third, LSE has unhelpfully tacked a rather garish &lt;em&gt;ex libris&lt;/em&gt; banner at the bottom of each page.  You can see a scaled-down copy of one of these DCT images below.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Ld2fW7xry2I/S6UFhi6DL5I/AAAAAAAAAE4/EPvwhCdHrV0/s1600-h/1941-025_uncropped.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 294px;" src="http://1.bp.blogspot.com/_Ld2fW7xry2I/S6UFhi6DL5I/AAAAAAAAAE4/EPvwhCdHrV0/s400/1941-025_uncropped.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5450768997865959314" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;My task, then, is to crop the DCT images in such a way as to remove the black border and banner, and then to cut the image down the middle to isolate the two physical pages.  I was afraid that, since the width of the DCT image and the position of the page spread therein varies from image to image, I would have to do the cropping manually.  Assuming it takes two minutes to crop an image manually, it would have taken about 217 hours to do the entire microfiche collection.&lt;/p&gt;

&lt;p&gt;Fortunately, I was able to devise an image processing algorithm, realized in the &lt;a href="http://www.ijg.org/"&gt;libjpeg&lt;/a&gt;-based C program below, which suggests the cropping region automatically.  It examines successive rows from the top of the image and calculates their average brightness; once it discovers a row with a brightness above a certain threshold, it has found the upper crop line.  It finds the bottom crop line similarly, but this time working upwards from just before the LSE banner.  The left and right crop lines are handled similary, except that the algorithm examines columns instead of rows, working inwards from the left and right edges.  The cropping region is then passed to &lt;a href="http://www.ijg.org/"&gt;jpegtran&lt;/a&gt; for lossless cropping, as shown in the shell script which follows.&lt;/p&gt;

&lt;pre class="brush:c"&gt;
#include &amp;lt;stdio.h&gt;
#include &amp;lt;stdlib.h&gt;
#include &amp;lt;jpeglib.h&gt;

#define THRESHOLD 15
#define MIN_X 65
#define MIN_Y 5
#define MAX_Y 2850

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  JSAMPARRAY buffer;
  int arg = 0;
  size_t row_stride;
  long x, y, top = 0 , bottom = 0, left = 0, right = 0;
  unsigned long v;

  /* Print usage information */
  if (argc &amp;lt;= 1) {
    fputs("Usage: autocrop file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg &amp;lt; argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "can't open %s\n", argv[arg]);
      return EXIT_FAILURE;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&amp;jerr);
    jpeg_create_decompress(&amp;cinfo);
    jpeg_stdio_src(&amp;cinfo, infile);
    (void) jpeg_read_header(&amp;cinfo, TRUE);
    (void) jpeg_start_decompress(&amp;cinfo);
    row_stride = cinfo.output_width * cinfo.output_components;

    /* Slurp JPEG into memory */
    buffer = (*cinfo.mem-&gt;alloc_sarray)
      ((j_common_ptr) &amp;cinfo, JPOOL_IMAGE, row_stride, cinfo.output_height); 
    if (buffer == NULL) {
      fprintf(stderr, "autocrop: out of memory\n");
      return EXIT_FAILURE;
    }
    while (cinfo.output_scanline &amp;lt; cinfo.output_height)
      jpeg_read_scanlines(&amp;cinfo, &amp;buffer[cinfo.output_scanline], 
                          cinfo.output_height);

    /* Find top crop */
    for (y = MIN_Y; y &amp;lt;= MAX_Y; y++) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x &amp;lt; row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride &gt; THRESHOLD) {
        top = y;
        break;
      }
    }

    /* Find bottom crop */
    for (y = MAX_Y; y &gt;= MIN_Y; y--) {
      v = 0;
      for (x = MIN_X * cinfo.output_components; x &amp;lt; row_stride; x++)
        v += buffer[y][x];
      if (v / row_stride &gt; THRESHOLD) {
        bottom = y;
        break;
      }
    }

    /* Find left crop */
    for (x = MIN_X * cinfo.output_components; x &amp;lt; row_stride; x++) {
      v = 0;
      for (y = MIN_Y; y &amp;lt;= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride &gt; THRESHOLD) {
        left = x / cinfo.output_components;
        break;
      }
    }

    /* Find right crop */
    for (x = row_stride - 1; x &gt;= MIN_X * cinfo.output_components; x--) {
      v = 0;
      for (y = MIN_Y; y &amp;lt;= MAX_Y; y++)
        v += buffer[y][x];
      if (v / row_stride &gt; THRESHOLD) {
        right = x / cinfo.output_components;
        break;
      }
    }

    /* Print the crop width, height, and upper left coordinates */
    printf("%s\t%ld\t%ld\t%ld\t%ld\n", argv[arg], 
           right - left, bottom - top, left, top);

    /* Clean up */
    (void) jpeg_finish_decompress(&amp;cinfo);
    jpeg_destroy_decompress(&amp;cinfo);
    fclose(infile);
  }

  return EXIT_SUCCESS;
}
&lt;/pre&gt;

&lt;pre class="brush:bash"&gt;
for p in */*.jpg; do
    w=-1
    pbase=$(basename $p .jpg)
    pdir=$(dirname $p)
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg ]; then
        read filename w h x y &amp;lt; &amp;lt;(echo $(../bin/autocrop $p))
        echo jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$x+$y $p \
            &gt; ../LSE_JPEG_cropped/$pdir/${pbase}a.jpg
    fi
    if [ ! -e ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg ]; then
        if [ $w -eq -1 ];then
            read filename w h x y &amp;lt; &amp;lt;(echo $(../bin/autocrop $p))
        fi
        echo jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p
        jpegtran -grayscale -crop $((w / 2))x$h+$((x + w / 2))+$y $p \
            &gt; ../LSE_JPEG_cropped/$pdir/${pbase}b.jpg
    fi
done
&lt;/pre&gt;

&lt;p&gt;This process is remarkably effective for these images.  Below you can see how it properly cropped the page spread shown above into two separate pages.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Ld2fW7xry2I/S6UFgkQwMtI/AAAAAAAAAEo/e7_9ijWlPyo/s1600-h/1941-025a.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 295px; height: 400px;" src="http://2.bp.blogspot.com/_Ld2fW7xry2I/S6UFgkQwMtI/AAAAAAAAAEo/e7_9ijWlPyo/s400/1941-025a.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5450768981049750226" /&gt;&lt;/a&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6UFhMQTieI/AAAAAAAAAEw/IA6b7hk4UQU/s1600-h/1941-025b.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 296px; height: 400px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6UFhMQTieI/AAAAAAAAAEw/IA6b7hk4UQU/s400/1941-025b.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5450768991785290210" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Out of the 11&amp;thinsp;168 &lt;em&gt;Socialist Standard&lt;/em&gt; pages, the autocrop algorithm cropped only 18 of them incorrectly, giving an error rate of just 0.161%.  Of these 18 failures, 7 were due to the page being overly skewed, 10 were due to a particularly dark cover image, and 1 was due to noise in the bottom margin.  Below are a couple examples of improperly cropped images.  As there are only 18 of them, I don't mind redoing these manually.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://1.bp.blogspot.com/_Ld2fW7xry2I/S6UFgNSFFNI/AAAAAAAAAEg/lX3lGc0VOpM/s1600-h/1968-085b_badcrop.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 239px; height: 395px;" src="http://1.bp.blogspot.com/_Ld2fW7xry2I/S6UFgNSFFNI/AAAAAAAAAEg/lX3lGc0VOpM/s400/1968-085b_badcrop.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5450768974881297618" /&gt;&lt;/a&gt;
&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6UFfutNbxI/AAAAAAAAAEY/f6HhnZuFD0s/s1600-h/1928-099a_badcrop.png"&gt;&lt;img style="display:inline; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 279px; height: 400px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6UFfutNbxI/AAAAAAAAAEY/f6HhnZuFD0s/s400/1928-099a_badcrop.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5450768966673592082" /&gt;&lt;/a&gt;
&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-6877427974798242133?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/autocrop.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://1.bp.blogspot.com/_Ld2fW7xry2I/S6UFhi6DL5I/AAAAAAAAAE4/EPvwhCdHrV0/s72-c/1941-025_uncropped.png' height='72' width='72'/><thr:total>3</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-794636025912696496</guid><pubDate>Fri, 19 Mar 2010 20:17:00 +0000</pubDate><atom:updated>2010-03-19T21:18:52.525+01:00</atom:updated><title>Page inventory</title><description>&lt;p&gt;I've completed an inventory of the pages in the 69 LSE PDFs.  That is, for each page in the PDF (or more specifically, for each of the two logical pages on every physical page), I noted whether it was a microfiche title slide, a regular &lt;em&gt;Socialist Standard&lt;/em&gt; page, a blank page, or something else.  Creating such an inventory was necessary so that I can later split up the pages by issue; I need to know where each issue starts and ends within the PDF.&lt;/p&gt;

&lt;p&gt;Below is a chart showing the 13&amp;thinsp;020 logical pages as they appear in sequence in each PDF.  The pages have been colour-coded as follows:  white, a blank page; grey, a microfiche title slide; green, a regular &lt;em&gt;Socialist Standard&lt;/em&gt; page; red, a "special supplement" that appears to have been distributed with the August or September 1967 issue; and blue, the official &lt;em&gt;Socialist Standard&lt;/em&gt; index which was included in some bound volumes.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6Pb9Gd9vjI/AAAAAAAAAEQ/_7wSE-rrfdg/s1600-h/issue_index.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 375px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S6Pb9Gd9vjI/AAAAAAAAAEQ/_7wSE-rrfdg/s400/issue_index.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5450441816803229234" /&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As can be seen, the length of the &lt;em&gt;Standard&lt;/em&gt; varies a great deal, sometimes even within a single month (often due to special anniversary issues, but sometimes for no apparent reason).  Indices were included only from 1940–1952 and 1969–1971.  Every PDF except for 1915 starts with a title page, and further title pages appear in the middle of the 1922, 1927, and 1933 volumes.  The January issues for 1968, 1969, and 1971 are missing their covers; for the first two I will have to contact the Party to get a physical copy to scan myself.  Finally, the December 1966 issue appears twice.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-794636025912696496?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/page-inventory.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://3.bp.blogspot.com/_Ld2fW7xry2I/S6Pb9Gd9vjI/AAAAAAAAAEQ/_7wSE-rrfdg/s72-c/issue_index.png' height='72' width='72'/><thr:total>2</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-7609390299136056631</guid><pubDate>Wed, 17 Mar 2010 15:28:00 +0000</pubDate><atom:updated>2010-03-17T16:41:32.425+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>libjpeg</category><category domain='http://www.blogger.com/atom/ns#'>cropping</category><title>Page and image size analysis</title><description>&lt;p&gt;I have just discovered two anomalies regarding the page and image sizes; it remains to be seen how they will affect the cropping task.&lt;/p&gt;

&lt;p&gt;The &lt;em&gt;Socialist Standard&lt;/em&gt; has used at least five different page sizes throughout its print run.  However, as shown in the table below, the page dimension ratios don't seem to correspond with those of the scanned versions in the LSE PDFs.&lt;/p&gt;

&lt;table style="text-align: center"&gt;
&lt;tr&gt;&lt;th rowspan="2"&gt;period&lt;/th&gt;&lt;th colspan="2"&gt;physical page&lt;/th&gt;&lt;th colspan="2"&gt;scanned page&lt;/th&gt;&lt;th colspan="2"&gt;apparent DPI&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;th&gt;dimensions (mm)&lt;/th&gt;&lt;th&gt;ratio&lt;/th&gt;&lt;th&gt;dimensions (px)&lt;/th&gt;&lt;th&gt;ratio&lt;/th&gt;&lt;th&gt;horizontal&lt;/th&gt;&lt;th&gt;vertical&lt;/th&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1904-09 – 1918-08&lt;/td&gt;&lt;td&gt;242 × 460&lt;/td&gt;&lt;td&gt;0.526&lt;/td&gt;&lt;td&gt;1715 × 2670&lt;/td&gt;&lt;td&gt;0.642&lt;/td&gt;&lt;td&gt;180&lt;/td&gt;&lt;td&gt;147&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1918-09 – 1932-08&lt;/td&gt;&lt;td&gt;180 × 239&lt;/td&gt;&lt;td&gt;0.753&lt;/td&gt;&lt;td&gt;1815 × 2555&lt;/td&gt;&lt;td&gt;0.710&lt;/td&gt;&lt;td&gt;256&lt;/td&gt;&lt;td&gt;272&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1932-09 – 1950-12&lt;/td&gt;&lt;td&gt;208 × 276&lt;/td&gt;&lt;td&gt;0.754&lt;/td&gt;&lt;td&gt;1965 × 2625&lt;/td&gt;&lt;td&gt;0.749&lt;/td&gt;&lt;td&gt;240&lt;/td&gt;&lt;td&gt;242&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1951-01 – 1969-12&lt;/td&gt;&lt;td&gt;212 × 270&lt;/td&gt;&lt;td&gt;0.785&lt;/td&gt;&lt;td&gt;1715 × 2625&lt;/td&gt;&lt;td&gt;0.653&lt;/td&gt;&lt;td&gt;205&lt;/td&gt;&lt;td&gt;247&lt;/td&gt;&lt;/tr&gt;
&lt;tr&gt;&lt;td&gt;1970-01 – 1972-12&lt;/td&gt;&lt;td&gt;210 × 297&lt;/td&gt;&lt;td&gt;0.707&lt;/td&gt;&lt;td&gt;1725 × 2825&lt;/td&gt;&lt;td&gt;0.611&lt;/td&gt;&lt;td&gt;209&lt;/td&gt;&lt;td&gt;242&lt;/td&gt;&lt;/tr&gt;
&lt;/table&gt;

&lt;p&gt;I'm at a loss as to what might account for the discrepancy, especially since the scanned aspect ratio is sometimes greater and sometimes lesser than the original.  Keeping in mind that the microfiche was produced from photographs of bound collections of the &lt;em&gt;Socialist Standard&lt;/em&gt;, here are some possible causes:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The &lt;em&gt;Standard&lt;/em&gt;'s sheets may have been cropped to a different size for binding.&lt;/li&gt;
&lt;li&gt;The &lt;em&gt;Standard&lt;/em&gt; was reprinted on paper of a different size for binding.&lt;/li&gt;
&lt;li&gt;Portions of the physical sheets were obscured during photography (for instance, to hold the book open and in place for the camera), resulting in a cropped photo.&lt;/li&gt;
&lt;li&gt;The paper is much wider than it appears in the two-dimensional photographs due to the binding gutter.&lt;/li&gt;
&lt;li&gt;The horizontal and vertical DPI settings used for scanning the microfiche were not equal.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Not only are the page ratios different, but the dimensions of the entire scans (including the margins around the book and the LSE banner at the bottom) vary inexplicably.  The height is always 3102 pixels but, as can be seen in the graph below, the width varies from 3405 to 4263 pixels.  There is no obvious reason for this.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://4.bp.blogspot.com/_Ld2fW7xry2I/S6D37CUZkoI/AAAAAAAAAEI/lAK6KPDExb4/s1600-h/widths.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 280px;" src="http://4.bp.blogspot.com/_Ld2fW7xry2I/S6D37CUZkoI/AAAAAAAAAEI/lAK6KPDExb4/s400/widths.png" border="0" alt=""id="BLOGGER_PHOTO_ID_5449628142725075586" /&gt;&lt;/a&gt;&lt;/p&gt;


&lt;p&gt;To automatically extract the image dimensions, I wrote the following C program using &lt;a href="http://www.ijg.org/"&gt;libjpeg&lt;/a&gt;:&lt;/p&gt;

&lt;pre class="brush:c"&gt;
#include &amp;lt;stdio.h&gt;
#include &amp;lt;stdlib.h&gt;
#include &amp;lt;jpeglib.h&gt;

int main(int argc, char *argv[]) {

  struct jpeg_error_mgr jerr;
  struct jpeg_decompress_struct cinfo;
  FILE *infile;
  int arg = 0, status = EXIT_SUCCESS;

  /* Print usage information */
  if (argc &amp;lt;= 1) {
    fputs("Usage: jpegdims file.jpg ...\n", stderr);
    return EXIT_FAILURE;
  }

  /* For each filename on the command line */
  while (++arg &amp;lt; argc) {

    /* Open the file */
    if ((infile = fopen(argv[arg], "rb")) == NULL) {
      fprintf(stderr, "jpegdims: can't open %s\n", argv[arg]);
      status = EXIT_FAILURE;
      continue;
    }

    /* Initialize JPEG decompression */
    cinfo.err = jpeg_std_error(&amp;jerr);
    jpeg_create_decompress(&amp;cinfo);
    jpeg_stdio_src(&amp;cinfo, infile);
    (void) jpeg_read_header(&amp;cinfo, TRUE);
    (void) jpeg_start_decompress(&amp;cinfo);

    printf("%7lu\t%7u\t%s\n", cinfo.output_width, 
                              cinfo.output_height, argv[arg]);

    /* Clean up */
    jpeg_destroy_decompress(&amp;cinfo);
    fclose(infile);
  }

  return status;
}
&lt;/pre&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-7609390299136056631?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/page-and-image-size-analysis.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://4.bp.blogspot.com/_Ld2fW7xry2I/S6D37CUZkoI/AAAAAAAAAEI/lAK6KPDExb4/s72-c/widths.png' height='72' width='72'/><thr:total>5</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-1409630056769761981</guid><pubDate>Wed, 17 Mar 2010 12:33:00 +0000</pubDate><atom:updated>2010-03-17T13:37:18.333+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>hardware</category><title>New hard drive</title><description>&lt;p&gt;My new 1.5 TB external hard drive arrived today!  Considering I placed the order with standard 3–5 day shipping, I'm rather impressed that it was dispatched and delivered in less than two days.  I'm currently in the process of moving my files over; it will be nice to be able to work now without constantly scrounging for free space.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-1409630056769761981?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/new-hard-drive.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-1857902510329208284</guid><pubDate>Mon, 15 Mar 2010 16:23:00 +0000</pubDate><atom:updated>2010-03-16T09:50:48.240+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>poppler</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><category domain='http://www.blogger.com/atom/ns#'>jpegtran</category><title>LSE PDF cropping</title><description>&lt;p&gt;Using the pdfimages tool from &lt;a href="http://poppler.freedesktop.org/"&gt;Poppler&lt;/a&gt;, I've confirmed that all the LSE PDFs consist of full-page RGB &lt;a href="http://en.wikipedia.org/wiki/Discrete_cosine_transform"&gt;DCT&lt;/a&gt; images.  There are a couple problems with this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;This format is &lt;a href="http://en.wikipedia.org/wiki/Lossy_compression"&gt;lossy&lt;/a&gt; and thus the quality of the images could suffer if I transform them.&lt;/li&gt;
&lt;li&gt;The files are needlessly large, since they are stored in full colour even though the scans are only in shades of grey.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Now, it is possible to extract the images as JPEGs and crop them &lt;a href="http://en.wikipedia.org/wiki/Lossless_compression"&gt;losslessly&lt;/a&gt; using &lt;a href="http://jpegclub.org/"&gt;jpegtran&lt;/a&gt;, but only when the upper left corner of the cropped region falls on an iMCU boundary.  Fortunately, the images are scanned at a high enough resolution that increasing the dimensions of the region by a few extra pixels to ensure boundary alignment shouldn't be a problem.  It's also possible to losslessly convert RGB JPEGs to greyscale, though the reduction in file size is negligible (for these images about 3.4%).&lt;/p&gt;

&lt;p&gt;So the next order of business is to extract the images from all the LSE PDFs and crop them.  It's possible that the exact coordinates for the cropped regions may vary across the images, depending on the paper size (which changed throughout the &lt;em&gt;Standard&lt;/em&gt;'s run) and the positioning of the paper when the issues were originally photographed for microfiche.  I am hoping, though, that large runs of issues will use the same cropping coordinates, allowing me to do most of the work automatically rather than manually.&lt;/p&gt;

&lt;p&gt;I used the following &lt;a href="http://www.gnu.org/software/bash/bash.html"&gt;bash&lt;/a&gt; script extract all the DCT images from the LSE PDFs as JPEGs:&lt;/p&gt;

&lt;pre class="brush:bash"&gt;
for year in {1904..1972}
do
  pdfimages -f 2 -j LSE_SocialistStandard_$year.pdf $year
done
&lt;/pre&gt;

&lt;p&gt;I'll then have to examine the resulting JPEG files manually to determine the cropping region for the left- and right-hand pages.  Once I have these, I can do batch greyscaling and cropping using the following script, where &lt;em&gt;W&lt;/em&gt; and &lt;em&gt;H&lt;/em&gt; are the pixel width and height of the region, respectively, and &lt;em&gt;X&lt;/em&gt;+&lt;em&gt;Y&lt;/em&gt; is the offset from the upper left corner of the original image:&lt;/p&gt;

&lt;pre class="brush:bash"&gt;
for f in *
do
  jpegtran -grayscale -crop W1xH1+X1+Y1 $f &gt;left-$f
  jpegtran -grayscale -crop W2xH2+X2+Y2 $f &gt;right-$f
done
&lt;/pre&gt;

&lt;p&gt;Given the amount of data I have, the above scripts can take about an hour to complete.  They also create a large amount of data, and I've found I'm running out of disk space.  I've therefore ordered a &lt;a href="http://www.amazon.co.uk/Samsung-Story-Station-1-5TB-External/dp/B002C3S6V8/ref=sr_1_1?ie=UTF8&amp;amp;s=electronics&amp;amp;qid=1268671120&amp;amp;sr=8-1"&gt;Samsung Story Station 1.5 TB USB 2.0 external hard drive&lt;/a&gt;.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-1857902510329208284?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/using-pdfimages-tool-from-poppler-ive.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-1793139349228271315</guid><pubDate>Mon, 15 Mar 2010 13:32:00 +0000</pubDate><atom:updated>2010-03-15T14:38:25.384+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>ghostscript</category><category domain='http://www.blogger.com/atom/ns#'>pdfjam</category><category domain='http://www.blogger.com/atom/ns#'>poppler-tools</category><category domain='http://www.blogger.com/atom/ns#'>pdfmod</category><category domain='http://www.blogger.com/atom/ns#'>qpdf</category><category domain='http://www.blogger.com/atom/ns#'>pdf2djvu</category><category domain='http://www.blogger.com/atom/ns#'>pdftk-qgui</category><category domain='http://www.blogger.com/atom/ns#'>gv</category><category domain='http://www.blogger.com/atom/ns#'>poppler</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><category domain='http://www.blogger.com/atom/ns#'>pdftk</category><category domain='http://www.blogger.com/atom/ns#'>pstoedit</category><category domain='http://www.blogger.com/atom/ns#'>pspdftool</category><title>PDF software inventory</title><description>Here's an inventory of the PDF manipulation software I have at my disposal for the project.  All of these packages are &lt;a href="http://www.gnu.org/philosophy/free-sw.html"&gt;Free Software&lt;/a&gt; in the sense that the user has the freedom to run, study, modify, and redistribute the program.

&lt;dl&gt;&lt;dt&gt;&lt;a href="http://okular.kde.org/"&gt;Okular&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;PDF viewer (GUI)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://www.gnu.org/software/gv/"&gt;GNU gv&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;PDF viewer (GUI)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://www.accesspdf.com/pdftk/"&gt;pdftk&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;merges and splits PDFs, rotates pages, edits metadata (command-line; GUI via pdftk-qgui)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://go.warwick.ac.uk/pdfjam"&gt;PDFjam&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;merges, rotates, and n-ups PDFs (command-line)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://live.gnome.org/PdfMod"&gt;PdfMod&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;adds, reorders, rotates, and removes pages; exports images; edits metadata (GUI)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://poppler.freedesktop.org/"&gt;poppler-tools&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;lists fonts; extracts images; prints metadata; converts PDF to HTML, PPM, PostScript, or text (command-line)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://qpdf.sourceforge.net/"&gt;QPDF&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;linearizes (web-optimizes) PDFs, various other low-level transformations (command-line)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://sourceforge.net/projects/pspdftool/"&gt;pspdftool&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;rearranges, deletes, scales, flips, numbers, crops, rotates, n-ups pages (command-line)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://www.pstoedit.net/"&gt;pstoedit&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;scales, shifts, splits, rotates, and resizes pages; converts PDF to other vector formats (command-line)&lt;/dd&gt;&lt;dt&gt;&lt;a href="http://pages.cs.wisc.edu/~ghost/"&gt;Ghostscript&lt;/a&gt;&lt;/dt&gt;&lt;dd&gt;PDF language interpreter; can do pretty much everything, though not necessarily easily (command-line)&lt;/dd&gt;&lt;/dl&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-1793139349228271315?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/pdf-software-inventory.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-2344611397948523912</guid><pubDate>Mon, 15 Mar 2010 11:53:00 +0000</pubDate><atom:updated>2010-03-15T16:58:43.270+01:00</atom:updated><category domain='http://www.blogger.com/atom/ns#'>bugs</category><category domain='http://www.blogger.com/atom/ns#'>dolphin</category><category domain='http://www.blogger.com/atom/ns#'>okular</category><category domain='http://www.blogger.com/atom/ns#'>poppler</category><category domain='http://www.blogger.com/atom/ns#'>gv</category><category domain='http://www.blogger.com/atom/ns#'>pdf</category><title>PDF viewing woes</title><description>&lt;p&gt;I've been running into problems with the programs I use to view the LSE PDFs.  For one thing, these PDFs are an average of 261 MB in size, whereas my file manager, &lt;a href="http://dolphin.kde.org/"&gt;Dolphin&lt;/a&gt;, won't generate previews for files larger than 100 MB.  This is a rather annoying and arbitrary upper limit, especially considering that it takes only a few milliseconds to generate thumbnails for 100 MB PDFs.  I've accordingly filed &lt;a title="KDE Bug 230820" href="https://bugs.kde.org/show_bug.cgi?id=230820"&gt;a bug report asking that the limit be removed&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;The other problem is that my usual PDF viewer, &lt;a href="http://okular.kde.org/"&gt;Okular&lt;/a&gt;, is phenomenally slow at rendering the pages of the LSE PDFs—it takes between 10 and 25 seconds per page.  (By comparison, the proprietary &lt;a href="http://www.adobe.com/products/reader/"&gt;Adobe Reader&lt;/a&gt; renders them almost instantly.)  Okular, like many other &lt;a href="http://www.gnu.org/philosophy/free-sw.html"&gt;Free Software&lt;/a&gt; document viewers, renders PDFs using the FreeDesktop project's &lt;a href="http://poppler.freedesktop.org/"&gt;Poppler&lt;/a&gt; library, and it is there that the problem lies.  Most likely this is due to a known issue, &lt;a href="https://bugs.freedesktop.org/show_bug.cgi?id=13518" title="FreeDesktop Bug 13518"&gt;Bug 13518&lt;/a&gt;.  For now, then, I will be using &lt;a href="http://www.gnu.org/software/gv/"&gt;GNU gv&lt;/a&gt;, which isn't based on Poppler and is able to render the LSE PDFs quickly.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-2344611397948523912?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/pdf-viewing-woes.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>2</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-9053434479617438319</guid><pubDate>Sun, 14 Mar 2010 23:59:00 +0000</pubDate><atom:updated>2010-03-15T17:00:54.506+01:00</atom:updated><title>The state of things</title><description>&lt;p&gt;My project to digitize the &lt;em&gt;Standard&lt;/em&gt; began in 2005.  On 25 and 26 December of that year, Norbert Sanden and I, who possessed all issues dating back to February 1970, manually scanned them on a pair of high-end Ricoh &lt;a href="http://en.wikipedia.org/wiki/Multifunction_printer"&gt;office MFPs&lt;/a&gt; in Kaiserslautern.  To save scanning time, and to avoid destroying the originals, we scanned two-page spreads into multi-page black-and-white &lt;a href="http://en.wikipedia.org/wiki/TIFF"&gt;TIFFs&lt;/a&gt;, usually one issue per file.  This means that in these scans, all the pages are in order, except for the front and back pages.  Additionally, from about April 1986 onwards, the cover sheet (including the front and back cover) was printed in &lt;a href="http://en.wikipedia.org/wiki/Spot_colour"&gt;spot colour&lt;/a&gt; (black plus one or two coloured inks), so the covers for these issues were scanned separately as colour &lt;a href="http://en.wikipedia.org/wiki/JPEG"&gt;JPEGs&lt;/a&gt;.  In total I have 2.3 GB in 1-bit 600 dpi TIFF files, and a further 2.6 GB in 300 dpi JPEGs.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Ld2fW7xry2I/S514oEPGNRI/AAAAAAAAADw/-zJhbwHCndM/s1600-h/tiffscans.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 240px;" src="http://2.bp.blogspot.com/_Ld2fW7xry2I/S514oEPGNRI/AAAAAAAAADw/-zJhbwHCndM/s400/tiffscans.png" border="0" alt="" title="The TIFF scans as shown in my file browser" id="BLOGGER_PHOTO_ID_5448643753915331858" /&gt;&lt;div style="text-align:center; font-size: 80%;"&gt;The TIFF scans as shown in my file browser&lt;/div&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://3.bp.blogspot.com/_Ld2fW7xry2I/S516sFnzdcI/AAAAAAAAAD4/E5H8bus-TAw/s1600-h/jpegscans.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 240px;" src="http://3.bp.blogspot.com/_Ld2fW7xry2I/S516sFnzdcI/AAAAAAAAAD4/E5H8bus-TAw/s400/jpegscans.png" border="0" alt="" title="The JPEG scans as shown in my file browser" id="BLOGGER_PHOTO_ID_5448646022030128578" /&gt;&lt;div style="text-align:center; font-size: 80%;"&gt;The JPEG scans as shown in my file browser&lt;/div&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I moved to London in January 2006 and, with access to the Party's archives, intended to continue manually scanning issues back to 1904.  The Party informed me that the 1904 to 1972 issues of the &lt;em&gt;Standard&lt;/em&gt; had been microfiched by a third party, and that it might be easier to arrange for scanning directly from the microfiche.  I set about calling various libraries to see if I could obtain a copy of the microfiche, and various document archiving companies to inquire about microfiche scanning costs.  The first library I called, that of the &lt;a href="http://www2.lse.ac.uk/library/Home.aspx"&gt;London School of Economics&lt;/a&gt;, informed me that they were actually in the process of scanning much of their microfilm and microfiche collections anyway, and offered to include the &lt;em&gt;Standard&lt;/em&gt; in this project and provide the Party with a copy for a small fee (about £20).  I agreed to this as this was far cheaper than arranging for the scanning through a company, and far less effort than manually scanning the printed issues.  However, the scanning project at LSE proceeded very slowly, and the scans weren't made available to me until several years later—namely, a few weeks ago.&lt;/p&gt;

&lt;p&gt;I just got around to looking at the LSE scans yesterday.  They're on 6 DVD-ROMs, and comprise 18 GB in 69 PDF files, one for each year from 1904 to 1972.  The PDFs consist of two-page spreads scanned at 200 dpi greyscale; unlike the images I scanned, the front and back covers are in the correct order.  The scans have not been OCR'd, and I haven't yet determined how the PDFs encode the image data; possibly it is JPEG.  Each file includes a title page identifying the year of the archive, and some also include the index published in bound volumes of the &lt;em&gt;Standard&lt;/em&gt;.  All pages also include a banner at the bottom with the text, "London School of Economics &amp; Political Science 2007 / Socialist Standard &lt;em&gt;xxxx&lt;/em&gt;", where &lt;em&gt;xxxx&lt;/em&gt; is the year.&lt;/p&gt;

&lt;p&gt;&lt;a onblur="try {parent.deselectBloggerImageGracefully();} catch(e) {}" href="http://2.bp.blogspot.com/_Ld2fW7xry2I/S516_T2ptBI/AAAAAAAAAEA/S04h0sRiZvU/s1600-h/1950pdf.png"&gt;&lt;img style="display:block; margin:0px auto 10px; text-align:center;cursor:pointer; cursor:hand;width: 400px; height: 240px;" src="http://2.bp.blogspot.com/_Ld2fW7xry2I/S516_T2ptBI/AAAAAAAAAEA/S04h0sRiZvU/s400/1950pdf.png" border="0" alt="" id="BLOGGER_PHOTO_ID_5448646352268014610"  title="Sample pages from one of the LSE PDFs" /&gt;&lt;div style="text-align:center; font-size: 80%;"&gt;Sample pages from one of the LSE PDFs&lt;/div&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Issues from 1998 onwards have been typeset digitally and are available as PDFs.  I should be able to obtain these directly from the Party's &lt;em&gt;Socialist Standard&lt;/em&gt; production team.  The Party also has a basic electronic index for the &lt;em&gt;Standard&lt;/em&gt; (probably including only title and author data, but possibly also subjects) which I hope to obtain later.  The index wasn't professionally produced, so I doubt it will be of much use when it comes to looking for specific subjects.  Since making a proper subject index would be a tremendous undertaking, I hope that OCR plus full text search will serve as a reasonable substitute for the time being.&lt;/p&gt;

&lt;p&gt;The next step will be to determine how best to crop the LSE scans such that there is a single physical page per image and no LSE footer.  This will be the subject of an upcoming post.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-9053434479617438319?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/state-of-things.html</link><author>noreply@blogger.com (Tristan Miller)</author><media:thumbnail xmlns:media='http://search.yahoo.com/mrss/' url='http://2.bp.blogspot.com/_Ld2fW7xry2I/S514oEPGNRI/AAAAAAAAADw/-zJhbwHCndM/s72-c/tiffscans.png' height='72' width='72'/><thr:total>7</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-2568742191515070969</guid><pubDate>Sun, 14 Mar 2010 16:31:00 +0000</pubDate><atom:updated>2010-03-15T16:58:16.794+01:00</atom:updated><title>Goals</title><description>&lt;p&gt;The goal of this project is to produce a digital archive of the &lt;em&gt;Socialist Standard&lt;/em&gt; which includes all issues from 1904 to the present, and which will be made available on DVD-ROM and online.  The archive will include a user-friendly interface for finding and viewing the issues, and should be accessible with any modern computer.&lt;/p&gt;

&lt;p&gt;In practice, this means that the issues will be provided in an accessible document archive format, such as &lt;a href="http://en.wikipedia.org/wiki/PDF"&gt;PDF&lt;/a&gt; or &lt;a href="http://en.wikipedia.org/wiki/DjVu"&gt;DjVu&lt;/a&gt;.  Older issues will need to be scanned and &lt;a href="http://en.wikipedia.org/wiki/Optical_character_recognition"&gt;OCR&lt;/a&gt;'d.  The interface will most likely take the form of a set of HTML pages viewed with a web browser, as this is the easiest way to ensure cross-platform compatibility.  The index will list the issues by date and cover image.  Ideally there will also be an index of article titles, authors, and subjects, and also a searchable full-text index.&lt;/p&gt;

&lt;p&gt;Some of the goals, such as full-text searching, may not be achievable in the short term, so I plan to create various editions of the archive.  The first edition may include just the non-OCR'd issues and a simple index; later editions can add other features as work on them is completed.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-2568742191515070969?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/goals.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item><item><guid isPermaLink='false'>tag:blogger.com,1999:blog-8183298601630944795.post-4023483830176692898</guid><pubDate>Sun, 14 Mar 2010 15:45:00 +0000</pubDate><atom:updated>2010-03-15T16:58:59.452+01:00</atom:updated><title>An introduction</title><description>&lt;p&gt;The &lt;a href="http://www.worldsocialism.org/spgb/"&gt;Socialist Party of Great Britain&lt;/a&gt; has been publishing its journal, the &lt;em&gt;Socialist Standard&lt;/em&gt;, without interruption since 1904.  This blog documents my project to produce a complete digital archive of the &lt;em&gt;Standard&lt;/em&gt;.  By doing so I hope to help myself and others track my progress, and to provide insight for fellow would-be amateur archivists into the process and challenges of digitizing a large newspaper archive.&lt;/p&gt;&lt;div class="blogger-post-footer"&gt;&lt;img width='1' height='1' src='https://blogger.googleusercontent.com/tracker/8183298601630944795-4023483830176692898?l=ssdigit.nothingisreal.com' alt='' /&gt;&lt;/div&gt;</description><link>http://ssdigit.nothingisreal.com/2010/03/introduction.html</link><author>noreply@blogger.com (Tristan Miller)</author><thr:total>0</thr:total></item></channel></rss>
