Digitizing the Socialist Standard archive involves running CPU-bound image processing tools on a large number of files. Since I've got a multicore CPU, it makes sense to run such operations in parallel rather than one after another. (A good rule of thumb I've heard is to always have twice as many processes running as you have cores or CPUs.) Up until now, I've been coding each batch of tasks in a makefile, and then invoking make
with the -j
argument for parallel execution. Needless to say, this is a bit inconvenient when I just have a one-off batch job to run, and it also prevents me from developing and testing bash-scripted tasks from the command line. For years I've wished that bash's looping statements could be parameterized by the number of loop bodies to run simultaneously. For example, instead of writing for x in a b c d e f;do somehugecommand $x;done
and waiting for somehugecommand
to run six times, one after the other, I want to be able to write something like for x in a b c d e f;do -j3 somehugecommand $x;done
and have three instances of somehugecommand
launch and run simultaneously.
Well, apparently such a tool has existed for many years now, but no one told me about it. It's called GNU Parallel, and it works much like the old familiar xargs
from GNU Findutils. You pass it a list of values on stdin, and pass as command-line arguments a command line to execute. As with xargs
, the character sequence {}
gets replaced with the values from stdin. And of course, you also tell it how many simultaneous jobs to run with the -j
option, just like with GNU Make. For example, whereas before I was calling the Tesseract OCR software on one file at a time with for f in $list_of_images;do tesseract $f.png $f -l eng hocr;done
, I'm now executing them in parallel with echo $list_of_images | parallel -j4 tesseract {}.png {} -l eng hocr
. What a fantastically useful utility!
As might be surmised from its name, GNU Parallel is an official GNU project, so it's surprising that it's not better known and more widely available. (For example, it's not packaged by openSUSE or other major distributions.) GNU Parallel's web page has some background which explains why:
In the years after 2005… I tried getting parallel accepted into GNU findutils. It was not accepted as it was written in Perl and the team did not want GNU findutils to depend on Perl…
In February 2009 I tried getting parallel added to the package moreutils. The author never replied to the email or the two reminders…
In 2010 parallel was adopted as an official GNU tool and the name was changed to GNU parallel. As GNU already had a tool for running jobs on remote computers (called pexec) it was a hard decision to include GNU parallel as well. I believe the decision was mostly based on GNU parallel having a more familiar user interface - behaving very much like xargs. Shortly after the release as GNU tool remote execution was added and all missing options from xargs were added to make it possible to use GNU parallel as a drop in replacement for xargs.
So to Ole Tange, the author of GNU Parallel, I just want to say thank you for this wonderful utility, and I'm sorry that you had so much trouble getting it adopted into a GNU package.