03 September 2012

GNU Parallel, where have you been all my life?

Digitizing the Socialist Standard archive involves running CPU-bound image processing tools on a large number of files. Since I've got a multicore CPU, it makes sense to run such operations in parallel rather than one after another. (A good rule of thumb I've heard is to always have twice as many processes running as you have cores or CPUs.) Up until now, I've been coding each batch of tasks in a makefile, and then invoking make with the -j argument for parallel execution. Needless to say, this is a bit inconvenient when I just have a one-off batch job to run, and it also prevents me from developing and testing bash-scripted tasks from the command line. For years I've wished that bash's looping statements could be parameterized by the number of loop bodies to run simultaneously. For example, instead of writing for x in a b c d e f;do somehugecommand $x;done and waiting for somehugecommand to run six times, one after the other, I want to be able to write something like for x in a b c d e f;do -j3 somehugecommand $x;done and have three instances of somehugecommand launch and run simultaneously.

Well, apparently such a tool has existed for many years now, but no one told me about it. It's called GNU Parallel, and it works much like the old familiar xargs from GNU Findutils. You pass it a list of values on stdin, and pass as command-line arguments a command line to execute. As with xargs, the character sequence {} gets replaced with the values from stdin. And of course, you also tell it how many simultaneous jobs to run with the -j option, just like with GNU Make. For example, whereas before I was calling the Tesseract OCR software on one file at a time with for f in $list_of_images;do tesseract $f.png $f -l eng hocr;done, I'm now executing them in parallel with echo $list_of_images | parallel -j4 tesseract {}.png {} -l eng hocr. What a fantastically useful utility!

As might be surmised from its name, GNU Parallel is an official GNU project, so it's surprising that it's not better known and more widely available. (For example, it's not packaged by openSUSE or other major distributions.) GNU Parallel's web page has some background which explains why:

In the years after 2005… I tried getting parallel accepted into GNU findutils. It was not accepted as it was written in Perl and the team did not want GNU findutils to depend on Perl…
In February 2009 I tried getting parallel added to the package moreutils. The author never replied to the email or the two reminders…
In 2010 parallel was adopted as an official GNU tool and the name was changed to GNU parallel. As GNU already had a tool for running jobs on remote computers (called pexec) it was a hard decision to include GNU parallel as well. I believe the decision was mostly based on GNU parallel having a more familiar user interface - behaving very much like xargs. Shortly after the release as GNU tool remote execution was added and all missing options from xargs were added to make it possible to use GNU parallel as a drop in replacement for xargs.

So to Ole Tange, the author of GNU Parallel, I just want to say thank you for this wonderful utility, and I'm sorry that you had so much trouble getting it adopted into a GNU package.

5 comments:

  1. """Well, apparently such a tool has existed for many years now, but no one told me about it."""

    Who should have told you? In other words: What information channels affects you?

    GNU Parallel is now officially accepted for:

    * Fedora https://admin.fedoraproject.org/pkgdb/acls/name/parallel
    * Debian Unstable http://packages.debian.org/unstable/utils/parallel
    * Ubuntu Quantal http://packages.ubuntu.com/quantal/parallel
    * FreeBSD http://www.freshports.org/sysutils/parallel

    Packages for other distributions: https://build.opensuse.org/package/show?package=parallel&project=home%3Atange

    ReplyDelete
  2. Hi Ole! I have no idea how you found my post so quickly. Evidently you are much better at finding information on GNU Parallel than I am. ;)

    I had been using the openSUSE package from the openSUSE Build Service page you linked to. However, as I understand it these are just user-contributed builds; Parallel isn't in any of the official openSUSE repositories (at least, not for the 11.4 version I'm running at home). My machine at work runs Ubuntu 10.04 LTS, and Parallel isn't in the official repositories for that either. I'm glad to see it's been officially accepted for some other distributions, though.

    As to who ought to have told me about it—well, no one, really; I guess it was a tongue-in-cheek comment. When such a great tool is developed I usually end up hearing about it one way or another. I would have expected to read about it on comp.unix.shell, but I haven't read any newsgroups ever since switching to an ISP without a Usenet feed. Or failing that, I would have just expected it to show up in my distributions' official package repositories; I browse the packages from time to time for utilities that look useful. So how did I actually discover Parallel? IIRC it was a few days ago from searching StackOverflow about parallel execution in bash.

    ReplyDelete
  3. You can help by requesting openSUSE to adopt it officially.

    ReplyDelete
  4. You could just do

    for x in a b c d e f;do (somehugecommand $x &);done

    ReplyDelete
    Replies
    1. For just six arguments, maybe. For hundreds or thousands, what you posted is probably the fastest way to bring a computer to a thrashing halt.

      Delete