HMMER 3.2 release


The glorious master plan was to finish HMMER4 while hoping that HMMER3 stayed stable. Alas, HMMER4 development has been even slower than expected, and bugs and bitrot have accumulated on HMMER3. Here’s a new HMMER 3.2 release to tide us all over. I’m managing HMMER releases again, with Travis Wheeler having moved a while ago to a faculty position at U. Montana.

You can get the HMMER3.2.1 source tarball from here.
Continue reading

Graeme Mitchison

Astronomy began when the Babylonians mapped the heavens. Our descendants will certainly not say that biology began with today’s genome projects, but they may well recognize that a great acceleration in the accumulation of biological knowledge began in our era.

Graeme Mitchison wrote those opening lines of our book Biological Sequence Analysis in Richard Durbin’s parents’ house in London. We four coauthors had borrowed the house for a month to write together, knowing that we had to get Richard out of the Sanger Centre or no progress would be made. The living room looked like a spy ring’s safe house, drapes drawn and full of improvised desks, computers, printer, and papers. We paired off in warring alliances to write, to cook, to argue, and to take long walks on the Hampstead Heath to cool down. At one point over a late dinner and wine, Anders Krogh proposed that one could make a hidden Markov model to recognize each of our writing styles. Richard proposed that mine could be recognized trivially by a high emission probability of the word “simple”. I recall snapping something back. I was struggling to draft our introduction and feeling defensive. At some point Graeme took it from me and in a few strokes replaced my clumsy efforts with the chapter that began with the beautiful lines above.

Continue reading

Fall 2018 MCB112 teaching fellows

I’m looking for four teaching fellows (TFs) for my course MCB112 Biological Data Analysis in the fall 2018 semester. TFs are typically Harvard G2 or G3 students (second- or third-year PhD students, in Harvard-speak), but can be more senior students or even postdocs. I teach the course in Python and Jupyter Notebook, using numpy and pandas, so experience in these things is a plus. Email me if you’re interested, or if you know someone else at Harvard who might be interested, let them know.

Thread count in HMMER3

It’s come to my attention (one helpful email, plus some snarky subtweets) that the --cpu flag of HMMER3 search programs may have a bad interaction with some cluster management software.

The --cpu n argument is documented as the “number of parallel CPU workers to use for multithreads”. Typically, you want to set n to the number of CPUs you want to use on average. But it is not the total number of threads that HMMER creates, because HMMER may also create additional threads that aren’t CPU-intensive. The number of threads that most HMMER3 programs will use is currently n+1, I believe. The HMMER4 prototype is currently using n+3, I believe.

The reason for the +1 is that we have a master/worker parallelization scheme, with one master and n workers. The master is disk intensive (responsible for input/output), and the workers are CPU intensive.

The reason for the +3 is that we are making more and more use of a technique called asynchronous threaded input to accelerate reading of data from disk. We fork off a thread dedicated to reading input, and it reads ahead while other stuff is happening. Another thread, in our current design, is there for decompression, if necessitated by the input file.

Apparently some cluster management software requires that you state the maximum number of CPUs your job will use, and if the job ever uses more than that, your job is halted or killed. So if HMMER starts n+1 threads, and the +1 thread — however CPU-nonintensive it may be — gets allocated to a free CPU outside your allotment of n, then your job is halted or killed. Which is understandably annoying.

The workaround with HMMER3 is to tell your cluster management software that you need a maximum of n+1 CPUs, when you tell HMMER --cpu n. You won’t use all n+1 CPUs efficiently (at best, you’ll only use n of them on average), but then, HMMER3 is typically i/o bound on standard filesystems, so it doesn’t scale to more than 2-4 CPUs well anyway.

I find it hard to believe that cluster management tools aren’t able to deal smartly with multithreaded software that combines CPU-intensive and i/o-intensive threads. I presume that there’s a good reason for these policies, and/or ways for cluster managers to configure or tune appropriately. I’m open to suggestions and pointers.