The glorious master plan was to finish HMMER4 while hoping that HMMER3 stayed stable. Alas, HMMER4 development has been even slower than expected, and bugs and bitrot have accumulated on HMMER3. Here’s a new HMMER 3.2 release to tide us all over. I’m managing HMMER releases again, with Travis Wheeler having moved a while ago to a faculty position at U. Montana.
It’s come to my attention (one helpful email, plus some snarky subtweets) that the
--cpu flag of HMMER3 search programs may have a bad interaction with some cluster management software.
--cpu n argument is documented as the “number of parallel CPU workers to use for multithreads”. Typically, you want to set
n to the number of CPUs you want to use on average. But it is not the total number of threads that HMMER creates, because HMMER may also create additional threads that aren’t CPU-intensive. The number of threads that most HMMER3 programs will use is currently
n+1, I believe. The HMMER4 prototype is currently using
n+3, I believe.
The reason for the
+1 is that we have a master/worker parallelization scheme, with one master and n workers. The master is disk intensive (responsible for input/output), and the workers are CPU intensive.
The reason for the
+3 is that we are making more and more use of a technique called asynchronous threaded input to accelerate reading of data from disk. We fork off a thread dedicated to reading input, and it reads ahead while other stuff is happening. Another thread, in our current design, is there for decompression, if necessitated by the input file.
Apparently some cluster management software requires that you state the maximum number of CPUs your job will use, and if the job ever uses more than that, your job is halted or killed. So if HMMER starts n+1 threads, and the +1 thread — however CPU-nonintensive it may be — gets allocated to a free CPU outside your allotment of n, then your job is halted or killed. Which is understandably annoying.
The workaround with HMMER3 is to tell your cluster management software that you need a maximum of
n+1 CPUs, when you tell HMMER
--cpu n. You won’t use all n+1 CPUs efficiently (at best, you’ll only use n of them on average), but then, HMMER3 is typically i/o bound on standard filesystems, so it doesn’t scale to more than 2-4 CPUs well anyway.
I find it hard to believe that cluster management tools aren’t able to deal smartly with multithreaded software that combines CPU-intensive and i/o-intensive threads. I presume that there’s a good reason for these policies, and/or ways for cluster managers to configure or tune appropriately. I’m open to suggestions and pointers.
We have two new open positions in the group. We are looking for a bioinformatics analyst and a scientific software engineer to join the teams that develop the HMMER and Infernal software packages for biological sequence analysis. The teams are growing, with a key new team member Nick Carter, who joined us from high performance computing research at Intel and a previous computer science faculty position at U Illinois. We’re aiming in particular to bring out the next release of HMMER, the long-fabled HMMER4. These positions offer the short-term opportunity to help us bring these existing projects to the next level, and a longer-term opportunity to participate in a variety of fundamental computational biology algorithms research and software development.
The scientific software engineer will work most closely with Nick and myself on HMMER, and later on Infernal in collaboration with Eric Nawrocki (NIH/NCBI). Our codebases are ANSI C99, we take advantage of SIMD vectorization instructions on multiple platforms, and we’re working hard on parallelization efficiency with multithreading (POSIX threads) and message passing (MPI), so we’re looking for someone with expertise in these technologies. We currently work primarily on Apple OS/X and Linux platforms ourselves, but our code has to build and work on any POSIX-compliant platform, so we’re also looking to expand our automated build/test procedures. Our codebases are open source and we work with standard open source tools such as git, autoconf, and GNU make, and development is distributed amongst a small group of people throughout the world, especially including collaborators at NIH NCBI, U Montana, HHMI Janelia Farm, and Cambridge UK; you can see our github repos here. The official advertisement for the position is online here, with instructions on how to apply.
The bioinformatics analyst will work most closely with our bioinformatics secret agent man Tom Jones, Nick, and myself. We’re looking for someone to bridge the gap between the computational engineering and the biological applications, someone who will have one hand in the development team (working on how our command line interfaces work, and how our output formats come out), and another hand on our collaborations within Harvard and with the outside world using and testing our software on real problems. We want someone thinking about benchmarking and testing, with expertise in a scripting language (Python, Perl); we want someone working on user-oriented documentation and tutorials (Jekyll, Markdown); we want someone thinking about ease and elegance of use; we want someone working on how our tools play well with others, including Galaxy and BioConductor, and liasing between us and the various package managers who bundle our software (such as Homebrew, MacPorts, all the various Linuxen). The official ad and instructions to apply are online here.
We are reading applications now, and will accept applications in a rolling fashion for at least another month or so; beyond that depends on how our candidate pool looks. The positions are available immediately. Our funding for them has just activated, supported by the NIH NHGRI under the NIH’s program (PA-14-156) for Extended Development, Hardening and Dissemination of Technologies in Biomedical Computing, Informatics, and Big Data Science; we gratefully acknowledge this new support, as I’ve rejoined the NIH community after ten years away at a monastery.
HMMER web servers were officially launched today at the EMBL European Bioinformatics Institute (EBI) in Cambridge UK. You can read an EBI press release here. This marks the completion of the pilot HMMER server project at Janelia Farm and its transition to the EBI. All of this has been led by Rob Finn, now the head of sequence family resources at EBI. A huge thank you goes out to Rob and his team at Janelia (Jody Clements, Bill Arndt, and Ben Miller), to HHMI for funding the pilot project, and to the EBI for agreeing to adopt the pilot project and making it all grown up and respectable.
Today we’ve separated the code development home (hmmer.org) from the servers (www.ebi.ac.uk/Tools/hmmer/). Nonetheless, the two sites are pointing back and forth at each other, so you can download the current HMMER release and documentation from EBI, and you navigate to EBI’s search pages starting from hmmer.org. We even think that RESTful URLs for the pilot servers at hmmer.org will continue to be forwarded and served properly by the new EBI servers. Let us know if you experience any glitches.
Rob’s team at EBI will run the EBI servers, and the Eddy/Rivas lab will continue to be responsible for hmmer.org. Because of the terrifyingly sophisticated planning processes we employ in the HMMER project, or maybe it’s just a coincidence, the EBI announcement comes just days before our move to Harvard. Everything HMMER-related at Janelia will now wind down quickly over the next few months. A big change for us. If you’re used to using hmmer.janelia.org, switch to using the project’s permanent URL at hmmer.org. We’re about to turn out the lights here at Janelia.