Thread count in HMMER3

It’s come to my attention (one helpful email, plus some snarky subtweets) that the --cpu flag of HMMER3 search programs may have a bad interaction with some cluster management software.

The --cpu n argument is documented as the “number of parallel CPU workers to use for multithreads”. Typically, you want to set n to the number of CPUs you want to use on average. But it is not the total number of threads that HMMER creates, because HMMER may also create additional threads that aren’t CPU-intensive. The number of threads that most HMMER3 programs will use is currently n+1, I believe. The HMMER4 prototype is currently using n+3, I believe.

The reason for the +1 is that we have a master/worker parallelization scheme, with one master and n workers. The master is disk intensive (responsible for input/output), and the workers are CPU intensive.

The reason for the +3 is that we are making more and more use of a technique called asynchronous threaded input to accelerate reading of data from disk. We fork off a thread dedicated to reading input, and it reads ahead while other stuff is happening. Another thread, in our current design, is there for decompression, if necessitated by the input file.

Apparently some cluster management software requires that you state the maximum number of CPUs your job will use, and if the job ever uses more than that, your job is halted or killed. So if HMMER starts n+1 threads, and the +1 thread — however CPU-nonintensive it may be — gets allocated to a free CPU outside your allotment of n, then your job is halted or killed. Which is understandably annoying.

The workaround with HMMER3 is to tell your cluster management software that you need a maximum of n+1 CPUs, when you tell HMMER --cpu n. You won’t use all n+1 CPUs efficiently (at best, you’ll only use n of them on average), but then, HMMER3 is typically i/o bound on standard filesystems, so it doesn’t scale to more than 2-4 CPUs well anyway.

I find it hard to believe that cluster management tools aren’t able to deal smartly with multithreaded software that combines CPU-intensive and i/o-intensive threads. I presume that there’s a good reason for these policies, and/or ways for cluster managers to configure or tune appropriately. I’m open to suggestions and pointers.

New open positions

We have two new open positions in the group. We are looking for a bioinformatics analyst and a scientific software engineer to join the teams that develop the HMMER and Infernal software packages for biological sequence analysis. The teams are growing, with a key new team member Nick Carter, who joined us from high performance computing research at Intel and a previous computer science faculty position at U Illinois. We’re aiming in particular to bring out the next release of HMMER, the long-fabled HMMER4. These positions offer the short-term opportunity to help us bring these existing projects to the next level, and a longer-term opportunity to participate in a variety of fundamental computational biology algorithms research and software development.

The scientific software engineer will work most closely with Nick and myself on HMMER, and later on Infernal in collaboration with Eric Nawrocki (NIH/NCBI). Our codebases are ANSI C99, we take advantage of SIMD vectorization instructions on multiple platforms, and we’re working hard on parallelization efficiency with multithreading (POSIX threads) and message passing (MPI), so we’re looking for someone with expertise in these technologies. We currently work primarily on Apple OS/X and Linux platforms ourselves, but our code has to build and work on any POSIX-compliant platform, so we’re also looking to expand our automated build/test procedures. Our codebases are open source and we work with standard open source tools such as git, autoconf, and GNU make, and development is distributed amongst a small group of people throughout the world, especially including collaborators at NIH NCBI, U Montana, HHMI Janelia Farm, and Cambridge UK; you can see our github repos here. The official advertisement for the position is online here, with instructions on how to apply.

The bioinformatics analyst will work most closely with our bioinformatics secret agent man Tom Jones, Nick, and myself. We’re looking for someone to bridge the gap between the computational engineering and the biological applications, someone who will have one hand in the development team (working on how our command line interfaces work, and how our output formats come out), and another hand on our collaborations within Harvard and with the outside world using and testing our software on real problems. We want someone thinking about benchmarking and testing, with expertise in a scripting language (Python, Perl); we want someone working on user-oriented documentation and tutorials (Jekyll, Markdown); we want someone thinking about ease and elegance of use; we want someone working on how our tools play well with others, including Galaxy and BioConductor, and liasing between us and the various package managers who bundle our software (such as Homebrew, MacPorts, all the various Linuxen). The official ad and instructions to apply are online here.

We are reading applications now, and will accept applications in a rolling fashion for at least another month or so; beyond that depends on how our candidate pool looks. The positions are available immediately. Our funding for them has just activated, supported by the NIH NHGRI under the NIH’s program (PA-14-156) for Extended Development, Hardening and Dissemination of Technologies in Biomedical Computing, Informatics, and Big Data Science; we gratefully acknowledge this new support, as I’ve rejoined the NIH community after ten years away at a monastery.

HMMER mission control: we are go for launch vehicle separation(s)

hmmer_titlebar_small_textHMMER web servers were officially launched today at the EMBL European Bioinformatics Institute (EBI) in Cambridge UK. You can read an EBI press release here. This marks the completion of the pilot HMMER server project at Janelia Farm and its transition to the EBI. All of this has been led by Rob Finn, now the head of sequence family resources at EBI.  A huge thank you goes out to Rob and his team at Janelia (Jody Clements, Bill Arndt, and Ben Miller), to HHMI for funding the pilot project, and to the EBI for agreeing to adopt the pilot project and making it all grown up and respectable.

Today we’ve separated the code development home (hmmer.org) from the servers (www.ebi.ac.uk/Tools/hmmer/). Nonetheless, the two sites are pointing back and forth at each other, so you can download the current HMMER release and documentation from EBI, and you navigate to EBI’s search pages starting from hmmer.org. We even think that RESTful URLs for the pilot servers at hmmer.org will continue to be forwarded and served properly by the new EBI servers. Let us know if you experience any glitches.

Rob’s team at EBI will run the EBI servers, and the Eddy/Rivas lab will continue to be responsible for hmmer.org.  Because of the terrifyingly sophisticated planning processes we employ in the HMMER project, or maybe it’s just a coincidence, the EBI announcement comes just days before our move to Harvard. Everything HMMER-related at Janelia will now wind down quickly over the next few months. A big change for us.  If you’re used to using hmmer.janelia.org, switch to using the project’s permanent URL at hmmer.org. We’re about to turn out the lights here at Janelia.

HMMER 3.1 beta test 2 released

hmmer-154x184

The HMMER dev team is happy to announce a bugfix release of HMMER3.1, release 3.1 beta 2, aka 3.1b2. Following Google’s ineffable lead in having perpetual beta test periods, 3.1 has been in beta test now for two years. When we said before that 3.1 will be released reasonably soon… “reasonably soon” continues to be a term of art for the dev team. Did we mention, it’s a stable beta release?

Anyhoo, moving right along. The 3.1b2 code is publicly available as a tarball available for download, or from hmmer.org, where you’ll also find precompiled binary releases for Mac and Linux.

The most significant upgrade in 3.1b2 is that the nhmmer program for DNA/DNA comparison now includes a somewhat radical heuristic acceleration technique that gets us about 10x more speed. Travis Wheeler has used an FM-index data structure to accelerate remote homology search in nhmmer. FM-index techniques are well known now in the computational biology community for fast near-exact matching (in read mappers, for example), and there have been some proofs of principle for accelerating Smith/Waterman especially with scoring systems set for close matches; Travis’s code is a full-fledged implementation in production code for remote homology. Travis is still working on it and writing it up. Meanwhile, you can try it out. If you format a DNA database with the new makehmmerdb command, and then use nhmmer –tformat hmmerfm to search the binary FM-index database, you’ll use the new acceleration.

Another significant upgrade is the inclusion of the hmmlogo program, which is essentially a commandline interface for producing the data underlying the Skylign profile logo server (skylign.org).

Also, eight, count ’em eight bugs have been fixed. Of the ones we count, anyway.

Congratulations again to Travis Wheeler, who continues as 3.1’s build master, even though he is now afar in his new mountain lair faculty position at the University of Montana, as the HMMER dev team continues to scatter and flee from Virginia.

The horrible grinding noise you hear is the HMMER4 development code branch. Do not be alarmed. All is well. It will be ready… reasonably soon.

Detailed release notes for 3.1b2 are below the fold.
Continue reading

HMMER3.1 beta test 1 released

hmmer-154x184

The HMMER dev team is happy to announce an upgrade release of HMMER3, release 3.1. A beta test version of the code is publicly available as a tarball available for download, or from hmmer.org, where you’ll also find precompiled binary releases for Mac and Linux.

HMMER 3.1 includes nhmmer and nhmmscan, programs for DNA/DNA homology searches with profile HMMs. nhmmer has already been incorporated in RepeatMasker, in collaboration with Arian Smit and colleagues, and is the software underlying the Dfam database of profiles for mobile DNA elements.

HMMER 3.1 database searches are about twice as fast as HMMER 3.0 was, fulfilling old campaign promises.

HMMER 3.1 includes hmmpgmd, the parallel search daemon underlying HMMER Web Services at hmmer.org.

This code is expected to be stable, but we’re releasing it as a beta test just to be careful. After some time in the wild, we’ll make a release candidate, and if you folks haven’t chewed any of that up too badly, we’ll make the final 3.1 release reasonably soon.

Congratulations to Travis Wheeler, 3.1’s build master — note the TJW on the notes below the fold, not an SRE — the first HMMER release managed by someone besides me (Sean).

Meanwhile… slowly, slowly, HMMER4 takes shape, as the gnomes of HMMER Labs toil sleeplessly on their latest monstrosities. The long awaited return of glocal alignment has been delayed into HMMER4, because the changes required turned out to be, um, quite extensive.

Detailed release notes for 3.1b1 are below the fold.
Continue reading

Join Rob’s HMMER team

hmmer-154x184

Rob Finn’s HMMER web services team is expanding. We’re looking for people to apply to two new positions to help Rob and Jody push forward on some important ideas for our services. We’re pushing in the direction of using more phylogenetic information (species trees) as we compute database homology searches and deliver the results — organizing everything on trees, rather than treating the protein database as a bag of unrelated sequences, as we (the community) have tended to do in the past. We’ll need help on the data visualization side (navigating search results organized on the tree of life), on the computing back end (accelerating our searches by searching representative subsets of complete proteomes, rather than “all” sequences — which will allow us to deliver fully interactive search times, measured in milliseconds), and on collaborative efforts with the primary protein sequence and genome data resources, as we (the community) get our data ecosystem organized around complete annotated genomes, not individual protein sequences. The positions, written in HR-speak, are advertised on HHMI’s web site here and here.