HMMER 3.0rc1: HMMER3's release candidate 1

After a year of alpha and beta testing, HMMER3 is ready. Release candidate 1 is available for download as a tarball. It includes precompiled binaries for Intel/Linux systems, both 64-bit (x86_84) and 32-bit (ia32) platforms, and you may just copy these into some directory in your \$PATH if you like. If the precompiled versions don't work for you, you can always compile your own version from the source.

Of course, "ready" is a euphemism for the balance of terror between the community complaining that Pfam (and Interpro, and more) are running on our beta test code, versus our never-ending to-do list of things we want H3 to do. We're at a stable release point, but we also have a lot left on our to-do lists.

Assuming we don't find anything more than a few typos in 3.0rc1, we'll upgrade to a full 3.0 release in a few days.

The release notes for 3.0rc1 follow:


HMMER 3.0rc1 release notes: HMMER 3.0 release candidate 1

http://hmmer.org/
SRE, Tue Feb 9 08:51:05 2010

This is the first release of HMMER 3.0, after a year of alpha and beta testing.

H3 is ready for production use. Indeed, it has been stable for many months. It is already widely deployed in its beta test versions for Pfam, Interpro, and other protein databases.

We are still not entirely happy with a number of issues, including the state of the documentation and the range of different formats (especially alignment formats) we read. And it would be nice if we'd finished the manuscripts describing how H3 works. But all this work will just have to continue.

Differences in 3.0rc1 relative to the previous 3.0b3 release are as follows.

New features and improvements:

**Better portability to different compilers.**
Feel free to compile HMMER3 from source with any C compiler you want. Previously, we strongly recommended the Intel compiler icc, and we distributed icc binaries, because icc-compiled H3 code vastly outperformed gcc-compiled code (up to two-fold faster). Bjarne Knudsen identified the issue that was slowing gcc-compiled code down, an obscure issue in floating point calculations having to do with what are called "denormal" numbers; gcc correctly implemented denormals, icc cuts corners and does not, and dealing with denormals is slow. We've made revisions and now gcc-compiled code is close in performance to icc-compiled code.
**hmmbuild is now multithreaded by default.**
You can now build all of Pfam from its seed alignments in a few minutes. Previously only the four search programs were multithreaded.
**All programs now have standard UNIX man pages,**
and copies are included in the User Guide. Man pages are not yet installed by default, but it's easy to do this for yourself if you want.
**The test suites have again expanded modestly.**

Changes to watch out for, because they alter output:

  • hmmstat output now includes a new column for model accession.
  • One of the bugfixes (#h71, see below) slightly alters the function of the "bias filter" in the H3 acceleration pipeline, and can produce small differences in the list of sequence targets found in searches.

All reported bugs have been fixed:

  • #h76 Several unit tests could fail on rare old machines because of a problem with tmp file creation with esl_tmpfile*().
  • #h75 HMMER_NCPU environment variable had no effect, wasn't hooked up.
  • #h74 Different envelopes, same alignment: rarely, two different but overlapping alignments could produce the same optimal-accuracy alignment, with the same alignment endpoints. This was a long standing bug that I had noticed in May 2008, and waited to see if anyone else noticed; it occurred at a rate of about one in three trillion comparisons, but was easy to reproduce with an appropriately constructed pathological example.
  • #h73 hmmalign --mapali failed if msa contains fragments, because of an order-of-call issue with checksum calculations and fragment marking.
  • #h72 jackhmmer could segfault if target sequence database contains duplicate names.
  • #h71 model composition calculation was wrong (hmm->compo), error in iocc[] calculation. This affects the bias filter step in the pipeline; it can have a small effect on results by changing which sequences have passed the filter relative to 3.0b3.
  • #h70 hmmbuild gave a cryptic error message if msa contains internal \~ characters; clarified the error message.
  • #h69 hmmscan --tblout segfaulted if profile lacks accession/description.
  • #e4 Long options were called "ambiguous" if they were a prefix of another long option, even on an exact match. For example, you could never get a --s option to work if there was also a --seed option.

Meanwhile, underneath the hood:

  • Support for DNA searches is in progress but not complete. You can see the work if you read the source code. We still don't recommend that you use H3 for DNA searches, even if it appears to work.
  • Elena Rivas has produced an independent reimplementation of the core of the H3 codebase for continuous-variable HMMs (as opposed to discrete residues), for analyzing recordings of mouse "song". As a result of Elena's line-by-line scrutiny of the H3 codebase, a number of small issues were fixed, including errant or misleading comments.
  • A lot of work is going on in our sequence format parsers. We will soon support binary sequence database formats including NCBI BLAST's and our own; we will also soon support a much wider variety of alignment file formats. We wanted this work to be done for the 3.0 release, but didn't quite make it.