HMMER 3.0rc1: HMMER3’s release candidate 1

hmmer-154x184

After a year of alpha and beta testing, HMMER3 is ready. Release candidate 1 is available for download as a tarball. It includes precompiled binaries for Intel/Linux systems, both 64-bit (x86_84) and 32-bit (ia32) platforms, and you may just copy these into some directory in your $PATH if you like. If the precompiled versions don’t work for you, you can always compile your own version from the source.

Of course, “ready” is a euphemism for the balance of terror between the community complaining that Pfam (and Interpro, and more) are running on our beta test code, versus our never-ending to-do list of things we want H3 to do. We’re at a stable release point, but we also have a lot left on our to-do lists.

Assuming we don’t find anything more than a few typos in 3.0rc1, we’ll upgrade to a full 3.0 release in a few days.

The release notes for 3.0rc1 follow:


HMMER 3.0rc1 release notes: HMMER 3.0 release candidate 1

http://hmmer.org/
SRE, Tue Feb 9 08:51:05 2010

This is the first release of HMMER 3.0, after a year of alpha and beta testing.

H3 is ready for production use. Indeed, it has been stable for many months. It is already widely deployed in its beta test versions for Pfam, Interpro, and other protein databases.

We are still not entirely happy with a number of issues, including the state of the documentation and the range of different formats (especially alignment formats) we read. And it would be nice if we’d finished the manuscripts describing how H3 works. But all this work will just have to continue.

Differences in 3.0rc1 relative to the previous 3.0b3 release are as follows.

New features and improvements:

Better portability to different compilers.

Feel free to compile HMMER3 from source with any C compiler you want. Previously, we strongly recommended the Intel compiler icc, and we distributed icc binaries, because icc-compiled H3 code vastly outperformed gcc-compiled code (up to two-fold faster). Bjarne Knudsen identified the issue that was slowing gcc-compiled code down, an obscure issue in floating point calculations having to do with what are called “denormal” numbers; gcc correctly implemented denormals, icc cuts corners and does not, and dealing with denormals is slow. We’ve made revisions and now gcc-compiled code is close in performance to icc-compiled code.

hmmbuild is now multithreaded by default.

You can now build all of Pfam from its seed alignments in a few minutes. Previously only the four search programs were multithreaded.

All programs now have standard UNIX man pages,

and copies are included in the User Guide. Man pages are not yet installed by default, but it’s easy to do this for yourself if you want.

The test suites have again expanded modestly.

Changes to watch out for, because they alter output:

  • hmmstat output now includes a new column for model accession.
  • One of the bugfixes (#h71, see below) slightly alters the function of the “bias filter” in the H3 acceleration pipeline, and can produce small differences in the list of sequence targets found in searches.

All reported bugs have been fixed:

  • #h76 Several unit tests could fail on rare old machines because of a problem with tmp file creation with esl_tmpfile*().
  • #h75 HMMER_NCPU environment variable had no effect, wasn’t hooked up.
  • #h74 Different envelopes, same alignment: rarely, two different but overlapping alignments could produce the same optimal-accuracy alignment, with the same alignment endpoints. This was a long standing bug that I had noticed in May 2008, and waited to see if anyone else noticed; it occurred at a rate of about one in three trillion comparisons, but was easy to reproduce with an appropriately constructed pathological example.
  • #h73 hmmalign –mapali failed if msa contains fragments, because of an order-of-call issue with checksum calculations and fragment marking.
  • #h72 jackhmmer could segfault if target sequence database contains duplicate names.
  • #h71 model composition calculation was wrong (hmm->compo), error in iocc[] calculation. This affects the bias filter step in the pipeline; it can have a small effect on results by changing which sequences have passed the filter relative to 3.0b3.
  • #h70 hmmbuild gave a cryptic error message if msa contains internal ~ characters; clarified the error message.
  • #h69 hmmscan –tblout segfaulted if profile lacks accession/description.
  • #e4 Long options were called “ambiguous” if they were a prefix of another long option, even on an exact match. For example, you could never get a –s option to work if there was also a –seed option.

Meanwhile, underneath the hood:

  • Support for DNA searches is in progress but not complete. You can see the work if you read the source code. We still don’t recommend that you use H3 for DNA searches, even if it appears to work.
  • Elena Rivas has produced an independent reimplementation of the core of the H3 codebase for continuous-variable HMMs (as opposed to discrete residues), for analyzing recordings of mouse “song”. As a result of Elena’s line-by-line scrutiny of the H3 codebase, a number of small issues were fixed, including errant or misleading comments.
  • A lot of work is going on in our sequence format parsers. We will soon support binary sequence database formats including NCBI BLAST’s and our own; we will also soon support a much wider variety of alignment file formats. We wanted this work to be done for the 3.0 release, but didn’t quite make it.

11 thoughts on “HMMER 3.0rc1: HMMER3’s release candidate 1

  1. Hey Sean, this is awesome!

    Quick Q … are the 32bit binaries broken? This is the output I get (sez x86-64?):

    binaries> file intel-linux-ia32/hmmsearch
    intel-linux-ia32/hmmsearch: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.9, not stripped

    Looks like the 32bit binaries are 64 bit binaries in disguise?

    Thank you!

    Like

  2. Oopsie, you’re right.

    Our ia32 compile environment changed recently and I didn’t notice; my scripted configuration had no effect, and it looks like stuff reverted to our default x86_64 compilation environment.

    We’ll fix that in the 3.0 release.

    Like

  3. I would be happier if I can simply change the format of stdout output to a more ‘computer-parsable’ one by a option like -o of Blast, instead of putting it to separate files. Plus, still the alignment can only be recovered from the ‘human-readable’ format, which is tedious and error-prone. I would appreciate if the alignment can also be put as a ‘computer-parsable’ format.

    Like

  4. Thanks, Hiroshi. My plan is to allow –domtblout and –tblout to accept “-” as an argument, meaning “standard output”.

    There would be three kinds of output: the normal ‘human-readable’ output, the .domtbl output, and the .tbl output. One and only one of these would be allowed to go to stdout; the other two (if used) must go to files.

    So:
    hmmsearch -o /dev/null –tblout – my.hmm my.database
    would send the tabular output to stdout, and discard the normal output.

    What would you think of that?

    Parseable alignment output is further down the to-do list. That’ll come when we start doing more systematic experiments on the accuracy of H3’s alignments and posterior probabilities.

    Like

  5. First of all, this is my personal opinion. I follow your decision. I can still modify the source code by myself if necessary.

    For writing a parser, the –domtblout output is pretty enough if there are alignment information supplemented. The –tblout information can be almost fully recovered from –domtblout output. So I thought that it makes more sense to switch the two output formats (‘human-readable’ and –domtblout) by an option. Anyway, how to switch the one to go to the stdout is not very critical; feeding ‘-‘ is a nice idea though it may not be a very simple implementation.

    However, showing the alignment in a way that it allows the users to write parsers easily seems to me more important.

    The best for me is, that at the end (or before the description field) of each line of –domtblout, the program shows five more fields that gives the alignment information: the HMM sequence, the middle line showing the match residues, the amino acid sequence, the homology line, and the secondary structure line if available, all without the LF characters. I understand that the output would be very ugly, but not many people would care as it’s for the computer.

    What do you think?

    Like

  6. Hey Sean,

    I was able to get 3.0rc1 to compile on BG/L last night, however, I had to hack the configure script a bit. The interactive nodes on the cluster we have access to are PPC based. As a result, the configure script kept trying to use VMX, which is unavailable on the worker nodes. Passing –disable-vmx was not working either. My work around was to force the configure script to use the dummy implementation.

    Would you expect the performance impact to be massive enough to make it a waste of time to use large numbers of cores with the the dummy implementation? We also have x86_64 clusters available, but significantly fewer cores to abuse. BG/L has SIMD instruction sets, have you guys investigated porting HMMER3 to that architecture?

    On a side note, I have been unsuccessful thus far at coercing GPU-HMMER to compile on Snow Leopard. I was not able to get CUDA v2 running, but was able to get CUDA v3.0.1beta… it is entirely possible that GPU-HMMER is not compatible with CUDA v3.

    Thanks!
    -Daniel

    Like

  7. ./configure –enable-dummy would work. (I’ll check into why –disable-vmx doesn’t work, and have that act like –enable-dummy.)

    The performance impact is large, about 100x or so. I wouldn’t recommend using the dummy implementation for real work. It’s there as a reference implementation in vanilla C, so someone can see how the algorithms work without the added complexity layer of vector instructions.

    Carlos Sosa at IBM has, I believe, been looking into porting H3 to the BG/L; we’ve exchanged some email about it.

    For GPU-HMMER problems, you’ll have to take that up with the authors of GPU-HMMER. It is a fork of HMMER2 code that we don’t have anything to do with. We have asked them not to use the name HMMER to avoid this kind of confusion.

    Like

  8. Pingback: Pfam, HMMER3 and the next release « Xfam Blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s