Graduation day: HMMER3 beta test (3.0b1) release

hmmer-154x184

The alpha test is complete, and the beta test phase begins today. HMMER3 beta test code is publicly available as a tarball available for download on our FTP site. Besides source code, the tarball also includes a binaries/ subdirectory with precompiled binaries for Intel/Linux systems, both 64-bit (in binaries/intel-linux-x86_64) and 32-bit (in binaries/intel-linux-ia32), and you can just copy these into some directory in your $PATH if you like. These should work on a wide variety of platforms but if you have any trouble, you can always compile your own version from the source code.

My attention is now going to turn toward writing up some papers on all this, and writing the HMMER3 documentation, because (of course) nobody’s going to find any bugs in the beta test, right?

The release notes for 3.0b1:

HMMER 3.0b1 release notes (HMMER3 beta test, release 1)
http://hmmer.org/
SRE, Wed Jun 17 17:08:27 2009
________________________________________________________________

3.0b1 includes the following changes:

  • The format of HMM files has changed slightly. There are now six
    saved statistical parameters for calculating E-values, not
    three. This format is called the “3/b” format. Previous HMMER3
    flatfiles (3/a format) are read fine in a reverse compatible
    mode. However, any binary press’ed HMM databases you have will
    need to be repress’ed with “hmmpress”.
  • An important bug (#h48) was fixed that affected any target
    sequence containing degenerate residue codes. Such target
    sequences may have been filtered out of any searches, in an
    unpredictable and probably platform-dependent way. This bug was
    introduced in the 3.0a2 release, and was not present in the 3.0a1
    release.
  • The four search programs now have optional tabular output formats
    for easier automated parsing; see –tblout and –domtblout
    options.
  • One of the stages of the accelerated “filter pipeline” for the
    search programs has been reimplemented. The ViterbiFilter step is
    now implemented in 16-bit precision in 8-fold parallel vectors (it
    was previously in 8-bit precision in 16-fold parallel vectors).
    Testing/analysis of numerical roundoff error determined that the
    8-bit implementation had unacceptable error on a subset of Pfam
    models. You should observe little if any difference in speed as a
    result of this change. There may be small differences in sequences
    that do or don’t pass through the filter, but for sequences that pass
    the filter, there should be no changes in their called domain
    structure, alignment, scores, or E-values.
  • ‘*’ characters are accepted in input sequence files and
    interpreted as “not a residue”. ‘*’ is common (though technically
    an illegal residue code) in FASTA files of ORFs translated from
    DNA. Because of a technical detail of how H3 scores insertions (it
    assumes insertions score zero), it is possible for an alignment to
    span a * if the * is assigned to an insert state, but not if the *
    is assigned to a match state. This is arguably a bug; but you
    shouldn’t be using * characters to begin with.
  • You can now “checkpoint” the results of jackhmmer iterations. See
    –chkhmm and –chkali options.
  • hmmbuild is now tolerant of alignments that contain sequence
    fragments. If a sequence looks like a fragment (see the
    –fragthresh option), leading and trailing gaps are ignored,
    rather than being counted as deletions. (This is important for
    metagenomics datasets and other partial protein sequence data.)
  • The default relative sequence weightng method in hmmbuild is now
    Henikoff position-based weights, replacing
    Gerstein/Chothia/Sonnhammer weights, to optimize model
    construction speed.
  • The two tweaks in hmmbuild (fragment marking and PB weights) mean
    that models built with beta1 will produce slightly different
    scores and results, compared to alpha2 code.

Other changes include:

  • A “hmmalign –mapali” option allows you to add the original training
    alignment (that an HMM was built from) into a new alignment.
  • –qformat and –tformat options are added to phmmer, jackhmmer,
    hmmsearch, and hmmscan, and –informat option to hmmbuild. These
    allow you to specify input sequence file formats, bypassing
    automatic format detection.
  • The target function for entropy weighting has been tweaked and
    simplified. It is now controlled by –esigma and –ere options.
    The –ere option no longer overrides the eweighting target
    function; the two-parameter function is always applied. (To set a
    fixed RE target without the tail of larger REs for smaller models,
    set a large negative argument to –esigma, such as –esigma
    -9999.) The new target function assigns somewhat higher relative
    entropy targets to small models, and should slightly alleviate
    H3’s lack of sensitivity with certain small, diverse Pfam models
    that were tuned to take advantage of glocal alignment’s somewhat
    better signal/noise.
  • The Viterbi filter now uses an approximation (the “3 nat
    approximation”) to account for NN,CC,JJ transitions, even though
    it has the numerical accuracy to do the exact calculation, because
    the approximation outperforms the exact calculation. I don’t want
    to talk about it.
  • Random number generators have been replaced with much faster
    versions, resulting in ten-fold speedups in stochastic sampling
    code.
  • hmmbuild output has changed; it now includes columns for effective
    sequence number and relative entropy/position.
  • Automated test coverage has increased.

Numbered bugs fixed:

  • #h55 hmmbuild failed to build reasonable model on fragment-rich MSA
  • #h53 hmmalign –trim had memory corruption fault
  • #h52 hmmscan segfaulted on rare comparisons with nenvelopes=0
    after stochastic clustering.
  • #h51 “hmmconvert -2” created invalid HMMER2 model if ali len > 9999
  • #h50 “hmmbuild -n” (assign name to HMM) didn’t have any effect
  • #h49 domain definition was sometimes finding the same domain twice
  • #h48 Degenerate residues caused target sequence to be bias-filtered.
  • #h47 “jackhmmer -A” segfaulted.

Unnumbered bugs fixed (some of the things I found that you didn’t):

  • “hmmalign -o” now works as intended; it wasn’t working before.
  • Alignment display midlines were showing + signs everywhere,
    because of a typo in the code.

================================================================
= Open bugs
================================================================

  • #h44 Compiler “option -xk not supported” warnings on MacOS/X
  • #h43 “non-aligned pointer being freed” on MacOS/X x86_64 icc
  • #h42 Low information content models have poor sensitivity

6 thoughts on “Graduation day: HMMER3 beta test (3.0b1) release

  1. “The four search programs now have optional tabular output formats
    for easier automated parsing; see –tblout and –domtblout
    options. ”

    Is hmmscan not one of these four search programs? I don’t see this new option for hmmscan.

    Like

  2. Oops. You’re right. They should be in hmmscan, and they’re not.

    Apologies for that. They’ll be in hmmscan in beta2 — likely next month (I’ll be away for several weeks, else I’d do it sooner.)

    Like

  3. Great stuff Sean. Having played with the alpha versions of HMMER3 it was already very robust and I am sure that there will be very few issues with the beta release (right 🙂 ). I know a lot of the bugs and feature requests have come from me……So many thanks for dealing with them (and the rest of my stupid questions)! We will almost certainly make the next release on Pfam in HMMER 3.0b1, rather than the HMMER 3.0a2 release, as it is so fast we can update, well re-search the models, over night.

    The increased sensitivity of HMMER3 has shown some really interesting relationships between Pfam families. Although there has been some pain going through all of the overlapping hits, and resolving the overlaps where necessary, I strongly believe that the pain has been well worth it. The increased coverage of our models and scope of the Pfam clans represents a significant step forward in terms of what is achievable with sequence analysis. When Pfam is released in the late summer on HMMER3 it should become an even more useful tool for the molecular biologist!

    Finally, if there are any other HMM based protein family databases wonder whether now is the time to switch to HMMER3 – I would say ‘YES!’.

    Like

  4. I am thrilled to see the Jackhmmer functionality in here! This will be very cool, and I cannot wait to put it up against PSI-BLAST in terms of speed and sensitivity.

    I wish, however, that it had a different name, as when I saw it I thought of the Wash. U. network processor implementation, which showed a small speedup over pentium chips a few years ago. Also interesting, but quite different.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s