Graduation day: HMMER3 beta test (3.0b1) release

The alpha test is complete, and the beta test phase begins today. HMMER3 beta test code is publicly available as a tarball available for download on our FTP site. Besides source code, the tarball also includes a binaries/ subdirectory with precompiled binaries for Intel/Linux systems, both 64-bit (in binaries/intel-linux-x86_64) and 32-bit (in binaries/intel-linux-ia32), and you can just copy these into some directory in your \$PATH if you like. These should work on a wide variety of platforms but if you have any trouble, you can always compile your own version from the source code.

My attention is now going to turn toward writing up some papers on all this, and writing the HMMER3 documentation, because (of course) nobody's going to find any bugs in the beta test, right?

The release notes for 3.0b1:

HMMER 3.0b1 release notes (HMMER3 beta test, release 1)
http://hmmer.org/
SRE, Wed Jun 17 17:08:27 2009

3.0b1 includes the following changes:

  • The format of HMM files has changed slightly. There are now six
    saved statistical parameters for calculating E-values, not
    three. This format is called the "3/b" format. Previous HMMER3
    flatfiles (3/a format) are read fine in a reverse compatible
    mode. However, any binary press'ed HMM databases you have will
    need to be repress'ed with "hmmpress".
  • An important bug (#h48) was fixed that affected any target
    sequence containing degenerate residue codes. Such target
    sequences may have been filtered out of any searches, in an
    unpredictable and probably platform-dependent way. This bug was
    introduced in the 3.0a2 release, and was not present in the 3.0a1
    release.
  • The four search programs now have optional tabular output formats
    for easier automated parsing; see --tblout and --domtblout
    options.
  • One of the stages of the accelerated "filter pipeline" for the
    search programs has been reimplemented. The ViterbiFilter step is
    now implemented in 16-bit precision in 8-fold parallel vectors (it
    was previously in 8-bit precision in 16-fold parallel vectors).
    Testing/analysis of numerical roundoff error determined that the
    8-bit implementation had unacceptable error on a subset of Pfam
    models. You should observe little if any difference in speed as a
    result of this change. There may be small differences in sequences
    that do or don't pass through the filter, but for sequences that pass
    the filter, there should be no changes in their called domain
    structure, alignment, scores, or E-values.
  • '*' characters are accepted in input sequence files and
    interpreted as "not a residue". '*' is common (though technically
    an illegal residue code) in FASTA files of ORFs translated from
    DNA. Because of a technical detail of how H3 scores insertions (it
    assumes insertions score zero), it is possible for an alignment to
    span a * if the * is assigned to an insert state, but not if the *
    is assigned to a match state. This is arguably a bug; but you
    shouldn't be using * characters to begin with.
  • You can now "checkpoint" the results of jackhmmer iterations. See
    --chkhmm and --chkali options.
  • hmmbuild is now tolerant of alignments that contain sequence
    fragments. If a sequence looks like a fragment (see the
    --fragthresh option), leading and trailing gaps are ignored,
    rather than being counted as deletions. (This is important for
    metagenomics datasets and other partial protein sequence data.)
  • The default relative sequence weightng method in hmmbuild is now
    Henikoff position-based weights, replacing
    Gerstein/Chothia/Sonnhammer weights, to optimize model
    construction speed.
  • The two tweaks in hmmbuild (fragment marking and PB weights) mean
    that models built with beta1 will produce slightly different
    scores and results, compared to alpha2 code.

Other changes include:

  • A "hmmalign --mapali" option allows you to add the original training
    alignment (that an HMM was built from) into a new alignment.
  • --qformat and --tformat options are added to phmmer, jackhmmer,
    hmmsearch, and hmmscan, and --informat option to hmmbuild. These
    allow you to specify input sequence file formats, bypassing
    automatic format detection.
  • The target function for entropy weighting has been tweaked and
    simplified. It is now controlled by --esigma and --ere options.
    The --ere option no longer overrides the eweighting target
    function; the two-parameter function is always applied. (To set a
    fixed RE target without the tail of larger REs for smaller models,
    set a large negative argument to --esigma, such as --esigma
    -9999.) The new target function assigns somewhat higher relative
    entropy targets to small models, and should slightly alleviate
    H3's lack of sensitivity with certain small, diverse Pfam models
    that were tuned to take advantage of glocal alignment's somewhat
    better signal/noise.
  • The Viterbi filter now uses an approximation (the "3 nat
    approximation") to account for NN,CC,JJ transitions, even though
    it has the numerical accuracy to do the exact calculation, because
    the approximation outperforms the exact calculation. I don't want
    to talk about it.
  • Random number generators have been replaced with much faster
    versions, resulting in ten-fold speedups in stochastic sampling
    code.
  • hmmbuild output has changed; it now includes columns for effective
    sequence number and relative entropy/position.
  • Automated test coverage has increased.

Numbered bugs fixed:

  • #h55 hmmbuild failed to build reasonable model on fragment-rich MSA
  • #h53 hmmalign --trim had memory corruption fault
  • #h52 hmmscan segfaulted on rare comparisons with nenvelopes=0
    after stochastic clustering.
  • #h51 "hmmconvert -2" created invalid HMMER2 model if ali len > 9999
  • #h50 "hmmbuild -n" (assign name to HMM) didn't have any effect
  • #h49 domain definition was sometimes finding the same domain twice
  • #h48 Degenerate residues caused target sequence to be bias-filtered.
  • #h47 "jackhmmer -A" segfaulted.

Unnumbered bugs fixed (some of the things I found that you didn't):

  • "hmmalign -o" now works as intended; it wasn't working before.
  • Alignment display midlines were showing + signs everywhere,
    because of a typo in the code.

================================================================
= Open bugs
================================================================

  • #h44 Compiler "option -xk not supported" warnings on MacOS/X
  • #h43 "non-aligned pointer being freed" on MacOS/X x86_64 icc
  • #h42 Low information content models have poor sensitivity