Graduation day: HMMER3 beta test (3.0b1) release ·

The alpha test is complete, and the beta test phase begins today. HMMER3 beta test code is publicly available as a tarball available for download on our FTP site. Besides source code, the tarball also includes a binaries/ subdirectory with precompiled binaries for Intel/Linux systems, both 64-bit (in binaries/intel-linux-x86_64) and 32-bit (in binaries/intel-linux-ia32), and you can just copy these into some directory in your \$PATH if you like. These should work on a wide variety of platforms but if you have any trouble, you can always compile your own version from the source code.

My attention is now going to turn toward writing up some papers on all this, and writing the HMMER3 documentation, because (of course) nobody's going to find any bugs in the beta test, right?

The release notes for 3.0b1:

HMMER 3.0b1 release notes (HMMER3 beta test, release 1)
http://hmmer.org/
SRE, Wed Jun 17 17:08:27 2009

3.0b1 includes the following changes:

The format of HMM files has changed slightly. There are now six
saved statistical parameters for calculating E-values, not
three. This format is called the "3/b" format. Previous HMMER3
flatfiles (3/a format) are read fine in a reverse compatible
mode. However, any binary press'ed HMM databases you have will
need to be repress'ed with "hmmpress".
An important bug (#h48) was fixed that affected any target
sequence containing degenerate residue codes. Such target
sequences may have been filtered out of any searches, in an
unpredictable and probably platform-dependent way. This bug was
introduced in the 3.0a2 release, and was not present in the 3.0a1
release.
The four search programs now have optional tabular output formats
for easier automated parsing; see --tblout and --domtblout
options.
One of the stages of the accelerated "filter pipeline" for the
search programs has been reimplemented. The ViterbiFilter step is
now implemented in 16-bit precision in 8-fold parallel vectors (it
was previously in 8-bit precision in 16-fold parallel vectors).
Testing/analysis of numerical roundoff error determined that the
8-bit implementation had unacceptable error on a subset of Pfam
models. You should observe little if any difference in speed as a
result of this change. There may be small differences in sequences
that do or don't pass through the filter, but for sequences that pass
the filter, there should be no changes in their called domain
structure, alignment, scores, or E-values.
'*' characters are accepted in input sequence files and
interpreted as "not a residue". '*' is common (though technically
an illegal residue code) in FASTA files of ORFs translated from
DNA. Because of a technical detail of how H3 scores insertions (it
assumes insertions score zero), it is possible for an alignment to
span a * if the * is assigned to an insert state, but not if the *
is assigned to a match state. This is arguably a bug; but you
shouldn't be using * characters to begin with.
You can now "checkpoint" the results of jackhmmer iterations. See
--chkhmm and --chkali options.
hmmbuild is now tolerant of alignments that contain sequence
fragments. If a sequence looks like a fragment (see the
--fragthresh option), leading and trailing gaps are ignored,
rather than being counted as deletions. (This is important for
metagenomics datasets and other partial protein sequence data.)
The default relative sequence weightng method in hmmbuild is now
Henikoff position-based weights, replacing
Gerstein/Chothia/Sonnhammer weights, to optimize model
construction speed.
The two tweaks in hmmbuild (fragment marking and PB weights) mean
that models built with beta1 will produce slightly different
scores and results, compared to alpha2 code.

Other changes include:

A "hmmalign --mapali" option allows you to add the original training
alignment (that an HMM was built from) into a new alignment.
--qformat and --tformat options are added to phmmer, jackhmmer,
hmmsearch, and hmmscan, and --informat option to hmmbuild. These
allow you to specify input sequence file formats, bypassing
automatic format detection.
The target function for entropy weighting has been tweaked and
simplified. It is now controlled by --esigma and --ere options.
The --ere option no longer overrides the eweighting target
function; the two-parameter function is always applied. (To set a
fixed RE target without the tail of larger REs for smaller models,
set a large negative argument to --esigma, such as --esigma
-9999.) The new target function assigns somewhat higher relative
entropy targets to small models, and should slightly alleviate
H3's lack of sensitivity with certain small, diverse Pfam models
that were tuned to take advantage of glocal alignment's somewhat
better signal/noise.
The Viterbi filter now uses an approximation (the "3 nat
approximation") to account for NN,CC,JJ transitions, even though
it has the numerical accuracy to do the exact calculation, because
the approximation outperforms the exact calculation. I don't want
to talk about it.
Random number generators have been replaced with much faster
versions, resulting in ten-fold speedups in stochastic sampling
code.
hmmbuild output has changed; it now includes columns for effective
sequence number and relative entropy/position.
Automated test coverage has increased.

Numbered bugs fixed:

#h55 hmmbuild failed to build reasonable model on fragment-rich MSA
#h53 hmmalign --trim had memory corruption fault
#h52 hmmscan segfaulted on rare comparisons with nenvelopes=0
after stochastic clustering.
#h51 "hmmconvert -2" created invalid HMMER2 model if ali len > 9999
#h50 "hmmbuild -n" (assign name to HMM) didn't have any effect
#h49 domain definition was sometimes finding the same domain twice
#h48 Degenerate residues caused target sequence to be bias-filtered.
#h47 "jackhmmer -A" segfaulted.

Unnumbered bugs fixed (some of the things I found that you didn't):

"hmmalign -o" now works as intended; it wasn't working before.
Alignment display midlines were showing + signs everywhere,
because of a typo in the code.

================================================================
= Open bugs
================================================================

#h44 Compiler "option -xk not supported" warnings on MacOS/X
#h43 "non-aligned pointer being freed" on MacOS/X x86_64 icc
#h42 Low information content models have poor sensitivity