HMMER 3.1 beta test 2 released

The HMMER dev team is happy to announce a bugfix release of HMMER3.1, release 3.1 beta 2, aka 3.1b2. Following Google's ineffable lead in having perpetual beta test periods, 3.1 has been in beta test now for two years. When we said before that 3.1 will be released reasonably soon... "reasonably soon" continues to be a term of art for the dev team. Did we mention, it's a stable beta release?

Anyhoo, moving right along. The 3.1b2 code is publicly available as a tarball available for download, or from hmmer.org, where you'll also find precompiled binary releases for Mac and Linux.

The most significant upgrade in 3.1b2 is that the nhmmer program for DNA/DNA comparison now includes a somewhat radical heuristic acceleration technique that gets us about 10x more speed. Travis Wheeler has used an FM-index data structure to accelerate remote homology search in nhmmer. FM-index techniques are well known now in the computational biology community for fast near-exact matching (in read mappers, for example), and there have been some proofs of principle for accelerating Smith/Waterman especially with scoring systems set for close matches; Travis's code is a full-fledged implementation in production code for remote homology. Travis is still working on it and writing it up. Meanwhile, you can try it out. If you format a DNA database with the new makehmmerdb command, and then use nhmmer --tformat hmmerfm to search the binary FM-index database, you'll use the new acceleration.

Another significant upgrade is the inclusion of the hmmlogo program, which is essentially a commandline interface for producing the data underlying the Skylign profile logo server (skylign.org).

Also, eight, count 'em eight bugs have been fixed. Of the ones we count, anyway.

Congratulations again to Travis Wheeler, who continues as 3.1's build master, even though he is now afar in his new ~~mountain lair~~ faculty position at the University of Montana, as the HMMER dev team continues to scatter and flee from Virginia.

The horrible grinding noise you hear is the HMMER4 development code branch. Do not be alarmed. All is well. It will be ready... reasonably soon.

Detailed release notes for 3.1b2 are below the fold.

HMMER 3.1b2 release notes

http://hmmer.org/
TJW, Sun Feb 22 07:59:45 2015
________________________________________________________________

3.1b2 includes the following large changes:

New heuristic for accelerating nhmmer roughly 10-fold.

We have developed a new algorithm that accelerates DNA search in
nhmmer. The acceleration can be tuned, such that greater speed will
tend to decrease sensitivity. The default settings yield roughly
10-fold acceleration while retaining nearly complete sensitivity
among hits with E-value \< 1e-3 (with a modest loss in sensitivity
among marginal hits with  E > 1e-3)

This algorithm requires that the sequence database first be
preprocessed into a binary file format. The new tool makehmmerdb
performs this task.

New method in hmmbuild for deciding if a sequence is a fragment.

If hmmbuild determines that a sequence is a fragment, all leading and
trailing gap symbols (all gaps before the first residue and after the
last residue) are treated as missing data symbols, and thus do not
count as observed gaps.

In H3.0 and H3.1b1, a sequence was called a fragment if its length was
less than a specified fraction of the alignment length. In the case of
alignments with many sequences, this often resulted in all sequences
being labeled as fragments, which could lead to unexpected terminal
match states when a small fraction of sequences contained a long
terminal extension. Now, a sequence is labeled a fragment if its range
in the alignment (the number of alignment columns between the first
and last positions of the sequence) is not greater than a specified
fraction of the full alignment length. This should improve HMMER's
ability to model alignments with ragged ends.

Other changes include:

-:- The DNA search tool, nhmmer, depends on a value MAXL, which hmmbuild
computes as an assertion of the maximum length at which HMMER
expects to see an instance of the model. This value could previously
become excessively long when building a model from an alignment with
many long insertions. The MAXL value computed by hmmbuild for DNA
alignments is now limited to 20*M, where M is the # of match states.

-:- A new tool, called hmmlogo, that computes letter height and indel
parameters that can be used to produce a profile HMM logo. This tool
can be thought of as a command-line interface for the data underlying
the Skylign logo server (skylign.org).

Bugfixes:

-:- #h100 hmmalign would segfault on a zero length input sequence.

-:- #h101 hmmsearch would segfault when searching a DNA HMM against a
protein db (on Linux only).

-:- #h102 Marginal hits late in a target sequence database were subject
to being filtered in an nhmmer search. This was due to a score
filter that (a) was intended to accelerate search, but had
essentially no impact on speed, and (b) was an overly
aggressive filter. Removed the filter.

-:- #h103 Error printing very small E-values. Closely related to #h98,
but occuring in the main thread (#h98 fixed the same problem
in worker threads).

-:- #h104 HMMER would not compile on OpenBSD, because netinet/in.h was
not included. This header file is included via arpa/inet.h
on most other systems, but not on OpenBSD.

-:- #h105 Errors encountered while running 'make clean' and 'make distclean'
in binary builds. This was the result of the Makefile trying to
remove the userguide folder and LICENSE.txt file, which are
already removed in the release process. The Makefile now accounts
for this possibility.

-:- #h106 H3 failed to read some old H2 HMM files. This happened in the
cases that (1) there was an empty DESC field in the file, or (2)
the model was not normalized. Both cases have been resolved.

-:- #h107 hmmsim only worked for Amino Acid models. It now works for
nucleotide models, also.