HMMER3.1 beta test 1 released

The HMMER dev team is happy to announce an upgrade release of HMMER3, release 3.1. A beta test version of the code is publicly available as a tarball available for download, or from hmmer.org, where you'll also find precompiled binary releases for Mac and Linux.

HMMER 3.1 includes nhmmer and nhmmscan, programs for DNA/DNA homology searches with profile HMMs. nhmmer has already been incorporated in RepeatMasker, in collaboration with Arian Smit and colleagues, and is the software underlying the Dfam database of profiles for mobile DNA elements.

HMMER 3.1 database searches are about twice as fast as HMMER 3.0 was, fulfilling old campaign promises.

HMMER 3.1 includes hmmpgmd, the parallel search daemon underlying HMMER Web Services at hmmer.org.

This code is expected to be stable, but we're releasing it as a beta test just to be careful. After some time in the wild, we'll make a release candidate, and if you folks haven't chewed any of that up too badly, we'll make the final 3.1 release reasonably soon.

Congratulations to Travis Wheeler, 3.1's build master -- note the TJW on the notes below the fold, not an SRE -- the first HMMER release managed by someone besides me (Sean).

Meanwhile... slowly, slowly, HMMER4 takes shape, as the gnomes of HMMER Labs toil sleeplessly on their latest monstrosities. The long awaited return of glocal alignment has been delayed into HMMER4, because the changes required turned out to be, um, quite extensive.

Detailed release notes for 3.1b1 follow.


HMMER 3.1 beta test 1, release notes

http://hmmer.org
TJW, Mon Apr 22 10:26:43 2013

HMMER 3.1 is an upgrade of HMMER3. It includes DNA/DNA searching, the hmmer.org web services daemon, and more speed; miscellaneous other improvements; and twenty fixes for reported HMMER3.0 bugs, as follows:

DNA searches.

The new tool nhmmer is used to search one or more queries (where a query can be an alignment, a sequence, or an HMM built by hmmbuild) against a database of (potentially long) DNA sequences. This is essentially the DNA analog of a merger of hmmsearch and phmmer. The new tool nhmmscan is used to search one or more query DNA sequences against a database of DNA HMMs (analogous to hmmscan for proteins).

More speed.

Bjarne Knudsen, Chief Scientific Officer of CLC bio in Denmark, contributed an important optimization of the MSV filter (the first stage in the accelerated "filter pipeline") that increases overall HMMER3 speed by about two-fold. This speed improvement has no impact on sensitivity.

The hmmer.org web server daemon.

We couldn't quite get enough speed out of standard cluster parallelization approaches like MPI, so we implemented our own hand-rolled IP socket communication protocol in a daemon called hmmpgmd.

In its current form, hmmpgmd will only work in homogeneous server environments, and documentation is sparse. Both of these traits will be addressed in future releases.

Two example client programs are also included in the release: one written in C (hmmc2), the other written in perl (hmmpgmd_client_example.pl).

Other improvements

  • A new tool to add a "mask line" to an alignment. A new tool called alimask is used to apply a mask line to a multiple sequence alignment, based on provided alignment- or model-coordinates. When hmmbuild receives a masked alignment as input, it produces a profile model in which the emission probabilities at masked positions are set to match the background frequency, rather than being set based on observed frequencies in the alignment.
  • HMM file format change. The format of HMM files has changed slightly. The new format is called the 3/f format. This format includes two new fields for each match state line: the CONS consensus residue and the MM mask value. In addition, there are three new header lines: (1) MAXL <n>, used to specify the maximum length at which HMMER expects to see an instance of the model (used in nhmmer DNA search), (2) CONS <yes|no>, used to indicate that per-position consensus positions are captured, and (3) MM <yes|no>, used to indicate that per-position masking is employed. Previous HMMER3 flatfiles are read fine in a reverse compatible mode. However, any binary press'ed HMM databases must be re-press'ed with hmmpress.
  • The programs phmmer, hmmsearch, and hmmscan offer a new tabular output format for easier automated parsing, --pfamtblout. This format is the one used internally by Pfam, but we make it more broadly available in case it is of use elsewhere. An analagous output format is available for nhmmer and nhmmscan, --dfamtblout.
  • Using new default parameterization for DNA, including a new mixture Dirichlet and changing the default relative entropy value (which can be set using --ere) from 0.59 to 0.45 bits.
  • Hit lists should now sort properly even for very high scores.
  • The program hmmstat includes new flags to convert between bit score and E-value for a given model (depending on input database size).
  • The program hmmbuild includes a new flag, --maxinsertlen, that causes weighted-average I->I transition counts to be limited to n.
  • For DNA and RNA alphabets, hmmconvert now writes only single GA/TC/NC values, since the second (for domains) is unnecessary.
  • The program hmmbuild now has a flag, --single, which, for single-sequence "alignments", produces models built using the substitution matrix that phmmer would have implicitly used.
  • Multiple alignment outputs now always include all consensus columns, even if that column is all gaps. This simplifies downstream processing for a lot of people's parsers, who may be expecting to be able to map a query's consensus coordinate system unambiguously onto a new alignment.
  • The hmmalign --allcol option (which did the above as an option) has been removed. To get the original behavior, you can use esl-reformat --mingap to remove all-gap columns from an alignment.
  • The "bias" score values in all output formats, for both per-sequence and per-domain outputs, were erroneously reported in nat units in 3.0, whereas all other scores are bit scores. 3.1 now corrects this; the bias values are in units of bits.
  • hmmbuild output now includes a column showing the value for MAXL, used as described in the "HMM file format change" section above.
  • hmmemit now has -a option for sampling alignments in Stockholm format.
  • hmmemit can now read more than one HMM from <hmmfile>.
  • Added hmmemit -C option: fancy consensus generation, showing no, weak, and strong conservation as n/x, lowercase, and uppercase. hmmemit --minl, --minu options control thresholds for the consensus.
  • Added --pnone option to hmmbuild, jackhmmer: no prior at all. The result is to parameterize as frequencies instead of MP estimates. This is useful when using hmmbuild/hmmemit to make easy-to-explain consensus sequences from alignments.
  • For hmmbuild and jackhmmer, changed --laplace to --plaplace.
  • Improved the text output of make to be easier on the eyes.

Bugfixes

  • #h79 Resizing a generic dynamic programming matrix with p7_gmx_GrowTo() could result in memory corruption. This function was not called in production H3 code, so the bug only appears in test code or other H3-based code derivatives (such as Elena Rivas' mouse vocalization recognizer, which is where the bug was caught).
  • #h80 hmmconvert was unable to read HMMER2 "Nucleic" save files, because H3 uses "DNA" or "RNA" as an alphabet label rather than H2's generic old "Nucleic".
  • #h81 "dombias" value in output files was being reported in nats. All score output is supposed to be reported in bits.
  • #h82 hmmbuild was corrupting resaved alignments if a target sequence contained all insert residues and no consensus.
  • #h83 Command line E-value threshold options only worked down to \~1e-30 or so because of some errant casts of internal P-values to single-precision floats. All P-values are supposed to be double precision, ranging down to \~1e-308.
  • #h84 hmmbuild was giving "composition fails pvector validation" error on some long (M > 10K or so) alignments. This was caused by a roundoff error accumulation.
  • #h85 hmmscan and hmmsearch would occasionally give different results for same model.
  • #h86 man pages had stray @KEYWORDS@ and weren't installed by make install
  • #h87 The documentation of --fragthresh and esl_msa_MarkFragments() incorrectly suggested that the rule was L < x * average_seqlen but the rule is actually L < x * alignment_len
  • #h88 Many classic score matrices are "invalid". The previously-used Yu/Altschul procedure would frequently fail, an effect of roundoff of the integer values used to store common score matrices. Matrices are now calculated using esl_scorematrix_ProbifyGivenBG()instead of the Yu/Altschul method. This change results in slightly diminished performance on internal benchmarks for phmmer using the default BLOSUM62 scoring matrix (converted to conditional probabilities), but is reliable and mathematically correct.
  • #h89 The environment variable HMMER_NCPU would create incompatibility with -n, --mpi
  • #h90 phmmer's displayed consensus sometimes differed from the query sequence. For example, under BLOSUM62 scoring, the most likely residue to align to an M is an L, not another M. The consensus of a model is now set in the model itself, at the time of model construction.
  • #h91 hmmscan failed to cleanly detect corrupted .h3* hmmpress files, and could fail with a floating point exception that was not at all helpful in identifying the actual problem.
  • #h92 jackhmmer failed in a late iteration on a large db containing translated ORFs with many * characters with the following:

    Fatal exception (source file ../../src/p7_alidisplay.c, line 429): backconverted subseq didn't end at expected length (GKCLLH401BTHCR_1/A9HXI5_BORPD-i4) Abort trap

    Same apparent phenotype as #h57, but #h57 was recorded closed 1 Jul 09 in the 3.0b3 release. Was apparently fixed by the solution to bug #h82. Recording as a separate bug, even though it was magically already fixed, so we have a record of the distinct phenotype.

  • #h93 Running out of disk space corrupts outputs. Return status of fprintf(), fputs(), etc. calls was not being checked, so HMMER was not detecting the problem.

  • #h94 phmmer assigned incorrect scores to degenerate residues.
  • #h96 Worker threads on OS X experienced denormal slow-down. Resolved with call to _MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON)
  • #h97 Bias filter was inactive in phmmer.
  • #h98 Error printing very small E-values. If E was less than about 1e-708, it should have flushed to zero, but instead survived to experience denormal calculation, then highlighted a printf bug.
  • #h99 The --laplace flag was inactive in hmmbuild.