HMMER 3.0b3: the final beta test release

hmmer-154x184

The final HMMER3 beta test release is available for download as a tarball. It includes precompiled binaries for Intel/Linux systems, both 64-bit (x86_84) and 32-bit (ia32) platforms, and you may just copy these into some directory in your $PATH if you like. If the precompiled versions don’t work for you, you can always compile your own version from the source.

The main additions in 3.0b3 are parallelizations and even more speed. It now includes support for multicores (POSIX multithreading) and support for clusters (MPI). For the first time, we’re seeing some searches peek above NCBI BLAST search speed. Yes, you read that right, profile HMM speeds have started to exceed BLAST speed. And we’re far from done.

The release notes for 3.0b3 follow:


HMMER 3.0b3 release notes : HMMER3 beta test, release 3
http://hmmer.org/
SRE, Fri Nov 13 16:22:28 2009


With this release, HMMER 3.0 is nearly complete, as far our design goals for the first public release. This release, 3.0b3, is likely to be the final beta test release. The next release is planned for mid-December 2009, and is likely to be the 3.0 public release.

The main thing that is missing now is good documentation, and two manuscripts we’re planning for describing HMMER3 technology. We expect to focus on documentation and additional testing for the next few weeks. We may also add more support for different input formats, especially multiple alignment formats. We may also add a binary sequence database format and an executable to produce that format (akin to BLAST’s xdformat or formatdb); input from disk is now our most important bottleneck in speed, not computational time.

We welcome Michael Farrar and Travis Wheeler to the HMMER3 development team. This release includes their first code contributions. We also welcome Bjarne Knudsen (CLCbio, Aarhus, Denmark) as a collaborator on the HMMER project. We expect to begin including Bjarne’s contributions in the next release.

Most important new features:

Multithreaded parallelization for multicore machines.

The four search programs now support multicore parallelization. By default, they will use all available cores on your machine. You can control the number of cores each HMMER process will use with the –cpu command line option or the HMMER_NCPU environment variable. Accordingly, you should observe about a 2x or 4x speedup on dual-core or quad-core machines, relative to previous releases. Even with a single CPU (–cpu 1), HMMER will devote a separate execution thread to database input, resulting in significant speedup over serial execution.

To turn off POSIX threads support, recompile from source after giving the ‘–disable-threads’ flag to ./configure.

MPI parallelization for clusters.

The four search programs and hmmbuild now also support MPI parallelization on clusters that have MPI installed. In binaries compiled for MPI support, an “–mpi” option is available for running a master/worker parallelization model under MPI.

To use MPI support, you need to compile your own HMMER binaries from source. We do not put MPI support in our precompiled binaries, because we can’t be sure your have an MPI library (like OpenMPI) installed. To enable MPI support, the ‘–enable-mpi’ flag to when you ./configure.

Support for PowerPC and other processors.

Previously, HMMER3 only supported x86-compatible processors that support SSE2 vector instructions. This includes Intel processors Pentium 4 on, and AMD processors from K8 (Athlon 64) on; we believe this includes almost all Intel processors since 2000 and AMD processors since 2003.

HMMER now also supports PowerPC processors that support Altivec/VMX vector instructions. This includes Motorola G4, IBM G5, and IBM Power6 processors, which we believe includes almost all PowerPC-based desktop systems since 1999 and servers since 2007.

Your processor type is automatically detected at configuration time. You will see it choose one of two vector implementations: “impl_sse” or “impl_vmx”. If you want (or if it fails to detect your machine properly) you can force the choice with ‘./configure –enable-sse’ or ‘./configure –enable-vmx’.

If you have a processor that our vector implementations don’t currently support, the ./configure will detect this and set you up with the “dummy” implementation, impl_dummy. The dummy implementation should work on anything (that supports POSIX/ANSI C99, anyway). The dummy implementation is very portable, but because it does not use a vector instruction set at all, it is very slow. The intent of the dummy implementation is to make sure that you can reproduce HMMER3 results almost anywhere, even if you have to wait for them.

A change in the logic of how reporting thresholds are applied.

3.0b2 had a deliberate “design feature” in how it applies reporting thresholds including Pfam GA/TC/NC bit score thresholds that has disturbed many people. If a sequence passed the per-sequence score threshold, then at least one domain (the best-scoring) was always reported in the per-domain output, regardless of whether the domain had passed the per-domain threshold GA2 or not. The idea was that we didn’t want you complaining that a sequence was reported in the per-sequence output, so where was it in the domain output? Trouble was, anyone trying to apply reporting cutoffs strictly (Pfam GA cutoffs, for example) gets peeved, because domains appear in the per-domain output that have not met the domain threshold.

3.0b3 changes this behavior. Reporting thresholds are now applied strictly.

A change in the logic of how filter thresholds are applied.

3.0b2 had a deliberate “design feature” in how it applied thresholds to the fast heuristic filters in its sequence processing pipeline. The MSV, Viterbi, and Forward filter scores are required to pass certain P-value thresholds (0.022, 0.001, and 1e-5 by default, respectively) before a sequence is subjected to full Forward/Backward computations. On a small number of target comparisons, the filter thresholds (especially the 1e-5 Forward filter threshold) could become more stringent than the reporting thresholds, and a sequence that passes the reporting thresholds wouldn’t be reported, because it failed at the filters. For example, a sequence with a Forward score with a P-value of 0.0001 would have a reportable E-value of 0.01 if you only search 100 sequences, but 0.0001 fails the Forward score filter threshold. Therefore 3.0b2 did an “OR test”: it also allows a sequence through the filters if its score and E-value pass the reporting thresholds. But this has all kinds of weird side effects with sequences winking counterintuitively in and out of reported output. One is that a sequence might pass the filters in a small database, but not in a large one. Also, because E-value estimates during a search have to be lower-bound estimates based on the number of sequences seen so far (it doesn’t know the total number of sequences in a FASTA file ’til the search is done), a sequence could wink in and out of output depending on the order of sequences in the sequence database.

The OR test is now removed. Comparisons must strictly pass specified P-value thresholds at each of the three filters. Search outputs are more strictly reproducible as a result.

If you compare to 3.0b2 outputs, you may notice fewer hits being reported with marginal to poor E-values, because 3.0b3 is failing some of these sequences at the filter steps.

A change in the tabular output formats.

Both tabular output formats (–tblout and –domtblout) now include columns giving the accessions for the query and the target, in addition to their names. If an accession isn’t available, a “-” (dash) appears in the accession column.

Small new features:

  • The programs can read and write aligned FASTA (“afa”) format. They cannot autodetect it, though. To read a multiple file in afa format, use “–informat afa” for programs that accept the –informat option.
  • hmmalign now has an –allcol option, which forces it to output an aligned column for every consensus position, even if a consensus column contains all gap characters. This can be useful when you’re trying to keep track of some sort of stable labeling/numbering on aligned consensus columns for a model.
  • The search programs now have a –noali option, to suppress the voluminous alignment output. (Here, in big production work, we commonly send the main output to /dev/null, and use –tblout and –domtblout to capture the results in easily parsed space-delimited tabular form.)
  • The search programs now have an –acc option, which requests that accessions be displayed instead of names for queries and targets, whenever an accession is known. For queries or targets lacking accessions, the name is displayed as a fallback. Because this can result in a mix of accessions and names, you should probably make sure that either all your queries/targets have accessions before you use –acc.
  • In output files, the section header that read: Domain and alignment annotation for each sequence: now reads Domain annotation for each sequence (and alignments): and with the new –noali option, the ” (and alignments)” phrase is omitted. If you had a parser keying off this line, watch out.
  • The installation process from source code now supports using separate build directories, using the GNU-standard VPATH mechanism. This allows you to maintain separate builds for different processors or with different configuration/compilation options. All you have to do is run the configure script from the directory you want to be the root of the build directory. For example:

    % mkdir my-hmmer-build
    % cd my-hmmer-build
    % /path/to/hmmer/configure
    % make
  • Sorting of top hits lists (both per-sequence and per-domain) now resolves ties in a consistent and reproducible fashion, by sorting on names, even in threaded and MPI parallel searches. This may produce small differences in order of hits seen in outputs compared to earlier H3 releases.
  • The test suites have expanded modestly.

All reported bugs have been fixed:

  • #h68 hmmalign fails on some seqs: Decoding ERANGE error
  • #h67 H2 models produced by H2 hmmconvert aren’t readable
  • #h66 hmmscan output of HMM names/descriptions corrupted
  • #h65 hmmpress -f corrupts HMM database instead of overwriting it
  • #h64 traces containing X (fragments) are misaligned
  • #h63 stotrace unit test may fail on i686 platforms
  • #h62 hmmscan segfaults on sequences containing * characters
  • #h61 hmmscan –cut_{ga,nc,tc} fails to impose correct thresholds
  • #h60 hmmbuild –mpi fails to record GA,TC,NC thresholds
  • #h59 seq descriptions containing ‘%’ cause segfault
  • #h58 hmmscan tabular domain output reverses target, query lengths
  • #h57 jackhmmer fails: “backconverted subseq didn’t end at expected length”

Some unreported bugs are fixed, including:

  • hmmsim was using a weak random number generator; could cycle/repeat
  • a bug in calculating checksums for multiple alignments is fixed.
    (HMMs built with 3.0b2 or earlier will have different alignment checksums recorded in their HMM files than 3.0b3 calculates. This means that hmmalign –mapali will fail, if you try to use 3.0b3 hmmalign –mapali with models constructed with versions prior to 3.0b3.)

31 thoughts on “HMMER 3.0b3: the final beta test release

  1. Pingback: HMMER3 released – with a pre-compiled binary! « The Lacunae

  2. Well … this version 3b03 of hmmscan gives really strange results using the –cut_ga option: applyed on Saccharomyces cerevisiae 288C predicted proteomes, we found 160 new domains not present in the official proteome described in the Pfam database (using the same strain and the same version of Pfam database). And when we submit the corresponding proteins (in which new domains have been found by hmmscan) to the Pfam search interface with the GA option this new domains don’t appear.
    It seems that there is still a problem with the –cut_ga option and that some domains are found significant wherase they shouldn’t.

    Like

  3. Further to Romain’s comment about differences between hmmscan and the Pfam website – there is no problem with the –cut_ga option. He wrote to us directly and he see’s multiple matches to several members of the same clan with hmmscan. On the Pfam website, we filter out overlapping families if they belong to the same clan, and we display only the most significant match within the clan. This is why he sees more domains being reported with hmmscan, than we show on the Pfam website.

    Jaina
    Pfam team

    Like

  4. In order to parse the table output from hmmsearch (option –tblout) I need at least two blanc spaces between the columns (Because the target description has only one blanc space).
    In the newest beta there is only one blancer between accession and the query name, so I can’t parse it anymore. Could you fix this?

    Like

  5. Saddy, sorry, I don’t think there’s a problem with the tblout format. It’s a whitespace delimited file for the first 18 fields on each line; everything after field 18 is free text target description. We parse this with awk, perl, sort, cut, join, and other tools.

    And you certainly can’t count on a free text description never having two adjacent spaces, so your change wouldn’t work (assuming I understand what you’re suggesting).

    And it’s not true that there is “only one space” between accession and query name; there is *at least* one space, and additional spaces are added as needed to get things in columns.

    And (finally) I’m not sure what you mean by you can’t parse it “anymore”… it hasn’t substantively changed (two fields are added, for the target and query accessions).

    Like

  6. I can’t seem to get hmmscan to use more than 2 cores on my i7 quad core desktop. I have set the flag –cpu 4 and –cpu 8, but %CPU never gets above 200. The “number of worker threads” in the output shows that correct number (e.g. 4 or 8). Is this possibly due to disk being the bottleneck?

    Like

  7. Morgan – yes, that’s possible. It’s likely using the threads, but failing to keep them fed with enough work. Disk input is becoming rate limiting, especially in hmmscan and especially for small queries. If your searches are down around 1000 aa) query to hmmscan, if you have time to play around with it; that’ll likely stress the worker threads more and the input thread less, and you should see the CPU load increase to the expected 4 on a quad-core.

    It may be advantageous to have the profile database on local disk, rather than on a network file system, depending on how fast or slow your network is.

    Michael Farrar’s optimization work here is now aiming more at disk input, of both sequences and models, for both threads and MPI. Eventually we’re likely to develop a client/daemon version of hmmscan that keeps the profile database resident in memory, to take the disk out of the equation altogether.

    Like

  8. Thanks for the response Sean. I have looked at the hmmscan process a bit more and I don’t think the disk IO is the limitation. “top” shows that there is no wait on disk IO (~0.0%wa in cpu state). Using iotop also shows that the disk is not constantly reading or writing. It seems that the bottleneck must be to memory.

    Like

  9. I confirmed the answer that gave Jaina Mistry to my post : I forgot the overlap clan step performed by the perl script provided by the Pfam team. Sorry for this erroneous information. I performed other tests on my data and I didn’t detect any problem.

    Like

  10. How about providing the option to export results in JSON? It would then be possible to very easily load results in virtually any programming language that has JSON parser support (e.g. 2 lines in Perl vs a custom reinvented, regex engine that everyone writes/tests/debugs for each desired programming language).

    BTW – I love how fast HMMER3 performs (25x speedup in my crude, unoptimized tests) – incredible!

    Like

  11. Hi, I have two questions:

    1. Do you have an estimated release date for the Cell processor version of H3? I’d appreciate an answer from the “coding reality” division 😉
    2. In the existing parallel versions (pthreads and MPI), which parts are parallel and which are not?

    Thanks!

    Like

  12. Luke: JSON sounds like a good idea; we’ll think about it seriously.

    Sebas: 1. We’re no longer encouraged by the future prospects for the Cell processor; our work on the Cell is currently deprioritized. 2. Explaining the parallelization strategies in the various programs would take quite a long answer, one that probably wouldn’t be much clearer than just reading the code, if I tried to answer you here. I’d rather leave that subject for future papers and documentation, rather than trying to explain in a blog comment.

    Like

  13. Does jackhmmer use hmmalign to create an MSA (of significant hits) and then does it use hmmbuild to create the profile?

    The reason I’m asking is that I’d like to know whether jackhmmer is generating a true MSA (all sequences compared to one another) or whether is generating a PSI-BLAST-like query-anchored alignment (sequences compared only to the query and alignment condenses all pairwise segments without comparing them to one nother).

    I’d appreciate any insight you can give me. Thanks.

    Like

  14. Hi Mileidy,

    jackhmmer uses the same routines that hmmalign, hmmbuild, and hmmsearch use, to implement an iterative search.

    All profile HMM alignments are alignments of each individual sequence to one common profile: N sequences aligned by N profile/sequence comparisons. We have no NxN sequence comparison step.

    Like

  15. When trying to run hmmpress on a large concatenated .hmm file (3 GB, still have to figure out if this a good idea), I got this error:

    Fatal exception (source file ../../src/p7_hmmfile.c, line 782):
    ftello() failed

    I solved it by passing CFLAGS=’-D _FILE_OFFSET_BITS=64′ to ./configure and recompiling, so it looks like seeking beyond 2 GB doesn’t work by default. Perhaps you could set this by default if HMMER is being built on a 64-bit architecture.

    Like

  16. I just thought it’s nicer to show the length of the domains (the information found in the .hmm files) in the hmmscan output. By comparing it with the span of the hit on sequences, one can judge if the domain is almost full-length or truncated. Although there is a field “[]” “[.” “.]” “..”, it does not tell how much amino acid left in the termini.

    Like

  17. Michael – thanks, we’re puzzled what’s happened there. The ./configure sets _FILE_OFFSET_BITS automatically. We’ve verified that it configures properly if you rebuild from source. Somehow the x86_64 precompiled binaries for 3.0b3 don’t have the flag set properly, and we’re not sure how that happened. We’ll keep an eye out for it in the 3.0 build.

    Hiroshi – the domtblout format has the query and target lengths; so use –domtblout to get what you want here. In the “human readable” standard output, line length is at a premium, and we don’t have room, hence the [], .], [., .. code.

    Like

  18. Sorry, I didn’t state that I compiled on x86_64 in the first place, and _FILE_OFFSET_BITS was not set. I just ran configure again without command line options, and this is part of the output:

    checking for _FILE_OFFSET_BITS value needed for large files… no

    So also on my system the test configure uses doesn’t reflect reality.

    Like

  19. Oi vey. OK, that’s an important clue. That puts the blame on GNU autoconf, and means we must have a system in our compile farm that behaves the same way yours does. We’ll look into it again, thanks.

    Like

  20. Hi,

    I ran hmmscan (HMMER3.0b3) with an E-value of 0.1 but got some results mentioning “inclusion threshold”. Can you please clarify what is that for. An example of it appended below.

    *********************
    Scores for complete sequence (score includes all domains):
    — full sequence — — best 1 domain — -#dom-
    E-value score bias E-value score bias exp N Model Description
    ——- —— —– ——- —— —– —- — ——– ———–
    —— inclusion threshold ——
    0.012 12.2 0.0 0.014 12.0 0.0 1.1 1 51182
    **********************

    Thanks

    Like

  21. There’s two kinds of thresholds: reporting and inclusion. The reporting threshold controls what appears in the default output. The inclusion threshold controls what domains get marked as “significant” in domain output (! instead of ?), what appears in optional output alignments (-O) from search programs, and what does into the next round of jackhmmer. The reporting threshold would generally be less stringent than the inclusion threshold.
    In the per-sequence output, where you’re looking, the inclusion threshold gets marked; this is the line where HMMER thinks you should stop looking at the hits. Everything above it should be real (but isn’t necessarily, of course), and everything below it should be looked at with a jaundiced eye.

    Like

  22. I have created a profile HMM database and hmmscan option (HMMER3) is working fine on it , except when I want to use the –domtblout option. But this option is working with Pfam HMM db. I made the HMM db using hmmbuild, hmmpress.

    It would be great if you can give some insight on this.

    Like

  23. It’s my very first time using HMMER 3.0 and I’m struggling to get hhmbuild to work on my stk file. It says
    “Error: Alignment file parse failed: parse failed (line 2): didn’t find // at end of alignment” although I have checked it and it does have // at the end.
    Also, I’ve tried generating the Stockholm file via different converter and I still get the same error message.
    What am I doing wrong?
    Thanks !

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s