hmmscan vs. hmmsearch speed: the numerology

From today’s email…

Suppose, for example, you want to search 300 million metagenomic sequence reads, each about 200nt long, against the Pfam database. What’s the best way to do that task with HMMER3? The bottom line: use hmmsearch, not hmmscan. For the numerology of why (and chapter and verse on how hmmsearch and hmmscan scale to large multithreaded and MPI tasks, their limitations, advice on how we do it, and some clues about what’s coming in the future), keep reading…

Translated searches

The first thing you’ll notice is that HMMER3 doesn’t yet implement translated DNA searches (blastx/tblastn/tblastx equivalents). So you need to produce a six-frame translation of your reads and save that as a protein sequence database. For 300M 200nt reads, that’s 300M x 200 nt / 3nt/aa x 6 frames = 120G (assuming you translate everything, with no restrictions on ORF length or such).

There’s two ways people do six-frame translation. You can translate each frame into separate ORFs, or you can translate the read into exactly six “ORFs”, one per frame, with * chararacters marking stop codons. HMMER prefers that you do the former. Technically, * chararacters aren’t legal amino acid residue codes, and the author of HMMER3 is a pedantic nitpicker, passive-aggressive, yet also a pragmatist: so while HMMER3 pragmatically accepts * chararacters in input “protein” sequences just fine, it pedantically relegates them to somewhat suboptimal status, and it passively-aggressively figures that any suboptimal performance on *-containing ORFs is your own fault for using *’s in the first place.

We plan to implement translated searches. In fact, we’re strongly considering implementing translated searches with a full gene model, a la GeneWise and Exonerate. We’ve not set any timeline for this though.

This task isn’t big, cpu-wise

Now it might seem natural to run

% hmmscan Pfam.hmm myorfs.fa

If you’re careful, and you run a pilot experiment on a small subset of myorfs.fa looking at how well this is performing, and you check your CPU system usage, you’ll notice something disturbing: hmmscan doesn’t seem to use all the CPU. How come? hmmscan speed is usually limited by the time it takes to input the profile database, not by CPU time.

Here’s some back-of-the-envelope numerology, so you can see the issues in play (here shorthand on units is K=thousands, M=millions, G=billions; or 1e6=10^6=million, etc.):

First, how long should this compute take?

  • The size of your compute scales as (sequence db size, residues) X (profile db size, consensus residues).
  • Here the sequence db is 120G residues.
  • The Pfam profile database consists of about 12000 profiles, of mean length ~200: that’s about 2.4M residues.
  • So the total compute size is 120G x 2.4M = 3e17 “dynamic programming cells” (dpc).
  • HMMER3 compares profiles to sequences at a rate of about 10G dpc/sec/core, on typical commodity cores.
  • So H3 needs 3e17 / 1e10 = 3e7 cpu-sec for this compute.
  • (Side node: there’s almost exactly pi x 10^7 seconds in a year. So 3e7 cpu-sec ~ 1 cpu-year.)
  • If we scaled this (perfectly) across 1000 cpus, it’d take about 8 hours realtime.

With hmmscan, this task is very big, disk input wise

But we also have to consider not just cpu speed, but also how fast we can read the target database. If we do the search with hmmscan:

  • With hmmscan, Pfam is the target database, and we’ll search it with ~4e9 query ORFs (assuming ~2 ORFs/frame, 6 frames, 300M reads).
  • Pfam is about 1 GB of data.
  • A typical local disk (SATA-2) will read data at about 300MB/sec.
  • So, each time we read Pfam from disk, it takes us 1G/300M = about three seconds.
  • (At this point, you see the problem. A 30aa ORF vs. Pfam takes about 7 milliseconds of CPU time, but 3300 milliseconds of disk time!)
  • So, all 4e9 hmmscan queries will cost us 4e9 x 3sec = 400 input-years. That’s a lot, compared to the 1 cpu-year we need.
  • This is why hmmpress exists. hmmscan doesn’t try to read all 1 GB of Pfam. It reads a packed binary 70M file (the .h3f file), 14x smaller, which contains only the information necessary for the first stage heuristic filter of H3 (the”MSV filter”). So it doesn’t really take 400 disk-years; it takes a mere 400/14 = 30 input-years. Still a lot, compared to our 1 cpu-year.
  • Of course, once you’re searching multiple queries, your filesystem is caching Pfam, not reading it from disk each time. You might get 1-5 GB/sec from your cache; say it’s 3 GB/s, about 10x faster than your disk. Now we’re down to 3 input years.

Now it should be clear that what’s limiting hmmscan is input speed, not cpu speed. The problem is exacerbated when you have a lot of small queries, like here, where each ~40aa ORF is only 10% the size of a typical full-length protein and requires only 10% of the typical cpu time per protein comparison.

Now it should also be clear that it doesn’t help to multithread hmmscan across cpu cores, especially on lots of small queries. Once input is rate-limiting, it doesn’t matter how many cores you’re splitting the job across. You might even notice that hmmscan –cpu 1 outperforms a default hmmscan, because the single thread is fully capable of handling the full input rate.

So why does hmmscan by default use all the cores in your machine? Optimism. The odds that you’re searching a large query (~1000’s of aa’s), which will scale reasonably across cores, outweigh the cost of running input-starved cores on small queries.

Could we do better? Oh sure. Eventually we will, too. For example, instead of relying on the filesystem cache, we could have hmmscan cache the entire .h3f dataset (or indeed all Pfam). Now we should be able to read the data at full memory speed (maybe 10-30 GB/sec for DDR3 SDRAM). But that’d still only get us down to ~0.4 input-years, compared to ~1 cpu-year. hmmscan still would scale poorly to even small numbers of cpu threads.

The ultimate solution is to parallelize the hmmscan input itself, so each core works on a pre-cached chunk of Pfam. This is exactly what the hmmpgmd daemon does, behind the scenes at the hmmer.janelia.org site. Each of the 12 12-core systems in the Janelia HMMER compute farm has precached both Pfam and NR, and input is parallelized across the 12 systems: so we can read data at an aggregate speed of around 30GB/s x 12 = 360 GB/s. That’s why a Pfam search at hmmer.janelia.org snaps results back to you in a blink. hmmpgmd is one of the new programs that will appear Real Soon Now, in the upcoming 3.1 release.

You don’t normally notice that hmmscan isn’t using the cpus effectively, because it’s so fast. A typical Pfam search returns in about 200 milliseconds, even on an input-starved single core. You’d only notice if you were wanting to search lots of query sequences against Pfam. But if you’re doing lots of seqs against lots of profiles — you can turn around and do lots of profiles against lots of sequences…

Use hmmsearch when comparing lots of seqs to lots of models, not hmmscan: here’s why

Super-important numerology: each residue in a profile is represented by about 40 bytes of data, whereas each residue in a sequence is one byte. Residue by residue, Pfam is 40x heavier than a sequence database.

So, in terms of input:

  • hmmscan with 4e9 seq queries vs 70M Pfam .h3f binary: 280 PB database searched
  • hmmsearch with 12K Pfam queries vs. 120G translated ORF database: 1.4 PB database searched</li

Yowza. You definitely don’t want to run 280 petabytes through your system when you’re input-limited and you can do the same task in a 1.4 PB flow; that’s a 200-fold difference. (The 200-fold arises from the 40x more bytes per profile residue, and the 5x difference between a mean profile length of ~200 and a mean ORF length of ~40.)

hmmscan and hmmsearch are doing exactly the same compute, at heart: comparing one profile to one sequence at a time. Their bit score results are identical. You can save hmmsearch tabular output files and use ’em just the same way you were going to use the hmmscan files. (Um, watch out for E-values: remember that E-values depend on the size of the database you search.)

Because of the greatly reduced input demand, hmmsearch typically scales to multiple threads (about four or so) per system whereas hmmscan doesn’t; and hmmsearch also runs very effectively in MPI mode.

So, bottom line, on our Janelia internal cluster (4000 cpus), we’d run this metagenome analysis task in a single job w/ something like this:

% hmmsearch --mpi Pfam.hmm myorfs.fa

and go watch Barcelona whup Manchester, and come back in a few hours.

24 thoughts on “hmmscan vs. hmmsearch speed: the numerology

  1. Does the MPI functionality in HMMER use any of the specific MPI capabilities like:
    1) low latency
    2) high bandwidth
    3) fast barriers
    ???

    I ask because setting up MPI can be a huge administrative hassle. I’ve found that in most cases, it’s much easier to set up a small number of SQL and web servers or a batch queue to handle parallelism, and using simple RPC-over-HTTP compared to MPI. It also translates better to cloud environments where machines fail or have highly variable performance; MPI makes it really painful to develop code for these environments.

    The more interesting question for me is how hard it’s going to be adapt HMMER3 to non-POSIX data access, and cluster management tools like those provided by Hadoop (I’m not suggesting you take the inner loops and convert them to Mappers and Reducers, although it’s an interesting avenue to explore). I’m more thinking of leveraging data serialization libraries like AVRO, and data parallel workflow tools like Pig.

    Like

  2. We don’t use any fancy MPI capabilities. (But the MPI implementation isn’t particularly efficient yet, either.)

    MPI has been administratively straightforward for us. YMMV.

    Running HMMER3 in a low-reliability environment will cause some amount of hassle in coding no matter what your communication protocol is. You’d need to have something guaranteeing that work units get turned into output results once and only once, with low latency.

    Adapting H3 to non-POSIX data access should be straightforward. I don’t have any experience with Hadoop, AVRO, or Pig through.

    Like

  3. I’m a believer (in that hmmsearch is a lot faster on my large sequence files). But I couldn’t follow your arguments well enough to answer my following question: where’s the tradeoff? In other words, when is it better to use hmmscan rather than hmmsearch? Is it a function of the ratio of the sequence to the HMM DB sizes?

    Like

  4. I expect so. It’s an empirical question. You’d have to test particular cases. hmmsearch is designed for one (or a few) profiles vs. a big seq database. hmmscan is designed for one (or a few) sequences vs. a big profile database. Since profiles are about 40x larger than single sequences, residue for residue, I’d expect hmmsearch to be preferable any time you’re in a situation of lots of profiles vs lots of sequences. It’s important to say that this is only a matter of how many passes the program has to make over the target database, not a matter of what the results look like; for any given profile/sequence comparison, hmmscan and hmmsearch give identical results. (E-values differ, because E-values depend on the size of the target database; but if you’re running lots of queries, you’ll want to be thinking about setting the -Z parameter anyway to the actual number of comparisons you’re running, # of queries * # of targets.)

    Like

  5. Hi Eddys.
    I know you must be very busy. So I leave my questions here.

    The hmm profile of each protein family was generated by your so called “seed sequences”. The seed sequences may come from different organisms ,so phylogenetic distances between them may vary a lot.

    My focus is on the bacteria and fungi genome sequences from a metagenomics dataset.
    I have two questions,
    #1
    Can I remove the non-bacteria sequences from the seed sequences and generate a new hmm profile for my annotation?
    #2
    In the HMM process, the hmm seems to be unchangable, and it is a little unrealistic. I mean , when I start my meta-genomics annotation, I add the annotated sequences (with high confidence) into the seed sequences to generate new profiles for other genes’ annotion. Is this reasonable?

    Like

  6. Thanks for your reply,

    I am sorry I do not clarify the second question.

    #2
    I get a lot of proteins in a metagenomics data set named with Protein_1, Protein_2…..Protein_N.
    Now I begin to find the functional domains with hmmer3 againt Pfam one by one.
    The result showed Protein_1 may have a functional domain named PF0001.1 with high confidence(E value < 0.001),so I want to add Protein_1 into the multiple sequence algiment originally generated by seed sequences from PF0001.1 and make a new HMM profile for other proteins' prediction. Is this reasonable?
    Hope this could clarify my question.

    Like

    • This is a version of your question #1. Yes, you can give HMMER any alignment you want to use as a query. That includes alignments where you’ve subtracted sequences from a Pfam alignment, or added sequences to one. But whatever sequences you put in the query alignment, you need to be confident that they are homologous.

      Like

  7. Pingback: Finally – Metaxa 1.1 - Microbiology, Metagenomics and Bioinformatics

  8. Does anyone had problems with E-value threshold?

    The command I am using goes like this:
    hmmsearch –notextw -E 1e-50 –cpu 6 profile.hmm prot.fasta > profile_protein
    where I want to have all sequences that have an E-value <= 1e-50

    and in the output file: even in the header it says: # sequence reporting threshold: E-value <= 1e-50
    But it shows results that are above the threshold and then I don't know what it is happening.

    Any ideas?

    Like

  9. That’s a known bug in 3.0. 3.0 can’t handle an E-value cutoff of less than about 1e-30, because at one point in the code it stores it as a floating-point value. Fixed in the next release.

    Like

  10. I have a question about handling the stop codon.

    If HMMER3 meets a stop codon (*) at the beginning part of the sequence, how hmmer will deal with it?
    For example, there is a mutation at the beginning which result in a stop codon.

    Will it adjust the score according to the location of the stop codon? I guess in this situation, it should return a very low score.

    Thanks.

    Like

  11. Best answer: * is not an amino acid symbol, and shouldn’t occur in protein sequences.

    What HMMER will actually do: prohibit it (score it as -infinity) in any match state, and ignore it (score it as zero) in any other part of an alignment (insert state or flanking nonhomology).

    Like

  12. Hi, recently I came across a tool called Pfamscan from the Pfam website [ftp://ftp.sanger.ac.uk/pub/databases/Pfam/Tools/README], and have a question about filtering overlapping hits:

    stated from their description:”We group together families which we believe to have a common evolutionary ancestor in clans. Where there are overlapping matches within a clan, pfam_scan.pl will only show the most significant (the lowest E-value) match within the clan. We perform the same clan filtering step on the Pfam website.”

    I have performed a hmmsearch of my sequences against the pfam database as described in this post, and would like to do the filtering similar to that described above, is it possible with the output from hmmsearch?

    Thank you very much for your kind assistance in advance!

    Maria

    Like

  13. Sure, but you’d have to implement for yourself something like what the Pfam website does. HMMER itself just deals with one model at a time, and isn’t aware of Pfam’s clan hierarchy.

    Like

  14. Dear Sean,

    Hi, to follow up on the different database size of the query protein sequence affecting the E-value during hmmsearch (let’s say using the Pfam-A hmm profiles). In our case we searched metagenomes of different sizes (ranging from 100,000 to 1,000,000 reads) using default settings for all, should we then filter our search based on bits-core instead then?

    Thank you very much for your advise in advance.

    Maria

    Like

  15. Generally you want to interpret results of any sequence database search in terms of statistical significance, which means using E-values or the like.

    If you’re doing multiple searches, against different database sizes, and you want to evaluate statistical significance relative to the total aggregated number of comparisons, you can use the -Z option to specify that number, and HMMER will calculate the E-values using that search space size.

    Like

  16. I don’t understand your use of the words “query” and “target”. “With hmmscan, Pfam is the target database, and we’ll search it with ~4e9 query ORFs (assuming ~2 ORFs/frame, 6 frames, 300M reads).” I expect hmmscan to hold the target database in memory, and read in the query ORFs one by one from a file to compare against the target DB. But you seem to be indicating that the opposite happens.

    I don’t think the terms “query” and “target” imply which of these things will be kept in memory, and which on disk. When I run a blast job, both are on disk.

    Like

    • Think of it as a nested for loop. Queries are in the outermost loop. Target database is in the innermost loop. hmmsearch searches query profile(s) against a target sequence database. hmmscan searches query sequence(s) against a target profile database.

      As it happens, both are on disk, for hmmscan and hmmsearch. In our parallel search daemon (hmmpgmd), now released in 3.1 release candidate code, target databases are cached in memory.

      Like

  17. Pingback: Creating and searching with Hidden Markov Models (HMMs) | Jeff Kimbrel

  18. This was a very helpful post, but I’d like to ask for greater clarification about the handling of stop codons.

    Sean, I’m a long-time fan of your work, and your point about translation termination signals not being amino acids is irrefutable. That said, as the genomics community continues to branch out and sequence the genes and genomes of a growing number of poorly-studied organisms, there will be a growing number of nucleotide sequences for which the correct genetic code is unknown, making even naïve ORF-calling and translation increasingly difficult.

    Our lab is currently sequencing the transcriptomes of dozens of poorly-characterized organisms. We have found evidence for alternative genetic codes being used by several of them, but with so much data, and with no good reference sequences available for most of it, it can be very difficult to determine the correct genetic code for every sequence we’d like to annotate.

    We’ve been doing what we can to identify non-canonical codon usage by BLAST’ing against reference databases and looking for high-identity alignments that stretch through naïvely-translated stop codons. I was hoping to upgrade our pipeline by replacing BLAST with HMMer, but unless I’m misreading your previous comments, it seems that you’ve intentionally made heuristic decisions that will make it difficult to use HMMer for this purpose. This would be very disappointing, so I’m hoping that I’ve simply misunderstood something.

    Like

  19. Ted, HMMER is currently designed for protein/protein or DNA/DNA comparison. For protein/protein comparison, it naturally assumes that the target sequence is, well, a protein sequence.

    Correct translation from DNA to protein, in the face of genetic code issues, editing, splicing, and what have you, is currently beyond our design scope, but we hope to address it in the future with a version of HMMER designed for comparing protein profiles directly to DNA sequence. I entirely agree it would be a desirable thing to solve well.

    Like

  20. Hi…i have done hmmsearch analysis and get some out put..my problem is out to extract the hmmsearch output sequences in fasta format using the accessions numbers….from target database installed locally…i have something like this but i want the fasta sequences for further analysis.
    hmmsearch :: search profile(s) against a sequence database
    # HMMER 3.0 (March 2010); http://hmmer.org/
    # Copyright (C) 2010 Howard Hughes Medical Institute.
    # Freely distributed under the GNU General Public License (GPLv3).
    # – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –
    # query HMM file: clustalO-multialign.out.hmm
    # target sequence database: IRGSP-1.0_protein_2013-04-24.fasta
    # sequence reporting threshold: E-value <= 1e-05
    # – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – – –

    Query: clustalO-multialign [M=1220]
    Scores for complete sequences (score includes all domains):
    — full sequence — — best 1 domain — -#dom-
    E-value score bias E-value score bias exp N Sequence Description
    ——- —— —– ——- —— —– —- — ——– ———–
    3.9e-262 874.2 38.1 5.6e-261 870.3 26.4 2.0 1 Os05t0492200-01 NB-ARC domain containing protein.
    3.6e-261 871.0 40.4 2.2e-225 752.4 7.0 2.2 2 Os11t0429100-00 Hypothetical conserved gene.
    7.4e-259 863.3 10.7 9.1e-259 863.0 7.4 1.0 1 Os03t0579200-00 Similar to NB-ARC domain containing protein.
    1.3e-258 862.5 55.3 3.6e-225 751.7 13.4 3.0 3 Os12t0481400-00 Similar to NBS-LRR disease resistance protei
    2.5e-258 861.6 37.8 1.1e-229 766.7 12.7 3.1 3 Os12t0481700-00 Disease resistance protein domain containing
    2.5e-258 861.6 37.8 1.1e-229 766.7 12.7 3.1 3 Os12t0482100-00 Disease resistance protein domain containing
    5.4e-253 843.9 34.5 1.6e-220 736.3 6.4 2.0 2 Os06t0707733-00 Disease resistance protein domain containing

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s