hmmscan vs. hmmsearch speed: the numerology

From today's email...

Suppose, for example, you want to search 300 million metagenomic sequence reads, each about 200nt long, against the Pfam database. What's the best way to do that task with HMMER3? The bottom line: use hmmsearch, not hmmscan. For the numerology of why (and chapter and verse on how hmmsearch and hmmscan scale to large multithreaded and MPI tasks, their limitations, advice on how we do it, and some clues about what's coming in the future), keep reading...

Translated searches

The first thing you'll notice is that HMMER3 doesn't yet implement translated DNA searches (blastx/tblastn/tblastx equivalents). So you need to produce a six-frame translation of your reads and save that as a protein sequence database. For 300M 200nt reads, that's 300M x 200 nt / 3nt/aa x 6 frames = 120G (assuming you translate everything, with no restrictions on ORF length or such).

There's two ways people do six-frame translation. You can translate each frame into separate ORFs, or you can translate the read into exactly six "ORFs", one per frame, with * chararacters marking stop codons. HMMER prefers that you do the former. Technically, * chararacters aren't legal amino acid residue codes, and the author of HMMER3 is a pedantic nitpicker, passive-aggressive, yet also a pragmatist: so while HMMER3 pragmatically accepts * chararacters in input "protein" sequences just fine, it pedantically relegates them to somewhat suboptimal status, and it passively-aggressively figures that any suboptimal performance on *-containing ORFs is your own fault for using *'s in the first place.

We plan to implement translated searches. In fact, we're strongly considering implementing translated searches with a full gene model, a la GeneWise and Exonerate. We've not set any timeline for this though.

This task isn't big, cpu-wise

Now it might seem natural to run

% hmmscan Pfam.hmm myorfs.fa

If you're careful, and you run a pilot experiment on a small subset of myorfs.fa looking at how well this is performing, and you check your CPU system usage, you'll notice something disturbing: hmmscan doesn't seem to use all the CPU. How come? hmmscan speed is usually limited by the time it takes to input the profile database, not by CPU time.

Here's some back-of-the-envelope numerology, so you can see the issues in play (here shorthand on units is K=thousands, M=millions, G=billions; or 1e6=10\^6=million, etc.):

First, how long should this compute take?

  • The size of your compute scales as (sequence db size, residues) X (profile db size, consensus residues).
  • Here the sequence db is 120G residues.
  • The Pfam profile database consists of about 12000 profiles, of mean length \~200: that's about 2.4M residues.
  • So the total compute size is 120G x 2.4M = 3e17 "dynamic programming cells" (dpc).
  • HMMER3 compares profiles to sequences at a rate of about 10G dpc/sec/core, on typical commodity cores.
  • So H3 needs 3e17 / 1e10 = 3e7 cpu-sec for this compute.
  • (Side node: there's almost exactly pi x 10\^7 seconds in a year. So 3e7 cpu-sec \~ 1 cpu-year.)
  • If we scaled this (perfectly) across 1000 cpus, it'd take about 8 hours realtime.

With hmmscan, this task is very big, disk input wise

But we also have to consider not just cpu speed, but also how fast we can read the target database. If we do the search with hmmscan:

  • With hmmscan, Pfam is the target database, and we'll search it with \~4e9 query ORFs (assuming \~2 ORFs/frame, 6 frames, 300M reads).
  • Pfam is about 1 GB of data.
  • A typical local disk (SATA-2) will read data at about 300MB/sec.
  • So, each time we read Pfam from disk, it takes us 1G/300M = about three seconds.
  • (At this point, you see the problem. A 30aa ORF vs. Pfam takes about 7 milliseconds of CPU time, but 3300 milliseconds of disk time!)
  • So, all 4e9 hmmscan queries will cost us 4e9 x 3sec = 400 input-years. That's a lot, compared to the 1 cpu-year we need.
  • This is why hmmpress exists. hmmscan doesn't try to read all 1 GB of Pfam. It reads a packed binary 70M file (the .h3f file), 14x smaller, which contains only the information necessary for the first stage heuristic filter of H3 (the"MSV filter"). So it doesn't really take 400 disk-years; it takes a mere 400/14 = 30 input-years. Still a lot, compared to our 1 cpu-year.
  • Of course, once you're searching multiple queries, your filesystem is caching Pfam, not reading it from disk each time. You might get 1-5 GB/sec from your cache; say it's 3 GB/s, about 10x faster than your disk. Now we're down to 3 input years.

Now it should be clear that what's limiting hmmscan is input speed, not cpu speed. The problem is exacerbated when you have a lot of small queries, like here, where each \~40aa ORF is only 10% the size of a typical full-length protein and requires only 10% of the typical cpu time per protein comparison.

Now it should also be clear that it doesn't help to multithread hmmscan across cpu cores, especially on lots of small queries. Once input is rate-limiting, it doesn't matter how many cores you're splitting the job across. You might even notice that hmmscan --cpu 1 outperforms a default hmmscan, because the single thread is fully capable of handling the full input rate.

So why does hmmscan by default use all the cores in your machine? Optimism. The odds that you're searching a large query (\~1000's of aa's), which will scale reasonably across cores, outweigh the cost of running input-starved cores on small queries.

Could we do better? Oh sure. Eventually we will, too. For example, instead of relying on the filesystem cache, we could have hmmscan cache the entire .h3f dataset (or indeed all Pfam). Now we should be able to read the data at full memory speed (maybe 10-30 GB/sec for DDR3 SDRAM). But that'd still only get us down to \~0.4 input-years, compared to \~1 cpu-year. hmmscan still would scale poorly to even small numbers of cpu threads.

The ultimate solution is to parallelize the hmmscan input itself, so each core works on a pre-cached chunk of Pfam. This is exactly what the hmmpgmd daemon does, behind the scenes at the site. Each of the 12 12-core systems in the Janelia HMMER compute farm has precached both Pfam and NR, and input is parallelized across the 12 systems: so we can read data at an aggregate speed of around 30GB/s x 12 = 360 GB/s. That's why a Pfam search at snaps results back to you in a blink. hmmpgmd is one of the new programs that will appear Real Soon Now, in the upcoming 3.1 release.

You don't normally notice that hmmscan isn't using the cpus effectively, because it's so fast. A typical Pfam search returns in about 200 milliseconds, even on an input-starved single core. You'd only notice if you were wanting to search lots of query sequences against Pfam. But if you're doing lots of seqs against lots of profiles -- you can turn around and do lots of profiles against lots of sequences...

Use hmmsearch when comparing lots of seqs to lots of models, not hmmscan: here's why

Super-important numerology: each residue in a profile is represented by about 40 bytes of data, whereas each residue in a sequence is one byte. Residue by residue, Pfam is 40x heavier than a sequence database.

So, in terms of input:

  • hmmscan with 4e9 seq queries vs 70M Pfam .h3f binary: 280 PB database searched
  • hmmsearch with 12K Pfam queries vs. 120G translated ORF database: 1.4 PB database searched\</li

Yowza. You definitely don't want to run 280 petabytes through your system when you're input-limited and you can do the same task in a 1.4 PB flow; that's a 200-fold difference. (The 200-fold arises from the 40x more bytes per profile residue, and the 5x difference between a mean profile length of \~200 and a mean ORF length of \~40.)

hmmscan and hmmsearch are doing exactly the same compute, at heart: comparing one profile to one sequence at a time. Their bit score results are identical. You can save hmmsearch tabular output files and use 'em just the same way you were going to use the hmmscan files. (Um, watch out for E-values: remember that E-values depend on the size of the database you search.)

Because of the greatly reduced input demand, hmmsearch typically scales to multiple threads (about four or so) per system whereas hmmscan doesn't; and hmmsearch also runs very effectively in MPI mode.

So, bottom line, on our Janelia internal cluster (4000 cpus), we'd run this metagenome analysis task in a single job w/ something like this:

% hmmsearch --mpi Pfam.hmm myorfs.fa

and go watch Barcelona whup Manchester, and come back in a few hours.