And how will it ever end?
unless the day finally arrives
when we have compared everything
in the world
to everything else in the world,
and there is nothing left to do
but quietly close our notebooks
and sit with our hands folded on our desks.
– Billy Collins, The Trouble With Poetry
Ryan Richt pointed out in a comment, so what if HMMER3 is as fast as BLAST; even BLAST isn’t fast enough these days!
That’s an interesting comment. I think we’ve reached a turning point in the field. I think we will now see a schism between tools that do remote homology search, versus tools that do rapid mapping of reads to identical or near-identical reference genomes. We’ve used BLAST as our jack-of-all-trades for over a decade. It’s the lineal descendant of the first dynamic programming sequence alignment routines that were happy just to score alignments by percentage identity. Modern needs are now pressing it on two very different fronts.
To map a zillion Illumina reads to a near-identical reference genome, BLAST really isn’t the right tool. Instead I’d reach for Gusfield’s brilliant book Algorithms on Strings, Trees, and Sequences and I’d learn how to use suffix trees or suffix arrays. The computer science guys know how to find near-exact matches in a tearing hurry. There, the cleverness of the algorithm is the main thing, not the scoring system.
But to peer as far back in evolutionary time as I can, and detect the remotest homologies, dealing with four billion years of evolutionary erosion and noise, BLAST really isn’t the right tool either (I’d argue). Instead I want to break out the hard core Bayesian statistical inference machinery. For remote sequence comparison, it’s really the cleverness of the scoring system that’s your main worry, and your algorithms are secondary.
They’re different problems, with different demands.
Tools are appearing which are heavily optimizing on the speed axis, to keep up with the increasingly massive demands of intraspecific (within-species) sequence comparison and mapping, while using simple scoring systems just sensitive enough to work for near-identical sequences. This really isn’t the problem domain I envision for HMMER.
HMMER’s aim is to deploy strong Bayesian inference for remote homology detection, using the most complex and realistic probabilistic models we think we can usefully manage, while staying fast enough and practical enough to keep up with the data demands of interspecific (cross-species) comparative sequence analysis.