And how will it ever end?
unless the day finally arrives
when we have compared everything
in the world
to everything else in the world,

and there is nothing left to do
but quietly close our notebooks
and sit with our hands folded on our desks.

– Billy Collins, The Trouble With Poetry

Ryan Richt pointed out in a comment, so what if HMMER3 is as fast as BLAST; even BLAST isn’t fast enough these days!

That’s an interesting comment. I think we’ve reached a turning point in the field. I think we will now see a schism between tools that do remote homology search, versus tools that do rapid mapping of reads to identical or near-identical reference genomes. We’ve used BLAST as our jack-of-all-trades for over a decade. It’s the lineal descendant of the first dynamic programming sequence alignment routines that were happy just to score alignments by percentage identity. Modern needs are now pressing it on two very different fronts.

To map a zillion Illumina reads to a near-identical reference genome, BLAST really isn’t the right tool. Instead I’d reach for Gusfield’s brilliant book Algorithms on Strings, Trees, and Sequences and I’d learn how to use suffix trees or suffix arrays. The computer science guys know how to find near-exact matches in a tearing hurry. There, the cleverness of the algorithm is the main thing, not the scoring system.

But to peer as far back in evolutionary time as I can, and detect the remotest homologies, dealing with four billion years of evolutionary erosion and noise, BLAST really isn’t the right tool either (I’d argue). Instead I want to break out the hard core Bayesian statistical inference machinery. For remote sequence comparison, it’s really the cleverness of the scoring system that’s your main worry, and your algorithms are secondary.

They’re different problems, with different demands.

Tools are appearing which are heavily optimizing on the speed axis, to keep up with the increasingly massive demands of intraspecific (within-species) sequence comparison and mapping, while using simple scoring systems just sensitive enough to work for near-identical sequences. This really isn’t the problem domain I envision for HMMER.

HMMER’s aim is to deploy strong Bayesian inference for remote homology detection, using the most complex and realistic probabilistic models we think we can usefully manage, while staying fast enough and practical enough to keep up with the data demands of interspecific (cross-species) comparative sequence analysis.

2 thoughts on “schism

  1. Thank you for the post 🙂

    I completely see where you are coming from. I was just wondering about a new uses for HMMER if it were (is) faster and how some of the computational approaches we take now might be suboptimal.

    For instance, if we get 100s of millions of slightly longer reads than presently (Illumina 200mers or something), we are sequencing total RNA, not DNA, and say that was from a metagenomic sample, is assembly really garbling a lot of information? Oh, and say these are all un-culture-able or undiscovered species for which there exists no reference sequence for fast near matching.

    HMMER/pfam actually seems like the best tool to annotate novel genes from novel species that might be evolutionarily diverged from what we currently know about. Assembly will probably join homologous genes from the mix of species in the sample. What if we HMMER/pfam every read and count up gene expression that way? Would we get a different answer than assembling and counting domains/genes then?

    I think the most interesting possibility would be if there were two homologous genes from two species that an assembler would put together assuming discrepancies were mismatches, but HMMER would separate out as having distinct coding domains.

    Maybe that would never happen?


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s