A joy it will be one day,
to remember even this.
Through so many hard straits, so many
twists and turns,
our course holds firm for Latium.
There fate holds out
a homeland, calm, at peace.
There the gods decree
the kingdom of Troy will rise again.
Bear up.Virgil, The Aeneid
Four years ago today, I started from a clean slate on a new version of HMMER, HMMER 3.0. It's been gestating longer than I'd planned, but at long last it's almost ready. After Christmas holidays, the first alpha test versions will be released.
It's far from actually done, mind you. But it's about ready to be tested.
The idea that drove me to spend the last four years rewriting HMMER from scratch (again) is pretty simple. BLAST has been the workhorse of computational molecular biology for almost twenty years. The theoretical foundation of biological sequence analysis has been greatly improved since BLAST was written, because of the advent of probabilistic modeling methods such as hidden Markov models. We even wrote a book about these methods -- Biological Sequence Analysis is now a decade old. But people still use BLAST; I still use BLAST. Why is that? Theoretical advances in a computational science are all well and good, but someone has to write a practical implementation if these advances are going to be widely used. BLAST is a damn fine implementation; fast, robust, and beautifully supported by NCBI. There are reasonable implementations of HMM methods, including HMMER2 and the UCSC SAM package, but mostly because they're so much slower than BLAST, they have stayed in a different niche -- both in how they're used (for profile searches in protein domain databases such as Pfam and SMART) and in terms of mind share. HMMs are still seen as a black box, not as an improved statistical foundation for all of sequence analysis. David Lipman once commented that the only thing that made HMMs interesting was their name - there's something's hidden, and a Russian is involved.
So: all this gorgeous theory was languishing, and it was starting to piss me off.
So: HMMER3's ever-so-modest goal is to compete with BLAST.
The most immediately visible change in HMMER3 is that HMMER is now about as fast as BLAST. We've got a spiffy new acceleration algorithm, a little less heuristic than BLAST's and a little more suited to parallelization on modern hardware. We will be able to wring even more speed out of it as our initial implementation improves, and I'm expecting we should be able to make it even faster than BLAST by the time we're done.
The next most visible change in HMMER3 is that it doesn't calculate optimal alignment scores, it calculates theoretically more powerful log-likelihood ratios that sum over the uncertainty of any particular alignment. I'll expand on this more in future posts (and papers), but the main idea is optimal alignment scores are the wrong score to use; they're an approximation that works ok when alignments are pretty certain, but the approximation breaks down and optimal alignment scores lose resolving power when we're comparing remote homologs. HMMER3 is using full probabilistic inference; it calculates the entire ensemble of possible alignments and reports confidence values (posterior probabilities) on every aligned residue. None of this is new! Theoretically speaking, we've known this is the Right Thing To Do since the 1990's. The UCSC SAM implementation has always run the Forward algorithm; and that's why it's even slower than HMMER2. Now in H3, we're doing the correct probabilistic inference calculations in context of a fast, practical codebase. Our acceleration algorithms are sufficiently good that we can finally afford to deploy the expensive heavy artillery of probabilistic inference, in a way that you won't notice the cost; your results will snap back about as fast as BLAST ever did.
The result is sequence database searching that's much more powerful than BLAST, or even HMMER2, but at BLAST's speed. At least in our internal benchmarks so far, the new power in HMMER3 is dramatic.
I'll talk more about what's new in HMMER3 in the coming weeks, as we prepare to roll out the alpha test versions, and start laying a road map for public release.