Reboot

In July 2015, our laboratory will move to Harvard University. I’ve accepted an offer to join Harvard’s faculty in two departments: Molecular & Cellular Biology (in the Faculty of Arts and Sciences), and Applied Mathematics (in the School of Engineering and Applied Sciences). The laboratory will be on the first floor of the venerable, historic Biolabs building on the Cambridge campus. Much like we optimized our Janelia location to be the closest lab to the pub, now we will be strategically located closest to the food truck on Divinity Avenue. I’m told that our future space was once occupied by Eric Lander, so who knows what we’re going to find during construction. Charred effigies of Craig Venter, I trust.
Continue reading

Dfam: annotation of transposable elements with profile HMMs

We’re happy to announce the release of Dfam 1.0, a set of profile HMMs for genomic DNA annotation of transposable elements. This essentially constitutes an upgrade of repeat element annotation from using searches with single sequence consensuses to using searches with profile HMMs, now that the HMMER3 project has made DNA/DNA profile HMM searches sufficiently fast for whole genomes. Dfam is a collaboration between Jerzy Jurka and his Repbase resources (Genetic Information Research Institute), Arian Smit and his RepeatMasker software (Institute for Systems Biology, Seattle), the HMMER3 development team at Janelia Farm (particularly Travis Wheeler, leading nhmmer development), and the Xfam database consortium (particularly Rob Finn, here at Janelia). Among other effects of this work, we expect the widely used RepeatMasker software to include nhmmer, Dfam models, and profile HMM searches in the near future. A preprint of the first Dfam paper is available now on our preprint server, and the database itself is available for use at dfam.janelia.org.
Continue reading

Infernal 1.1: RNA alignment and database search, 10,000x faster

One of our lab’s goals is to make it possible to systematically search for homologs of RNAs in genomes, not just by looking for sequence conservation but also by looking for RNA secondary structure conservation. A powerful model framework for RNA structure/sequence comparison, called profile stochastic-context free grammars (profile SCFGs), was introduced in the mid-1990s both by Yasu Sakakibara and by us. But profile SCFG methods are among the most computationally intensive algorithms used in genome sequence analysis, requiring (in their textbook description, anyway) O(N^4) time and O(N^3) memory for an RNA of N residues. Profile SCFG implementations like our Infernal software have required immense computational power to get even the most basic sort of searches done.

We are happy to announce a new landmark in our work on these methods, with a new version of Infernal that is about 100x faster than the previous (1.0) version, and 10,000x faster than when Eric Nawrocki started working on making Infernal fast enough for routine use. Over at infernal.janelia.org, Eric has made available the first release candidate of Infernal 1.1, 1.1rc1, including source code and binaries for Linux and MacOS/X. A typical RNA homology search of a vertebrate genome that used to require a cpu-year can now be done in about an hour on a single CPU, or a few seconds on a cluster.

So really for the first time, Infernal has become practical for systematic RNA sequence analysis of whole genomes. Roughly speaking, Infernal 1.1 is running at a speed comparable to what HMMER2 ran at — we’ve brought the RNA search problem down from the utterly ridiculous to the merely difficult.

The next version of the Rfam RNA sequence family database will be the first to be computed entirely natively with Infernal RNA structure comparison, instead of using BLASTN as a prefilter. An all-vs-all comparison of all 2000 Rfam models against the entire EMBL DNA database (170 Gb) would take 30,000 cpu-years using Infernal 0.55; now with Infernal 1.1, that enormous Rfam compute is only going to take us about a day on Janelia’s cluster.

Like Infernal 1.0, 1.1 is achieving its speed by using profile HMMs as heuristic prefilters. Whereas 1.0 used HMMER2-like prefilters, 1.1 has now switched to using HMMER3‘s vector engine, sharing code with Travis Wheeler’s soon-to-be-announced nhmmer program for DNA/DNA comparison.

Happy RNA hunting — and don’t let anyone tell you that O(N^4) algorithms aren’t tractable!