Infernal 1.0

Infernal 1.0 -- the first production release of our software for RNA sequence/structure homology search and alignment -- has been released. Source code and documentation are available over at infernal.janelia.org.

Infernal development is now led by Eric Nawrocki in the lab.

Infernal's actually been used by the Rfam database since about 2002, in pre-1.0 versions. This is the first version that we think is really ready for production use, more than just a testbed for algorithms. The 1.0 code went through several release candidates in the past few months. We think we have all the obvious bugs shaken out of it.

The problem Infernal addresses is RNA homology search. You have an RNA sequence or a multiple alignment of related RNA sequences, and you want to search the sequence databases for homologs. Sequence similarity search (HMMER) might suffice, but you need about 60-70% sequence identity to detect significant RNA/DNA sequence alignments, and homologous sequences can erode below that detection level in as little as a few tens of millions of years.

If you were looking for a protein, you'd search by comparing amino acid sequences, not by comparing DNA/RNA sequence; partly because of the larger 20-letter alphabet, you get more statistical power from amino acid comparisons, with sequence tools being able to see down to about 20-30% pairwise aa identity, which will often be preserved across a billion years or more.

For a functional RNA, you obviously can't resort to amino acid sequence comparisons. What you can do, though, is to use conserved RNA secondary structure as additional signal -- at least, if your RNA of interest has a known conserved RNA secondary structure.

But how do you combine sequence conservation and RNA secondary structure conservation in a single consistent scoring system for homology search? That turns out to be a solved problem in other fields: there's a class of "formal grammars" called stochastic context-free grammars that solves it beautifully, provided we only capture classical "nested" secondary structure, and give up on capturing any information from RNA pseudoknots. That's acceptable; pseudoknotted base pairs are important, but always outnumbered (usually greatly outnumbered) by base pairs in standard stem-loops. Infernal implements models called profile stochastic context-free grammars. It's essentially like HMMER and profile HMMs, extended to RNA secondary structure.

The disadvantage of profile SCFG methods is that they're computationally intensive. We continue to work hard on accelerating Infernal, but it's still slow. The only people who can really make routine use of it are people with a lot of computing resources at their disposal -- like the Rfam team in Cambridge.