Dfam: annotation of transposable elements with profile HMMs

We're happy to announce the release of Dfam 1.0, a set of profile HMMs for genomic DNA annotation of transposable elements. This essentially constitutes an upgrade of repeat element annotation from using searches with single sequence consensuses to using searches with profile HMMs, now that the HMMER3 project has made DNA/DNA profile HMM searches sufficiently fast for whole genomes. Dfam is a collaboration between Jerzy Jurka and his Repbase resources (Genetic Information Research Institute), Arian Smit and his RepeatMasker software (Institute for Systems Biology, Seattle), the HMMER3 development team at Janelia Farm (particularly Travis Wheeler, leading nhmmer development), and the Xfam database consortium (particularly Rob Finn, here at Janelia). Among other effects of this work, we expect the widely used RepeatMasker software to include nhmmer, Dfam models, and profile HMM searches in the near future. A preprint of the first Dfam paperis available now on our preprint server, and the database itself is available for use at dfam.janelia.org.

Many genomes contain a lot of interesting "junk". Unlike the junk in my basement, much of the junk in genomes expands on its own, by self-replication, and thus is particularly hard to control, let alone get rid of. At least 40% of the human genome is composed of the decaying DNA remains of transposable elements (TEs), different species of which have replicated in great waves during the evolution of our genome. Our genome fights back with mechanisms to suppress TE replication, but there's only so much it can do against these invaders; our genome is forced to tolerate some load of these parasites and hitchhikers (fortunately it's a myth that eukaryotic genome size is concerned much with "efficiency"; the amount of DNA our cells carry is a tiny and insignificant player in our cellular energy budget). Very interestingly, it's become clear (though it's hardly surprising) that we have co-opted our genomic parasites in some instances, repurposing some of their genes and their regulatory regions for human genomic functions. How much of this code reuse has gone on? This will be interesting to figure out.

Most copies of TEs in genomes are dead, in the sense that they've mutated and lost their replicative ability. Most dead sequences decay rapidly, accumulating more mutations by neutral evolution. In a few tens of millions of years, mutational drift makes it hard to recognize old TE-derived sequence at all. We can recognize that 40% of the human genome is TE-derived; and maybe about 1-3% of it is genic; what's the rest of our genome for, and where's it from? A good bet is that a lot of the rest is TE-derived sequence that's so old, we're just not able to recognize it.

So for reasons like these, it's important to crank up the resolution of the computational techniques we use to recognize and annotate TE-derived sequence in genomes. For a long time, our lab has wanted to apply the technological advances of probability models (profile HMMs) to genomic DNA annotation, but implementations of profile HMM technology have been too slow. That changed with the HMMER3 project and with Travis Wheeler's new nhmmer program for DNA/DNA comparison.

A tremendous amount of expert biological annotation has gone into the Repbase data resource, due to the work of Jerzy Jurka and his collaborators. Repbase consensus sequences are at the heart of Arian Smit's RepeatMasker software, a fundamental genome annotation tool for TEs. What the collaboration has done, in part, is to automate a procedure to upgrade Repbase consensus sequences to multiple alignments, and to profile HMMs. This first version (1.0) is only of human TEs, 1007 of them (plus 136 other models of human repeat sequences), for human genome annotation. We expect to expand the project to other genomes (at least to the rest of Repbase) in the future.

The shakiest part of this collaboration is that we're concerned for how everyone gets funded. Not for ourselves; at Janelia, we're supported by the Howard Hughes Medical Institute. But we're like the exception that proves the rule. How do useful databases get sustained, stable long-term funding in biomedical research? Sometimes, they don't. And that's particularly true if your data resource gets overshadowed by a tool that encompasses it. I think it's fair to say that most people in the community know RepeatMasker, but how many people who use RepeatMasker know that most of the consensus sequences come from Jerzy Jurka's Repbase? I know (from experience) that there are people who think we've got RepeatMasker so we don't need Repbase, but this is a dire misunderstanding of how the dependencies flow in the genome data resource ecosystem. From a usability perspective, we want these dependencies to be invisible, so genomic data flows freely, openly, frictionlessly -- but when an important expertly-curated resource is invisibly submerged beneath the tools and science that depend on it, how will funding agencies recognize its value? I'm not sure we have a satisfying answer to this in the field of genome data analysis.