ENCODE says what?

So I read in the newspaper this week that the ENCODE project has disproven the idea of junk DNA. I sure wish I'd gotten the memo, because this week a collaboration of labs led by myself, Arian Smit, and Jerzy Jurka just released a new data resource that annotates nearly 50% of the human genome as transposable element-derived, and transposon-derived repetitive sequence is the poster child for what we colloquially call "junk DNA".

The newspapers went on to say that ENCODE has revolutionized our understanding of noncoding DNA by showing that far from being junk, noncoding DNA contains lots of genetic regulatory switches. Well, that's also odd, because another part of my lab is (like a lot of other labs in biology these days) studying the regulation of genes in a model animal's brain (the fruit fly Drosophila). We and everyone else in biology have known for fifty years that genes are controlled by regulatory elements in noncoding DNA. (Well, I've only known for thirty years, not fifty, I admit -- only since Mrs. Dell'Antonio kicked me out of high school biology class and gave me a molecular genetics textbook to read by myself.)

Now, with all respect to my journalist friends, I've learned not to believe everything I read in the newspapers. I figured I'd better read the actual ENCODE papers. This is going to take a while. I've only read the main Nature paper carefully so far (there's 30+ of them, apparently, across multiple journals). But it's already clear that at least the main ENCODE paper doesn't say anything like what the newspapers say.

The ENCODE project and our existing knowledge of genomes are both vastly more substantial than the discussion the ENCODE authors are provoking in the press right now.

The human genome has a lot of junk

Genome size varies a lot. You might think that apparently more complex organisms like human would have more DNA than simpler organisms like single-celled amoebae but that turns out not to be true. Salamanders have 10-fold more DNA than us; lungfish, about 30-fold more.

So maybe we don't really know how to define or measure "complexity"; maybe we're just being anthropocentric when we think of ourselves as complex. Who's to say that amoebae are less complex than humans? Ever looked at an amoeba? (They're pretty awesome.) Still. The key observation isn't just that very different creatures have very different genome sizes; it's that similar species can have very different genome sizes. This fact, surprising at the time, begged a good explanation. If two species are similar, yet their genomes are 10x different in size, what's all that extra DNA doing?

This observation about genome sizes (called the "C-value" paradox, for technical reasons) raised the idea that maybe genomes could expand (and shrink) rapidly (on an evolutionary timescale) as a result of some neutral (non-adaptive) processes -- that maybe organisms could tolerate DNA that didn't have a direct functional effect on the organism itself, but was instead was being created and maintained by neutral or even parasitic mechanisms of evolution. Somebody (it's a good bet that T. Ryan Gregory knows who) dubbed this "junk" DNA, and that was probably an unfortunate term, because it's incited people's anger from the day it was coined. It's not polite to tell someone their beautiful house is full of junk. Even if it is.

A key discovery that satisfactorily explained the C-value paradox was the discovery that genomes, especially animal and plant genomes, contain large numbers of transposable (mobile) elements that replicate all by themselves, often at the (usually slight) expense of their host genome. For instance, about 10% of the human genome is composed about a million copies of a small mobile element called Alu. Another big fraction of the genome is composed of a mobile element called L1. Transposons are related to viruses, and we think that for the most part they are parasitic in nature. They infect a genome, replicating, spreading, and multiplying; eventually they die, mutate, and decay away, leaving their DNA sequences. Sometimes when an Alu replicates and hops into a new place in our genome, it breaks something. Usually (partly because the genome is mostly nonfunctional) a new Alu just hops somewhere else in the junk, and has no appreciable effect on us.

So it turns out that when we look at all these different genome sizes, almost all of the puzzling size variation is explained by genomes having different "loads" of transposable elements. Some creatures, like pufferfish, have only low loads of transposons. Some creatures, like salamanders, lungfish, amoebae, corn, and lilies, are loaded with massive numbers of transposons. As it happens, the human genome is annotated as about 50% transposon-derived sequence -- right at that 50/50 borderline where someone can say "the human genome is mostly junk" and someone else can say "the human genome is mostly not junk".

In 1980, two key papers -- by Orgel and Crick, and by Sapienza and Doolittle -- nicely laid out the argument that genomes contain "selfish" or "junk" DNA, largely transposon-derived, sometimes quite large amounts of it. These papers are quite beautiful and scholarly. They are careful to say, for example, that it would be surprising if evolution did not sometimes co-opt useful functions from this great amount of extra DNA sequence slopping around. Indeed, we are now finding many interesting examples of transposon-derived stuff being co-opted for organismal function (but these are the exception, not the rule). Without trying to be snide or pedantically academic, I'll note that the main ENCODE paper cites neither Orgel/Crick or Sapienza/Doolittle; what this means is, regardless of what we read in the newspapers, ENCODE is not actually trying to interpret their data in light of the current thinking about junk DNA, at least in the actual paper.

Transposon-derived sequences are the poster child for "junk DNA" because we can positively identify transposon-derived sequences by computational analysis, and reconstruct the evolutionary history of transposon invasions of genomes. There's likely to be other nonfunctional DNA "junk" too, in the DNA that we can't currently put any annotation at all on, but the key point is that the dead bones of many transposons are something we can affirmatively identify.

noncoding DNA is part junk, part regulatory, part unknown

It is crucial to understand that "noncoding" DNA is not synonymous with "junk" DNA. The current view of the human genome, which ENCODE has now systematically and comprehensively confirmed and extended, is that it is about 1% protein-coding, in perhaps about 20,000 "genes" averaging about 1500 coding bases each (where the concept of a "gene" is amorphous, but useful; we know one when we see one). Genes are turned on and off by regulatory DNA regions, such as promoters and enhancers -- as has been worked out over fifty years, starting with how bacterial viruses work. In animals like humans, most people (ok, I) would guess that there are maybe 10-20 regulatory regions per gene, each maybe 100-300 bases long; so, very roughly, maybe on the order of about 1000-6000 bases of noncoding regulatory information per 1500 coding bases in a gene. I'm only giving hand-wavy back of the envelope notions here because it's actually quite difficult to pin these numbers down exactly; our current knowledge of regulatory DNA sequences in detail is distressingly incomplete. That's something that ENCODE's trying to help figure out, in systematic fashion, and where a lot of ENCODE's substantive value is. The point is, we already knew there was likely at least as much regulatory DNA as coding DNA, and probably more; we just don't have a very satisfying handle on it all yet, and we thought we needed an ENCODE project to survey things more comprehensively.

So when you read a Mike Eisen saying "those damn ENCODE people, we already knew noncoding DNA was functional", and a Larry Moran saying "those damn ENCODE people, there is too a lot of junk DNA", they aren't contradicting each other. They're talking about different (sometimes overlapping) fractions of human DNA. About 1% of it is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA given what we know (and our knowledge about regulatory sites is especially incomplete). About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as "junk" in the colloquial sense that transposons have their own purpose (and their own own biochemical functions and replicative mechanisms), like the spam in your email. And there's some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example.

Now that still leaves a lot of the genome. What's all that doing? Transposon-derived sequence decays rapidly, by mutation, so it's certain that there's some fraction of transposon-derived sequence we just aren't recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk ("mostly" as in, somewhere north of 50%).

At the same time, we still have only a tenuous grasp on the details of gene regulation, even though we think we understand the broad strokes now. Nobody should bet against finding more and more regulatory noncoding DNA, either. The human genome surely contains a lot of unannotated functional DNA. The purpose of the ENCODE project was to help us sort this out. Its data sets, and others like them, will be fundamental in giving us a comprehensive view of the functional elements of the human genome.

ENCODE's definition of "functional" includes junk

ENCODE has assigned a "biochemical function" to 80% of the genome. The newspapers add, "therefore it's not junk", but that's a critically incorrect logical leap. It presumes that junk DNA doesn't have a "biochemical function" in the sense that ENCODE chose to operationally define "function". So in what sense did ENCODE define the slippery concept of biological function, to allow them to assign a human genome fraction (to two significant digits, ahem)?

ENCODE calls a piece of DNA "functional" if it reproducibly binds to a DNA-binding protein, is reproducibly marked by a specific chromatin modification, or if it is transcribed. OK. That's a fine, measurable operational definition. (One might wonder, why not just call "DNA replication" a function too, and define 100% of the genome as biochemically functional, but of course, as Ewan Birney (the ENCODE czar) would tell you, I would never be that petty. No sir.) I am quite impressed by the care that the ENCODE team has taken to define "reproducibility", and to process their datasets systematically.

But as far as questions of "junk DNA" are concerned, ENCODE's definition isn't relevant at all. The "junk DNA" question is about how much DNA has essentially no direct impact on the organism's phenotype - roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are - because at least at one point in their history, transposons are "alive" for themselves (they have genes, they replicate), and even when they die, they've still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.

Thought experiment: if you made a piece of junk for yourself -- a completely random DNA sequence! -- and dropped it into the middle of a human gene, what would happen to it? It would be transcribed, because the transcription apparatus for that gene would rip right through your junk DNA. ENCODE would call the RNA transcript of your random DNA junk "functional", by their technical definition. And if even it weren't transcribed, that would be because it acted as a different kind of functional element (your random DNA could accidentally create a transcriptional terminator).

the random genome project

So a-ha, there's the real question. The experiment that I'd like to see is the Random Genome Project. Synthesize a hundred million base chromosome of entirely random DNA, and do an ENCODE project on that DNA. Place your bets: will it be transcribed? bound by DNA-binding proteins? chromatin marked?

Of course it will.

The Random Genome Project is the null hypothesis, an essential piece of understanding that would be lovely to have before we all fight about the interpretation of ENCODE data on genomes. For random DNA (not transposon-derived DNA, not coding, not regulatory), what's our null expectation for all these "functional" ENCODE features, by chance alone, in random DNA?

(Hat tip to The Finch and Pea blog, a great blog that I hadn't seen before the last few days, where you'll find essentially the same idea.)

evolution works on junk

Even if you did the Random Genome Project and found that a goodly fraction of a totally random DNA sequence was "functional", transcribed and bound and chromatin-marked, would this somehow diminish your view of the human genome?

Personally, I don't think we can understand genomes unless we try to recognize all the different noisy, neutral evolutionary processes at work in them. Without "noise" -- without a background of specific but nonfunctional transcription, binding, and marking -- evolution would have less traction, less de novo material to grab hold of and refine and select, to make it more and more useful. Genomes are made of repurposed sequence, borrowed from whatever happened to be there, including the "junk DNA" of invading transposons.

As Sydney Brenner once said, there's a difference between junk and garbage; garbage is stuff you throw out, junk is stuff you keep because it just might be useful someday.

Conflict of interest/full disclosure: I was a member of the national advisory council to the NIH National Human Genome Research Institute at the time ENCODE was conceived and planned - so I'm not quite as innocent and disinterested in policy questions of NIH NHGRI big science projects and media engagement strategy as this post may have made it sound.