ENCODE says what?

So I read in the newspaper this week that the ENCODE project has disproven the idea of junk DNA. I sure wish I’d gotten the memo, because this week a collaboration of labs led by myself, Arian Smit, and Jerzy Jurka just released a new data resource that annotates nearly 50% of the human genome as transposable element-derived, and transposon-derived repetitive sequence is the poster child for what we colloquially call “junk DNA”.

The newspapers went on to say that ENCODE has revolutionized our understanding of noncoding DNA by showing that far from being junk, noncoding DNA contains lots of genetic regulatory switches. Well, that’s also odd, because another part of my lab is (like a lot of other labs in biology these days) studying the regulation of genes in a model animal’s brain (the fruit fly Drosophila). We and everyone else in biology have known for fifty years that genes are controlled by regulatory elements in noncoding DNA. (Well, I’ve only known for thirty years, not fifty, I admit — only since Mrs. Dell’Antonio kicked me out of high school biology class and gave me a molecular genetics textbook to read by myself.)

Now, with all respect to my journalist friends, I’ve learned not to believe everything I read in the newspapers. I figured I’d better read the actual ENCODE papers. This is going to take a while. I’ve only read the main Nature paper carefully so far (there’s 30+ of them, apparently, across multiple journals). But it’s already clear that at least the main ENCODE paper doesn’t say anything like what the newspapers say.

The ENCODE project and our existing knowledge of genomes are both vastly more substantial than the discussion the ENCODE authors are provoking in the press right now.

The human genome has a lot of junk DNA

Genome size varies a lot. You might think that apparently more complex organisms like human would have more DNA than simpler organisms like single-celled amoebae but that turns out not to be true. Salamanders have 10-fold more DNA than us; lungfish, about 30-fold more.

So maybe we don’t really know how to define or measure “complexity”; maybe we’re just being anthropocentric when we think of ourselves as complex. Who’s to say that amoebae are less complex than humans? Ever looked at an amoeba? (They’re pretty awesome.) Still. The key observation isn’t just that very different creatures have very different genome sizes; it’s that similar species can have very different genome sizes. This fact, surprising at the time, begged a good explanation. If two species are similar, yet their genomes are 10x different in size, what’s all that extra DNA doing?

This observation about genome sizes (called the “C-value” paradox, for technical reasons) raised the idea that maybe genomes could expand (and shrink) rapidly (on an evolutionary timescale) as a result of some neutral (non-adaptive) processes — that maybe organisms could tolerate DNA that didn’t have a direct functional effect on the organism itself, but was instead was being created and maintained by neutral or even parasitic mechanisms of evolution. Somebody (it’s a good bet that T. Ryan Gregory knows who) dubbed this “junk” DNA, and that was probably an unfortunate term, because it’s incited people’s anger from the day it was coined. It’s not polite to tell someone their beautiful house is full of junk. Even if it is.

A key discovery that satisfactorily explained the C-value paradox was the discovery that genomes, especially animal and plant genomes, contain large numbers of transposable (mobile) elements that replicate all by themselves, often at the (usually slight) expense of their host genome. For instance, about 10% of the human genome is composed about a million copies of a small mobile element called Alu. Another big fraction of the genome is composed of a mobile element called L1. Transposons are related to viruses, and we think that for the most part they are parasitic in nature. They infect a genome, replicating, spreading, and multiplying; eventually they die, mutate, and decay away, leaving their DNA sequences. Sometimes when an Alu replicates and hops into a new place in our genome, it breaks something. Usually (partly because the genome is mostly nonfunctional) a new Alu just hops somewhere else in the junk, and has no appreciable effect on us.

So it turns out that when we look at all these different genome sizes, almost all of the puzzling size variation is explained by genomes having different “loads” of transposable elements. Some creatures, like pufferfish, have only low loads of transposons. Some creatures, like salamanders, lungfish, amoebae, corn, and lilies, are loaded with massive numbers of transposons. As it happens, the human genome is annotated as about 50% transposon-derived sequence — right at that 50/50 borderline where someone can say “the human genome is mostly junk” and someone else can say “the human genome is mostly not junk”.

In 1980, two key papers — by Orgel and Crick, and by Sapienza and Doolittle — nicely laid out the argument that genomes contain “selfish” or “junk” DNA, largely transposon-derived, sometimes quite large amounts of it. These papers are quite beautiful and scholarly. They are careful to say, for example, that it would be surprising if evolution did not sometimes co-opt useful functions from this great amount of extra DNA sequence slopping around. Indeed, we are now finding many interesting examples of transposon-derived stuff being co-opted for organismal function (but these are the exception, not the rule). Without trying to be snide or pedantically academic, I’ll note that the main ENCODE paper cites neither Orgel/Crick or Sapienza/Doolittle; what this means is, regardless of what we read in the newspapers, ENCODE is not actually trying to interpret their data in light of the current thinking about junk DNA, at least in the actual paper.

Transposon-derived sequences are the poster child for “junk DNA” because we can positively identify transposon-derived sequences by computational analysis, and reconstruct the evolutionary history of transposon invasions of genomes. There’s likely to be other nonfunctional DNA “junk” too, in the DNA that we can’t currently put any annotation at all on, but the key point is that the dead bones of many transposons are something we can affirmatively identify.

Noncoding DNA is part junk, part regulatory, part unknown

It is crucial to understand that “noncoding” DNA is not synonymous with “junk” DNA. The current view of the human genome, which ENCODE has now systematically and comprehensively confirmed and extended, is that it is about 1% protein-coding, in perhaps about 20,000 “genes” averaging about 1500 coding bases each (where the concept of a “gene” is amorphous, but useful; we know one when we see one). Genes are turned on and off by regulatory DNA regions, such as promoters and enhancers — as has been worked out over fifty years, starting with how bacterial viruses work. In animals like humans, most people (ok, I) would guess that there are maybe 10-20 regulatory regions per gene, each maybe 100-300 bases long; so, very roughly, maybe on the order of about 1000-6000 bases of noncoding regulatory information per 1500 coding bases in a gene. I’m only giving hand-wavy back of the envelope notions here because it’s actually quite difficult to pin these numbers down exactly; our current knowledge of regulatory DNA sequences in detail is distressingly incomplete. That’s something that ENCODE’s trying to help figure out, in systematic fashion, and where a lot of ENCODE’s substantive value is. The point is, we already knew there was likely at least as much regulatory DNA as coding DNA, and probably more; we just don’t have a very satisfying handle on it all yet, and we thought we needed an ENCODE project to survey things more comprehensively.

So when you read a Mike Eisen saying “those damn ENCODE people, we already knew noncoding DNA was functional”, and a Larry Moran saying “those damn ENCODE people, there is too a lot of junk DNA”, they aren’t contradicting each other. They’re talking about different (sometimes overlapping) fractions of human DNA. About 1% of it is coding. Something like 1-4% is currently expected to be regulatory noncoding DNA given what we know (and our knowledge about regulatory sites is especially incomplete). About 40-50% of it is derived from transposable elements, and thus affirmatively already annotated as “junk” in the colloquial sense that transposons have their own purpose (and their own own biochemical functions and replicative mechanisms), like the spam in your email. And there’s some overlap: some mobile-element DNA has been co-opted as coding or regulatory DNA, for example.

Now that still leaves a lot of the genome. What’s all that doing? Transposon-derived sequence decays rapidly, by mutation, so it’s certain that there’s some fraction of transposon-derived sequence we just aren’t recognizing with current computational methods, so the 40-50% number must be an underestimate. So most reasonable people (ok, I) would say at this point that the human genome is mostly junk (“mostly” as in, somewhere north of 50%).

At the same time, we still have only a tenuous grasp on the details of gene regulation, even though we think we understand the broad strokes now. Nobody should bet against finding more and more regulatory noncoding DNA, either. The human genome surely contains a lot of unannotated functional DNA. The purpose of the ENCODE project was to help us sort this out. Its data sets, and others like them, will be fundamental in giving us a comprehensive view of the functional elements of the human genome.

ENCODE’s definition of “functional” includes junk

ENCODE has assigned a “biochemical function” to 80% of the genome. The newspapers add, “therefore it’s not junk”, but that’s a critically incorrect logical leap. It presumes that junk DNA doesn’t have a “biochemical function” in the sense that ENCODE chose to operationally define “function”. So in what sense did ENCODE define the slippery concept of biological function, to allow them to assign a human genome fraction (to two significant digits, ahem)?

ENCODE calls a piece of DNA “functional” if it reproducibly binds to a DNA-binding protein, is reproducibly marked by a specific chromatin modification, or if it is transcribed. OK. That’s a fine, measurable operational definition. (One might wonder, why not just call “DNA replication” a function too, and define 100% of the genome as biochemically functional, but of course, as Ewan Birney (the ENCODE czar) would tell you, I would never be that petty. No sir.) I am quite impressed by the care that the ENCODE team has taken to define “reproducibility”, and to process their datasets systematically.

But as far as questions of “junk DNA” are concerned, ENCODE’s definition isn’t relevant at all. The “junk DNA” question is about how much DNA has essentially no direct impact on the organism’s phenotype – roughly, what DNA could I remove (if I had the technology) and still get the same organism. Are transposable elements transcribed as RNA? Do they bind to DNA-binding proteins? Is their chromatin marked? Yes, yes, and yes, of course they are – because at least at one point in their history, transposons are “alive” for themselves (they have genes, they replicate), and even when they die, they’ve still landed in and around genes that are transcribed and regulated, and the transcription system runs right through them.

Thought experiment: if you made a piece of junk for yourself — a completely random DNA sequence! — and dropped it into the middle of a human gene, what would happen to it? It would be transcribed, because the transcription apparatus for that gene would rip right through your junk DNA. ENCODE would call the RNA transcript of your random DNA junk “functional”, by their technical definition. And if even it weren’t transcribed, that would be because it acted as a different kind of functional element (your random DNA could accidentally create a transcriptional terminator).

The random genome project

So a-ha, there’s the real question. The experiment that I’d like to see is the Random Genome Project. Synthesize a hundred million base chromosome of entirely random DNA, and do an ENCODE project on that DNA. Place your bets: will it be transcribed? bound by DNA-binding proteins? chromatin marked?

Of course it will.

The Random Genome Project is the null hypothesis, an essential piece of understanding that would be lovely to have before we all fight about the interpretation of ENCODE data on genomes. For random DNA (not transposon-derived DNA, not coding, not regulatory), what’s our null expectation for all these “functional” ENCODE features, by chance alone, in random DNA?

(Hat tip to The Finch and Pea blog, a great blog that I hadn’t seen before the last few days, where you’ll find essentially the same idea.)

Evolution works on junk

Even if you did the Random Genome Project and found that a goodly fraction of a totally random DNA sequence was “functional”, transcribed and bound and chromatin-marked, would this somehow diminish your view of the human genome?

Personally, I don’t think we can understand genomes unless we try to recognize all the different noisy, neutral evolutionary processes at work in them. Without “noise” — without a background of specific but nonfunctional transcription, binding, and marking — evolution would have less traction, less de novo material to grab hold of and refine and select, to make it more and more useful. Genomes are made of repurposed sequence, borrowed from whatever happened to be there, including the “junk DNA” of invading transposons.

As Sydney Brenner once said, there’s a difference between junk and garbage; garbage is stuff you throw out, junk is stuff you keep because it just might be useful someday.

Conflict of interest/full disclosure: I was a member of the national advisory council to the NIH National Human Genome Research Institute at the time ENCODE was conceived and planned – so I’m not quite as innocent and disinterested in policy questions of NIH NHGRI big science projects and media engagement strategy as this post may have made it sound.


  1. My undergrad molecular genetics is really rusty, my ecological knowledge less so. Even for someone such as myself this genetic stuff is quite complex, and I imagine for the layperson, somewhat unfathomable. I remember from my studies that junk DNA existed, but in those days new things were being discovered. I have been hoping by now some resolution of debates may have occurred, but if anything it seems to have got worse! From my vantage point, natural selection mainly (although not always) produces things that are functional.
    Thus I rather like the idea of junk DNA being a pillow, a sort of protective entity as described by Claudiu Bandea.
    However there is a possibility that it is relic DNA “left over” from ancestral selection processes. You would then expect an onion to have more of it than us, as we would hopefully be more refined! It may be left over and not harmful, but it may even be preferentially maintained if not conserved due to providing such a “pillow” effect.
    From my point of view, what I really need you guys to be all doing is focussing on regulatory genes and other control processes. I am not interested in what has basic biochemical function such as junk DNA, but what has possible translation to the outside world i.e. can influence ecological responses. Can we please stop the squabble and concentrate on what Darwinian selection actually provides at the level of the genome and proteins?



  2. @Sean Eddy: I don’t fully understand, but like other posters I admit this is not my field, so I’m still trying to sort out how this plays out. You say transposons are the poster child for junk DNA, but then mention that Drosophila telomeres are made out of transposons. Are those telomere transposons junk or functional? You seem happy for them to be both, which seems at odds with the definitions of junk and functional. Not ENCODE’s definition of function, but as I understand Comings/Ohno evolutionary fitness terms–surely selection clearly prefers a telomere?

    If they are functional, and not junk, then don’t you need more numbers to justify why you lumped in the entire amount of transposable elements as junk, say, numbers to suggest which fraction of transposable elements have not been co-opted for evolutionary fitness and which have? The OP says “Indeed, we are now finding many interesting examples of transposon-derived stuff being co-opted for organismal function (but these are the exception, not the rule).” The last clause makes a definitive statement (upper bound on prevalence), while the opening clause suggests the number is increasing: is there data to suggest an upper bound? Are you saying “these are currently the exception, not the rule, and it’s my belief this won’t change”?

    The OP says “The ‘junk DNA’ question is about how much DNA has essentially no direct impact on the organism’s phenotype – roughly, what DNA could I remove (if I had the technology) and still get the same organism.” That, too, is imprecise: is a human with HER2 mutation the same phenotype as one without? If you removed introns completely, would RNA splicing work? What if it worked only half the time, is that a different phenotype? What if the intron required a certain minimum number of base pairs for splicing to work, which part of the intron is functional and which is not, and could the standard measures of selective pressure identify function? If you removed everything in a 5′ UTR except the exact regulatory sequence, would that work, or is the “regulatory sequence” a sequence-specific part plus a non-sequence-specific spacing to allow a 3D molecule to bind a certain sequence and still have room to accomodate the rest of its structure? Which is functional, and can you measure the selective pressure for that? Is it about REMOVING that DNA, or is it in practice about which stretches of DNA repress point mutations, and thus are non-redundant and sequence-specific?

    Wandering further off, if you eliminated the introns and the spliceosome, such that the same transcript got generated, is that the same phenotype and organism? Or have you created a new, intron-and-spliceosome-free organism? If that organism proved horribly more susceptible to attack/disabling by transposable elements, such that it died out, would that constitute proof that those parts of DNA were not junk but selectively functional? What if it didn’t die out, but became obviously different? If there were multiple splicings, and in your non-splicing organism you duplicated each as a separate gene, does that extra DNA constitute proof that the non-splicing organism doesn’t need those additional base pairs (it’s a bunch of extra DNA in virtually identical organisms)?

    You wrote “I agree, and I think that’s part of the slipperiness of the term ‘function’, and why the term ‘junk’ is only a colloquialism. The junk on my desk is junk, but if you suddenly removed it, my coffee cup would fall over and spill into my laptop; the junk has become part of the system.” Wait…are you saying that spacer DNA has a secondary function, but not a selective one? Or are you acknowledging it can have a selective function, but would still consider it “junk”? If the spacer DNA has a selective function–let’s say complete removal of the spacer ultimately results in death of the fetus–is that identifiable and measurable right now? How is it classified according to the numbers you’ve given? It might even have a transposable element in the middle.

    D. Allan Drummond writes, “A better definition is ‘a change whose fitness effect is detectable by natural selection’. Set aside for the moment our inability to measure fitness effects; this is true, unfortunate, and irrelevant to the question of whether the definition is correct.” This is on a post which is essentially ranting about ENCODE’s numbers for functioning DNA, so if you concede you can’t actually measure your correctly defined “functional” entities, why are you quibbling about somebody else’s numbers? If your hugely conservative estimator function for your “correct definition” is known to underpredict, and they’ve chosen an optimistic “incorrect definition” that acts as a liberal estimator for your definition, why are you quibbling? They can point out DNA which your estimator wrongly excludes, and you can point out DNA their definition wrongly includes.

    nr comments about how little information it would take to encode spacing rather than sequence. If I understand his point, the density of information encoded by a base pair isn’t relevant when deciding whether it is under selective pressure or not. True, in information-theoretic terms it is wasteful to use say 100 base pairs to convey a binary message “yes, transcribe this DNA” or “no, do not transcribe”. But it isn’t _coding_ for a spacer, it _is_ the spacer. It acts in sort of an epistatic way.

    Lastly, in a more philosophic vein, I think the whole wasted time around “junk DNA” springs from using inappropriate terminology. When you use words in a substantially different manner than society at large, it should come as no surprise there will be much confusion and you will have to explain your precise meaning over and over. And you really don’t have much of a leg to stand on if some of your colleagues decide to use the word in a manner more closely fitting general useage.

    In particular, Brenning (and Ohno too?) differentiate between junk you keep, and garbage you throw away. Here are relevant meanings from a dictionary description:

    1. Discarded material, such as glass, rags, paper, or metal, some of which may be reused in some form.
    2. Informal
    a. Articles that are worn-out or fit to be discarded: broken furniture and other junk in the attic.
    b. Cheap or shoddy material.
    c. Something meaningless, fatuous, or unbelievable: nothing but junk in the annual report.
    tr.v. junked, junk·ing, junks
    To discard as useless or sell to be reused as parts; scrap.
    1. Cheap, shoddy, or worthless: junk jewelry.

    In particular, I would highlight the repeated theme that junk should be thrown out, has been thrown out, or is being thrown out. And yet, “junk DNA” is widely acknowledged to have all sorts of benefits and uses, it just is not currently under selective pressure, and is presumed to be separate from “garbage DNA” which more closely corresponds to the given definition for junk.
    It’s no wonder there’s confusion and scorn when conveyed to the general English-speaking public, and as with the musician Prince/artist-formerly-known-as, explaining the special meaning for the characters doesn’t completely fix things.



    1. Matt, I’ve started to notice that people who think that this is a debate about the semantics of the words “junk” and “function” tend not to talk about the data and observations that led to Ohno’s concept of junk DNA, and instead tend to argue from their intuition about how they think genomes should work. It is indeed complicated, for some of the reasons you discuss. You mentioned it’s not your field; if you’re interested in why someone might think the way I do, there’s good books on the subject. I recommend Michael Lynch’s book The Origins of Genome Architecture.

      But what you really ought to do is read about transposons, which are super cool; and once you see how transposons work, I think you’ll see why we don’t have to imagine that they all have advantageous functions for us, any more than we imagine that the cold or flu viruses are advantageous to us. For the most part we have them because we can’t get rid of them; they’re ‘alive’ for themselves, not for us.



  3. There is always a rebuttal to every argument isn’t there? People believe what they want to believe. What this article is really doing is providing a response to project results that threaten one’s belief in evolution and support intelligent design.



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s