A deeper confusion


John Parrington is an associate professor in the Department of Pharmacology at the
University of Oxford. However, as he details in the introduction section of the book,
he writes The Deeper Genome from the perspective of a science journalist rather than someone who is intimately
familiar with the field, and his interest in the subject was in large part sparked
by the announcement of the results from the second phase of the ENCODE Project on
September 5th, 2012, when he was spending time at The Times in London on a British Science Association Media Fellowship. The introduction also
states the main message of the book:

So while the original Human Genome Project provided the sequence of letters that make
up the DNA code, ENCODE appeared to have gone substantially further and told us what
all these different letters actually do. Perhaps most exciting was its claim to have
solved one of the biggest conundrums in biology: this is the fact that our genes,
which supposedly define us as a species, but also distinguish you or I or anyone else
on the planet from each other, make up only 2 % of our DNA. The other 98 per cent
had been written off as “junk”; however, this raised the question of why our cells
should spend vital energy replicating and storing something with no function. […]
By scanning through the whole genome rather than just the genes, and using multiple,
cutting-edge approaches to measure biochemical activity, ENCODE had come to the startling
conclusion that, far from being junk, as much as 80 per cent of these disregarded
parts of the genome had an important function (pp. 2–3)

This view is defended not just on the basis of the results from the ENCODE Project
(to be discussed below). The author also brings together discoveries from many other
hot research areas, which are presented as supporting the idea that the whole genome
is functional.

The first two chapters of The Deeper Genome take the reader on a brief walk through the history of genetics and genome biology
in the last 200 years. Chapter 1, “The Inheritors”, covers the problem of the mechanisms
of inheritance that early evolutionary theory faced and that was never resolved by
Darwin in his lifetime, through Mendel’s insights into that question and their rediscovery
at the turn of the 20th century, and the subsequent establishment of the chromosomal
theory of inheritance and modern genetics. Chapter 2, “Life as a Code”, goes over
the discovery of DNA, the identification of DNA as the carrier of genetic information,
the deciphering of its double helix structure, the molecular biology revolution of
the 1950s and 1960s and the establishment of the Central Dogma (Crick 1958]).

Chapter 3, “Switches and Signals”, presents the early history of regulatory biology,
focusing on the pioneering work of Jacob and Monod, and ending with a brief discussion
of chromatin and epigenetic marks. Chapter 4, “The Spacious Genome”, continues with
the 1970s discoveries of enhancers, introns and splicing, before going into a discussion
of the C-value (the discrepancy between genome size and the perceived complexity of
organisms; Thomas 1971]) and g-value (the discrepancy between the number of genes and organismal complexity;
Hahn and Wray 2002]) paradoxes, junk DNA and the development of some of the classic explanations for
their existence.

Up to the junk DNA part, these first chapters are an enjoyable read and will be useful
to general readers and possibly even working biologists, given that, as with most
scientific disciplines, the teaching of biology up to and including the graduate level
rarely includes the development of good understanding of the intellectual history
of the field, a gap that students are left to fill on their own if they are so inclined.
However, the junk DNA section marks the beginning of the problems that plague the
rest of the book, specifically the imprecise presentation of facts and the omission
of important and powerful arguments to the contrary of the position defended in the
text. For example, Parrington states that:

In this sense, what more perfect demonstration is there that nature is “an excellent
tinkerer, not a divine artificer”, than the fact that 98 per cent of our own genome
is useless? […] This is a powerful argument, and one that I have much sympathy with,
guided as I am by the principle that both life and the universe can be explained by
purely materialist principles. However, using the uselessness of so much of the genome
for such a purpose is also risky, for what if the so-called junk turns out to have
an important function, but one that hasn’t yet been identified? Whether such important
functions exist within non-coding DNA has been one of the most hotly debated topics
in genetics over the last few years (p. 72).

However, no knowledgeable person has ever defended the position that 98 % of the human
genome is useless. The 98 % figure corresponds to the fraction of it that lies outside
of protein coding genes, but the existence of distal regulatory elements, as nicely
narrated by the author himself, has been at this point in time known for four decades,
and there have been numerous comparative genomics studies pointing to a several-fold
larger than 2 % fraction of the genome that is under selective constraint (Siepel
et al. 2005]; Lindblad-Toh et al. 2011]; Davydov et al. 2010]; Meader et al. 2010]), largely lying in noncoding areas. Thus there is (and there has been) no real debate
regarding whether noncoding DNA can have important functions—it absolutely does, this
is well known, and it is misleading to state otherwise, let alone later use that as
an argument in favor of the functionality of the whole genome.

Chapter 5, “RNA Out of the Shadows”, explores the wide variety of roles that noncoding
RNA plays in cells, from ribozymes (and the RNA world hypothesis), to small RNAs and
RNA interference, and finally, lincRNAs (long intergenic noncoding RNAs). While one
of the purposes of the chapter is to use the multitude of noncoding RNAs to support
the functionality of most of the genome, it actually underestimates their diversity;
it also gets some of its facts wrong:

Currently, there are four known classes of non-coding RNAs, although each class almost
certainly include many subclasses. First, there are the siRNAs, which, as we’ve just
discussed, regulate gene expression by destroying their target mRNAs. The second class
are known as microRNAs, or miRNAs for short. […] Third, there are the piRNAs […]
The fourth class are the long non-coding RNAs, or lcRNAs [sic]. These are defined
mainly by length, all being over two hundred bases long, in contrast to the other
three classes which are typically much smaller, at around twenty bases (pp. 83–84).

Of course, in reality there are many more functional types of noncoding RNAs than
just these four—aside for the fundamental for gene expression tRNAs and rRNAs, the
snRNAs (small nuclear RNA, components of the spliceosome) and snoRNAs (small nucleolar
RNAs that guide the chemical modifications of other RNAs) are also large classes of
noncoding RNAs that have been known for decades. We can then mention the RNA component
of the telomerase, the 7SK RNA, the SRP RNAs, Y RNAs, Vault RNAs, RNAse P, and numerous
others, and this is just within eukaryotes phylogenetically close to humans; prokaryotes
have a number of unique to them noncoding RNAs, as do various eukaryote clades. Importantly,
many of these RNAs have been known for nearly three decades or more (Walter and Blobel
1982]; Brown et al. 1991]; Borsani et al. 1991]; Greider and Blackburn 1987]; Reddy et al. 1984]; Lerner et al. 1981]; Kedersha and Rome 1986]; Blum et al. 1990]; Ray and Apirion 1979]; Guerrier-Takada et al. 1983]), and they occupy only a small fraction of the genome (Kellis et al. 2014])—for example, according to version 19 of the GENCODE annotation of the human genome
(Harrow et al. 2012]), the exons of lincRNAs cover only 0.2 % of the human genome and miRNAs comprise
a minuscule 0.013 % (see Table 1 below), i.e., their existence is hardly grounds for rejecting the notion of junk
DNA.

Table 1. Fraction of the human genome occupied by the exons of annotated transcripts

Chapter 6, “It’s a Jungle in There!” is the centerpiece of the book, focusing on the
ENCODE Project and its results. Unfortunately, the author derives his information
mainly from press releases and interviews and not from the primary literature, which
leads him down a path towards some erroneous and not supported by the data conclusions
as a result of the compound overhyping of the data at each step of reporting. All
of the content of the chapter is based on the 2012 main integration paper of the ENCODE
Consortium (ENCODE Project Consortium 2012]), and even that primarily comes from writings about the paper rather than the paper
itself, while the probably more important with respect to the question of how much
of the genome is functional later ENCODE publication (Kellis et al. 2014]) is ignored.

Instead of providing an accurate summary of the current understanding of the issue,
the book just repeats the claim that ENCODE has found “important function” for basically
the whole genome. But this is not really what the ENCODE paper read on its own claims.
Here are the key quotes from it (emphasis mine):

These data enabled us to assign biochemical functions for 80 % of the genome, in particular outside of the well-studied protein-coding
regions

…

Operationally, we define a functional element as a discrete genome segment that encodes a defined product (for example, protein
or non-coding RNA) or displays a reproducible biochemical signature (for example,
protein binding, or a specific chromatin structure)

…

The vast majority (80.4 %) of the human genome participates in at least one biochemical
RNA—and/or chromatin-associated event in at least one cell type

Given these definitions, and given the limitations imposed by the resolution of the
assays used, that 80 % of the genome (which is indeed equivalent to close to 100 %
as between 15 and 25 % of it is not uniquely mappable with short sequencing reads
and is thus “invisible” in these analyses) is “functional” is indeed correct. But
this is only under these particular definitions of function and following the biochemical
criterion for functionality, which is not on its own proof of function, much less
that it is an “important” one. Here is a quote from Kellis et al. (2014]) (the ENCODE publication explicitly dedicated to the question of assessing functionality):

However, biochemical signatures are often a consequence of function, rather than causal.
They are also not always deterministic evidence of function, but can occur stochastically.
For example, GATA1, whose binding at some erythroid-specific enhancers is critical
for function, occupies many other genomic sites that lack detectable enhancer activity
or other evidence of biological function (70). Likewise, although enhancers are strongly
associated with characteristic histone modifications, the functional significance
of such modifications remains unclear, and the mere presence of an enhancer-like signature
does not necessarily indicate that a sequence serves a specific function (71, 72).
In short, although biochemical signatures are valuable for identifying candidate regulatory
elements in the biological context of the cell type examined, they cannot be interpreted
as definitive proof of function on their own.

Reminiscent of the ways a mackerel can transform into a cetacean, somewhere along
the chain of transmission of information “biochemical function”, operationally defined,
transformed into “important function” in the sense in which the term is traditionally
understood.

Parrington lists four types of evidence that ENCODE used to “assess function” (the
correct term would be “identify candidate functional elements”). The first one is
mentioned as “identifying all the places in the genome where transcription factors
bind to the DNA”, which presumably refers to transcription factor ChIP-seq (Chromatin
Immunoprecipitation coupled with high-throughput sequencing; Johnson et al. 2007]). The second involves the mapping of open chromatin (DNAse-seq and digital genomic
footprinting, or DGF; Thurman et al. 2012]; Neph et al. 2012]). The third approach mentioned by him is the mapping of DNA methylation. Finally,
the transcriptome maps generated using RNA-seq are listed.

The inclusion of DNA methylation in this list is quite puzzling. The main ENCODE integration
paper indeed lists DNA methylation as one of the assays used; however, first, it was
not applied as a proxy for functionality but for other scientific purposes, second,
the particular technique used was Reduced Representation Bisulfite Sequencing (Meissner
et al. 2005]), which does not give a truly genome-wide measurement of DNA methylation, and third,
DNA methylation can hardly be used as a criterion for functionality, because most
of the GpG sites in somatic mammalian genomes are usually methylated (Lister et al.
2009]), with some important exceptions in regulatory elements and elsewhere (Jones 2012]), and because DNA methylation is one of the mechanisms used to silence one of the
classic examples of junk DNA, transposable elements (Yoder et al. 1997]).

This is not the only problem, as the presentation of the other methods, what they
can and cannot, and what they in fact do tell us about how much of the genome is functional,
is very incomplete. As a first example, the genome is indeed pervasively transcribed;
however, on its own this is an oversimplification that is very far from telling the
complete story. Here are some more quotes from Kellis et al. (2014]):

In agreement with prior findings of pervasive transcription (85, 86), ENCODE maps
of polyadenylated and total RNA cover in total more than 75 % of the genome. These
already large fractions may be underestimates, as only a subset of cell states have
been assayed. However, for multiple reasons discussed below, it remains unclear what
proportion of these biochemically annotated regions serve specific functions

…

For example, RNA transcripts of some kind can be detected from 75 % of the genome, but a significant portion of these are of low abundance […].
For polyadenylated RNA, where it is possible to estimate abundance levels, 70 % of
the documented coverage is below approximately one transcript per cell (100–103).
The abundance of complex nonpolyadenylated RNAs and RNAs from subcellular fractions,
which account for half of the total RNA coverage of the genome, is likely to be even
lower, although their absolute quantification is not yet achieved

That a large fraction of the genome is transcribed is not surprising—after all, while
annotated exons might occupy only 2 % of it, the introns of those same genes cover
a much larger fraction of the genome (Table 2).

Table 2. Fraction of the human genome occupied by annotated transcripts (exons + introns)

This is DNA that is transcribed in order to produce mRNAs, and many of the products
of its transcription are present in the various subcellular fractions assayed by ENCODE
(in addition to polyadenylated RNA, which is the mature state of mRNAs, ENCODE also
analyzed polyA+ and non-polyA transcripts from total cell, cytosolic, nuclear, nucleoplasmic,
nucleolar, and chromatin cellular subfractions). But we cannot expect complete absence
of transcription outside of annotated genes either. Another quote from Kellis et al.
(2014]):

At present, we cannot distinguish which low-abundance transcripts are functional,
especially for RNAs that lack the defining characteristics of known protein coding,
structural, or regulatory RNAs. A priori, we should not expect the transcriptome to
consist exclusively of functional RNAs. Zero tolerance for errant transcripts would
come at high cost in the proofreading machinery needed to perfectly gate RNA polymerase
and splicing activities, or to instantly eliminate spurious transcripts

No serious attention is given in the book to the fact that much of the observed transcription
is at low levels, or that, as shown in Kellis et al., the strength of the biochemical
signal correlates quite well with evolutionary conservation, i.e. regions of the genome
expressed at high levels or more strongly occupied by transcription factors are more
likely to be conserved than those with low levels of signal, and what all this means
for the question of the extent of functionality of the genome (Kellis et al. 2014]):

Thus, one should have high confidence that the subset of the genome with large signals
for RNA or chromatin signatures coupled with strong conservation is functional and
will be supported by appropriate genetic tests. In contrast, the larger proportion
of genome with reproducible but low biochemical signal strength and less evolutionary
conservation is challenging to parse between specific functions and biological noise.

Another issue that is ignored is the resolution of the assays used and how they contribute
to the 80 % number. The biggest contribution to it comes from the transcriptome, but
the fraction of the genome occupied by ChIP-seq peaks is also quite large. However
(Kellis et al. 2014]):

Biochemical methods, such as ChIP or DNase hypersensitivity assays, capture extended
regions of several hundred bases, whereas the underlying transcription factor binding
elements are typically only 6–15 bp in length

The upward bias on biochemical functionality estimates imposed by technical limitations
is even more of an issue with histone marks, where even the resolution of the assay
is not that much of a problem as is the fact that a single enhancer or promoter with
a limited number of functionally constrained bases pairs can induce changes in the
chromatin state of several neighboring nucleosomes.

The best available assay for accurately constraining the size of the whole regulatory
lexicon is DGF (digital genome footprinting, which provides “footprints” of the occupancy
of transcription factors and other regulatory proteins on DNA thanks to the protection
of DNA against DNAse digestion that they provide), even if the footprints derived
from it are often also slightly extended relative to the actual occupied site. Indeed,
a very large number of footprints are identified; however, they only occupy 10 % of the genome, and the transcription factor binding motifs residing in them cover
5 %, i.e. a number much smaller than the whole genome or even the majority of it.

Parrington notes that a large fraction of the identified putative regulatory elements
show little conservation between human and mouse, as revealed by the parallel mouse
ENCODE project (Yue et al. 2014]; Cheng et al. 2014]; Stergachis et al. 2014]). This is indeed a fascinating and very important observation, but its real significance
is not that it highlights the uniqueness of humans, as interpreted in the book, but
that it actually supports the view that mammalian genomes are shaped in large part
by neutral evolutionary forces (Villar et al. 2014]; Marinov 2014]).

This is the final major issue with the chapter—the results of the ENCODE Project are
presented as rejecting the junk DNA theory, without much real discussion of what that
theory is based on and why so many scientists hold it to be true. A brief overview
of its main components is in order here:

1. Based on early estimates of the mutation rate in humans and the size of the human
genome, simple calculations done decades ago showed that only a small fraction of
it could be constrained at the sequence level, otherwise there would be too many deleterious
mutations in every generation for the species to survive (Ohno 1972]). The estimates of the mutation rate have been revised somewhat since then, and empirical
estimates on constraint within the human population have become available too (Ward
and Kellis 2012]), but this has not resulted in raising the estimate on the fraction of the genome
that could be selectively constrained to anything remotely close to the majority of
it.

2. The C-value paradox revealed wide disparities between genome sizes in different
organisms that are difficult to explain through other means than most of the large
genomes being junk. More recently, the “onion test” has been formulated (Gregory 2007]), as a means of testing alternative theories for explaining these paradoxes (it consists
of asking such proposals to explain why onion needs much more DNA than humans for
regulation, structural maintenance, or protection against mutagens, and why some species
of onion need 5 times more DNA than other members in the Allium genus for the same purposes).

3. The discovery of selfish DNA elements and that they occupy a large fraction of
big genomes (Orgel and Crick 1980]; Doolittle and Sapienza 1980]).

4. The understanding of the limits on the power of natural selection imposed by the
population genetic environment of a species, and in particular the role of the effective
population size (), which determines the relative influence of selection and drift in the evolution
of a lineage. So it happens (partly for obvious ecological reasons having to do with
the physical size of organisms) that large-bodied multicellular organisms are among
the lineages with the lowest , meaning that the power of natural selection is weakest within their populations,
which readily explains many of the nonadaptive or even maladaptive features of their
genomes (Lynch and Conery 2003]; Lynch 2007a], b]), such as the presence of large amounts of junk DNA.

Regrettably, there is no engagement with these arguments in the book, when in fact
ENCODE data fits comfortably within that framework, while in the same time providing
a much richer understanding of the process of genome evolution. For example, the complexity
of the regulatory apparatus in metazoans, and its rapid evolution, as evidenced by
its divergence between rodents and primates, revealed by ENCODE and mouse ENCODE and
by other studies (Villar et al. 2014]) can be understood as a consequence of the low- population genetic environment of these organisms, which facilitates the evolution
of new regulatory elements and the complexification of regulatory networks, as many
of the intermediates in the process are either effectively neutral or maladaptive
(Lynch 2007c]) and are not as readily tolerated in organisms with large . These lines of reasoning, and the more general concept of constructive neutral evolution
(Stoltzfus 1999], 2012]; Lukes et al. 2011]), are entirely absent from the book.

Chapter 7, “The Genome in 3D”, moves onto recent work towards characterizing the three-dimensional
organization of the genome and its role in gene regulation. This is a very interesting
topic and Parrington does a reasonably good job at presenting some of the basics,
but it has no real relevance to the question whether there is junk DNA or not—first,
if genes and regulatory elements are separated by large physical distances in linear
space, it becomes in a way a necessity to have complex gene regulation happening in
3D, and second, that the genome is folded in complex and regulated manner does not
in any way imply that all of it is functional—all of that can be accomplished with
a small fraction of it serving as anchor points for chromatin looping, nuclear matrix
and nuclear lamina attachment.

Chapter 8, “The Jumping Genes”, is dedicated to transposable elements (TEs). The story
of their discovery by Barbara McClintock and a brief exposition on the main classes
of TEs are followed by examples of both the negative effects they often have on their
host and of TEs being exapted into various functions in the cell. Of course, these
examples do not mean that all TEs are functional, and fortunately, Parrington does
not make that claim.

The chapter on TEs also serves as prelude to Chapter 9, “The Marks of Lamarck”, the
main idea of which is that much evidence has accumulated in recent years for Lamarckian
evolution being a real phenomenon. This is indeed true (Koonin and Wolf 2009]); however, the chapter does not talk about the bona fide examples of Lamarckian evolution
such as the CRISPR systems of prokaryotes, but focuses almost entirely on epigenetic
inheritance in metazoans, a phenomenon that is fascinating, but still very poorly
characterized, and very far from being proven to play a significant role in vertebrate
evolution.

Chapter 10, “Code, Non-Code, Garbage, and Junk”, revisits the subject of junk DNA,
again presenting numerous appeals for much of it being functional and playing a regulatory
role. Not much needs to be said about it, except that it also features a bizarre argument
for the functionality of pseudogenes. One would have thought that the ceRNA (competitive
endogenous RNA) hypothesis (the idea that noncoding RNAs can regulate the expression
of other RNAs by acting as “sponges” for miRNAs; Salmena et al. 2011]) would have been used for that purpose even though powerful arguments have been presented
against it (Denzler et al. 2014]). Instead, Parrington brings up pseudoenzymes, which are proteins that clearly belong
to a family of functional enzymes but lack catalytic activity. However, pseudoenzymes
are not at all pseudogenes, as they are normal products of functional genes!

Chapter 11, “Genes and Disease”, is dedicated to the genetic basis of diseases. Somewhat
surprisingly, it devotes very little space to discussing the exciting area of research
emerging at the intersection of the results from the ENCODE Project and GWAS studies,
given that this has been one of the major accomplishments of the former, which is
also the inspiration for the book.

Chapter 12, “What Makes Us Human?”, talks about human evolution, in particular in
the light of paleogenomics, while Chapter 13, “The Genome That Became Conscious”,
discusses the cellular foundations of human brain function, with emphasis on epigenetics
and gene regulation.

The concluding chapter, “The Case for Complexity” reiterates how much more complex
genome biology is than previously thought. This is indeed the case, and it is also
true that the last decade has seen a technological revolution that has allowed us
to dig deeper than ever before into it. Popular books accurately conveying that complexity
in an accessible manner are much needed. However, The Deeper Genome misses the opportunity of being the book that fills that gap, by making the unwarranted
conclusion that all of the genome is functional its core message, and overhyping the
importance of the findings it discusses. In short, the argument boils down to the
following:

1. Junk DNA theory predicts that most of the genome would be completely biochemically
inert.

2. Biochemical activity can be equated with function in the traditional sense of the
word.

3. Human genome biology is extremely complex and much of the genome shows at least
a trace of biochemical activity.

4. Therefore junk DNA theory has been falsified.

However, the premises do not hold—junk DNA theory predicts no such thing, biochemical
activity can only identify candidate functional elements, and the complexity of genome
biology is not mutually exclusive with most of the genome being junk.