Relationships between putative G-quadruplex-forming sequences, RecQ helicases, and transcription

PQS mapping near TSS

GENCODE version 7 gene annotations (produced using GRCh37) were downloaded from www.gencodegenes.org/releases/7.html. The list of annotations was filtered to remove all entries with a tag of “cds_start_NF”
or “low_sequence_quality”, leaving a set of 156,253 GENCODE v7 transcripts. From this
list of transcripts, 139,758 unique transcription start sites (TSS) were identified,
representing 59,005 unique genes. PQS were then mapped to the 2 kbp upstream and downstream
of identified TSS using PQS midpoint reference genomic coordinates (Additional file
1). R code for this procedure is provided in Additional file 2. A total of 88,058 unique PQS were mapped to within 2 kbp of a TSS. Since many genes
have multiple TSS located within a few kbp of each other, some PQS were counted multiple
times, with a total of 251,386 PQS mapping assignments to known TSS (Additional file
3).

BS and WS gene expression data

Differential gene expression data for Bloom Syndrome patient fibroblasts was obtained
from the supplemental information section of reference 16]. Genes in this dataset had been identified by Affymetrix GeneChip Human Exon 1.0
ST arrays using manufacturer-recommended protocols, with the criterion for differential
expression set at???1.5 absolute fold expression change with an adjusted p-value??0.05
and FDR??0.1. Datasets for genes differentially expressed in Werner Syndrome patient
fibroblasts were obtained from reference 17] uploaded to GEO accession GSE48761. Genes in this dataset had been identified using
Human Gene 1.0 ST Array (Affymetrix) using standard procedures. Datasets were downloaded
from the GEO repository and processed in R using microarray analysis packages available
from Bioconductor. Packages included hgu95av2cdf, hgu95av2.db, limma, marray, affy,
affyQCReport, and affyPLM. WS datasets were normalized using the robust multiple-array
average (RMA) algorithm, and genes differentially-expressed in WS patient samples
were identified with the criterion of???1.5 absolute fold expression change compared
to controls, with an adjusted p-value of??0.05 and FDR??0.05 (Additional file 4).

Analysis of PQS in genes differentially expressed in BS and WS

Genes differentially-expressed in BS and WS were divided into up- and down-regulated
gene sets. All TSS were identified in genes differentially-expressed in BS and WS,
together with all PQS positioned within 2 kbp of these TSS. PQS were annotated for
sense or antisense strand and positions assigned in 200 bp bins repeated at a 10 bp
interval from ?2 kbp to +2kbp relative to TSS. This approach was also applied to map
all PQS in the human genome, and an additional 100 times in a statistical bootstrapping
method using a randomly-generated collection of genes of the same size as the test
dataset. Test datasets and randomly-generated datasets were compared to the genome-wide
dataset using a p-value generated from the prop.test function in R. P-values and the ratio of mean PQS per TSS for the comparison between the test datasets
and genome-wide controls were plotted as a function of bin position relative to the
TSS. The threshold for statistical significance was picked to be the p-value below which a data point in the randomly generated dataset has a 1 % chance
of false rejection of the null hypothesis. FRD were calculated as the ratio of predicted
number of false positives data points (3.81 per 381 data points) to the number of
data points in the test dataset that pass the threshold p-value. R code for this analysis can be found in Additional file 5.

Epigenetic prediction of gene expression

The generation of predictive models for gene expression based on epigenetic data followed
a method similar to that previously described 19]. Epigenetic and gene expression data were obtained from the ENCODE project through
the NCBI Gene Expression Omnibus online repository (accession GSE34448, GSE32970,
and GSE29611). Cell lines used in the epigenetic modeling of gene expression include
H1-hESC (embryonic stem cell), HeLa-S3 (cervical cancer), K562 (immortalized myelogenous
leukemia), HUVEC (human umbilical vein endothelial cell), HepG2 (hepatocellular carcinoma),
NHEK (normal human epidermal keratinocyte), and GM12878 (EBV-transformed lymphoblastoid
cell). For each of these cell lines, genome-wide tracks for histone modifications
H3K9ac, H3K4me3, H3K4me2, H3K27ac, H3K79me2, H3K36me3, H3K4me1, H3K27me3, H4K20me1,
histone variant H2A.Z, and chromatin accessibility (quantified by digital DNAse I
hypersensitivity) were obtained, along with gene expression datasets quantified by
cap analysis of gene expression (CAGE) technology.

GENCODE version 7 transcript annotations were used to map individual CAGE-quantified
transcript levels to known TSS. Total transcriptional activity for each TSS was calculated
as the sum of the expression values for all transcripts that originate from that TSS.
Since many genes contain multiple TSS, only the TSS with the highest aggregate expression
level for each of the 59,005 unique genes in the genome was retained for analysis.
R code for these data manipulations can be found in Additional files 6, 7, and 8.

The GC fraction for the 5 kbp upstream of each of the identified strongest TSS was
calculated using tools in the R package Biostrings produced by the open source Bioconductor
software project. Computation for this part of the analysis was run on the Mayo Research
Computing Facility (RCF) shared-resource, Beowulf-style Linux cluster using R version
3.0.2, with scripts written to accommodate batch mode execution managed by the Open
Grid Engine open-source batch-queuing system.

ENCODE data for DNAse I hypersensitivity, histone modification, and histone variant
tracks were obtained from the GEO repository as described above. Track signal files
were downloaded in .bigwig format and processed using utilities in the R package rtracklayer
from Bioconductor. Track signals within 1 kbp upstream and downstream of each TSS
were extracted and the signal within this region was split and averaged into twenty
100-bp bins, spanning from 1 kbp upstream to 1 kbp downstream of each TSS. R code
for this procedure can be found in Additional files 9 and 10. For each of the 20 bins constructed for each genomic track, a correlation coefficient
was calculated between the log
2
-transformed bin signal (with a small pseudocount added to avoid the log
2
(0) issue) and log
2
-transformed gene expression signal quantified via CAGE, excluding TSS with a expression
value of 0. For each genomic feature, the bin that had the highest correlation with
expression was selected as the “best bin” for analysis. The optimal pseudocount value
for each genomic track was determined by repeating the correlation analysis described
above, but with pseudocounts ranging from 0.25-5 % of the maximal binned signal for
that track feature. For each genomic track, the best pseudocount and “bestbin” combination
was selected by identifying the bin and pseudocount combination that resulted in the
highest absolute correlation with gene expression. R code for this procedure can be
found in Additional file 11.

“Bestbin” and pseudocount combinations for each epigenetic track, along with the GC%
5 kbp upstream of each TSS were then used to construct predictive models of gene expression.
TSS with missing values for any of the genomic tracks were excluded, leaving between
4,011 and 9,614 TSS, depending on the cell line. Datasets for the cell lines used
in the analysis can be found in Additional files 12, 13, 14, 15, 16, 17, and 18. The subset of TSS containing no PQS within 2 kbp of the TSS was then divided into
equal training and test datasets using a random sampling approach. The subset of TSS
containing at least one PQS was included in the test dataset. All data channels in
training and test datasets were normalized to be mean-centered at 0 and to have a
standard deviation of 1. A Bayesian linear regression model for predicting gene expression
from epigenetic parameters and GC fraction was then generated using the training dataset
and the R package tgp. This model was then applied to the test dataset (not used in
the training of the model) to generate gene expression predictions.

Modeling PQS correlation with transcription

For each TSS in the epigenetic modeling of gene expression test dataset, a prediction
error was calculated as the CAGE-determined TSS-specific expression value minus the
Bayesian linear regression model prediction from the epigenetic data. The portion
of the test dataset TSS containing at least one PQS was extracted and the position
and strand of the nearest PQS to each TSS was determined. Using an iterative approach,
TSS-specific prediction error data for sense and antisense strands were aggregated
based on the location of the nearest PQS, using a 200 bp bin size, and running 2 kbp
upstream to 2 kbp downstream at an iteration interval of 10 bp. For each bin, the
distribution of prediction errors was compared to the prediction errors of the TSS
in the test dataset containing no PQS using the oneway.test() function in R for one-way
ANOVA. As a statistical control to assess the likelihood of obtaining a low p-value in the comparison by random chance, a statistical bootstrapping approach was
employed in which the prediction errors in the PQS dataset were replaced with random
prediction errors from the test dataset, and the p-values were calculated as above. This comparison was repeated 100 times and the threshold
p-value for determining statistical significance of the test dataset comparison was
picked to be the p-value below which a data point in the control dataset has a 1 % chance of false rejection
of the null hypothesis. P-values in the test dataset below this threshold value were selected as statistically
significant. FDR were calculated as the ratio of significant p-value data points in the test dataset compared to the average number of false positives
in the control datasets (3.81 false discoveries per 381 data points). R code for this
entire analysis can be found in Additional file 19.

PQS motif analysis

The number of G stacks containing at least 3 contiguous guanine nucleotides in each
PQS within 2 kbp of a TSS was calculated for all human genes using a pattern-matching
algorithm. The nucleotide fractions of adenine, thymine, and cytosine in each PQS
were calculated using the package Biostrings. For the subset of PQS containing 4 G
stacks, with each G stack containing 3 bases, lengths of loop sequences between G
stacks were also calculated. In order to avoid ambiguity in this analysis, it was
required that the first and last base in a loop sequence not be guanine. Computation
was performed on the Mayo Research Computing Facility (RCF) shared-resource, Beowulf-style
Linux cluster using R version 3.0.2, with scripts written to accommodate batch mode
execution managed by the Open Grid Engine open-source batch-queuing system.

PQS were binned based on position, using a 200 bp bin width calculated at a 50 bp
interval, spanning from 2 kbp upstream to 2 kbp downstream of the TSS. Subsets of
PQS data for genes up- and down-regulated in BS and WS were extracted, and PQS features
(total length, loop lengths, number of G-stacks, fractions adenine, cytosine, and
thymine) were compared individually between genes differentially expressed in BS and
WS and the remainder of the genome, bin by bin, using one-way ANOVA. Comparisons were
done independently for PQS in the sense and antisense DNA strands. This bin-based
comparison was repeated 100 times in a statistical bootstrapping method using a randomly-generated
collection of genes of the same size as the test dataset. The threshold for statistical
significance was picked to be the p-value below which a data point in the randomly generated dataset has a 1 % chance
of false rejection of the null hypothesis. False discovery rates were calculated as
the ratio of predicted number of false positives data points to the number of data
points in the test dataset that pass the threshold p-value. R code for this analysis can be found in Additional file 20.

Self-organizing map multidimensional classification of PQS

From the entire human genome, the subset of PQS with 4 G-stacks and 3 loops within
2 kbp of a TSS was selected, excluding PQS in which the first or last base in a loop
was guanine. This selection criterion reduced the number of surveyed PQS from 88,058
to 17,795. Each PQS in this selected dataset was classified for total sequence length,
loop lengths, and fractions of adenine, cytosine, and thymine. Self-organizing map
artificial neural network analysis was implemented in R using the package kohonen.
Maps consisting of 25 nodes in a hexagonal grid were trained using the PQS dataset,
100 iterative presentations of the data to the model, and a learning rate with linear
decline from 0.05 to 0.01. The average number of PQS per gene per node was calculated
and compared to the same calculation repeated for the PQS and gene subset for genes
up- and down-regulated in BS and WS. Log
2
enrichment ratios were calculated for each node in the BS and WS datasets. As a means
to ascertain whether the enrichment of PQS in certain nodes could arise by chance,
a statistical bootstrapping method was employed in which enrichment ratios of nodes
on the PQS map were calculated for 100 randomly-generated gene sets of the same size
as the BS and WS datasets. Mean and 2? values were calculated node by node for these
random distributions. Statistical significance for enrichment ratios of nodes in the
BS and WS datasets was assigned on the basis of lying outside of the 95 % CI for the
random distributions. R code for this analysis can be found in Additional file 21. Motif classifications of all genomic PQS can be found in Additional file 22.