Seqinspector: position-based navigation through the ChIP-seq data landscape to identify gene expression regulators

Common regulators of co-expressed genes

seqinspector can be used for different types of analyses. One is to study clusters
of co-expressed genes to find their putative molecular regulators. To demonstrate
this functionality, we utilized the results from gene expression profiling of mouse
astroglial primary cultures treated with dexamethasone 13]. Dexamethasone is an agonist of a nuclear transcription factor, the glucocorticoid
receptor (GR). The list of 24 gene symbols for dexamethasone-regulated transcripts
was submitted to seqinspector (seqinspector.cremag.org) (see Additional file 1). The genome assembly was set as Mus musculus (mm9), and the default background was used. seqinterpreter correctly identified GR
(P?=?0.0024; Bonferroni corrected) as a true-positive regulator of the genes from the
list (see Table 1) and provided a profile of the GR binding sites for each analyzed gene.

Table 1. Top five enriched ChIP-seq tracks in the promoters of genes regulated by dexamethasone

Protein-protein interactions

seqinterpreter can be applied to study protein-protein interactions. seqinterpreter
calculates the average coverage for all stored ChIP-seq tracks for the submitted genomic
ranges. The obtained coverages are then compared with a reference dataset using a
two-sample t-test followed by correction for multiple testing. To demonstrate this functionality,
we utilized data from a ChIP-seq analysis of SP1 binding in GM12878 human lymphoblastoid
cells (data available at GEO GSM803363). We submitted the top 358 peaks (3000 signal
value) from this dataset to seqinterpreter as genomic ranges in bed format (Additional
file 2). We submitted the lowest 358 peaks as background. We used the Homo sapiens hg19 assembly. The transcription factors ATF3, SP2, NFYA, NFYB, E2F4, IRF1 and SRF
indirectly bind to SP1 binding sites 14]. seqinspector correctly identified the enrichment of the following factors as interacting
proteins: (1) ATF3 (P?=?0.013), (2) SP2 (P?=?1.7?×?10
-33
), (3) NFYA (P?=?8.9?×?10
-47
), (4) NFYB (P?=?9.6?×?10
-41
), (5) E2F4 (P?=?1.8?×?10
-16
), (6) IRF1 (P?=?2.2?×?10
-7
) and (7) SRF (P?=?1.2?×?10
-6
). The tool identified also other transcription factors binding to the same sites,
including IRF3 (P?=?3.0?×?10
-43
), C-FOS (P?=?4.0?×?10
-42
) and CHD2 (P?=?6.8?×?10
-41
). All of the presented P-values are Bonferroni corrected.

Cell type enrichment

Another straightforward application of seqinspector is the study of transcript expression
in various tissue and cell types. For this purpose, tracks generated by the FANTOM5
project using cap analysis of gene expression were added to the seqinspector database
(264 tracks for human and 81 for mouse after manual curation). This type of analysis
reveals active transcription start sites and gene variants expressed in particular
cell populations. List of genes or transcripts derived from microarray or RNA-seq
profiling experiments can be inspected for cell-type-specific gene expression. To
provide an example of this utility, we used results of gene expression profiles in
different cellular compartments of the nervous system 15]. Submission of neuron- and astrocyte-specific lists of genes (Additional file 3) confirmed cell-type enrichment and indicated which transcriptional start sites are
utilized in these two cell populations (Fig. 2a). For astrocyte-specific genes, only one significantly over-represented track was
noted – the Hippocampal Astrocytes CAGE track generated by the FANTOM5 consortium
(CNhs12129.11709-123B8, P?=?0.0034). For neuron-specific genes, 14 over-represented tracks were noted, including
13 FANTOM5 tracks generated from neural tissue or isolated neurons (e.g., Olfactory
brain, CNhs10489.18-22I9, P?=?2.4 × 10
-5
and Raphe neurons, CNhs12631.11722-123D3, P?=?3.1 × 10
-4
) and one ENCODE track for the neuron-restrictive silencer factor (NRSF, GSM915175,
P?=?0.026).

Fig. 2. Identification of cell-type-specific transcript expression start site. The plots display
the mean coverage for the selected gene sets in various cell types based on the CAGE
data. The x-axis represents the genomic region around transcription start sites from
5’ to 3’. The y-axis represents the coverage that has been normalized to the number
of aligned tags per million. a Coverage histograms for neuron-specific (blue) and astrocyte-specific (red) gene lists from Zhang et al. 15]. The upper panel displays the mean coverage of raphe neurons CAGE tags, whereas the
bottom histogram refers to the hippocampal astrocyte CAGE tags. b Coverage histograms for genes up-regulated (blue) and down-regulated (red) in ruptured intracranial aneurysm from Pera et al. 16]. The upper panel presents the mean coverage of neutrophil CAGE tags, whereas the
bottom (average) histogram refers to CD8+ T lymphocytes CAGE tags

Another possible use of seqinspector is to estimate the distribution of transcriptional
alterations among various cell populations, which may be estimated from results of
gene expression profiling in a heterogeneous tissue. To demonstrate this functionality,
we used a list of genes from expression profiling in whole-blood samples obtained
from patients after ruptured intracranial aneurysms and a control group (Additional
file 4) 16]. The lists of up-regulated and down-regulated genes were compared against each other.
seqinspector identified over-representation of CD8+ T-lymphocyte-specific transcripts
among the down-regulated genes (CNhs12178.12191.129B4, P?=?0.005) and neutrophil-specific genes among the up-regulated genes (CNhs10862.11233.116C9,
P?=?0.12) (Fig. 2b). All of the presented P-values are Bonferroni corrected.

Comparison with other tools

To demonstrate the effectiveness of seqinspector, we compared this tool with available
ChIP-seq data-based online tools (CSCAN 10], ENCODE ChIP-Seq Significance Tool 17] and Enrichr with ENCODE ChIP-seq and ChIP-x gene set libraries 18]) as well as tools based on in silico predicted transcription factor binding sites (oPOSSUM 3.0 19] and Cremag 12]). For this purpose, we used the following five example sets of genes regulated by
various transcriptional mechanisms. One from each tool excluding Enrichr (no list
of genes with specified transcriptional factor was provided with this tool): (a) an
example gene set provided in this paper—GR-dependent genes regulated in mouse; (b)
a list of human genes regulated by dexamethasone from the ENCODE ChIP-Seq Significance
Tool website; (c) BDP1 target gene set from the CSCAN website; (d) liver-specific
gene set from the oPOSSUM 3.0 website and (e) a list of genes regulated by SRF from
the Cremag website. We converted the original lists into Ensembl Transcript IDs and
gene symbols using Biomart 20] to meet the tool-specific input requirements. We used the default settings for all
of the tools for the comparison. As a score, we used the rank of the expected transcription
factor in the obtained results, where the points one to ten were awarded with a maximum
given for the first position on the list (Fig. 3). seqinspector received the highest summary score (score?=?40) followed by Enrichr
using ENCODE data (score?=?37) and oPOSSUM 3.0 (score?=?27). Thus, the seqinspector
tool, which is based on parametric statistics for enrichment calculation, was comparable
to other promoter analysis methods. All of the gene sets with the original IDs, RefSeq
numbers, promoter sequences and results are provided in Additional file 5.

Fig. 3. Comparison of seqinspector to other online tools. The heatmap presents the scores
from seqinspector, Enrichr, oPOSSUM, Cremag, CSCAN and ENCODE ChIP-Seq Significance
Tool (in columns) for five selected gene sets (in rows). The scores were calculated
based on the rank in the results of the expected transcription factor (on the right) with ten points for first rank (dark green color), nine points for second and down to one point for the tenth rank (white color). The sum of the scores is presented at the bottom. The tools are ordered by the
sum of their scores in decreasing order