DNA microarray integromics analysis platform

Case study 1: functional links between gene products and metabolites

Publicly available lipidomics and transcriptomics data from a murine nutrigenomics
study 48] were used to demonstrate the performance of the proposed analytical protocol using
two independent data sets. The data consist of the expression levels of 120 genes
(X) and the concentrations of 21 fatty acids (Y), both spanning a total of 40 observations
of several genotypes and diets.

Following data import, analytical procedures were executed using the Integromics platform
(Fig. 4). Depending on the number of observations compared to the number of variables (dependent
and independent), appropriate methods of canonical correlation (i.e., classical canonical
correlation, such as CCA or rCCA) may be applied.

Fig. 4. The “Analysis Setup” panel enables the selection of analytical methods, such as canonical
analysis (CCA or rCCA) or PLS

CCA is appropriate for cases involving a relatively large number of observations and
a low number of correlated variables (dependent and independent) in the two datasets.
rCCA is a modification of the classical canonical correlation method that permits
analysis of a small number of observations and a large number of variables in both
sets. PLS is not restricted by the number of observations or highly correlated variables.

Case study using YACCA (canonical analysis) and FRCCA (regulated canonical analysis)
packages in R.

The canonical analysis procedure (CCA) generates specific data corresponding to the
analysed datasets. This case study focused on the most basic parameters of canonical
analysis and their interpretation. Conducting CCA initially requires the removal of
correlated predictors (independent variables, or the transcriptomics data (X)) and
responses (dependent variables, or the lipidomics data (Y)) as highly correlated variables
may significantly skew the results. The correlation threshold is set arbitrarily and
depends on the type of data. In the life sciences domain, the correlation coefficient
is considered to be greater than 0.7.

Once highly correlated data are removed from X and Y, the correlation between the
dependent and independent variables can be calculated. The Integromics platform produces
a correlation map for X and Y, identifying the variables that display strong correlations
(Fig. 5).

Fig. 5. Map of the Pearson correlations between two datasets: gene transcripts (X) and fatty
acids (Y). The radius of the circle indicate the strength of correlations. Negative
correlations are marked in blue while positive correlations are marked in red

The appropriately prepared data sets were analysed using the canonical correlation
procedure which identifies linear combinations of variables in the analysed sets such
that the correlations between them are as high as possible. The results of the CCA
procedure are accessible as text (numerical data) or PNG (figures) files several seconds
after the analysis is complete.

Detailed information about the CCA results, such as the canonical correlations, the
X coefficients, the Y coefficients, the structural correlations (loadings) for variables
X and Y, and the aggregate redundancy coefficients (total variance explained) are
accessible in the cca.fit-YACCA.txt file. Statistics that describe datasets X and
Y are saved in the summaryX-YACCA.txt and summaryY-YACCA.txt files respectively. The
significance of the calculated canonical correlation is performed using the F-test
(Rao’s F approximation), and these results are provided in the F_test_CCA.txt file.

The results obtained from the murine nutrigenomics datasets reveal high canonical
correlations (CV1:1, CV2:0.997, CV3:0.995). These canonical correlations were found
to be statistically significant at a significance level of 0.05. The calculated values
of aggregate redundancy were X | Y: 0.42 and Y | X: 0.89, indicating that Y explains
X in 42 % of cases and that X explains Y in 89 % of cases.

The platform supplies graphical files which present the results of the analyses. For
example, a diagram of the impact loadings in both sets is presented in Fig. 6.

Fig. 6. Graphical representation of the canonical loadings of the X (gene transcript) and
Y (fatty acids) datasets for the first canonical variates (CV1)

High canonical loading of a particular variable indicates the significance of that
variable in the interpretation of the canonical variate. In this case, among the gene
transcripts, the greatest impact on this model was obtained for genes PRG4 (?0.637),
Il2 (0.56), NR1I2 (?0.58), and RARB (0.59) and for fatty acids 28364 (?0.78), 28125
(?0.75), and 36023 (0.66). These results are reflected by the heights of the bars
in the graph.

As mentioned above, CCA should be performed for a large number of observations. If
the number of observations is lower than the number of transcriptomics or lipidomics
variables, classical canonical analysis is not appropriate (the number of observations
should be 10 times the number of variables). In the life sciences domain the obtained
data often displays a structure in which the number of observations (samples) is small
while the number of variables (e.g., transcripts of genes) is large. In such situations,
a modified classical canonical analysis method (rCCA) should be used.

For the analysed dataset, CCA can be used when the correlation threshold is at least
0.7 for the transcriptomics andlipidomics data. When r?=?0.8, CCA is not performed due to the insufficient number of observations compared
to the number of X and Y variables. In such casesrCCA can be performed instead. The
results of the rCCA calculation are available in a similar form to the CCA results.

The canonical correlation and the canonical weights and loadings for X and Y are accessible
in the my_res-FRCCA.txt file. Descriptive statistics are provided in the summaryX-FRCCA.txt
and summaryY_FRCCA.txt files. In FRCCA the results do not include information regarding
the significance calculations. The values of the canonical correlations for the first
three sets of canonical variables are as follows: CV1: 0.99; CV2: 0.98; and CV3: 0.98.
The graphical representation depicts the variables as points on a two-dimensional
plane in which the axes represent the canonical loadings, as shown in Fig. 7.

Fig. 7. Canonical variable representation

PLS procedure

For the integration of partially correlated transcriptomics and lipidomics data the
PLS procedure was used. The first step of the PLS procedure involves constructing
a predictive model. For this purpose the entire dataset was incorporated into the
PLS model.

To demonstrate the prediction values, a subset of murine nutrigenomics data were used
(50 transcriptomics and 10 lipidomics variables). The PLS model was fit to the 10
latent variables, followed by cross-validation. The root mean squared error of prediction
(RMSEP) parameters were calculated for the specific lipidomics variables (fit method:
kernelPLS; number of components considered: 10). This step yields information about
the quality of the fit of the model to particular components. In the analysed dataset,
some variables were classified as poor (RMSEP too high), such as ChEBI 16196, in which
RMSEP was 4.5 for ncomp?=?8 and even higher for ncomp??8 (Fig. 8). Other variables remained good, such as ChEBI 15756, exhibiting the lowest RMSEP
in 5 components.

Fig. 8. Cross-validated RMSEP curves for the murine nutrigenomics dataset (results for the
first 9 lipidomics variables). “CV” is the cross-validation estimate while“adjCV”
(for RMSEP) is the bias-corrected cross-validation estimate. They can only be calculated
if the model has been cross-validated

To confirm the quality of the variables included in this model, cross-validated predictions
using 5 components (the components for which most of the variables displayed the lowest
RMSEP) were visually compared to the measured values (Fig. 9).

Fig. 9. Cross-validated predictions for the murine nutrigenomics dataset (results for the
first 9 lipidomics variables)

The charts of some of the variables exhibited poor alignment to the corresponding
target profiles – e.g., fatty acid ChEBI ID 36036 (Fig. 9). This result suggests that the X variables did not appropriately describe the given
lipidomics variable. Another graphical representation of the results of the PLS procedure
consists of a pairwise plot of score values. Such plots help reveal patterns, groups
and outliers. In our example (5 components plotted), there was no clear indication
of any grouping or outliers (Fig. 10).

Fig. 10. Score plot for the murine nutrigenomics dataset

The explained variances for each component in this example were as follows: Comp1?=?32 %,
Comp2?=?19.5 %, Comp3?=?5.5 %, Comp4?=?13.2 %, Comp5?=?6.7 %, Comp6?=?4.2 %, Comp7?=?3.6 %,
Comp8?=?2.1 %, Comp9?=?1.4 %, and Comp10?=?1.1 %. The canonical loadings are crucial
for the interpretation step as higher loading values indicate greater contribution
of the given variable to the model. Figure 11 presents the top 10 genes, ranked according to their loading values.

Fig. 11. Top 10 genes ranked by their loading values

Case study 2: discovery of significant miRNA interactions

The other use case concerns the Normal Human Dermal Fibroblasts (NHDF) experiment
49]. This study compared the differential expression of the entire transcriptome, including
miRNAs, between different cell lines (12 samples): a control group of NHDF cells (NHDF),
NHDF cells treated with LPS (NHDF LPS), NHDF cells infected with porcine endogenous
retroviruses (PERVs) (NHDF PK15) and NHDF PERV-infected cells treated with LPS (NHDF
LPS PK15). For each sample, the paired expression levels of both mRNA and miRNA were
measured using AffymetrixGeneChip miRNA 2.0 and HGU133A2 microarrays. This study measured
the changes in the expression levels of both mRNAs and miRNAs between specific groups
and, when changes were found, attempted to establish which RNAi process regulated
their expression in PERV-infected cells either with or without LPS treatment.

Preliminary miRNA microarray data preparation was performed using the Affymetrix miRNA
QCTool 50]. The procedure included background detection, background adjustment (BC-CG Adjust),
quantile normalization, addition of small constants (CVStab), and summarization using
a median polishing approach. The presented platform can be used to import such pre-normalized
data for further analysis. Normalization of the mRNA microarray data were performed
using the presented platform, including RMA background correction and quantile normalization.

The next step involved microarray significance analysis to calculate significant differences
in the transcript levels. Statistically significant differences in the miRNA levels
were found for only two pairs of cell groups, NHDF PK15 (P) versus NHDF (K) giving
hsa-let-7e, hsa-miR-199a-5p, hsa-miR-99a and NHDF LPS PK15 (LP) versus NHDF LPS (L)
giving hsa-miR-3197. The resulting list of statistically significant miRNA subsets
was selected using a Median of False Discovery Rate (FDR) threshold of 0 (the threshold
used is relative, which means that genes are significantly different at a level greater
than 0.05) for the prediction of putative miRNA targets among the differentially expressed
mRNAs.

The Integromics platform supports two miRNA target prediction methods. The first is
based on a single expression sample (group). Due to the low diversity of the compared
groups we selected the most liberal settings for the conservation of miRNA and target
sites (Fig. 12). However, the results were found to be insignificant (Fig. 13).

Fig. 12. Selection panel for miRNA target prediction based on a single expression sample

Fig. 13. Section of the output table listing the ProMISe for the NHDF expression levels

The available parameters of this procedure include:

the number of cells representing the highest probability. This parameter affects
the dimension of the output matrix; e.g., a value of 10 results in an output matrix
with diagonal length 10;

the granularity level of the probe set (either transcript or gene);

VB-GMM: the threshold that defines termination/convergence of the VB expectation
maximization (VB-EM) algorithm, which is used to optimize the parameters;

the maximum number of EM iterations of VB-GMM.

The ProMISe output tables for NHDF PK15 (P), NHDF LPS (L) and NHDF LPS PK15 (LP) were
very similar to the NHDF results (see Fig. 13). The maximum probability (approximately 0.2) was obtained for the same pair (TMSB10
and hsa-miR-1469) from each sample.

The final component of the workflow used TargetScore for miRNA target prediction based
on the variability of the expression of the compared groups. Using the TargetScore,
our system prepares sequence-based information, the context score and PCT for each
miRNA in the selected list. The exact approach depends on the selected level of analysis
at either the gene or transcript level. The system informs the user whether sequence-based
information was obtained for a given miRNA sample.

The applied settings and parameters are presented above. We selected the transcript
analysis level and the default values for the remaining settings (convergence threshold
of the VB-EM algorithm and maximum number of EM iterations). The peak size of the
resulting matrix was set at 20 (Fig. 14).

Fig. 14. Selection panel for miRNA target prediction based on expression variability

As a result of the above steps, we expected to answer the following questions: 1)
which transcript (gene) displays the highest probability of being a target of any
miRNAs; 2) which miRNA displays the highest probability of interfering with any transcripts;
and 3) which miRNA-transcript (gene) pair displays the highest probability of interacting
with each other?

The resulting tables present the peak probability values for the genes/transcripts.
The attached histograms help distinguish this limited “peak” dataset from the entire
population of results. If the “transcript” option is selected, the table also contains
other transcripts associated with a specific gene according to the described multi-transcript
protocol.

The full list of output files and miRNA target prediction charts based on the variability
of expression includes TargetScore probabilities calculated only for log(FC) input:

list of miRNA samples that contain no sequence-based information (context score or
PCT) in the TargetScan database

list of transcripts that are linked to more than one gene

vector of the peak-probability genes computed only using log(FC)

histogram of the TargetScore probabilities for the fold-change in expression and
TargetScore probabilities calculated using the sequence-based information:

table of cell types displaying the greatest probabilities of miRNA-mRNA interaction

histogram of the TargetScore-computed probabilities of miRNA-mRNA interaction

table of rows (genes/transcripts) containing the maximum values of the sum of the
probabilities (see Fig. 15)

Fig. 15. The most important part of result of TargetScore analysis NHDF PK15 (P) compared
to NHDF (K). (a) The Table of Max Probabilities showing cells displaying the greatest probability
of miRNA-mRNA interaction with NM_5824 (LRRC17 gene) marked. This transcript exhibits
the highest probability from among mRNA/miRNA pairs (b) Table of rows with Max Sum of Probabilities containing rows (genes/transcripts)
with the maximum values of the sum of probabilities, with NM_006122 (MAN2A2 gene)
and NM_18621 (the same MAN2A2 gene) transcripts highlighted. (c) Table of columns with Max Sum of Probabilities containing columns (miRNAs) with
the maximum values of the sums of probabilities (arranged in the order of decreasing
probability: hsa-miR-199a-5p, hsa-miR-99a, hsa-let7e). (d) Histogram of TargetScores containing the TargetScore-computed probabilities of miRNA-mRNA
interaction. (e) Histogram of Sum of Rows containing the sums of the TargetScore probabilities for
each row (gene/transcript)

histogram of the sums of the TargetScore probabilities for each row (gene/transcript)

table of columns (miRNAs) containing the maximum values of the sums of probabilities

The output files support the following interpretation. Transcript NM_5824 (LRRC17
gene) exhibits the highest probability among mRNA/miRNA pairs (this conclusion is
based on the full TargetScore results and the Table of Max Probability; see Fig. 15). The distribution of probabilities shows that the highest value is not an outlier.
This suggests that we could expect many miRNA/mRNA pairs with slightly lower probability
values. We select the transcripts with the best miRNA interaction probabilities from
among the full TargetScore results and from the table of rows listing maximum probability
sums (see Fig. 15). The selected transcript, NM_006122, is encoded by the MAN2A2 gene. The corresponding
histogram facilitates assessment of the specificity of each value. By analogy, we
could assess miRNAs by using maximum probability sum columns to select the highest
probability of interference with any transcripts. The resulting miRNAs, arranged in
the order of decreasing probability, are: hsa-miR-199a-5p, hsa-miR-99a and hsa-let7e
(see Fig. 15).