A. Orthologs linked either to pathogenicity or synergy with the host
Annotation of pathogenic bacterial strains
Our literature search detailed in the Methods section identified 949 bacterial strains
as pathogenic, meaning they had been reported in the literature as pathogenic to an
animal host at least once. The label pathogenic, in the way we use it, doesn’t mean
that the bacterial strain will cause disease to an animal host under all circumstances.
Additional file 1 presents the list of bacterial strains deemed as pathogenic, with evidence provided
in the file in the form of references or database citations. This supplemental file
also contains labeling of pathogenic strains found as antibiotic resistant in the
Antibiotic Resistance Genes Database (ARDB)25]. Table 1 shows a sample of the bacterial genera possessing both pathogenic and nonpathogenic
strains. Strains of the same bacterial genus often separated into pathogenic and nonpathogenic
clusters.
Table 1. List of bacterial genera highlighting the number of strains that have been associated
with the pathogenic state, as well as the strains associated with the commensal state
in the present study
There was no KEGG- or literature-recorded evidence of pathogenicity for the remaining
1578 decoded bacterial strain genomes in the KEGG database. Thus, they were deemed
for this study as nonpathogenic. Additional file 2 presents the list of the 1578 bacterial strains deemed nonpathogenic in our study.
Orthology contents of pathogenic and nonpathogenic strains
Next, we created two sets of matrices for ortholog genes, one set for the genomes
of pathogenic and the other for nonpathogenic strains. The columns of each matrix
identified the bacterial strain whereas rows identified whether an ortholog was present
(1) or absent (0) in that strain. It turned out that 7194 different orthologs were
present in at least one of the genomes of the 2527 bacterial strains under study.
These large matrices are used for the abundance computations presented in this study
and hence included as Additional file 3.
For a given ortholog, fractions of pathogenic and nonpathogenic bacterial strains
presenting the ortholog in their genomes are represented by symbols Ap and Anp, respectively. The scatter diagram shown in Fig. 1 presents the Ap and Anp values for the 7194 orthologs present in bacterial strains. It appears that most
of the orthologs have comparable presence in all bacterial strains whereas a small
portion (green and red dots for pathogenic and nonpathogenic strains, respectively)
is biased towards one of the two phenotypes.
Fig. 1. Scatter diagram of relative abundance of 7194 orthologs found in 2527 decoded bacterial
genomes. The horizontal axis represents the percentage of nonpathogenic strains presenting
the ortholog (Anp) whereas the vertical axis represents the corresponding percentage in pathogenic
strain subpopulation (Ap). The pathogen-abundant (PA??4) and nonpathogen-abundant orthologs (PA??0.25) were marked in red and green, respectively. Note that PA?=?Ap/Anp
The histogram shown in Fig. 2 is another view of the data presented in the scatter diagram in Fig. 1. Here, we plotted the frequency of occurrence against the pathogen abundance score
log PA for all the orthologs under consideration. The parameter PA?=?Ap / (Anp?+?0.0001) is a measure of relative abundance of the ortholog in pathogenic strains.
In cases where Anp equaled zero, the equation still enables division due to the presence of 0.0001 in
the denominator. The two tail ends of the distribution indicate those orthologs abundant
in pathogenic but rarely found in nonpathogenic (red) and vice versa (green). The
cutoff values we used (PA??4, and PA??¼), although somehow arbitrary, were placed at the inner edges of the tails of the
histograms.
Fig. 2. Histogram showing the frequency of occurrence of orthologs with respect to the pathogen
abundance score (PA). The two edges of the histogram (PA??4, PA??0.25) are marked in red and green, respectively
Shown in Fig. 3a are the overall characteristics of the PA distribution among the orthologs. In brief,
there were 229 pathogenic only and an additional 379 orthologs for which PA??4, representing about 8 percent of the orthologs found in our bacterial strain library. Taken together, we deemed this group as pathogen abundant or pathogen-linked. Total
number of genes in the 608 pathogen-abundant orthologs was 18,982, indicating their
presence in a diverse set of bacterial species. Pathogen exclusive orthologs comprised
only 1,518 of this set of genes, suggesting most genes previously linked to pathogenicity
is not exclusive to disease-causing bacterial strains.
Fig. 3. Pie charts indicating the distribution of orthologs of the present study with respect
to ortholog abundance (3A) and the virulence factors presented by the VFDB web platform
(3B) as a function of the pathogen abundance score PA
The number of orthologs exclusive to nonpathogenic strains was much larger at 879,
and an additional 485 had PA??¼. The rest, a total of 5222 orthologs, were commonly present among pathogenic and
nonpathogenic strains. It is expected that these numbers will change as the number
of available bacterial genomes in the literature increase from thousands to tens of
thousands.
Next, we identified those orthologs in our list, a total of 1308, which were also
present in the Virulence Factor Database, VFDB, by matching either the gene names or gene descriptions. As indicated in Fig. 3b, VFDB orthologs are significantly biased towards pathogen abundant orthologs. Additional
file 4 lists the pathogen-abundant and non-pathogen-abundant orthologs in accordance with
the PA ranking, along with VFDB labeling if present in that database. Overall, our study
indicates the absence of one-to-one match between known virulence factors and pathogen-abundant
orthologs.
Orthologs enriched in pathogenic, antibiotic resistant, and nonpathogenic bacterial
strains
Statistical enrichment of KEGG pathways was conducted based on the hypergeometric
test for ortholog sets PA??4, and PA??¼, respectively. Results are shown in Fig. 4. Orthologs abundant in pathogenic strains crowd pathogen-linked cellular pathways:
Staphyloccus aureus, Leigonellosis, Pertussis, Salmonella, Shigellosis, and Escherichia coli infections, as well as epithelial signaling in H. pylori infection. Pathogen abundant orthologs are also found in pathways involving bacterial
secretion systems, Nod like receptor signaling, bacterial invasion of epithelial cells,
and plant-pathogen interactions.
Fig. 4. Pie chart for statistical enrichment of KEGG pathways by pathogen-abundant (PA??4, red) and nonpathogen-abundant (PA??0.25, blue) orthologs, respectively
Orthologs found exclusively in nonpathogenic strains occupy nodes in metabolic pathways
(Fig. 4). These pathways include biosynthesis of peptidoglycans, microlides, carotenoids, asamycins, and onribiosomal peptides. The GO cell compartment investigations not shown in the figure indicate
that pathogen abundant orthologs are enriched in crosstalk positions of contact with
the host whereas nonpathogen exclusive orthologs code proteins involving in events
in the cell interior.
Next, we looked at molecular function enrichments of pathogenic associated orthologs
and compared the results with corresponding enrichments obtained using the VFDB database.
Figure 5 shows that both our annotation and VFDB contain roughly equal amounts of orthologs
in secretion, toxins, peptidase, and pilin categories. However, pathogen abundant list of the present study has significantly
more abundance in enzyme categories such as oxidoreductases, transferases, hydrolases, lyases, and ligases. Some of the orthologs in our list are also enriched in regulatory function, particularly
involving transcription and translation.
Fig. 5. Gene ontology molecular function distributions in pathogen-abundant orthologs (PA??4, blue) and in VFDB (red), respectively
Pathway enrichment was also conducted within the population of pathogenic strains
for the subset identified as antibiotic resistant using ARDB Database 25]. The hypergeometric test revealed the pathways shown in Table 2 as particularly enriched in antibiotic resistant strains. These included sphingolipid
metabolism, producing bioactive metabolites that regulate cell function 26], PI3K-Akt signaling pathway, an intracellular pathway important in apoptosis 27], and Aminoacyl-tRNA biosynthesis 28]. Some of the modules in the enriched pathways also appear in eukaryotic processes
for drug resistance against chemotherapy. One must caution, however, that the results
could potentially change with the updating of ARDB, even if the p values in these
enrichments are vanishingly small.
Table 2. KEGG reference pathways, which are statistically enriched by orthologs abundant in
antibiotics-resistant bacteria. Hypergeometric test assumes as background the set
of pathogen-linked orthologs.
B. Gene circuits linked either to pathogenicity or synergy
This section presents results on genetic circuits statistically enriched in pathogenic
and nonpathogenic bacteria. We used two different types of comparison to achieve our
results: a) analyzing the entire set of genomes partitioned into pathogenic and nonpathogenic
phenotypes; and b) conducting the same operation within genera for the 17 genera identified
in Table 1 with a star. In the first approach, we mapped the list of ortholog genes linked to
pathogenicity (PA??4) and non-pathogenicity onto KEGG reference pathways and identified,
based on KEGG repository, those multiply connected clusters of genes (gene circuits)
containing at least three pathogen-linked or nonpathogen-linked orthologs. Results
are shown in Tables 3 and 4, respectively. Additional file 5 presents corresponding results for within-genera comparisons, both for pathogen-
and nonpathogen-linked circuits along with the genera containing such circuits.
Table 3. Gene circuits linked to pathogenic phenotype. Gene symbols in the table indicate orthologs
in the circuit, which are abundant in pathogenic but rarely found in nonpathogenic
strains
Table 4. Nonpathogen-linked gene circuits in bacterial strains. The columns identify pathways,
circuits, nonpathogen-linked orthologs within the circuit; and genera expressing the
circuit
Gene clusters more common in pathogenic strains
The Table 3 presents a set of gene circuits with the ortholog genes linked to pathogenicity and
also indicates the pathway to which the gene circuits belong. Along with Table 3, comes Fig. 6, in which the wiring diagrams for the gene circuitry are shown in the form of cutouts
from the KEGG Reference pathways. The circuits in the figure have the same ordering
number used in Table 3. Note also that the actual circuits contain orthologs not only pathogen-linked (shown
in pink and orange) but also others, some preferentially found in pathogenic strains
and others more ubiquitous. The p value through hypergeometric test for a bacterial
strain containing at least one pathogen-ortholog in a circuit shown in Table 3 was less than 0.01.
Fig. 6. Examples of gene circuitry containing pathogen-linked ortholog clusters in KEGG reference
pathways. Orthologs with PA??4 but not present in VFDB were shaded in pink whereas orthologs with PA??4 and also in VFDB in orange. The numbers indicating specific circuitry correspond
to their identification numbers in Table 3
The circuitry in Table 3 and Fig. 6 falls into the following categories:
Gene circuits for bacterial secretion and invasion pathways:
The type III bacterial secretion system pathway mediates toxin and protein delivery to host cells.
Table 3 shows the existence of multiple clusters of pathogen-linked orthologs in this pathway.
Consistent with our findings, the type III pathway is listed in the literature as
modulating pathogenic interactions with host organisms including animals and plants
29]–32].
Also shown in Fig. 6 are examples of pathogen-linked circuits in the secretion system. One such circuit
contains pathogen-linked orthologs yscF, yscO, yscP, yscX, yscC, and yscW. Subsets of pathogen-linked orthologs of the cluster are present in 242 pathogenic
and 106 nonpathogenic strains, resulting in vanishing p values in hypergeometric test.
Moreover, the bias towards pathogenicity increases dramatically with increasing number
of pathogen-linked orthologs in this cluster in the genome of the bacterial strain.
Another secretion-linked pathway is that of type IV gene circuit, for which some of its genes exist in both pathogenic and nonpathogenic
strains. The circuit functions in translocation of DNA and protein substrates to target
cells via direct cell-to-cell contact. In our study, the complete circuit is preferentially
present in pathogenic strains. Consistent with these observations, recent investigations
uncovered a role for pathogenicity for this circuit 33]–35].
Pathogen-linked gene clusters in the two-component System:
The two-component regulatory system is a stimulus–response coupling pathway, which
enables bacteria to sense and respond to changes in its environment 36]–40]. Membrane-bound histidine kinases are major building blocks of the pathway.
These signal transduction systems modulate crosstalk between species within the microbiome.
The Table 3contains multiple gene clusters (circuits) in the two-component system containing
orthologs linked to pathogenicity: cluster (devS, reA, reB, arT) involved in hypoxia, oxygen, and nitrogen assimilation; cluster (uhpC, uhpA, uhpT) modulating hexose phosphate uptake; and the cluster (pagC, pagO, pagD, pagK, pgtE) involved in Mg2+ starvation, and others. See also Fig. 6 for the wiring diagrams of these clusters. Elements of the metabolite assimilation
cluster have been linked in the literature to pathogenicity of mycobacterium tuberculosis
41], 42]. The second cluster in the list in the two-component system, mediating hexose phosphate
uptake, plays an important role in the sodium-dependent D-glucose transport protein
of Helicobacter pylori 43]. This gene circuit is involved in Mg2+ starvation and was shown to play a role in
the pathogenicity of Salmonella enterica 44]. Mg2+ starvation is also involved in quorum sensing of Pseudomonas fluorescens 45] and in biosynthesis of complex lipids needed for virulence of mycobacterium tuberculosis
46].
Metabolic circuits linked to pathogenicity:
A metabolic gene circuit whose genes are commonly found in pathogenic strains is
the CMP-Pse metabolism circuit cluster belonging to the amino nucleotide sugar mechanism. Pathogen-linked ortholog
genes in this circuit consist of pseC, pseH, pseF, and UAP1 (Table 3). This circuit is linked to the synthesis of glycoconjugates, which are typically
expressed on the surfaces of pathogenic bacteria. The protein products of the circuit
have already been identified as virulence factors in the VFDB database and in the
literature 47]–49].
Nodal elements of the Peptidoglycan biosynthesis circuit cluster shown in Table 3 are also preferentially present in pathogenic strains. Pathogen-linked orthologs
in this gene circuit consist of the genes sgtA, sgtB, femA, pbpA, femB, pbp3, femX, and fmhB. Peptidoglycans are polymers consisting of sugars and amino acids forming a mesh
scaffold external to the plasma membrane. Recent studies in the literature point to
the role of peptidoglycans in the pathogen phenotype of different bacteria 50]–52].
Sorbose to Sorbose 1-phosphate circuit of the Phosphotransferase (PTS) system also
shown in Table 3 contains pathogen-linked genes PTS-Sor-EIIC, sorA, PTS-Sor-EIID, sorM, PTS-Sor-EIIA, sorF, PTS-Sor-EIIB, sorB. PTS circuit codes a group translocation process present in many bacteria, transporting
sugars from the environment into the bacterial cell. The circuit has been linked in
the literature to Streptococcus invasion 53].
Our statistical computations based on hypergeometric test indicate that the likelihood
of pathogenic identification of a strain increases dramatically with increasing numbers
of the circuit genes expressed in the strain’s genome. Reflecting this finding, the
Table 3 contains 15 gene circuits for which bacterial strains containing at least 75 percent
of the circuit elements are always pathogenic. Hence a signature for pathogenicity
may be derived from the study of clusters of pathogen-linked orthologs in bacterial
strains.
Figure 6 presents other examples of ortholog groupings listed in Table 3 and acting in tandem in pathogenic processes. One such circuitry shown in the figure
is involved in the biosynthesis of siderophore group of nonribosomal peptides. These
are high affinity iron binding compounds 54] and were found to play an important role in virulent bacterial infections 55]. Also shown in the Figure is pathogen associated ortholog circuit clusters crowding
the bacterial secretion system not discussed above in detail. As noted in the literature,
the secretion system facilitates transport, injection, and release of effector compounds
including enzymes, and toxins in bacteria 56], 57].
Additional pathogen-linked circuitry identified through comparisons of genomes belonging
to the same genera:
Additional circuits linked to pathogenicity could be identified using within-genera
genome comparisons. We have conducted comparisons of ortholog contents of strains
belonging to the same genera for the seventeen genera with most number of strains
in our dataset, shown in Table 1. Again, the clusters of pathogen-linked orthogs forming on KEGG reference pathways
were identified. However, in this case, we relaxed the pathogen-linkage evaluation
from PA??4 to PA??2 since genomes belonging to the same genera are more or less
similar. In addition, we are looking here for circuitry common across genera.
Results of these computations are presented in Additional file 5, identified in rows 1 to 21 for pathogen-linked circuits. The table shows not only
the circuitry but also the genera associated with the specified circuitry. The circuit
clusters most common across genera in this Table lie in the pathways for glycine,
serine and threonine metabolism and sulfur metabolism (Fig. 7). These pathways have been implicated in playing important roles in pathogenicity
58]–61]. Also in this category, is the gene circuit in Additional file 5 row 1 linked to Alzheimer’s disease via Amyloid B and Mitochondial Disfunction.Â
Fig. 7. Examples of gene circuitry in KEGG Reference pathways, which are linked to pathogenicity
via within-genus comparison. The orthologs linked to pathogenicity in these circuits
are shaded in pink. The numbers indicating specific circuitry correspond to their
identification numbers in Additional file 5
Another pathogen-linked gene circuitry that comes out in genera-specific comparisons
is the Amyloid B and Mitochondrial Dysfunction circuitry in the KEGG Alzheimer’s pathway
(Additional file 5). The pathogen-linked orthologs in this circuitry (UQCRFS1, RIP1, petA, MME, IDE, ide, CALM, NDUFV2) are found in 16 of the 17 genera under consideration. This observation suggests
the diversity of a bacterial infection that could be linked as a possible modulator
of the Alzheimer’s disease 62]–64].
Gene circuits found in nonpathogenic strains
Circuits linked to nonpathogenicity are shown in Tables 4 and Additional file 5, respectively, for across genera and within genera comparisons. Our detailed results
shown in Table 4 and Fig. 8 are summarized below.
Fig. 8. Examples of gene circuitry containing nonpathogen-linked ortholog clusters in KEGG
reference pathways. Orthologs with PA??1/4 are shaded in pink. The numbers indicating specific circuitry correspond to
their identification numbers in Table 4
Antibiotics and metabolite producing circuits:
Presence of clusters of nonpathogen-linked orthologs in metabolic circuits such as
steroid biosynthesis, arginine and proline metabolism, and the Insulin signaling pathway
indicate the importance of these pathways in establishing synergy with the host in
all the major genera considered (Table 4). Some of the orthologs in the nonpathogen-linked bacterial gene circuits have orthologs
in the human. Other synergy circuits in Table 4 are involved in radiation survival 65]. KEGG reference metabolic pathways contain large numbers of nodes creating thousands
of clusters for bacterial species, and hence the relative lack of literature for some
of the clusters shown in Table 4.
The polyketide circuit shown in Fig. 8 and presented in Table 4 facilitates the synthesis of common antibiotics 68]. Polyketides are complex organic compounds, which are highly active biologically.
Many pharmaceuticals are derived from or inspired by polyketides. In addition to the
polyketide circuits, the circuit shown in Fig. 8 composed of scyllo-inosamine orthologs (stsE, strB1, E2.4.2.27, strK, K12570, aphD, strA) is involved in the biosynthesis of streptomycin and similar anti-mycobacteria antibiotics.
Recent studies show bacterial virulence factors in type 3-secretion pathway as targeted
by virulence inhibitors such as those illustrated in Fig. 869]. Also shown in the figure is the one-carbon pool by folate pathway, activating one-carbon units for biosynthesis 70]. It plays a major role in amino acid metabolism 71]. It has been shown to affect proof reading of DNA replication, DNA methylation, and
chromatin structure 72]–74]. The list for commensal circuitry presented in Table 4 is not complete, but representative of the diversity of commensal circuits found
in bacterial strains.
Genera-specific genome comparisons reveal additional circuitry clusters found almost
exclusively in nonpathogenic bacterial strains. Shown in Additional file 5 in rows 22 to 24 are clusters for benzoate degradation, and the cluster for dopamine
circuitry in Isoquinoline alkaloid biosynthesis. Benzoate degradation is an important
factor in reducing drug-induced toxicity 66]. It is not clear how dopamine inducing bacterial gene circuits drive synergy with
the host, yet modulations in dopamine circuitry in bacteria was previously linked
to Alzheimer’s disease stage progressions via Borrelia infection 67].
