Large-scale contamination of microbial isolate genomes by Illumina PhiX control

The presence of PhiX sequences within individual genomes first attracted our attention
while manually curating a small number of isolate genomes. Initially thought of as
an exciting biological phenomenon or the result of horizontal gene transfer, after
careful analyses, these scaffolds turned out to be nothing but sequencing artifacts.
Sequencing centers generate massive amounts of data, which calls for strict quality
control measures. The sheer volume of data being generated on a daily basis necessitates
well-defined, automatic quality control protocols at source. Contaminated sequences
once released to public databases typically trace thousands of analysis routes and
can add to error propagation and incorrect hypotheses 16]. Thus, it is extremely important to detect contaminated sequences at the source and
prevent them from affecting subsequent downstream analyses.

Contamination and sequence artifacts can come from multiple sources including but
not limited to sequencing controls such as PhiX, cloning vectors, adapters, PCR primers,
nucleic acid impurities present in reagents required for sample isolation and preparation
and human error. Salter et al. 17] identified a wide range of contaminants from DNA extraction kits and other laboratory
reagents affecting the outcome of culture-independent microbiota research; while Lusk
18] detected widespread contamination in four independent high throughput sequencing
experiments. A study 19] scanning DNA sequences from The Thousand Genome Project 20] identified significant contamination by Mycoplasma sequences. While DNA contamination has been a long-standing issue in research laboratories,
its potential long-term implications were highlighted recently in light of developments
in high throughput sequencing and human microbiome research. A recent commentary published
in Nature 21] summarizes the problem well.

Several tools have been developed over the years for quality control of raw sequence
reads such as Phred 22], Sequence Scanner 23] (specifically for first generation sequence data) and NCBI’s VecScreen and UniVec
24,25] to get rid of contaminants of vector origin. More recent programs have been designed
for analyzing NGS data such as TileQC 26], FastQC 27], PRINSEQ 28], NGS-QC 29], programs to detect contamination such as DeconSeq 30], as well as multi genome alignment (MGA) 31] and QC-Chain 32] which can provide both rapid QC and contamination filtering of NGS data. Such programs
are meant to prevent release of contaminated sequences. However, our results from
scanning publicly available microbial isolate genome sequences for contamination shows
that large number of errors can be detected in spite of the easy availability of multiple
quality control measures. The sheer volume of PhiX contaminated genomes is alarming
and calls for implementation of stricter quality control measures especially at large
genome centers with high rates of sequence turnover.

Detection of PhiX contamination encouraged us to expand our search further; we performed
additional analysis looking for other sources of contamination and have identified
genomes in public databases that are:

(a) ?either a partial or complete mixtures of two or more strains

(b) ?genomes contaminated with short fragments of two or more species

(c) ?‘isolate’ genomes where a complete genome is cloned inside another

The list of such genomes is available in Additional file 4 and their nucleotide sequences are available on a JGI public ftp site 33]. The IMG database has already implemented a quality control step to identify and
remove these artifacts during data submission, and the sequence data in the system
is free of PhiX contamination. We are currently in the process of cleaning up additional
contaminated genomes. Most have already been removed from IMG completely or are being
re-instated after cleaning up of contaminated scaffolds. At the same time, most of
the PhiX contaminated genomes continue to exist in other public databases such as
NCBI/RefSeq or Genbank and are easily accessible to researchers over the world. While
we welcome the technological advances associated with NGS platforms and acknowledge
their long-term benefits, we expect principal investigators (PI) of large-scale sequencing
projects to be aware of the possible pitfalls and take corrective measures as necessary.
For the genomes contaminated with PhiX sequences, we recommend individual PI’s to
retract the corresponding sequences, remove contaminating scaffolds, and re-upload
the clean sequences to public databases.

Additional file 4. List of non-PhiX contaminations that were detected and removed from the public IMG
database.

Format: XLSX
Size: 59KB Download file