Identify and map genomic islands with xenoGI


Genomic islands (GI) are part of a genome that have been transferred horizontally beween organisms, typically bacteria. They code for many functions, such as symbiosis, pathogenesis, antibiotic resistance, etc. Determining GI is often performed by base composition analysis and phylogeny estimations.

Dr. Eliot Bush has developed xenoGI, a tool that proposes several features to aid identify islands of genes that entered via common horizontal transfer events, to map those events onto the phylogenetic tree, and more. Here, he briefly describes the features and benefits of his tool.

Features and benefits of xenoGI

Microbes have acquired many important traits through the horizontal transfer of genomic islands. Understanding the evolution of these traits often requires us to understand the adaptive path that has produced them.

The goal of xenoGI is to reconstruct the history of genomic island insertions in a clade of closely related microbes. It takes as input a set of sequenced genomes and a tree specifying their phylogenetic relationships.

It then identifies genomic islands and maps their origin on the phylogenetic tree, determining which branch they inserted on.

The key challenge in this problem is to accurately identify the origin of genes. Every gene in the input genomes must have one of two origins. Either it is a core gene present in the most recent common ancestor of the strains, or it arrived via a horizontal transfer event. The algorithm seeks to determine which is which by creating gene families in a way that takes account of both the species tree and synteny information. It then identifies families whose members are adjacent and whose most recent common ancestor is shared, and merges them into islands reflecting a common origin.

The figure below shows an example of this sort of analysis related to acid tolerance in the Escherichia clade. gadB is a glutamate decarboxylase enzyme known to be involved in acid tolerance in E. coli. In an analysis of eleven enteric species, xenoGI finds that gadB is part of an island of eight genes that inserted on the branch leading to Escherichia, before the divergence of E. fergusonii.

Example of analysis with xenoGI

The fact that it operates in the context of a clade makes xenoGI distinctive compared with previous genomic island finding methods. Other distinctive features include the fact that it is gene based, doesn’t depend on the aligner MAUVE, and integrates species tree and synteny information from an early stage of its analysis.

In the past, reconstructing the history of GI insertions into a clade typically required heavy human involvement. xenoGI provides an automated solution to this problem. Beyond this, a thorough comparative analysis is the gold standard for genomic island finding (even if one’s goal is only to find islands in only a single genome). Our hope is that xenoGI will make this sort of analysis accessible to more users.


Bush et al. (2017). xenoGI: reconstructing the history of genomic island insertions in clades of closely related bacteria. BioRxiv.


Towards standardized protocols for microbiome studies


The study of the microbiome – that is, the ensemble of microbial communities living inside us – has become a major application for high-throughput DNA sequencing. Functional changes in the composition of the gut microbiome have been implicated in multiple human diseases.

Due to its complexity, the analysis of sequencing data from microbiome study typically involves a lot of different protocols and bioinformatics tools. From sample collection and DNA extraction to sequencing and computational analysis, technical errors and bias can occur at each step, rendering the uniformization of protocols a complex task.

To this end, two consortia recently proposed to examine the sources of inter-laboratory variability in various aspects of microbiome data generation. This work was published in the last issue of Nature Biotechnology.

Steps in the microbiome data generation process and technical sources of error and bias at each step. From Gohl D. M. 2017

The Microbiome Quality Control (MBQC) project consortium

The first study published by Sinha et al. and the MBQC focuses on two main sources of variation: data handling (extraction, amplification, and sequencing) and bioinformatics processing. To assess these potential sources of bias, they sent human samples to 15 laboratories and subjected the dataset to analysis by 9 bioinformatics protocols.

Microbiome Quality Control Project baseline study design.

DNA extraction and library preparation showed the highest degree of variation among laboratories, while different bioinformatics analysis introduced little variability.

The authors also provide guidelines for optimal use of bioinformatics protocols to mitigate this variability, such as performing relative (rather than absolute) diversity measures, phylogenetic (rather than taxonomic), and analyses and quantitative (rather than based on presence or absence) measures.

The next phase of the MBQC project will consist in carrying out systematic surveys of microbiome assay protocols, with the goal to establish a shared library of positive- and negative-control standards for different microbial habitats.

Assessing DNA extraction protocols

In a second study, Costea et al. tested 21 representative DNA extraction protocols on the same fecal samples and quantified differences in terms of microbial community composition.

The authors identified three protocols that performed better in yield and integrity of the extracted DNA, and in their ability to represent hard-to-extract Gram-positive species.

The details of these protocols, together with standard methods for sample and library preparation can be found here:

In this study, however, DNA extraction methods were the sole source of variation investigated. Nonetheless, this study provides a benchmark for future development of new extraction methods, as well as a set of recommendations to improve cross-study comparability.

Together, these works underline the necessity to work towards uniform and standardized protocols for microbiote study.


Process 16S rRNA sequences with the sl1p tool


Advancing DNA sequencing technologies have encouraged a surge of microbiome studies. The microbiome, the set of microbes (bacteria, viruses, archaea) who live in a particular environmental niche, has been extensively studied, including in the context of human disease, changes in ecological environments, and progressive oxygen gradients in the deep sea. One of the most popular methods for these types of studies is the sequencing of segments of the 16S rRNA gene– a highly conserved gene among bacterial populations which allows researchers to identify the taxonomic diversity within a given bacterial niche.

Drs. Whelan and Surette have recently come up with a new tool, sl1p, that helps automate the processing of 16S rRNA gene sequencing data and provides analyses which allow the user to jump right into answering their own microbiome-related research questions without extensive bioinformatics training. Here, they describe the main features and benefits of their tool.

The need for a better tool

Many tools and pipelines exist for the processing of microbial marker gene data. Many of these, such as the popular QIIME and mothur, process data using different approaches and algorithms, or provide the user with a choice of approaches for these various steps. Further, these tools often consist of a set of command line steps which are both time consuming and prone to irreproducibility. To address these issues, we developed the short-read library 16S rRNA gene sequencing pipeline (sl1p; pronounced “slip”), a stand-alone pipeline which automates these steps into an easy-to-use, reproducible approach.

sl1p processes 16S rRNA gene sequencing data with the most biologically accurate tools

In order to process 16S rRNA gene sequencing data, a variety of processing steps must be implemented. These include but are not limited to quality filtering, checking for chimeras, picking operational taxonomic units (OTUs), and assigning taxonomy to OTUs (Fig.1). sl1p implements a wide variety of algorithms and options for each of these processing steps. Importantly, the defaults of sl1p were carefully chosen to represent the tools and approaches which worked best in a comprehensive comparison using mock human microbiome sequencing datasets and cultured isolates. Detailed information about these comparions can be found in Whelan FJ & Surette MG (2017) Microbiome.

Figure 1. Processing steps implemented in Sl1p

sl1p conducts preliminary analyses of microbial community data

Included in sl1p’s output are preliminary analyses that the user can use to quickly obtain a broad understanding of their data immediately after sl1p has been run. The preliminary analyses produced by sl1p include a summary of the amount of non-bacterial reads in each sample, taxonomic summaries of each sample at various taxonomic levels (phyla, class, order, family, and genus), as well as alpha- and beta-diversity outputs using 3 different distance metrics (Fig.2). Importantly, these outputs are produced using both QIIME and R and the raw commands for both are included for the user to use as they further interrogate their data to answer questions specific to their research, making these analyses more approachable to the non-bioinformatician.

Figure 2. Preliminary analyses provided in Sl1p

sl1p promotes reproducibility

The main goal of sl1p was to make reproducible and accurate microbiome research more accessible. sl1p produces a comprehensive logfile (Fig.3) which outlines exactly how sl1p was called, important version information of each of the software dependencies, and how each processing step was conducted. This logfile is a valuable tool in order to be able to reproduce a given sl1p run or to understand how small changes in the processing workflow can alter the resulting data output. Further, sl1p provides an R markdown file detailing each step taken in sl1p’s preliminary analyses of the data. Not only is this file an appropriate place for the user to start their own analyses, but it provides transparency in how the sl1p outputs are generated.

Figure 3. Sl1p logfile produced after analysis


Whelan FJ & Surette MG. (2017). A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis. Microbiome.