Process 16S rRNA sequences with the sl1p tool


Advancing DNA sequencing technologies have encouraged a surge of microbiome studies. The microbiome, the set of microbes (bacteria, viruses, archaea) who live in a particular environmental niche, has been extensively studied, including in the context of human disease, changes in ecological environments, and progressive oxygen gradients in the deep sea. One of the most popular methods for these types of studies is the sequencing of segments of the 16S rRNA gene– a highly conserved gene among bacterial populations which allows researchers to identify the taxonomic diversity within a given bacterial niche.

Drs. Whelan and Surette have recently come up with a new tool, sl1p, that helps automate the processing of 16S rRNA gene sequencing data and provides analyses which allow the user to jump right into answering their own microbiome-related research questions without extensive bioinformatics training. Here, they describe the main features and benefits of their tool.

The need for a better tool

Many tools and pipelines exist for the processing of microbial marker gene data. Many of these, such as the popular QIIME and mothur, process data using different approaches and algorithms, or provide the user with a choice of approaches for these various steps. Further, these tools often consist of a set of command line steps which are both time consuming and prone to irreproducibility. To address these issues, we developed the short-read library 16S rRNA gene sequencing pipeline (sl1p; pronounced “slip”), a stand-alone pipeline which automates these steps into an easy-to-use, reproducible approach.

sl1p processes 16S rRNA gene sequencing data with the most biologically accurate tools

In order to process 16S rRNA gene sequencing data, a variety of processing steps must be implemented. These include but are not limited to quality filtering, checking for chimeras, picking operational taxonomic units (OTUs), and assigning taxonomy to OTUs (Fig.1). sl1p implements a wide variety of algorithms and options for each of these processing steps. Importantly, the defaults of sl1p were carefully chosen to represent the tools and approaches which worked best in a comprehensive comparison using mock human microbiome sequencing datasets and cultured isolates. Detailed information about these comparions can be found in Whelan FJ & Surette MG (2017) Microbiome.

Figure 1. Processing steps implemented in Sl1p

sl1p conducts preliminary analyses of microbial community data

Included in sl1p’s output are preliminary analyses that the user can use to quickly obtain a broad understanding of their data immediately after sl1p has been run. The preliminary analyses produced by sl1p include a summary of the amount of non-bacterial reads in each sample, taxonomic summaries of each sample at various taxonomic levels (phyla, class, order, family, and genus), as well as alpha- and beta-diversity outputs using 3 different distance metrics (Fig.2). Importantly, these outputs are produced using both QIIME and R and the raw commands for both are included for the user to use as they further interrogate their data to answer questions specific to their research, making these analyses more approachable to the non-bioinformatician.

Figure 2. Preliminary analyses provided in Sl1p

sl1p promotes reproducibility

The main goal of sl1p was to make reproducible and accurate microbiome research more accessible. sl1p produces a comprehensive logfile (Fig.3) which outlines exactly how sl1p was called, important version information of each of the software dependencies, and how each processing step was conducted. This logfile is a valuable tool in order to be able to reproduce a given sl1p run or to understand how small changes in the processing workflow can alter the resulting data output. Further, sl1p provides an R markdown file detailing each step taken in sl1p’s preliminary analyses of the data. Not only is this file an appropriate place for the user to start their own analyses, but it provides transparency in how the sl1p outputs are generated.

Figure 3. Sl1p logfile produced after analysis


Whelan FJ & Surette MG. (2017). A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis. Microbiome.

Your top 3 RNA-seq quality control tools


RNA-sequencing (RNA-seq) is currently the leading technology for transcriptome analysis. RNA-seq has a wide range of applications, from the study of alternative gene splicing, post-transcriptional modifications, to comparison of relative gene expression between different biological samples.

To help you prepare and analyse your RNA-seq experiments in the best conditions, we have launched a new series of surveys focused on the best tools for each fundamental step of an RNA-seq experiment.

Starting your analysis with quality control

The first step in the analysis of an RNA-seq experiment is quality control. This crucial step will ensure that your data have the best quality to perform the subsequent steps of your analysis. Quality control usually include sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide composition bias, etc.

We therefore start this series by presenting you the best QC tools, chosen by the OMICtools community!

Your number 1 tool: NGS QC Toolkit

NGS QC Toolkit was the favorite tool for 79% of OMICtools members.

This standalone and open source application proposes several QC tools to quality check and filter your NGS data. The toolbox is divided in 4 major groups of tool:

  • Quality control tools for Illumina or Roche 454 data
  • Trimming tools
  • Format conversion tools
  • Statistics tools

All QC tools can generate graphs as outputs, as well as diverse statistics, such as average quality scores at each base position, GC content distribution, etc.

NGS QC Toolkit toolbox

The application can be downloaded here:  Link and can be run on Windows and Linus operating system, provided Activeperl is installed.

Your second favorite tool: RseqFlow

RseqFlow is a RNA-seq analysis pipeline that covers pre- and post-mapping quality control, as well as other analysis steps. The pipeline is divided in 4 branches, that can be run individually or in a workflow mode.

  • Branch 1: Quality Control and SNP calling based on the merging of alignments to the transcriptome and genome.
  • Branch 2: Expression level quantification for Gene/Exon/Splice Junctions based on alignment to the transcriptome.
  • Branch 3: Some file format conversions for easy storage, backup and visualization.
  • Branch 4: Differentially expressed gene identification based on the output of the expression level quantification from Branch 2.

RseqFlow provides a downloadable Virtual Machine (VM) image managed with Pegasus, that allows users to run the pipeline easily using different computational resources, available here: Link

RseqFlow can also be run with a unix shell mode that allows users to execute each branch of analysis with a unix command (The following software must be pre-installed: Python 2.7 or higher, R 2.11 or higher, and GCC).

Third place for Trim Galore!

Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).

It’s main features include:

  • For adapter trimming, Trim Galore! uses the first 13 bp of Illumina standard adapters (‘AGATCGGAAGAGC’) by default (suitable for both ends of paired-end libraries), but accepts other adapter sequence, too
  • For MspI-digested RRBS libraries, Trim Galore! performs quality and adapter trimming in two subsequent steps. This allows it to remove 2 additional bases that contain a cytosine which was artificially introduced in the end-repair step during the library preparation
  • For any kind of FastQ file other than MspI-digested RRBS, Trim Galore! can perform single-pass adapter- and quality trimming
  • The Phred quality of basecalls and the stringency for adapter removal can be specified individually
  • And more…
Example of a dataset downloaded from the SRA which was trimmed with a Phred score threshold of 20

Trim galore is built around Cutadapt and FastQC, and thus requires both tools to be installed to function properly.

The tool is downloadable here: Link and comes with a comprehensive and illustrated User Guide.


Evaluating the functional impact of genetic variants with COPE software


Dr Ge Gao, developer of the COPE-PCG software tool, talks here about his tool and how it can assist researchers to analyze sequencing data.

COPE: A new framework of context-oriented prediction for variant effects

Evaluating functional impacts of genetic variants is a key step in genomic studies. Whilst most popular variant annotation tools take a variant-centric strategy and assess the functional consequence of each variant independently, multiple variants in the same gene may interfere with each other and have different effects in combination than individually (e.g., a frameshift caused by an indel can be “rescued” by another downstream variant). The COPE framework, Context-Oriented Predictor for variant Effect, was developed to accurately annotate multiple co-occurring variants.

This new gene-centric annotation tool integrates the entire sequence context to evaluate the bona fide impact of multiple intra-genic variants in a context-sensitive approach.

COPE handles complex effects of multiple variants

Unlike the current variant-centric approach that assesses the functional consequence of each variant independently, COPE takes each functional element as the basic annotation unit and considers that multiple variants in the same functional element may interfere with each other and have different effects in combination than individually (complementary rescue effect).

cope-fig1overview-omictoolsOverview of COPE: COPE uses each transcript as a basic annotation unit. The variant mapping step identifies variants within transcripts. The coding region inference step removes introns from each transcript; all possible splicing patterns are taken into consideration for splice-altering transcripts (in this case, the red dot indicates a splice acceptor site SNP, and intron retention and exon skipping are taken into consideration). The sequence comparison step compares a ‘mutant peptide’ against a reference protein sequence to obtain the final amino acid alteration.

Applying COPE software to genomic data

Screening the official 1000 Genomes variant set, COPE identified a considerable number of false-positive Loss-of-Function calls for 23.21% splice-disrupting variants, 6.45% frameshift indels and 2.10% stop-gained variants, as well as several false-negative Loss-of-Function variants in 38 genes.

To the best of our knowledge, COPE is the first fully gene-centric tool for annotating the effects of variants in a context-sensitive approach.


Schematic diagram of typical types of annotation corrections implemented in COPE. A rescued stop-gained SNV indicates that another SNV (‘A’ to ‘C’) in the same codon rescues a variant-centric stop-gained SNV (‘A’ to ‘T’). Stop-gained MNV indicates that two or more SNVs result in a stop codon (‘A’ to ‘T’ and ‘C’ to ‘G’). A rescued frameshift indel indicates that another indel in the same haplotype recovers the original open reading frame. A splicing-rescued stop-gained/frameshift variant indicates that a stop-gained or frameshift variant is rescued by a novel splicing isoform. A rescued splice-disrupting variant indicates that a splice-disrupting variant is rescued by a nearby cryptic site (as shown in the figure) or a novel splice site. The asterisk in the figure indicates a stop codon.

Evaluating the quality of COPE: availability, usability and flexibility

  • Free software
  • Publically available online server and stand-alone package for large-scale analysis


Screenshot of the COPE web server. Example of input (A) and annotation by COPE (B)

  • Software documentation: A detailed guideline for installation and setup is available
  • Recent updates: COPE-PCG has been online since June 2016, and COPE-TFBS since March 2017 on a new website
  • Analysis of protein-coding genes (COPE-PCG), transcription factor binding sites (COPE-TFBS) and more… the COPE framework may also be extended and adapted to non-coding RNAs and miRNAs in a near future.

About the author

Dr Ge Gao is principal investigator at the Center for Bioinformatics of Peking University. His team focuses primarily on developing novel computational techniques to analyze, integrate and visualize high-throughput biological data effectively and efficiently, with applications for deciphering the function and evolution of gene regulatory system. Dr Ge Gao is specialized in large-scale data mining, using a combination of statistical learning, high-performance computing, and data visualizing.


Cheng et al., 2017. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Research.

Cheng et al., in preparation. Systematically identify and annotate multiple-variant compound effect at transcription factor binding sites in the human genome.