Process 16S rRNA sequences with the sl1p tool


Advancing DNA sequencing technologies have encouraged a surge of microbiome studies. The microbiome, the set of microbes (bacteria, viruses, archaea) who live in a particular environmental niche, has been extensively studied, including in the context of human disease, changes in ecological environments, and progressive oxygen gradients in the deep sea. One of the most popular methods for these types of studies is the sequencing of segments of the 16S rRNA gene– a highly conserved gene among bacterial populations which allows researchers to identify the taxonomic diversity within a given bacterial niche.

Drs. Whelan and Surette have recently come up with a new tool, sl1p, that helps automate the processing of 16S rRNA gene sequencing data and provides analyses which allow the user to jump right into answering their own microbiome-related research questions without extensive bioinformatics training. Here, they describe the main features and benefits of their tool.

The need for a better tool

Many tools and pipelines exist for the processing of microbial marker gene data. Many of these, such as the popular QIIME and mothur, process data using different approaches and algorithms, or provide the user with a choice of approaches for these various steps. Further, these tools often consist of a set of command line steps which are both time consuming and prone to irreproducibility. To address these issues, we developed the short-read library 16S rRNA gene sequencing pipeline (sl1p; pronounced “slip”), a stand-alone pipeline which automates these steps into an easy-to-use, reproducible approach.

sl1p processes 16S rRNA gene sequencing data with the most biologically accurate tools

In order to process 16S rRNA gene sequencing data, a variety of processing steps must be implemented. These include but are not limited to quality filtering, checking for chimeras, picking operational taxonomic units (OTUs), and assigning taxonomy to OTUs (Fig.1). sl1p implements a wide variety of algorithms and options for each of these processing steps. Importantly, the defaults of sl1p were carefully chosen to represent the tools and approaches which worked best in a comprehensive comparison using mock human microbiome sequencing datasets and cultured isolates. Detailed information about these comparions can be found in Whelan FJ & Surette MG (2017) Microbiome.

Figure 1. Processing steps implemented in Sl1p

sl1p conducts preliminary analyses of microbial community data

Included in sl1p’s output are preliminary analyses that the user can use to quickly obtain a broad understanding of their data immediately after sl1p has been run. The preliminary analyses produced by sl1p include a summary of the amount of non-bacterial reads in each sample, taxonomic summaries of each sample at various taxonomic levels (phyla, class, order, family, and genus), as well as alpha- and beta-diversity outputs using 3 different distance metrics (Fig.2). Importantly, these outputs are produced using both QIIME and R and the raw commands for both are included for the user to use as they further interrogate their data to answer questions specific to their research, making these analyses more approachable to the non-bioinformatician.

Figure 2. Preliminary analyses provided in Sl1p

sl1p promotes reproducibility

The main goal of sl1p was to make reproducible and accurate microbiome research more accessible. sl1p produces a comprehensive logfile (Fig.3) which outlines exactly how sl1p was called, important version information of each of the software dependencies, and how each processing step was conducted. This logfile is a valuable tool in order to be able to reproduce a given sl1p run or to understand how small changes in the processing workflow can alter the resulting data output. Further, sl1p provides an R markdown file detailing each step taken in sl1p’s preliminary analyses of the data. Not only is this file an appropriate place for the user to start their own analyses, but it provides transparency in how the sl1p outputs are generated.

Figure 3. Sl1p logfile produced after analysis


Whelan FJ & Surette MG. (2017). A comprehensive evaluation of the sl1p pipeline for 16S rRNA gene sequencing analysis. Microbiome.

Analyse co-expression gene modules with CEMItool


Identifying single changes in gene expression levels is a common analysis step after a microarray or RNA-Seq experiment. The expression levels of co-expressed genes can also be analyzed and visualized by gene co-expression networks (GCNs), which are undirected graphs used to represent co-expression relationships between pairs of genes across samples.

Dr. Helder Nakaya from Sao Paolo University has recently developed CEMItool, an easy-to-use method to automatically run gene co-expression analyses in R. Here, he describes the features provided by CEMItools.

Analyse your transcriptomic data for co-expression modules

The analysis of co-expression gene modules can help uncover the mechanisms underlying diseases and infection. CEMItool is a fast and easy-to-use Bioconductor package that unifies the discovery and the analysis of co-expression modules.

Among its features, CEMItool evaluates whether modules contain genes that are over-represented by specific pathways or that are altered in a specific sample group, as well as it integrates transcriptomic data with interactome information, identifying the potential hubs on each network.

In addition, CEMiTool provides users with a novel unsupervised gene filtering method, and automated parameter selection for identifying modules. The tool then reports everything in HTML web pages with high-quality plots and interactive tables.

CEMItool features

Several functions can be run independently, or all at once using the cemitool function.

Using a simple command line, CEMItool can generate a plot that displays the expression of each gene within a module:

Expression of each genes within a module

CEMItool can also determine which biological functions are associated with the module by performing an over representation analysis (ORA). For this command, a pathway list must be provided in the form of GMT file:


Biological functions associated with the module.

Finally, interaction data, such as protein-protein interactions can be visualized in annotated module graphs:

Annotated graph showing interactions within a module.

Overall, the CEMItool provides the following benefits:

  • Easy-to-use package, automating within a single R function (cemitool) the entire module discovery process – including gene filtering and functional analyses
  • Perform comprehensive modular analysis
  • Fully automated process

A comprehensive instruction guide for CEMItool is provided on Bioconductor : Link


Russo P, Ferreira G, Bürger M, Cardozo L and Nakaya H (2017). CEMiTool: Co-expression Modules identification Tool. R package version 1.1.1.


MEDLINE queries made easy with the MEDOC tool


MEDOC (MEdline DOwnloading Contrivance) is a Python program designed to download data from MEDLINE on an FTP and to load all extracted information into a local MySQL database, thus making MEDLINE search easy.

MEDLINE, the biomedical data keeper

Since MEDLINE’s database has been released almost 50 years ago, the number of indexed publications rose from 1 million in 1970 to 27 millions this year. The aim of this repository is to facilitate the access to the scientific literature for everyone.

Evolution of the number of document on PubMed

The NIH (National Institute of Health, USA) also provides a powerful search engine, which allows to query this database throught the well-know web interface PubMed. This search engine supports complex queries by using logical operators (OR, AND) and indexes different text blocks (such as title, abstract) for refined search. Moreover, different API services have been released to allow routine search, informatics parsing of the results, and data extraction.

However, to query these API (eUtilities), the user needs to program a different script for every search (which can become time-consuming when many data requiring different parsing are needed) and to query the API many times to retrieve individual data from unique article.

To make data-mining easier, the NIH now allows to download MEDLINE’s data from a FTP containing XML-tagged file.

Relational database to the rescue

Even if noSQL databases are rising up these last years, a local and relationnal-based version of the MEDLINE database is useful for complex and frequent queries. The idea behind MEDOC was thus to build a relational scheme and load XML files into this mySQL version.


The figure above presents every steps executed by the Python3 wrapper to construct this local database. 13 tables were created to store every data contained into XML files extracted from the NIH FTP (authors, chemical products, MESH, corrections, citation subset, publication type, language, grant, data bank, personal name subject, other ID and investigator).

Example of request

It took 113 hours (4 days and 17 hours) for MEDOC to load the 1174 files contained into the FTP in the mySQL database (representing 61.3 Go of disk space used).

Querying this version is almost instantenious, even if joining several tables together. In the example provided bellow, the 10 last publications about antioxidants indexed on PubMed were retrieved with SQL queries.


The following result was provided in 0.022 secondes:

Result obtained with the example query

In summary, this indexed relational database allows the user to build complex and rapid queries. All fields can thus be searched for desired information, a task that is difficult to accomplish through the PubMed graphical interface. MEDOC is free and publicly available on Github