Your top 3 RNA-seq quality control tools

Banner-RNAseq-QC-omictools

RNA-sequencing (RNA-seq) is currently the leading technology for transcriptome analysis. RNA-seq has a wide range of applications, from the study of alternative gene splicing, post-transcriptional modifications, to comparison of relative gene expression between different biological samples.

To help you prepare and analyse your RNA-seq experiments in the best conditions, we have launched a new series of surveys focused on the best tools for each fundamental step of an RNA-seq experiment.

Starting your analysis with quality control

The first step in the analysis of an RNA-seq experiment is quality control. This crucial step will ensure that your data have the best quality to perform the subsequent steps of your analysis. Quality control usually include sequence quality, sequencing depth, reads duplication rates (clonal reads), alignment quality, nucleotide composition bias, etc.

We therefore start this series by presenting you the best QC tools, chosen by the OMICtools community!

Your number 1 tool: NGS QC Toolkit

NGS QC Toolkit was the favorite tool for 79% of OMICtools members.

This standalone and open source application proposes several QC tools to quality check and filter your NGS data. The toolbox is divided in 4 major groups of tool:

  • Quality control tools for Illumina or Roche 454 data
  • Trimming tools
  • Format conversion tools
  • Statistics tools

All QC tools can generate graphs as outputs, as well as diverse statistics, such as average quality scores at each base position, GC content distribution, etc.

NGS-QC-figure-omictools
NGS QC Toolkit toolbox

The application can be downloaded here:  Link and can be run on Windows and Linus operating system, provided Activeperl is installed.

Your second favorite tool: RseqFlow

RseqFlow is a RNA-seq analysis pipeline that covers pre- and post-mapping quality control, as well as other analysis steps. The pipeline is divided in 4 branches, that can be run individually or in a workflow mode.

  • Branch 1: Quality Control and SNP calling based on the merging of alignments to the transcriptome and genome.
  • Branch 2: Expression level quantification for Gene/Exon/Splice Junctions based on alignment to the transcriptome.
  • Branch 3: Some file format conversions for easy storage, backup and visualization.
  • Branch 4: Differentially expressed gene identification based on the output of the expression level quantification from Branch 2.

RseqFlow provides a downloadable Virtual Machine (VM) image managed with Pegasus, that allows users to run the pipeline easily using different computational resources, available here: Link

RseqFlow can also be run with a unix shell mode that allows users to execute each branch of analysis with a unix command (The following software must be pre-installed: Python 2.7 or higher, R 2.11 or higher, and GCC).

Third place for Trim Galore!

Trim Galore! is a wrapper script to automate quality and adapter trimming as well as quality control, with some added functionality to remove biased methylation positions for RRBS sequence files (for directional, non-directional (or paired-end) sequencing).

It’s main features include:

  • For adapter trimming, Trim Galore! uses the first 13 bp of Illumina standard adapters (‘AGATCGGAAGAGC’) by default (suitable for both ends of paired-end libraries), but accepts other adapter sequence, too
  • For MspI-digested RRBS libraries, Trim Galore! performs quality and adapter trimming in two subsequent steps. This allows it to remove 2 additional bases that contain a cytosine which was artificially introduced in the end-repair step during the library preparation
  • For any kind of FastQ file other than MspI-digested RRBS, Trim Galore! can perform single-pass adapter- and quality trimming
  • The Phred quality of basecalls and the stringency for adapter removal can be specified individually
  • And more…
Trimgalore-figure-omictools
Example of a dataset downloaded from the SRA which was trimmed with a Phred score threshold of 20

Trim galore is built around Cutadapt and FastQC, and thus requires both tools to be installed to function properly.

The tool is downloadable here: Link and comes with a comprehensive and illustrated User Guide.

References:

Analyse co-expression gene modules with CEMItool

Banner-CEMItool-OMICtools

Identifying single changes in gene expression levels is a common analysis step after a microarray or RNA-Seq experiment. The expression levels of co-expressed genes can also be analyzed and visualized by gene co-expression networks (GCNs), which are undirected graphs used to represent co-expression relationships between pairs of genes across samples.

Dr. Helder Nakaya from Sao Paolo University has recently developed CEMItool, an easy-to-use method to automatically run gene co-expression analyses in R. Here, he describes the features provided by CEMItools.

Analyse your transcriptomic data for co-expression modules

The analysis of co-expression gene modules can help uncover the mechanisms underlying diseases and infection. CEMItool is a fast and easy-to-use Bioconductor package that unifies the discovery and the analysis of co-expression modules.

Among its features, CEMItool evaluates whether modules contain genes that are over-represented by specific pathways or that are altered in a specific sample group, as well as it integrates transcriptomic data with interactome information, identifying the potential hubs on each network.

In addition, CEMiTool provides users with a novel unsupervised gene filtering method, and automated parameter selection for identifying modules. The tool then reports everything in HTML web pages with high-quality plots and interactive tables.

CEMItool features

Several functions can be run independently, or all at once using the cemitool function.

Using a simple command line, CEMItool can generate a plot that displays the expression of each gene within a module:

Cemitool-expression-pattern-omictools
Expression of each genes within a module

CEMItool can also determine which biological functions are associated with the module by performing an over representation analysis (ORA). For this command, a pathway list must be provided in the form of GMT file:

 

Cemitool-ORA-omictools
Biological functions associated with the module.

Finally, interaction data, such as protein-protein interactions can be visualized in annotated module graphs:

Cemitool-interaction-omictools
Annotated graph showing interactions within a module.

Overall, the CEMItool provides the following benefits:

  • Easy-to-use package, automating within a single R function (cemitool) the entire module discovery process – including gene filtering and functional analyses
  • Perform comprehensive modular analysis
  • Fully automated process

A comprehensive instruction guide for CEMItool is provided on Bioconductor : Link

Reference:

Russo P, Ferreira G, Bürger M, Cardozo L and Nakaya H (2017). CEMiTool: Co-expression Modules identification Tool. R package version 1.1.1.

 

Your top 3 gene clustering software tools

Banner-clustering-OMICtools

Clustering is a fundamental step in the analysis of biological and omics data. It is used to construct groups of objects (genes, proteins) with related function, expression patterns, or known to interact together. In microarrays or RNA-seq experiments, gene clustering is often associated with heatmap representation for data visualization.

Choosing the right clustering tool for your analysis

Many clustering methods and algorithms have been developed and are classified into partitioning (k-means), hierarchical (connectivity-based), density-based, model-based and graph-based approaches.

To help you choose between all the existing clustering tools, we asked OMICtools members to vote for their favorite software. Here are the top 3 tools, chosen by 23 voters.

First place for ClustEval

ClustEval is a web-based clustering analysis platform developed at the Max Planck Institute for Informatics and the University of Southern Denmark. It is designed to objectively compare the performance of various clustering methods from different datasets.

More precisely, ClustEval has compared the performances of 18 different clustering methods among the most used, using 24 different datasets. These datasets include gene expression data, protein sequence similarity, protein structure similarity, social network, word sense disambiguation, etc. The performance of a clustering method is then evaluated by a F1-score (harmonic mean of precision and recall).

Finally, ClustEval can be downloaded and installed by users to perform their own clustering analysis comparison, using VirtualBox image, Docker & Docker Compose or as a R package.

Performance-clustering-ClustEval-OMICtools
Performance of all clustering tools on all nonartificial data sets on the basis of F1 scores.

Second position for Babelomics

Babelomics is a web application developed by the Computational Genomics Department of the Principe Felipe Research Center in Valencia. It performs a wide range of functional analysis of gene expression and genomic data, from processing to expression analysis and gene set enrichment.

In its current version, Babelomics 5, the web-site displays a user-friendly and intuitive interface for the clustering of microarray or RNA-seq data using one of three different methods: UPGMA, SOTA, and k-means. The subsequent result can be visualized as a heatmap. Examples of data set and analysis are provided for every functionality of the application, and tutorials available here.

Babelomics-OMICtools
Babelomics clustering tool.

Third place for AltAnalyze

AltAnalyze is a comprehensive application for the analysis of single-cell and bulk RNA-seq data that can automatize every step of gene expression and splicing analysis, including clustering and heatmap representation. It was developed in the Nathan Salomonis laboratory at Cincinnati Children’s Hosptial Medical Center and the University of Cincinnati.

AltAnalyze proposes many options for clustering algorithms and normalization, as well as unique features such as finding optimized clusters for single-cell analysis.

AltAnalyse can be downloaded and run on all operating systems, and comes with useful documentation (tutorials, blog, FAQ).

Cluster-heatmap-AltAnalyze-OMICtools
Heatmap and clustering generated with AltAnalyze

References

(Wiwie et al., 2015) Comparing the performance of biomedical clustering methods. Nature Methods.

(Alonso et al., 2015) Babelomics 5.0: functional interpretation for new generations of genomic data. Nucleic Acids Research.

(Emig et al., 2010) AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Research.

(Olson et al., 2016) Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature.