Your top 3 RNA-seq read alignment tools

Banner-Alignment-omictools

RNA-sequencing (RNA-seq) is currently the leading technology for transcriptome analysis. RNA-seq has a wide range of applications, from the study of alternative gene splicingpost-transcriptional modifications, to comparison of relative gene expression between different biological samples.

To help you perform your RNA-seq experiments in the best conditions, we are continuing our series of surveys by asking the OMICtools community to choose their favorite analysis tools step by step.

Mapping reads to reference genome

 After a first step of quality control (previous blog post here), the next step in the analysis of your RNA-seq experiment is alignment of reads to a reference genome or a transcriptome database.

There are two types of aligners: Splice-unaware and splice-aware. Splice-unaware aligners are able to align continuous reads to a genome of reference, but are not aware of exon/intron junctions. Therefore, in RNA-sequencing, there use is limited to the analysis of expression of known genes, or alignment to transcriptome. On the other hand, splice-aware aligners map reads over exon/intro junctions and are therefore used for discovering new splice forms, along with the analysis of gene expression levels.

With that in mind, we asked OMICtools members to vote for their favorite reads alignment tools (among splice-aware and splice unaware aligners). Here are the results of the survey.

Your number 1 reads aligner: STAR

Though it did not appear in the original survey, you were a lot to mention this tool so we thought it deserved the top spot!

Spliced Transcripts Alignment to a Reference (STAR) is a standalone software that uses sequential maximum mappable seed search followed by seed clustering and stitching to align RNA-seq reads. It is able to detect canonical junctions, non-canonical splices, and chimeric transcripts.

One of the main advantages of STAR are its high speed, accuracy, and efficiency (Engström et al.).

Figure-STAR-omictools
Schematic representation of the Maximum Mappable Prefix search in the STAR algorithm for detecting (a) splice junctions, (b) mis- matches and (c) tails.

STAR is implemented as a standalone C++ code and is freely available at https://github.com/alexdobin/STAR/releases.

Your second favorite tool: Tophat

We were 54% to choose Tophat as your favorite RNA-seq aligner.

TopHat aligns RNA-seq reads to mammalian-sized genomes by first using the short read aligner Bowtie, and then by mapping to a reference genome to discover RNA splice sites de novo.

Figure-Tophat-omictools
The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside.

TopHat has been widely used in RNA-seq protocols and is often paired with the software Cufflinks for a full analysis of sequencing data (Trapnell et al.). Initially launched in 2009, Tophat got updated to Tophat2 in 2013, and has now been progressively replaced with HISAT.

Bronze medal for HISAT

We finish our podium with HISAT, chosen by 30% of voters.

HISAT (and its newer version HISAT2) is the next generation of spliced aligner from the same group that have developed TopHat.

HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-manzini (Fm) index, employing two types of indexes for alignment: a whole-genome Fm index to anchor each alignment and numerous local Fm indexes for very rapid extensions of these alignments.

HISAT most interesting features include its high speed and its low memory requirement.

Figure-HISAT-omictools
Alignment speed of spliced alignment software for 20 million simulated 100-bp reads.

HISAT is open-source software freely available at http://www.ccb.jhu.edu/ software/hisat/.

References:

Pär G Engström et al. (2013). Systematic evaluation of spliced alignment programs for rnA-seq data. Nature Methods.

Cole Trapnell et al. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics.

Alexander Dobin et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics.

Daehwan Kim et al. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature Methods.

Link splice-isoform expression to cancer metabolism with GEMsplice

Banner-gemsplice-omictools

Metabolic models rely on genes and proteins expression to estimate or predict a metabolic cell phenotype. In the case of cancer, it is now admitted that metabolism dysregulations play a crucial role in cancer onset and proliferation. However, most metabolic models only rely on gene expression, and do not account for splice-isoform expression and/or alteration.

To solve this gap, Claudio Angione developed GEMsplice, a desktop application that allows to link splice-isoform gene expression data to cancer metabolism. Here, he describes the features and benefits of GEMsplice.

Solving the gap in cancer metabolism models

Despite being often perceived as the main contributors to cell fate and physiology, genes alone cannot predict the cellular phenotype. A genome-scale analysis of cancer metabolism captures many effects that cannot be identified using standard transcriptomic analysis.

However, although metabolic models have been successfully integrated with transcriptomic data to provide a mechanistic link between genotype and phenotype in cancer, there is no method for integration of splice isoform expression levels into such models. As a result, transcriptomic data in metabolic models can only be integrated at the gene level. Expression data at the splice-isoform level is currently neglected or simply averaged within the same gene to approximate the expression at the gene level.

This issue has been outlined in a number of recent reviews, and recently acknowledged by the scientific community as one of the main issues of metabolic modelling. In fact, the incorporation of splice isoforms is needed to understand complex diseases like cancer, where alternative splicing plays a crucial role.

GEMsplice features

GEMsplice is the first method for the incorporation of splice-isoform expression data into genome-scale metabolic models. It is validated by generating cancer-versus-normal predictions on metabolic pathways and by comparing them with available literature on pathways affected by breast cancer.

GEMsplice uses gene expression and transcript level information to incorporate them into the model (Figure 1). As a result, it exploits the full potential of next-generation sequencing in the context of genome-scale metabolic reconstructions. A set of phenotype-specific RNA-Seq transcript expression levels in a variety of breast cancer types and stages from the Cancer RNA-Seq Nexus dataset (Li et al., 2016), including data from TCGA, GEO and SRA, are then mapped onto the model using constraint-based modeling. Cancer-specific metabolic models are finally generated and investigated using multilevel linear programming, leading to phenotype prediction for different types of breast cancer (Figure 1).

Gemsplice-figure-omictools
Figure 1: GEMsplice incorporates RNA-Seq data into genome-scale metabolic models at the splice-isoform level.

GEMsplice is freely available for academic use on Github.

With respect to state-of-the-art methods, GEMsplice will enable for the first time computational analyses of metabolism at transcript level with splice-isoform resolution.

References:

Claudio Angione. (2018). Integrating splice-isoform expression into genome-scale models characterizes breast cancer metabolism. Bioinformatics.

Towards standardized protocols for microbiome studies

Banner-microbiome-OMICtools

The study of the microbiome – that is, the ensemble of microbial communities living inside us – has become a major application for high-throughput DNA sequencing. Functional changes in the composition of the gut microbiome have been implicated in multiple human diseases.

Due to its complexity, the analysis of sequencing data from microbiome study typically involves a lot of different protocols and bioinformatics tools. From sample collection and DNA extraction to sequencing and computational analysis, technical errors and bias can occur at each step, rendering the uniformization of protocols a complex task.

To this end, two consortia recently proposed to examine the sources of inter-laboratory variability in various aspects of microbiome data generation. This work was published in the last issue of Nature Biotechnology.

Step-microbiome-data-OMICtools
Steps in the microbiome data generation process and technical sources of error and bias at each step. From Gohl D. M. 2017

The Microbiome Quality Control (MBQC) project consortium

The first study published by Sinha et al. and the MBQC focuses on two main sources of variation: data handling (extraction, amplification, and sequencing) and bioinformatics processing. To assess these potential sources of bias, they sent human samples to 15 laboratories and subjected the dataset to analysis by 9 bioinformatics protocols.

Microbiome-Quality-Control-Project-OMICtools
Microbiome Quality Control Project baseline study design.

DNA extraction and library preparation showed the highest degree of variation among laboratories, while different bioinformatics analysis introduced little variability.

The authors also provide guidelines for optimal use of bioinformatics protocols to mitigate this variability, such as performing relative (rather than absolute) diversity measures, phylogenetic (rather than taxonomic), and analyses and quantitative (rather than based on presence or absence) measures.

The next phase of the MBQC project will consist in carrying out systematic surveys of microbiome assay protocols, with the goal to establish a shared library of positive- and negative-control standards for different microbial habitats.

Assessing DNA extraction protocols

In a second study, Costea et al. tested 21 representative DNA extraction protocols on the same fecal samples and quantified differences in terms of microbial community composition.

The authors identified three protocols that performed better in yield and integrity of the extracted DNA, and in their ability to represent hard-to-extract Gram-positive species.

The details of these protocols, together with standard methods for sample and library preparation can be found here: http://www.microbiome-standards.org/

In this study, however, DNA extraction methods were the sole source of variation investigated. Nonetheless, this study provides a benchmark for future development of new extraction methods, as well as a set of recommendations to improve cross-study comparability.

Together, these works underline the necessity to work towards uniform and standardized protocols for microbiote study.

References: