Your top 3 RNA-seq read alignment tools

Banner-Alignment-omictools

RNA-sequencing (RNA-seq) is currently the leading technology for transcriptome analysis. RNA-seq has a wide range of applications, from the study of alternative gene splicingpost-transcriptional modifications, to comparison of relative gene expression between different biological samples.

To help you perform your RNA-seq experiments in the best conditions, we are continuing our series of surveys by asking the OMICtools community to choose their favorite analysis tools step by step.

Mapping reads to reference genome

 After a first step of quality control (previous blog post here), the next step in the analysis of your RNA-seq experiment is alignment of reads to a reference genome or a transcriptome database.

There are two types of aligners: Splice-unaware and splice-aware. Splice-unaware aligners are able to align continuous reads to a genome of reference, but are not aware of exon/intron junctions. Therefore, in RNA-sequencing, there use is limited to the analysis of expression of known genes, or alignment to transcriptome. On the other hand, splice-aware aligners map reads over exon/intro junctions and are therefore used for discovering new splice forms, along with the analysis of gene expression levels.

With that in mind, we asked OMICtools members to vote for their favorite reads alignment tools (among splice-aware and splice unaware aligners). Here are the results of the survey.

Your number 1 reads aligner: STAR

Though it did not appear in the original survey, you were a lot to mention this tool so we thought it deserved the top spot!

Spliced Transcripts Alignment to a Reference (STAR) is a standalone software that uses sequential maximum mappable seed search followed by seed clustering and stitching to align RNA-seq reads. It is able to detect canonical junctions, non-canonical splices, and chimeric transcripts.

One of the main advantages of STAR are its high speed, accuracy, and efficiency (Engström et al.).

Figure-STAR-omictools
Schematic representation of the Maximum Mappable Prefix search in the STAR algorithm for detecting (a) splice junctions, (b) mis- matches and (c) tails.

STAR is implemented as a standalone C++ code and is freely available at https://github.com/alexdobin/STAR/releases.

Your second favorite tool: Tophat

We were 54% to choose Tophat as your favorite RNA-seq aligner.

TopHat aligns RNA-seq reads to mammalian-sized genomes by first using the short read aligner Bowtie, and then by mapping to a reference genome to discover RNA splice sites de novo.

Figure-Tophat-omictools
The TopHat pipeline. RNA-Seq reads are mapped against the whole reference genome, and those reads that do not map are set aside.

TopHat has been widely used in RNA-seq protocols and is often paired with the software Cufflinks for a full analysis of sequencing data (Trapnell et al.). Initially launched in 2009, Tophat got updated to Tophat2 in 2013, and has now been progressively replaced with HISAT.

Bronze medal for HISAT

We finish our podium with HISAT, chosen by 30% of voters.

HISAT (and its newer version HISAT2) is the next generation of spliced aligner from the same group that have developed TopHat.

HISAT uses an indexing scheme based on the Burrows-Wheeler transform and the Ferragina-manzini (Fm) index, employing two types of indexes for alignment: a whole-genome Fm index to anchor each alignment and numerous local Fm indexes for very rapid extensions of these alignments.

HISAT most interesting features include its high speed and its low memory requirement.

Figure-HISAT-omictools
Alignment speed of spliced alignment software for 20 million simulated 100-bp reads.

HISAT is open-source software freely available at http://www.ccb.jhu.edu/ software/hisat/.

References:

Pär G Engström et al. (2013). Systematic evaluation of spliced alignment programs for rnA-seq data. Nature Methods.

Cole Trapnell et al. (2009). TopHat: discovering splice junctions with RNA-Seq. Bioinformatics.

Alexander Dobin et al. (2013). STAR: ultrafast universal RNA-seq aligner. Bioinformatics.

Daehwan Kim et al. (2015). HISAT: a fast spliced aligner with low memory requirements. Nature Methods.

Simulate immune responses to vaccines with C-IMMSIM

Banner-C-IMMSIM-Omictools

Because of the diversity of the immune repertoires, it is very challenging to predict the efficacy of a vaccine to properly stimulate all components of the immune system and ultimately protect against infectious diseases, that is it’s immunogenicity.

To aid researchers in the design and set up of their vaccination/infection protocols, Dr. Filippo Castiglione and colleagues from the Institute for Applied Computing in Rome have developed C-IMMSIM. Here, he talks about the features and benefits of his tool.

A novel and free tool to perform in silico experiments about vaccinations and/or infections

To date, this is the only simulation tool for the immune response which combines epitope/peptide prediction algorithms with agent-based methodology to predict the follow-up of virtual infection experiments.

It represents the immune system in its most fundamental universal requirements such as the concept of diversity in all specific repertoires (the model is poly-clonal), the use of stochasticity actions in cell meeting and cooperation, the definition of an affinity potential on the bases of amino-acid avidity, the concept of clone division and immune memory (i.e., the clonal selection theory), specific controls of innate and adaptive immunity to avoid self-reactions, the thymus selection, the concept of danger, and other things.

Figure1-C-IMMSIM-OMICtools
Overall architecture of the simulation tool

Predicting tailored responses to the antigen

The user can specify the antigen to be injected in terms of its constitutional protein primary structures, that is, the linear sequences or amino acids.  The model also allows a certain degree of “patient” specificity by indicating of the Major Histocompatibility Complexes (both class I and II).

The algorithm identifies the portions of the linear sequences composing the antigen that are likely to be seen by the immune system. In other words the algorithm detects the B-cell epitopes and the T-cell peptides assigning a score that is eventually used throughout the simulation to identify the immunogenic portion of the antigen.

Figure2-C-IMMSIM-Omictools
Simulation of an immunization experiment

The user can choose to inject a vaccine, a bacterium or a virus. He can also combine those kinds of antigens to simulate, for instance, vaccination and challenge or combined infections by different pathogens. He can also choose the simulated volume and the time horizon.

The outcome is a detailed description of the epitopes and peptides used to mount the specific immune response and a series of plots showing time dependent cell counts and cytokines’ concentrations.

Since the web tool exports only a small fraction of the toggles that can be used to specify the characteristics of both antigen and immune system, the tool could appear somehow limited to some user. In this case one could contact the author (me) to start a collaboration on a specific well-reasoned scientific question.

Challenges to face

The scientific issue this simulation tool faces is to simulate the immune response to vaccination and to include elements of the immune system relevant to the issue of vaccine design (such as the use of specific adjuvants).

Another challenge is to use the simulator as an alternative to animal models to compare alternatives in vaccine formulations, to evidence strengths and weaknesses of this approach and to identify points of intervention to increase biological fidelity of the results.

Conclusion

The combination of genomic information and simulation of the dynamics of the immune system, in one single tool, can offer new perspectives for a better understanding of the immune system.

Reference

Predict transcription factor binding from DNase footprints with Sasquatch

Banner-Sasquatch-OMICtools

Predicting the impact of regulatory sequence variation on transcription factor (TF) binding is an important challenge as the vast majority of disease associated SNPs are found in the non-coding genome (Vaquerizas, Kummerfeld, Teichmann, & Luscombe, 2009)⁠. Most existing approaches rely on large catalogs of cell type and TF specific functional annotations. As only a minority of TFs is well characterized (Maurano et al., 2015; Rockman & Kruglyak, 2006)⁠, identifying the relevant factors and probing them in the appropriate cell types represents a major limitation of TF centric approaches.

With this in mind, Ron Schwessinger and colleagues from University of Oxford have developed the Sasquatch tool to use DNase footprinting data to estimate and visualize the effects of non-coding variants on TF binding. Here, they talk about the features and benefits of their tool.

Sasquatch – predicting TF binding from average, k-mer based DNase footprints

DNase I cuts the genome preferentially in accessible regions, associated with regulatory function. By mapping only the very cut sites instead of entire fragments, DNase-seq can reveal protein occupation at bp resolution. Sasquatch analyses DNase footprints to comprehensively determine any k-mer’s potential for cell type specific TF binding in the context of open chromatin and how this may be changed by sequence variants. Sasquatch is an unbiased approach, independent of known TF binding sites and motifs and only requires a single DNase-seq dataset per cell type.

Probe TF binding potential

Querying and k-mer from the tissue repository retrieves the relative DNase cut profile over a 250 bp window of that k-mer. By automatically detecting shoulders and footprints in the profile, Sasquatch quantifies the Shoulder-to-Footprint Ratio (SFR) and thus the average protein occupancy of that k-mer within open-chromatin. The SFR is cell type specific with tissue specific TF only yielding a footprint and high SFR in relevant cell types while housekeeping TF score consistently high SFRs.

Figure_1_Sasquatch_omictools

Predict impact of sequence variation

Comparing the footprint profile of a reference and variant sequence yields a total and relative damage score of a particular variant. For that Sasquatch compares the SFRs in a sliding window approach. Variants with high damage scores are predicted to strongly alter TF binding potential in open chromatin in a cell type specific manner.

Figure_2_Sasquatch_omictools

Priorities non-coding sequence variants

By querying sequence variants in batch mode, Sasquatch can quickly prioritize thousands of variants for their potential to alter TF binding. Importantly, Sasquatch assumes open-chromatin context. Therefore, filtering variants for location in potential open-chromatin is advised when dealing with many variants.

In silico mutate entire regions

By predicting the damage score of every possible base substitution at every bp, Sasquatch can create in silico mutation plots. Peaks of high damage predict the location of likely binding sites within open-chromatin that can be damaged by sequence variants. By analyzing entire genomic elements Sasquatch can help to characterize regulatory elements.

Figure_3_Sasquatch_omictools

Concluding Remarks

We implemented Sasquatch as webtool for fast and straight forward usage (http://apps.molbiol.ox.ac.uk/sasquatch/cgi-bin/foot.cgi). To allow for a more flexible and customized usage, we also made it available as R implementation (https://github.com/Hughes-Genome-Group/sasquatch). We pre-processed all human ENCODE DNase data to supply a large repository of cell types for your analysis. Custom DNase-seq data can be easily pre-preprocessed and run in our R implementation.

Reference

Maurano, M. T. et al. (2015). Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nature Genetics. Rockman, M. V., and Kruglyak, L. (2006). Genetics of global gene expression. Nature Reviews Genetics.
Schwessinger, R. et al. (2017). Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Research.
Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A., & Luscombe, N. M. (2009). A census of human transcription factors: function, expression and evolution. Nature Reviews Genetics.