Evaluating the functional impact of genetic variants with COPE software

COPE-software-variant-effect-omictools

Dr Ge Gao, developer of the COPE-PCG software tool, talks here about his tool and how it can assist researchers to analyze sequencing data.

COPE: A new framework of context-oriented prediction for variant effects

Evaluating functional impacts of genetic variants is a key step in genomic studies. Whilst most popular variant annotation tools take a variant-centric strategy and assess the functional consequence of each variant independently, multiple variants in the same gene may interfere with each other and have different effects in combination than individually (e.g., a frameshift caused by an indel can be “rescued” by another downstream variant). The COPE framework, Context-Oriented Predictor for variant Effect, was developed to accurately annotate multiple co-occurring variants.

This new gene-centric annotation tool integrates the entire sequence context to evaluate the bona fide impact of multiple intra-genic variants in a context-sensitive approach.

COPE handles complex effects of multiple variants

Unlike the current variant-centric approach that assesses the functional consequence of each variant independently, COPE takes each functional element as the basic annotation unit and considers that multiple variants in the same functional element may interfere with each other and have different effects in combination than individually (complementary rescue effect).

cope-fig1overview-omictoolsOverview of COPE: COPE uses each transcript as a basic annotation unit. The variant mapping step identifies variants within transcripts. The coding region inference step removes introns from each transcript; all possible splicing patterns are taken into consideration for splice-altering transcripts (in this case, the red dot indicates a splice acceptor site SNP, and intron retention and exon skipping are taken into consideration). The sequence comparison step compares a ‘mutant peptide’ against a reference protein sequence to obtain the final amino acid alteration.

Applying COPE software to genomic data

Screening the official 1000 Genomes variant set, COPE identified a considerable number of false-positive Loss-of-Function calls for 23.21% splice-disrupting variants, 6.45% frameshift indels and 2.10% stop-gained variants, as well as several false-negative Loss-of-Function variants in 38 genes.

To the best of our knowledge, COPE is the first fully gene-centric tool for annotating the effects of variants in a context-sensitive approach.

cope-fig1bannotations-omictools

Schematic diagram of typical types of annotation corrections implemented in COPE. A rescued stop-gained SNV indicates that another SNV (‘A’ to ‘C’) in the same codon rescues a variant-centric stop-gained SNV (‘A’ to ‘T’). Stop-gained MNV indicates that two or more SNVs result in a stop codon (‘A’ to ‘T’ and ‘C’ to ‘G’). A rescued frameshift indel indicates that another indel in the same haplotype recovers the original open reading frame. A splicing-rescued stop-gained/frameshift variant indicates that a stop-gained or frameshift variant is rescued by a novel splicing isoform. A rescued splice-disrupting variant indicates that a splice-disrupting variant is rescued by a nearby cryptic site (as shown in the figure) or a novel splice site. The asterisk in the figure indicates a stop codon.

Evaluating the quality of COPE: availability, usability and flexibility

  • Free software
  • Publically available online server and stand-alone package for large-scale analysis

cope-screnshot-omictools

Screenshot of the COPE web server. Example of input (A) and annotation by COPE (B)

  • Software documentation: A detailed guideline for installation and setup is available
  • Recent updates: COPE-PCG has been online since June 2016, and COPE-TFBS since March 2017 on a new website
  • Analysis of protein-coding genes (COPE-PCG), transcription factor binding sites (COPE-TFBS) and more… the COPE framework may also be extended and adapted to non-coding RNAs and miRNAs in a near future.

About the author

Dr Ge Gao is principal investigator at the Center for Bioinformatics of Peking University. His team focuses primarily on developing novel computational techniques to analyze, integrate and visualize high-throughput biological data effectively and efficiently, with applications for deciphering the function and evolution of gene regulatory system. Dr Ge Gao is specialized in large-scale data mining, using a combination of statistical learning, high-performance computing, and data visualizing.

References

Cheng et al., 2017. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Research.

Cheng et al., in preparation. Systematically identify and annotate multiple-variant compound effect at transcription factor binding sites in the human genome.

Your Top CRISPR/Cas9 software tools

crisprcas9-banner-omictools

The development of CRISPR-Cas9 systems has revolutionized genome engineering in living organisms. This novel technology opens up a new era in genomics, along with a wide range of applications. Several bioinformatics tools have recently been developed for researchers designing CRISPR/Cas9 experiments, and analyzing and evaluating CRISPR/Cas9 genome editing.

A few weeks ago, we asked OMICtools members to choose their top 3 CRISPR/Cas9 favorite tools among those most used by the scientific community. Here are the results of your votes. 

Gold medals for CRISPR-GA, CROP-IT and CRISPRTarget tools

Three web applications came out equally on top – each voted as a number #1 tool by 45% of the users surveryed: CRISPR-GA (CRISPR Genome Analyzer), CROP-IT (CRISPR/Cas9 Off-target Prediction and Identification Tool) and CRISPRTarget.

The CRISPR-GA platform has become an essential tool for anyone wanting to assess the quality of their CRISPR/Cas9 experiment. It provides an easy (three mouse clicks), sensitive (detection limit 50.1%), and comprehensive analysis of gene editing results. The CRISPR-GA platform maps the reads, it estimates and locates insertions and deletions, computes the allele replacement efficiency, and then provides you with a report integrating all this information.

crispr-ga-Fig-omictoolsCRISPR-GA pipeline. (A) From experiment to report. Schematic pipeline of a gene editing assessment. (B) Output of CRISPR-GA estimating a range of information. Deletions, insertions, homologous recombination (HR) and corresponding efficiencies. Upper panels estimate the number of insertions and deletions and each corresponding size. Middle panels estimate the number of insertions and deletions, and their corresponding location within the genomic locus of interest. The bottom panel shows the number of deletions and HRs at each corresponding location, and outputs the HR and NHEJ (non-homologous end-joining) efficiency. (C) Experimental results assessed by CRISPR-GA from testing several mutants of cas9, gRNAs and a DNA template. HR and NHEJ values are shown. From Güell et al., 2014. Genome editing assessment using CRISPR Genome Analyzer (CRISPR-GA).  Bioinformatics.

  • CROP-IT (CRISPR/Cas9 Off-Target Prediction and Identification Tool)

CROP-IT is a userfriendly web application where users can design optimal sgRNA guiding sequences and can search for potential off-target binding or cleavage sites. The CROP-IT tool integrates knowledge from experimentally identified Cas9 binding sites, cleavage sites as well as information on chromatin state (data from multiple studies and 125 cell types). CROP-IT scores predict off-target binding and cleavage Cas9 sites and outputs a list of the top sites.

crop-it-algofig-omictools

Schematic of CROP-IT algorithm based on a computational model where each position of the guiding RNA sequence is differentially weighted based on experimental Cas9 binding and cleavage site information from multiple independent sources. Furthermore, it incorporates chromatin state information for the human genome by analyzing accessible chromatin regions from 125 human cell types. By integrating observed information from Cas9 DNA binding, CROP-IT performs significantly better than existing computational prediction tools. From Singh et al., 2015. Cas9-chromatin binding information enables more accurate CRISPR off-target prediction. Nucleic Acids Research. 

CRISPRTarget is one of the first tools developed for predicting the targets of CRISPR RNA spacers. This web application interactively explores diverse databases. CRISPTarget provides the flexibility to search for matches in either or both orientations of the input, and to discover targets with protospacer adjacent motifs, as well as any adjacent pairing potential.

criprstarget-outputs-fig-omictools

Graphical output of CRISPRTarget. Output of a search for targets of the Streptomyces thermophilus DGCC7710 CRISPR array. The direction of transcription is known, however both strands are shown in the diagram as if the direction of transcription was unknown. Two relatively low-scoring matches using these interactive settings are shown (rank 44–45). They have good spacer-protospacer base pairing but lack a WTTCTNN PAM. Match 45 is a match to a phage to which this strain is sensitive (Φ2972). Yellow indicates spacer/protospacer, blue shows flanking sequences, and mismatches between the crRNA and the target DNA protospacer are indicated in red. From Biswas et al., 2013. CRISPRTarget: bioinformatic prediction and analysis of crRNA targets. RNA Biology.

Silver medal for ZiFit

Second place went to ZiFiT (Zinc Finger Targeter v4.1), with 36% of the votes.

Originally developed to identify potential zinc finger nuclease (ZFN) sites in target sequences, ZiFiT also provides support for the identification of CRISPR/Cas target sites and reagents as well as a user-friendly guidance for construction of TALEN-encoding plasmids.

(Sander et al., 2010. ZiFiT (Zinc Finger Targeter): an updated zinc finger engineering tool. Nucleic Acids Research.)

Bronze medals for Crass and MAGeCK tools

Equal third place went to Crass (CRISPR Assembler) and MAGeCK (Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout), with 31% of the votes each.

  • Crass (CRISPR Assembler)

Crass identifies and reconstructs CRISPR loci and spacers from raw metagenomic data without the need for assembly or prior knowledge of CRISPR in the data set. The sensitivity, specificity and speed of Crass facilitates analysis of metagenomic data, phage-host interactions and co-evolution within microbial communities.

criprassembler-schema-omictools

Comparison between different CRISPR loci visualization techniques. (A) Traditional approach to visualization where the spacers are shown as differently colored rectangles (the same color refers to the same spacer) anchored to the leader sequence (white triangle). (B) The same CRISPR loci reconstructed by Crass into a spacer graph. From Skennerton et al., 2013. Crass: identification and reconstruction of CRISPR from unassembled metagenomic data. Nucleic Acids Res.

  • MAGeCK (Model-based Analysis of Genome-wide CRISPR/Cas9 Knockout)

The MAGeCK algorithm was developed by Li et al. (Genome Biol. 2014) for prioritizing single-guide RNAs, genes and pathways in genome-scale CRISPR/Cas9 knockout screens. It identifies both positively and negatively selected genes simultaneously, and reports robust results across different experimental conditions. This computational method, with a low false discovery rate (FDR) and high sensitivity, brings new clues for answering biological questions and addressing therapeutic needs. 

Follow this tutorial to see how the MAGeCK algorithm works.

Stay tuned for more feedback from the OMICtools community on the latest and best tools to use!

Snakemake: taking parallelization a step further

pipes-snakemake

Written by Raoul Raffel from Bioinfo-fr.net, translated by Sarah Mackenzie.

Hello and welcome back to a new episode of the series of Snakemake tutorials, this one dealing with parallelization. If you missed the first episode introducing you to Snakemake for Dummies, check out the article to catch up on it before you read on.

Here we are going to see how easy Snakemake makes it to parallelize data. The general idea revolves around cutting out the raw files from the start of your pipeline and then putting them back together after the calculation-intensive steps. We are also going to find out how to use a JSON configuration file. This file is the equivalent of a dictionary / hash table and can be used to stock global variables or parameters used by the rules. It will make it easier to generalize your workflow and modify the parameters of the rules without touching Snakefile.

snakemake-parallelization-omictools

 To use it in a Makefile file, you need to add the following key word:

snakemake-parallelization-omictools

You can then access elements as if it were a simple dictionary:

 snakemake-parallelization-omictools

A single key word for parallelizing

Only one new key word (dynamic) and two rules (cut and merge) are needed to parallelize.

It’s easiest to illustrate this using the workflow example from the previous tutorial. In it, the limiting step was Burrows-Wheeler Aligner (rule bwa_aln), as this command doesn’t have a parallelization option. We can overcome this limitation with the following two rules.

snakemake-parallelization-omictools

In this case I have simplified as much as possible to show the power of this functionality, however if you want to use them in the workflow from the previous tutorial you will have to adapt these two rules.

Note: the option –cluster allows the use of a scheduler (e.g. –cluster ‘qsub’).

Taking automatization further

The file configfile.json allows automatic generation of target files (i.e., the files you want at the end of your workflow). The following example uses the configuration files presented earlier to generate the target file. Note that {exp} and {samples} come from pairs.

snakemake-parallelization-omictools

Here’s an example of the workflow with parallelization:

snakemake-parallelization-omictools

snakemake-parallelization-omictools

snakemake-parallelization-omictools

snakemake-parallelization-omictools

So to sum up, we’ve taken another step forward in learning the functionalities of Snakemake using only a single key word and two new rules. This is an easy way to improve the efficacy of your workflows by reducing the calculation time for calculation-intensive steps.

However it’s important to keep in mind that excessive parallelization is not necessarily the optimal strategy. For example if I  decide to cut a file which contains 1000 lines into 1000 files of a single line each and I have only two poor processors at my disposal, I’m likely facing a loss of time rather than a gain. So it’s up to you to make the most judicious choice for your parallelization strategy on the basis of the machine(s) available, the size of the files to cut up, and the software/scripts and extra time that your two new rules will add to the workflow.

But if you are facing particularly demanding tasks, and a computer cluster is available, you may well see an impressive gain in time.