Evaluating the functional impact of genetic variants with COPE software


Dr Ge Gao, developer of the COPE-PCG software tool, talks here about his tool and how it can assist researchers to analyze sequencing data.

COPE: A new framework of context-oriented prediction for variant effects

Evaluating functional impacts of genetic variants is a key step in genomic studies. Whilst most popular variant annotation tools take a variant-centric strategy and assess the functional consequence of each variant independently, multiple variants in the same gene may interfere with each other and have different effects in combination than individually (e.g., a frameshift caused by an indel can be “rescued” by another downstream variant). The COPE framework, Context-Oriented Predictor for variant Effect, was developed to accurately annotate multiple co-occurring variants.

This new gene-centric annotation tool integrates the entire sequence context to evaluate the bona fide impact of multiple intra-genic variants in a context-sensitive approach.

COPE handles complex effects of multiple variants

Unlike the current variant-centric approach that assesses the functional consequence of each variant independently, COPE takes each functional element as the basic annotation unit and considers that multiple variants in the same functional element may interfere with each other and have different effects in combination than individually (complementary rescue effect).

cope-fig1overview-omictoolsOverview of COPE: COPE uses each transcript as a basic annotation unit. The variant mapping step identifies variants within transcripts. The coding region inference step removes introns from each transcript; all possible splicing patterns are taken into consideration for splice-altering transcripts (in this case, the red dot indicates a splice acceptor site SNP, and intron retention and exon skipping are taken into consideration). The sequence comparison step compares a ‘mutant peptide’ against a reference protein sequence to obtain the final amino acid alteration.

Applying COPE software to genomic data

Screening the official 1000 Genomes variant set, COPE identified a considerable number of false-positive Loss-of-Function calls for 23.21% splice-disrupting variants, 6.45% frameshift indels and 2.10% stop-gained variants, as well as several false-negative Loss-of-Function variants in 38 genes.

To the best of our knowledge, COPE is the first fully gene-centric tool for annotating the effects of variants in a context-sensitive approach.


Schematic diagram of typical types of annotation corrections implemented in COPE. A rescued stop-gained SNV indicates that another SNV (‘A’ to ‘C’) in the same codon rescues a variant-centric stop-gained SNV (‘A’ to ‘T’). Stop-gained MNV indicates that two or more SNVs result in a stop codon (‘A’ to ‘T’ and ‘C’ to ‘G’). A rescued frameshift indel indicates that another indel in the same haplotype recovers the original open reading frame. A splicing-rescued stop-gained/frameshift variant indicates that a stop-gained or frameshift variant is rescued by a novel splicing isoform. A rescued splice-disrupting variant indicates that a splice-disrupting variant is rescued by a nearby cryptic site (as shown in the figure) or a novel splice site. The asterisk in the figure indicates a stop codon.

Evaluating the quality of COPE: availability, usability and flexibility

  • Free software
  • Publically available online server and stand-alone package for large-scale analysis


Screenshot of the COPE web server. Example of input (A) and annotation by COPE (B)

  • Software documentation: A detailed guideline for installation and setup is available
  • Recent updates: COPE-PCG has been online since June 2016, and COPE-TFBS since March 2017 on a new website
  • Analysis of protein-coding genes (COPE-PCG), transcription factor binding sites (COPE-TFBS) and more… the COPE framework may also be extended and adapted to non-coding RNAs and miRNAs in a near future.

About the author

Dr Ge Gao is principal investigator at the Center for Bioinformatics of Peking University. His team focuses primarily on developing novel computational techniques to analyze, integrate and visualize high-throughput biological data effectively and efficiently, with applications for deciphering the function and evolution of gene regulatory system. Dr Ge Gao is specialized in large-scale data mining, using a combination of statistical learning, high-performance computing, and data visualizing.


Cheng et al., 2017. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Research.

Cheng et al., in preparation. Systematically identify and annotate multiple-variant compound effect at transcription factor binding sites in the human genome.

Guidelines for RNA-seq data analysis


Adaptation and translation by Sarah Mackenzie and Helene Perrin of the french articles on bioinfo-fr.net: « L’analyse de données RNA-seq: mode d’emploi » by Julien Delafontaine, Bioinformatician at Lausanne University, and « RNA-seq : plus de profondeur ou plus d’échantillons? » by Isabelle Stévant, PhD student in Bioinformatics at Geneva University.

High-throughput sequencing is currently producing massive quantities of data, which is often extremely difficult for biologists to analyze, and who thus look to bioinformaticians for help. From these tens of gigabytes of data, the biologist hopes to extract results which are statistically relevant (preferably with very small p-values!), biologically interpretable, and which justify the time and money invested in the study.

A biologist typically has multiple aims, such as:

  • to demonstrate that certain genes are transcribed
  • to demonstrate that certain genes are transcribed under specific conditions, but not others
  • to demonstrate alternative splicing of these genes, etc.

The second case is the most common. The “conditions” referred to may represent the presence or absence of a transcription factor, the presence or absence of a certain phenotype/disease, different culture conditions, different tissues or cell types, etc. The aim is to demonstrate for example that a particular transcription factor is associated with some diseases when expressed in the liver. The researcher should know the species he/she is working with, if gene reads are “single-end” or “paired-end” (see below) and what sort of outcome can be expected.

What is RNA-seq?

Let’s start with a brief reminder of the experimental methodology for RNA-seq (1): RNA is isolated from several cells at the same time, cut into fragments which are then amplified by PCR before being processed in a sequencer. This in turn provides short genomic sequences, termed “reads”. This gives tens of millions of short phrases composed of A, T, G, and C. Working from the principle that the number of reads is proportional to the amount of corresponding RNA in the cell, the aim is to estimate the amount.

RNA-seq is a relatively recent technique and is also known as next-generation sequencing (NGS) or high-throughput sequencing (HTS), along with ChIP-seq, DNA-seq and FAIRE-seq, all of which allow analysis of the regulation of gene expression.

Note: mRNA studied with RNA-seq are considered “mature”, i.e., polyadenylated and already spliced. This is important given that a gene may produce large quantities of RNA, but which are rapidly degraded before reaching maturity. However some of these RNA may be active, for example as transcription factors, for other genes. This nascent RNA can be studied using other technologies such as NET-seq (2).

What do RNA-seq data look like?

The files obtained from sequencing are typically in the FASTQ (text) format. This may be SRA (binary), however the format is usually used at the end of the analysis, when the data are made publicly available online.


In the FASTQ file shown here, each block of 4 lines, starting with “@”, represents a read: the sequence of an RNA fragment of a fixed length, the quality (reliability) of the sequencing of the fragment, and various other unreadable pieces of information. The sequencer produces short read sequences of approximately 30 to 150 nucleotides (nt).

Given the size of these files (hundreds of millions of lines), a sample is often divided into multiple files of equal sizes.

Sequencing can be carried out as “single end“, where each read is independent of the others, or as “paired-end“, where the reads are paired, one for each strand, at a fixed mean distance from each other. Paired-end sequencing is required to properly detect splicing, and in the quite different context of indels. For everything else, single-end sequencing is adequate (and much cheaper). Some single-end techniques also allow information about the original strand (reference or complementary) to be conserved, and thus the direction of the transcription.

RNAseq data preprocessing and quality control

1. Preparation

Copy the data to a location with a lot of available storage space and a maximum amount of RAM – access to a storage server and a computer cluster is ideal.

Obtain the genomic sequence of the relevant species in FASTA (.fa) format, such as from NCBI. This will be used by different programs to construct the index which will be used for mapping (see below).

2. Quality Control

Firstly ensure that the sequencer has functioned correctly. The first base pairs of a read are sequenced with high reliability, however the further you advance along the sequence, the less precise it becomes. Programs such as FastQC produce different graphs which allow you to identify whether you should shorten the reads – known as “trimming” – and to what extent.

This step is extremely important because it is difficult to align poor quality reads with the genome, rendering all analyses pointless: there have been cases where the proportion of mapped reads has increased from 2% to 60% with trimming.

Read alignment

3. Alignment (mapping)

Theoretically, we know only the sequence of the read but not which part of the genome it comes from. Thus each read will be aligned on the species genome, i.e. to locate the position on the genome of a sub-sequence similar to that of the read. This should theoretically be unique for a read which is sufficiently long (>30 nt), thus obtaining the original position of the transcript.

You can think of the genome as a long straight line, composed of billions of characters, and the reads as small segments which align with this line where the sequence is similar. If a transcript was entirely sequenced, all the exons would be “covered” by reads, and in some cases by multiple reads.

The mean number of reads per mapped position is termed the “sequencing depth”. In order to guarantee adequate sequencing depth, hundreds of millions of reads need to be sequenced, and their quality ensured also, as this is the basis of the statistical power when determining the level of expression of a transcript.

Alignment is never perfect, due to sequencing errors which can occur during the amplification phase, due to repeat regions, and to mutations, etc. We find reads in non-transcribed regions, as well as in introns (and in some cases non-spliced RNA). Also, the reference genome itself is not exact; it is the result of a process of assembly, which was also derived by sequencing, and which is a consensus between dozens of individuals.

3.1. Working with an existing annotation

Let’s assume we already have the files (GTF, BED, etc.) with the location of each exon/gene/transcript. These kinds of annotation can be found on Ensembl (BioMart) or NCBI.

Several programs allow short reads to be aligned, such as Bowtie, BWA, and a multitude of others. They are all useful, Bowtie and BWA being the fastest and most widely used algorithms. We have not personally tested the paying programs.

Among the options, the most important to check are the number of reported alignments (if a read map is present at several locations), and the number of errors allowed in the sequence (mismatches). The sequencer has an error rate of approximately 1%, so we can expect at least one error in a read of 100 nt. Furthermore, while 100-nt sequences are theoretically unique, if we consider the sequence of the genome as random, there are several regions which are either repeated or similar.

3.2. Reconstructing the annotation

If you are not confident about the available annotations, or are looking for new exons/isoforms, the genome structure can be completely or partially reconstructed using the reads, at the time of alignment. Programs such as TopHat or Cufflinks can guess the limits of exons and reconstruct possible isoforms from the data.

However if we reconstruct the entire genome from scratch, this approach is extremely slow and requires an enormous amount of memory. A good compromise is to provide the program with an annotation (as a GTF file) for it to complete.

3.3. Alignment result

In all cases, the result of this step is a BAM (binary version) or SAM (text version) file. SAMs are rarely useful and use up a lot of space, explaining why mainly only BAMs are used with SAMtools. The content is identical for both formats, and resembles the following:


Each line represents an alignment: the read identifier, the name of the reference sequence, the alignment position on the reference, the read sequence, the quality of the alignment and various quality indicators (flags). For more details, see the specifications of the SAM format.

Note: the quality of the alignment is not related to the quality of the previously described read. Here it reflects to what extent we can be sure that the read comes from the same position on the genome.

3.4. Cufflinks and other tools

You will see the term Cufflinks in several places throughout this article. This tool is everywhere, with the well-known drawbacks of:

  • generating a complicated (even disheartening) output, at least on the initial viewing;
  • creating bugs which destroy a job after 3 days on the cluster (according to some colleagues);
  • non-amenable to being personalized as it isn’t your own code.

It’s up to you, although we do recommend you test its latest version, but be aware also that alternatives such as RSEM. 

Transcript quantification and annotation

4. Counting

According to the hypothesis that the number of reads from a given gene is proportional to the abundance of its RNA in the cell, we need to count the reads from each gene, transcript or exon of the genome.

4.1. Preparing a table

Using BAM files, one solution is to use SAMtools with an in-house script. For example, in the command line “samtools mpileup” provides an exploitable format. The Python wrapper, pysam, offers more flexibility and efficacy if you code in this language. There are other wrappers for different languages (R, Ruby, C++, Java, Perl, Haskell, Lisp).

If you do end up doing this part yourself (which is justifiable), it doesn’t matter how you go about it, the main thing is to produce a readable text file – which is already useable for the biologist – and which already resembles a table, with an explicit header, so it can be easily upload in R, such as:


Alternatively, you can do the same thing with the Python library, HTseq. You can also do this with Cufflinks alone, irrespective of whether it’s from complete annotation or from what it constructs itself.

4.2. The problem of splice junctions

When aligning the genome or the exome, reads coming from a splicing junction (overlapping two exons joined after RNA splicing) can no longer find their original position as the two halves of the read sequence are separated by an intron in the reference sequence. This is somewhat unfortunate as it means we lose some of the data. To remedy this, we can

The two latter options work infinitely better with paired-end sequencing, as an exaggerated space between two reads of a pair provides proof that a part of the transcript was cut, as is the case with splicing.

4.3. The problem of alternative transcripts

When looking to quantify the expression of the various alternative transcripts (isoforms) of a gene, we cannot simply count the number of reads between the genomic coordinates of each of them, such as can be found on Ensembl. Many isoforms share the same exons and if a read maps to a common exon it is impossible to know which transcript it comes from.

We can nonetheless attempt to distribute the reads optimally between isoforms, or look for the most likely distribution. In concrete terms, the possible solutions are:

  • Least squares (including non-negative least squares) optimally resolve the overdetermined linear system T=CE, where T is the vector of the counts per transcript, E is the vector of counts per exon, and C is the table of correspondence (composed of ones and zeros) indicating in which exon the transcript is located.
  • The maximum likelihood method applied to the sum of the probabilities of each read to belong to the relevant transcript. This method is used in Cufflinks for example.
  • An EM algorithm, the subject of several recent publications (RSEM, iReckon).

These methods give relatively different results, the first, the most unbiased, mainly applies the “benefit of the doubt”, while the second tends to privilege one particular transcript. We have not yet tested the third. In some cases a LASSO can be added to “correct” certain errors in the method.

It is very challenging to know which of these methods is the most realistic, having few cases (or none at all) of genes for which the relative expression of isoforms has been measured by another method (and which furthermore is with your cell type, under your conditions, etc). The results are thus quite hypothetical, and the answer to this type of question will likely be found using an analysis other than RNA-seq. We can nonetheless use qRT-PCR to more precisely estimate the abundance of a small number of transcripts, or to perform a simulation. The subject is still up for discussion.

4.4. PCR duplicates

During the RNA fragment amplification step (PCR) which precedes sequencing, some fragments may be over-amplified by error. This results in an abnormal number of identical reads, which may bias the results. Fortunately these “PCR duplicates” are relatively easy to identify after visualization (see following section), as we observe disproportionate columns of several hundred identical superimposed reads. It is significantly harder to detect this in advance and for all genes, which obviously poses a statistical problem when the samples are compared if one fragment is over-amplified in one sample but not in another.


Visualization and interpretation

5. Visualization

5.1. Genome browsers

To preview your data, the best option is to construct for each sample what is known as a “profile” or a “density” of the signal, in other words, a type of bar plot which indicates the number of reads mapped at each position of the genome. This is provided as a well formatted file that a visualization program (genome browser) such as UCSC (online) or IGV (local) will be able to interpret to produce a graph.

ucsc-rnaseqAn example image from UCSC, from the top to bottom, RNA-seq profiles, CpG-islands, repeat elements, Refseq and Ensembl genes, GC content and peaks of ChIP-seq.

This is extremely useful for several reasons:

  • We can visualise the “mountains” in the reads which localise the exons and other transcripts.
  • We can visually compare the expression levels to compare different samples and genes.
  • We can add profiles of other types of signals, such as ChIP-seq, methylation, conservation, motifs, etc. which can explain our signal. This is a very recent approach and is often required to justify a new discovery: if a gene is expressed under one set of conditions and not another, we can expect to observe with Polymerase II, a peak in methylated H3K4 at the start, a peak of … in the middle, and a peak of …at the end….

To obtain these densities, a multitude of programs are available which each bioinformatics laboratory has recoded for their own use – the function “genomeCoverageBed” from BEDTools is apparently able to do this, otherwise you can code an nth “bam2wig” yourself, knowing that this uses BAM at the entry (again, you can use SAMtools or the Pysam wrapper), and should produce a bed/bigBed/bedGraph/wig/bigWig.

For online visualization, firstly you should be aware that you can see the BAMs directly, and that the preferred formats are those where there is no need to download the entire file: you upload only an “index” and when you search for a new region, only the essential part is downloaded from your computer. This is the case for bigBed/bigwig, formats in fact invented for the UCSC browser which provides conversion tools (p.ex. bedGraphToBigWig), and BAMs. This will thus be dramatically faster.

5.2. MA-plot

Once the counts have been obtained, one type of graph is frequently used to compare samples: the MA-plot. The letter M represents the log ratios, the letter A stands for average. As the name implies, the MA-plot is a plot of the mean (on the x-axis) versus the ratio (on the y-axis) of two values: gene expression under two conditions, with each gene being a point on the graph.

In other words, if x1 is the number of reads which map to a gene under a given condition, and x2 is the number of reads of the same gene under another condition, then the horizontal axis will be (x1 + x2)/2 , and the vertical axis will be  x1 / x2 .

However, the graph will in fact be difficult to read (all concentrated in a corner), which is why we transform the scale to log:

form-rnaseqon the x-axis , and  formula-rnaseq on the y-axis.

ma-plot-rnaseqMA-plot: x-axis is the log10 of the geometric mean; the y-axis is the log2 of the ratio. Purple shows different quantiles (quasi replicates are shown here). Source: HTSstation

This is interpreted as the further to the right the point is, the more strongly the gene is expressed overall, and vice versa. The higher the point is on the graph, the more it is overexpressed in the first condition relative to the second, and vice versa. Genes of interest are thus located further to the right (with more convincing ratios if the counts are higher) and as far as possible toward the top or the bottom of the graph, visibly separated from the mass of other points.

Two technical replicates will show for example very little variation around the horizontal axis, while two relatively different samples will give a graph in the form of a well-rounded fish. We can also check if the median is truly zero, without which we will see a bias: the samples will be incorrectly normalized (see below).

This is a purely visual approach which does not demonstrate anything, however if a gene which is thought to play an important role is noticeable in such a graph, we already know that the experiment was successful and there will be a pretty figure for the publication – an important point!

The R limma library is well adapted for this, however you can easily produce your own script or use other visualization programs such as Glimma and Goulphar which have the notable advantage of being interactive.

6. Calculation of an expression differential

For a given gene, we may wish to know if its expression is significantly higher under one condition compared to another.

For example, if I have 100 reads under Condition 1 and 500 under Condition 2, we could imagine that they are significantly different (but in fact we don’t know at all). And what if there were 10 and 50 reads? Or 1 and 5?  the proportion is the same, yet the difference may be due to chance. And what if there are 100 versus 150 reads? At this point, we need mathematics to be convinced.

6.1. Normalisation

In order to compare samples, they must first be made comparable. Biases such as the total number of reads aligned for a sample, PCR artefacts etc can invalidate results. For example, imagine a case where there are ten times more reads under one condition than another; in such a case almost all genes will be considered significantly enriched.

One possible way to remedy this would be to divide each sample by the total number of reads. Unfortunately, as with RNA-seq, the small regions tend to accumulate a large proportion of the reads, creating more bias than it corrects. Imagine two samples where one has twice as many reads as the other, but where half of the reads come from a single gene. By dividing by two, all genes are divided by two, and thus all become significantly less expressed.

We could also consider dividing the counts of each gene by the size of the gene to obtain a density of reads. This transformation is necessary to compare genes from the same sample, but not for comparing different samples. It should however be noted that some publications (3) show a bias in the detection of differentially expressed genes, favoring the longest genes. Despite a loss of statistical power, it may be preferable to divide the counts by the transcript length.

By linking the two, we obtain RPKM, for Reads Per Kilobase and per Million (of reads). The formula is:

Biologists like this because they’ve been told it’s been normalized. In reality, for the above mentioned reason, this is unfortunately not suitable for RNA-seq.

The best solution is presented in the article about DESeq (4) which we will discuss later. In short we build a reference using the mean (geometric, more robust) of the samples and normalizing them all relative to this reference.

Note: RNA-seq does not in fact allow us to differentiate a natural amplification of the expression of all genes of an individual, from amplification due to a technical artefact. In the first case normalization is not needed, and in the second it is. We generally work from the hypothesis that the level of expression of the majority of genes is not different between two samples – most of the time.

6.2. Statistics

Each read has a probability of p (very small) of coming from a given gene G. For two reads, the probability that they both come from G is p∗p, and that neither comes from G, is (1−p)∗(1−p), etc. For N reads, the probability that, among them “k”, comes from G and not the others is proportional to . This is known as binomial law. The problem with this model is that the superscripts are very big with , which is enough for each cluster to stop working.

Fortunately we can approximate a binomial with the Poisson approximation. A formula thus gives us the probability of observing a certain number of reads (known) if we know a parameter λ (unknown) which according to the theory can only be the mean of this distribution. We are not really interested in this probability, on the contrary, we imagine that our observation is probable (because it occurred) and estimate this parameter λ from the number of reads. If replicates are present, this is a good thing: we can estimate λ using the arithmetic mean of the counts of various replicates. If we have a single sample from this condition, the mean is estimated as being the only count of the reads observed – which is statistically questionable in all senses of the term. We then compare the λ under two conditions to know if statistically the means are the same.

Note: in reality, the Poisson distribution is a bit too restrictive and we tend to use instead a negative binomial principle in order to add an additional variance term. A change which can lead to complications, as we also need to estimate this variance. But rest assured, a program will cheerfully do this for you, so in the worst case scenario you can comfortably forget all that and press the red button!

Currently the best programs for doing these calculations are the R libraries such as DESeq and EdgeR. Both have a very clear manual, a readable publication, and a full reference. The program, given your previously obtained table, sends you a new tabulated file with a column of adjusted p-values, from which you can extract a list of the most significantly over/under expressed genes in a given condition. Furthermore, if necessary it can also take care of normalizing your table before you start using it. Cufflinks also claims to be able to do all this in its new version. The mathematical model is globally the same in all cases.

7. Interpretation

At this point you now need to determine if the genes identified as significant are in fact related to your experiment. In itself, a list of 100 genes is not particularly useful. Often the biologist knows which genes he/she can expect to find in a list, and furthermore hopes they are among the first in the list. If on the other hand he/she has no clues about what is being looked for, the next step is purely exploratory, and Bayesian (5) arguments have demonstrated that in all cases results have no value as proof.

Finally, if the adjusted p-values are too big, the genes can be filtered simply by p-value, or even by “fold change” (ratio) to allow the biologist to believe that all hope is not lost. The effect is particularly apparent in the absence of replicates.

If that’s all that can be taken from the RNA-seq itself, analyses can involve other types of data, making the picture more complex and likely more interesting. The specific constraints may also call for a change in mapping parameters, p-value limits, etc.

8. Conclusion

RNA-seq is a relatively recent method, and consequently not only are the tools available not as well fine-tuned as those for microarrays, but they are also rapidly evolving and are regularly being replaced – notably those proposing a complete pipeline . Nonetheless, it is a very promising technique and a successor of microarrays, already offering more structural information while allowing comparable or superior sensitivity, with existing tools. The main drawback is its cost, and the fact that the costs are rapidly decreasing renders it increasingly popular, and thus makes creating efficient pipelines increasingly essential.


RNA-seq: more depth or more samples?

When you throw yourself into the adventure of high-throughput transcriptome sequencing, you are forced to ask yourself THE question, yes THE question, that all of us who are on a tight budget fear:


To what depth should I sequence my samples?

All publications agree, the more replicates you have, the greater the statistical power to detect differentially expressed genes. However, while the costs of sequencing have considerably reduced, not everyone has the means to sequence tens (or even hundreds) of samples. If your project is focused on gene expression, the solution for finding a good compromise between statistical power and cost is to determine the optimal depth of sequencing required to sequence the maximum number of samples without losing information.

A sequencer such as the Illumina HiSeq 2000 can only sequence a finite number of reads. The HiSeq 2000 performs sequencing on 2 flow-cells, each of which contains 8 lines (see the video for more information about the Illumina sequencer). Each line can produce up to 200 million single-end reads (where only one end of the cDNA is read) or 400 million paired-end reads (where both ends of the      cDNA are read). It is possible to combine several libraries (or samples) on the same line using indexing systems which allow each cDNA fragment to be tagged to identify which sample it came from. The equation is simple. The more samples combined on a line, the fewer the reads per sample.

If too many samples are combined, there is a risk that low-expressing genes (low abundance RNA) will be missed however the per-sample cost of sequencing is low. Conversely, if few samples are combined, more reads than necessary may be sequenced to detect gene expression, and the per-sample cost is very high.

If your project is based on global gene expression and not on the detection of alternative transcripts, it may not be necessary to sequence 25-30 million reads per sample as is typically done. It is thus possible to reduce the number of reads per sample and put more samples in the sequencer for the same budget.

Evaluating sequencing depth

I’m going to use a concrete example to demonstrate this point. For my thesis project, I am performing single-cell RNA-seq on mouse embryonic tissues to identify cell lines and follow their differentiation over time. All up, I have 7 conditions and a total of 640 libraries in stock (~90 cells per condition). The laboratory could only finance one sequencing run (2 flow cells). The problem: should I evaluate only part or all of my samples?

At the start of the project, we performed a test with paired-end sequencing of about 30 cells, with a depth of 25 million reads. Based on this preliminary study, we proved that we could detect many of the genes required to identify our different cell lines, however the statistical power was quite low. One of the cell lines was very poorly represented, with the genes specific to this population not being significant. We thus know that we need more cells for each stage to have more robust analyses.

In order to calculate how many samples I can include in the sequencer without losing too much information, I had to evaluate the optimal sequencing depth.

For this, I chose 3 FastQ files from my initial sequencing (raw files containing the reads not yet mapped) representing 3 different cell types, and I took random samples from a percentage of reads to simulate increasing smaller depths of sequencing.

For this random reads sampling, I used HTseq (python) software which offers many tools to manipulate files, including FastQ and BAM. Given as it took me some time to find the method, I’m sharing the code with you here:


If I combined 16 samples, I theoretically sequenced at 400M/16=25 million reads per sample. Thus, if I want to simulate a 10M sequencing, I will take (10/25)*100=40% of the reads from my 3 FastQ.

I chose to simulate sequencings covering a theoretical depth of 0.1 to 20 million reads and developed a small quick’n dirty shell script to initiate the simulations in parallel on a calculation server.

Once the simulations were done, I needed to map them all on the reference genome (in my case, the mouse), and calculate the number of genes that can be detect for each simulated sequencing depth. For this you use an alignment software of choice (Bowtie, BWA, Gem Mapper, etc.), then quantify gene expression. I chose Cufflinks to obtain the FPKM (Fragments Per Kilobase of Exon Per Million Fragments Mapped). This relatively well-known measure normalises the number of mapped reads on a gene relative to the length of the gene but also relative to the total number of mapped reads on the genome. This normalized measure thus allows us to estimate the abundance of a given RNA in our sample. We consider that an FPKM of 1 corresponds to the presence of an RNA molecule in our sample. We count the number of genes detected with at least 1 FPKM then put them all in a graphic form to obtain saturation curves (thanks to R and ggplot2!). Working the principal that there is a sequencing depth at which all RNA can be detected, and thus all genes expressed in our initial sample, if we sequence beyond this depth, the number of genes will remain unchanged because the maximum detectable number has been reached. The saturation plot allows us to define the depth needed to reach the plateau.

sequencing-depth-rnaseqEstimation of optimal sequencing depth for analysis of gene expression by RNA-seq (Isabelle Stévant, CC BY)

On this plot, it is clear that the plateau is reached at about 5 million reads for the 3 cells. Sequencing at a greater depth will not detect more genes. In my case, if I want to sequence my 640 cells on the 16 available lines on the sequencer, I can combine my samples at a maximum of 640 cells/16 lines = 40, corresponding to a depth of 10 million reads per cell (400M reads per line/40 cells). According to the saturation plot, I should be able to detect the same number of genes as with deeper sequencing. I performed a final check using linear regression to be sure that the FPKM values for each gene are the same between the sequencing depth at 20 million reads and at 10 million reads:

read-paires-rnaseqConsistency between sequencing at 20M reads vs 10M reads (simulated data, Isabelle Stévant, CC BY)

With the correlation value extremely close to 1, meaning that the data are extremely similar, I can go ahead with a sequencing depth of 10 million reads instead of 25 million without the risk of losing gene expression data – and I can thus increase the number of samples from 256 to 640 for a single sequencing run!


If, like me, you want to study gene expression but not necessarily in great detail (i.e., alternative variants and transcripts), it is better to increase the number of replicates rather than place all your resources in deep sequencing. The sequencing depth can be evaluated first up if you have preliminary or sufficiently similar data (same tissue, same organism). This step will undoubtedly allow you to perform more experiments without blowing your sequencing budget and improve your results thanks to more robust differential expression analyses.


I sequenced 8 samples at 25M reads and at 10M reads. The sequencing libraries came from two different preparations (same initial cDNA but 2 libraries created separately by different people, using kits from different batches). This means that any variations obtained may be due either to the library preparation or to sequencing depth, or possibly to both.

Below are the comparative results from one of the samples:

I obtained a similar profile for the 8 samples as that shown here. I obtained a correlation above 0.99 between the two sequencing depths, and the regression line is close to perfect (y=x).

Conclusion: I did not lose information by sequencing with less depth, and was able to sequence more than twice as many samples compared to what we had originally anticipated.


  1. Mortazavi A. et al. (2008) Mapping and quantifying mammalian transcriptomes by RNA-Seq. Methods 5(7):621-8.
  2. Churchman L.S. and Weissman J.S. (2011) Nascent transcript sequencing visualizes transcription at nucleotide resolution. Nature 469: 368–373
  3. Oshlack A. et al. (2009) Transcript length bias in RNA-seq data confounds systems biology. Biol. Direct 4:14
  4. Anders S. et al. (2010) Differential expression analysis for sequence count data. Genome Biol. 11(10): R106
  5. Ioannidis J.P.A. et al. (2005) Why Most Published Research Findings Are False. PLOS Med. 2(8):e124