6 ways biorepositories support clinical research


We are pleased to publish this guest blog post from Geneticist Inc.

Biorepositories help research institutions by providing tissue samples for clinical studies. Human tissue samples play a critical role in disease research by enabling assessments of molecular expression, prediction of toxicity, and identification of biomarkers. They help clarify and expand field-of-use claims, selection of appropriate species for preclinical studies and they assist in the clinical trial stages of drug development. Below is a list of key areas where availability of tissues (both from humans and preclinical species) can support pharmaceutical and other research.

Assessment of Molecular Expression

Biobanks contain vast libraries of human tissue samples, allowing for the assessment of expression levels of biological target molecules such as proteins and RNA. Methods include immunohistochemistry, in situ hybridization, western blotting, PCR and tissue microarrays, all of which can be applied to both normal and diseased tissues to assess expression levels of target molecules. Determining expression levels in a large volume of tissue samples provides critical information to drug developer, allowing an assessment of the appropriateness of a potential drug target. The exclusion of inappropriate drug targets saves millions in funding and years of wasted research.

FFPE DNA and RNA analysis leads the way as a source of comprehensive tissue information. It enables the stratification of tissues, thus advancing our understanding of heterogeneous diseases like cancer that were previously treated without an appreciation of their inherent molecular heterogeneity.

Toxicity Predictions

By illuminating altered levels of target molecule expression in organs and tissues outside of those targeted by a drug, data gathered from testing tissue samples can warn researchers of unanctipated toxicity in drugs under development.

Biomarker Studies

In addition to assessing expression levels, human tissue samples provide an excellent source for the identification and clarification of biomarkers. Well-annotated tissues offer an opportunity for disease stratification that can help identify appropriate personalized therapy for patients exhibiting similar biomarker profiles.

Field of Use Claims

By accurately identifying drug targets in tissue samples, targets can then be searched for in well-classified samples from patients with different diseases. The enormous quantities of well-annotated FFPE blocks could serve as a means to expand the use of existing drugs for diseases that exhibit similarities in biomarkers.

Preclinical Species Selection

The selection of appropriate species for preclinical evaluation of pipeline drugs can be aided by tissue procurement from biorepository collections, particularly procurement of FFPE tissue. This is done by analyzing differences and selecting species with the most similar target compound expression profiles, as determined by tissue arrays. Efficiently modeling human diseases assists drug developers to avoid investigating costly, dead-end avenues, testing compounds in preclinical stages that will prove ineffective or toxic in human trials.

Clinical Trials

Once drug development reaches the clinical stage, embedded tissue blocks can continue to play a critical role in furthering research. Tissue samples enable patient stratification, prognostic assessments and pharmacological studies that would be impractical to perform by acquiring large numbers of trial participants

While in vitro studies lay the foundation for a biochemical, molecular and genetic understanding of the biology of diseases, human tissue samples provide a source of information from which fundamental knowledge is transformed into actionable information.

Related publications

Conversant Bio. Well-Annotated Tissue Samples: An Essential Part of Drug Discovery.

Roswell Park Cancer Institute Blog. The Importance of Tissue Samples in Research.

McDonald, 2010. Principles of Research Tissue Banking and Specimen Evaluation from the Pathologist’s Perspective.


Evaluating the functional impact of genetic variants with COPE software


Dr Ge Gao, developer of the COPE-PCG software tool, talks here about his tool and how it can assist researchers to analyze sequencing data.

COPE: A new framework of context-oriented prediction for variant effects

Evaluating functional impacts of genetic variants is a key step in genomic studies. Whilst most popular variant annotation tools take a variant-centric strategy and assess the functional consequence of each variant independently, multiple variants in the same gene may interfere with each other and have different effects in combination than individually (e.g., a frameshift caused by an indel can be “rescued” by another downstream variant). The COPE framework, Context-Oriented Predictor for variant Effect, was developed to accurately annotate multiple co-occurring variants.

This new gene-centric annotation tool integrates the entire sequence context to evaluate the bona fide impact of multiple intra-genic variants in a context-sensitive approach.

COPE handles complex effects of multiple variants

Unlike the current variant-centric approach that assesses the functional consequence of each variant independently, COPE takes each functional element as the basic annotation unit and considers that multiple variants in the same functional element may interfere with each other and have different effects in combination than individually (complementary rescue effect).

cope-fig1overview-omictoolsOverview of COPE: COPE uses each transcript as a basic annotation unit. The variant mapping step identifies variants within transcripts. The coding region inference step removes introns from each transcript; all possible splicing patterns are taken into consideration for splice-altering transcripts (in this case, the red dot indicates a splice acceptor site SNP, and intron retention and exon skipping are taken into consideration). The sequence comparison step compares a ‘mutant peptide’ against a reference protein sequence to obtain the final amino acid alteration.

Applying COPE software to genomic data

Screening the official 1000 Genomes variant set, COPE identified a considerable number of false-positive Loss-of-Function calls for 23.21% splice-disrupting variants, 6.45% frameshift indels and 2.10% stop-gained variants, as well as several false-negative Loss-of-Function variants in 38 genes.

To the best of our knowledge, COPE is the first fully gene-centric tool for annotating the effects of variants in a context-sensitive approach.


Schematic diagram of typical types of annotation corrections implemented in COPE. A rescued stop-gained SNV indicates that another SNV (‘A’ to ‘C’) in the same codon rescues a variant-centric stop-gained SNV (‘A’ to ‘T’). Stop-gained MNV indicates that two or more SNVs result in a stop codon (‘A’ to ‘T’ and ‘C’ to ‘G’). A rescued frameshift indel indicates that another indel in the same haplotype recovers the original open reading frame. A splicing-rescued stop-gained/frameshift variant indicates that a stop-gained or frameshift variant is rescued by a novel splicing isoform. A rescued splice-disrupting variant indicates that a splice-disrupting variant is rescued by a nearby cryptic site (as shown in the figure) or a novel splice site. The asterisk in the figure indicates a stop codon.

Evaluating the quality of COPE: availability, usability and flexibility

  • Free software
  • Publically available online server and stand-alone package for large-scale analysis


Screenshot of the COPE web server. Example of input (A) and annotation by COPE (B)

  • Software documentation: A detailed guideline for installation and setup is available
  • Recent updates: COPE-PCG has been online since June 2016, and COPE-TFBS since March 2017 on a new website
  • Analysis of protein-coding genes (COPE-PCG), transcription factor binding sites (COPE-TFBS) and more… the COPE framework may also be extended and adapted to non-coding RNAs and miRNAs in a near future.

About the author

Dr Ge Gao is principal investigator at the Center for Bioinformatics of Peking University. His team focuses primarily on developing novel computational techniques to analyze, integrate and visualize high-throughput biological data effectively and efficiently, with applications for deciphering the function and evolution of gene regulatory system. Dr Ge Gao is specialized in large-scale data mining, using a combination of statistical learning, high-performance computing, and data visualizing.


Cheng et al., 2017. Accurately annotate compound effects of genetic variants using a context-sensitive framework. Nucleic Acids Research.

Cheng et al., in preparation. Systematically identify and annotate multiple-variant compound effect at transcription factor binding sites in the human genome.

Evaluating biomedical data production with text mining


Estimating biomedical data

Evaluating the impact of a scientific study is a difficult and controversial task. Recognition of the value of a biomedical study is widely measured by traditional bibliographic metrics such as the number of citations of the paper and the impact factor of the journal.

However a more relevant critical success criteria for a research study likely lies in the production itself of biological data, both in terms of quality and also how these datasets can be reused to validate (or reject!) hypotheses and support new research projects. Although biological data can be deposited in specific repositories such as the GEO database, ImmPort, ENA, etc., most data are primarily disseminated in articles within the text, figures and tables. This raises the question – how can we find and measure the production of biomedical data diffused in scientific publications?

To address this issue, Gabriel Rosenfeld and Dawei Lin developed a novel text-mining strategy that identifies articles producing biological data. They published their method “Estimating the scale of biomedical data generation using text mining” this month on BioRxiv.

Text mining analysis of biomedical research articles

Using the Global Vector for Word Representation (GloVe) algorithm, the authors identified term usage signatures for 5 types of biomedical data: flow cytometry, immunoassays, genomic microarray, microscopy, and high-throughput sequencing.

They then analyzed the free text of 129,918 PLOS articles published between 2013 and 2016. What they found was that nearly half of them (59,543) generated 1 or more of the 5 data types tested, producing 81,407 data sets.


Estimating PLOS articles generating each biomedical data type over time (from “Estimating the scale of biomedical data generation using text mining“, BioRxiv).

This text-mining method was tested on manually annotated articles, and provided a valuable balance of precision and recall. The obvious next  – and exciting – step is to apply this approach to evaluate the amount and types of data generated within the entire PubMed repository of articles.


Estimating PLOS articles generating each biomedical data type over time (from “Estimating the scale of biomedical data generation using text mining“, BioRxiv).

A step beyond data dissemination

Evaluating the exponentially growing amount and diversity of datasets is currently a key aspect of determining the quality of a biomedical study. However in today’s era of bioinformatics, in order to fully exploit the data we need to take this a step beyond the publication and dissemination of datasets and tools, towards the critical parameter of improving data reproducibility and transparency (data provenance, collection, transformation, computational analysis methods, etc.).

Open-access and community-driven projects such as the online bioinformatics tools platform OMICtools, provide access not only to a large number of repositories to locate valuable datasets, but also to the best software tools for re-analyzing and exploiting the full potential of these datasets.

In a virtual circle of discovery, previously generated datasets could be repurposed for new data production, interactive visualization, machine learning and artificial intelligence enhancement, allowing us to answer new biomedical questions.