MEDLINE queries made easy with the MEDOC tool

Banner-MEDOC-Omictools

MEDOC (MEdline DOwnloading Contrivance) is a Python program designed to download data from MEDLINE on an FTP and to load all extracted information into a local MySQL database, thus making MEDLINE search easy.

MEDLINE, the biomedical data keeper

Since MEDLINE’s database has been released almost 50 years ago, the number of indexed publications rose from 1 million in 1970 to 27 millions this year. The aim of this repository is to facilitate the access to the scientific literature for everyone.

Pubmed-Omictools
Evolution of the number of document on PubMed

The NIH (National Institute of Health, USA) also provides a powerful search engine, which allows to query this database throught the well-know web interface PubMed. This search engine supports complex queries by using logical operators (OR, AND) and indexes different text blocks (such as title, abstract) for refined search. Moreover, different API services have been released to allow routine search, informatics parsing of the results, and data extraction.

However, to query these API (eUtilities), the user needs to program a different script for every search (which can become time-consuming when many data requiring different parsing are needed) and to query the API many times to retrieve individual data from unique article.

To make data-mining easier, the NIH now allows to download MEDLINE’s data from a FTP containing XML-tagged file.

Relational database to the rescue

Even if noSQL databases are rising up these last years, a local and relationnal-based version of the MEDLINE database is useful for complex and frequent queries. The idea behind MEDOC was thus to build a relational scheme and load XML files into this mySQL version.

MEDOC-Omictools

The figure above presents every steps executed by the Python3 wrapper to construct this local database. 13 tables were created to store every data contained into XML files extracted from the NIH FTP (authors, chemical products, MESH, corrections, citation subset, publication type, language, grant, data bank, personal name subject, other ID and investigator).

Example of request

It took 113 hours (4 days and 17 hours) for MEDOC to load the 1174 files contained into the FTP in the mySQL database (representing 61.3 Go of disk space used).

Querying this version is almost instantenious, even if joining several tables together. In the example provided bellow, the 10 last publications about antioxidants indexed on PubMed were retrieved with SQL queries.

requete_sql_medoc_tool

The following result was provided in 0.022 secondes:

Query-result-omictools
Result obtained with the example query

In summary, this indexed relational database allows the user to build complex and rapid queries. All fields can thus be searched for desired information, a task that is difficult to accomplish through the PubMed graphical interface. MEDOC is free and publicly available on Github

Your top 3 gene clustering software tools

Banner-clustering-OMICtools

Clustering is a fundamental step in the analysis of biological and omics data. It is used to construct groups of objects (genes, proteins) with related function, expression patterns, or known to interact together. In microarrays or RNA-seq experiments, gene clustering is often associated with heatmap representation for data visualization.

Choosing the right clustering tool for your analysis

Many clustering methods and algorithms have been developed and are classified into partitioning (k-means), hierarchical (connectivity-based), density-based, model-based and graph-based approaches.

To help you choose between all the existing clustering tools, we asked OMICtools members to vote for their favorite software. Here are the top 3 tools, chosen by 23 voters.

First place for ClustEval

ClustEval is a web-based clustering analysis platform developed at the Max Planck Institute for Informatics and the University of Southern Denmark. It is designed to objectively compare the performance of various clustering methods from different datasets.

More precisely, ClustEval has compared the performances of 18 different clustering methods among the most used, using 24 different datasets. These datasets include gene expression data, protein sequence similarity, protein structure similarity, social network, word sense disambiguation, etc. The performance of a clustering method is then evaluated by a F1-score (harmonic mean of precision and recall).

Finally, ClustEval can be downloaded and installed by users to perform their own clustering analysis comparison, using VirtualBox image, Docker & Docker Compose or as a R package.

Performance-clustering-ClustEval-OMICtools
Performance of all clustering tools on all nonartificial data sets on the basis of F1 scores.

Second position for Babelomics

Babelomics is a web application developed by the Computational Genomics Department of the Principe Felipe Research Center in Valencia. It performs a wide range of functional analysis of gene expression and genomic data, from processing to expression analysis and gene set enrichment.

In its current version, Babelomics 5, the web-site displays a user-friendly and intuitive interface for the clustering of microarray or RNA-seq data using one of three different methods: UPGMA, SOTA, and k-means. The subsequent result can be visualized as a heatmap. Examples of data set and analysis are provided for every functionality of the application, and tutorials available here.

Babelomics-OMICtools
Babelomics clustering tool.

Third place for AltAnalyze

AltAnalyze is a comprehensive application for the analysis of single-cell and bulk RNA-seq data that can automatize every step of gene expression and splicing analysis, including clustering and heatmap representation. It was developed in the Nathan Salomonis laboratory at Cincinnati Children’s Hosptial Medical Center and the University of Cincinnati.

AltAnalyze proposes many options for clustering algorithms and normalization, as well as unique features such as finding optimized clusters for single-cell analysis.

AltAnalyse can be downloaded and run on all operating systems, and comes with useful documentation (tutorials, blog, FAQ).

Cluster-heatmap-AltAnalyze-OMICtools
Heatmap and clustering generated with AltAnalyze

References

(Wiwie et al., 2015) Comparing the performance of biomedical clustering methods. Nature Methods.

(Alonso et al., 2015) Babelomics 5.0: functional interpretation for new generations of genomic data. Nucleic Acids Research.

(Emig et al., 2010) AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Research.

(Olson et al., 2016) Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature.

 

Your top 3 Venn diagram tools

banner-venn-diagram

Venn diagrams are very simple, yet incredibly useful tools used to show all logical relations between finite collections of different sets of data. In Venn diagrams, sets of data are often represented as overlapping circles. Data that are shared between two different sets will reside at the intersection, while unique data remain outside the intersection.

Venn diagrams in biology

In biology and omics, Venn diagrams can be used for a variety of purposes, such as the comparison of different lists of genes or proteins (generally 2 or 3) to identify similarities and represent them in two dimensions. Most softwares allow easy extraction of the data, and let you customize the diagrams.

Continuing our series of data visualization tools, OMICtools members voted for their favorite Venn diagram representation softwares and websites. Here are the results from 54 voters.

Your top 1 Venn diagram generation tool: Venny

You were 50% to choose Venny as your number 1 favorite tool to generate Venn diagrams.

Venny is a web-server created by Juan Carlos Oliveros from the BioinfoGP service at the Spanish National Biotechnology Centre, that can be used online or offline to generate Venn diagrams from up to four lists.

Its straight-forward usage lets you create diagrams and extract data in 3 basic steps:

  1. Paste your lists of data (one element per row) and rename the lists
  2. Click on the numbers to get exclusive and common data between lists
  3. Right-click the figure to view and save the diagram

Venny allows basic customization of your diagrams (line weight, font size and style).

Venny-omictools
Example Venn diagram generated with Venny

Shared second place for BioVenn and Venn diagram

43% of the OMICtools community voted for Biovenn and Venn diagram as their favorite tool!

BioVenn

BioVenn is a web application developed at the Centre for Molecular and Biomolecular Informatics that enables creation of Venn diagrams from up to 3 sets of data. Unlike in Venny, the diagrams in BioVenn are area-proportional, which means that the size of the circles and the overlaps correspond to the sizes of the data sets. BioVenn also comes with interesting features, such as the ability to directly upload a data set from a tab file, or to support a wide range of identifiers which can be linked to biological databases.

BioVenn-omictools
Venn diagram generated with BioVenn

Venn diagram

Venn diagram was developed by the VIB-Ugen Center for Plant Systems Biology at Ghent University. This web application allows users to draw Venn diagrams from up to 6 data lists, in a symmetric or non-symmetric fashion. The diagrams can then be downloaded in SVG or PNG format. Moreover, Venn diagram is able to calculate the intersections of up to 30 different lists, making it a useful tool to identify common values between multiple data sets.

Venn-diagram-omictools
Venn diagram generated with the Venn diagram software

Bronze medal for InteractiVenn

A very close third place goes to InteractiVenn, chosen by 41% of voters as their favorite tool.

InteractiVenn is a sophisticated and flexible web-based tool that allows creation of Venn diagrams from up to 6 lists and analysis of set unions, while preserving the shape of the diagram. By displaying partial unions, the user is able to locate regions that combine unions of sets and their intersections, thus providing additional observations on the interactions between joined sets.

With InteractiVenn, the user can choose text size, color, opacity, and can export the diagram in a vectored format.  Datasets can also be saved locally for later use on the website.

Interactivenn-omictools
Venn diagram generated with InteractiVenn

References

(Oliveros, J.C., 2007-2015) Venny. An interactive tool for comparing lists with Venn’s diagrams. http://bioinfogp.cnb.csic.es/tools/venny/index.html

(Hulsen et al., 2008) BioVenn – a web application for the comparison and visualization of biological lists using area-proportional Venn diagrams. BMC Genomics.

Venn diagram: http://bioinformatics.psb.ugent.be/webtools/Venn/

(Heberle et al., 2015) InteractiVenn: a web-based tool for the analysis of sets through Venn diagrams. BMC Bioinformatics.