MEDLINE queries made easy with the MEDOC tool


MEDOC (MEdline DOwnloading Contrivance) is a Python program designed to download data from MEDLINE on an FTP and to load all extracted information into a local MySQL database, thus making MEDLINE search easy.

MEDLINE, the biomedical data keeper

Since MEDLINE’s database has been released almost 50 years ago, the number of indexed publications rose from 1 million in 1970 to 27 millions this year. The aim of this repository is to facilitate the access to the scientific literature for everyone.

Evolution of the number of document on PubMed

The NIH (National Institute of Health, USA) also provides a powerful search engine, which allows to query this database throught the well-know web interface PubMed. This search engine supports complex queries by using logical operators (OR, AND) and indexes different text blocks (such as title, abstract) for refined search. Moreover, different API services have been released to allow routine search, informatics parsing of the results, and data extraction.

However, to query these API (eUtilities), the user needs to program a different script for every search (which can become time-consuming when many data requiring different parsing are needed) and to query the API many times to retrieve individual data from unique article.

To make data-mining easier, the NIH now allows to download MEDLINE’s data from a FTP containing XML-tagged file.

Relational database to the rescue

Even if noSQL databases are rising up these last years, a local and relationnal-based version of the MEDLINE database is useful for complex and frequent queries. The idea behind MEDOC was thus to build a relational scheme and load XML files into this mySQL version.


The figure above presents every steps executed by the Python3 wrapper to construct this local database. 13 tables were created to store every data contained into XML files extracted from the NIH FTP (authors, chemical products, MESH, corrections, citation subset, publication type, language, grant, data bank, personal name subject, other ID and investigator).

Example of request

It took 113 hours (4 days and 17 hours) for MEDOC to load the 1174 files contained into the FTP in the mySQL database (representing 61.3 Go of disk space used).

Querying this version is almost instantenious, even if joining several tables together. In the example provided bellow, the 10 last publications about antioxidants indexed on PubMed were retrieved with SQL queries.


The following result was provided in 0.022 secondes:

Result obtained with the example query

In summary, this indexed relational database allows the user to build complex and rapid queries. All fields can thus be searched for desired information, a task that is difficult to accomplish through the PubMed graphical interface. MEDOC is free and publicly available on Github

Your top 3 gene clustering software tools


Clustering is a fundamental step in the analysis of biological and omics data. It is used to construct groups of objects (genes, proteins) with related function, expression patterns, or known to interact together. In microarrays or RNA-seq experiments, gene clustering is often associated with heatmap representation for data visualization.

Choosing the right clustering tool for your analysis

Many clustering methods and algorithms have been developed and are classified into partitioning (k-means), hierarchical (connectivity-based), density-based, model-based and graph-based approaches.

To help you choose between all the existing clustering tools, we asked OMICtools members to vote for their favorite software. Here are the top 3 tools, chosen by 23 voters.

First place for ClustEval

ClustEval is a web-based clustering analysis platform developed at the Max Planck Institute for Informatics and the University of Southern Denmark. It is designed to objectively compare the performance of various clustering methods from different datasets.

More precisely, ClustEval has compared the performances of 18 different clustering methods among the most used, using 24 different datasets. These datasets include gene expression data, protein sequence similarity, protein structure similarity, social network, word sense disambiguation, etc. The performance of a clustering method is then evaluated by a F1-score (harmonic mean of precision and recall).

Finally, ClustEval can be downloaded and installed by users to perform their own clustering analysis comparison, using VirtualBox image, Docker & Docker Compose or as a R package.

Performance of all clustering tools on all nonartificial data sets on the basis of F1 scores.

Second position for Babelomics

Babelomics is a web application developed by the Computational Genomics Department of the Principe Felipe Research Center in Valencia. It performs a wide range of functional analysis of gene expression and genomic data, from processing to expression analysis and gene set enrichment.

In its current version, Babelomics 5, the web-site displays a user-friendly and intuitive interface for the clustering of microarray or RNA-seq data using one of three different methods: UPGMA, SOTA, and k-means. The subsequent result can be visualized as a heatmap. Examples of data set and analysis are provided for every functionality of the application, and tutorials available here.

Babelomics clustering tool.

Third place for AltAnalyze

AltAnalyze is a comprehensive application for the analysis of single-cell and bulk RNA-seq data that can automatize every step of gene expression and splicing analysis, including clustering and heatmap representation. It was developed in the Nathan Salomonis laboratory at Cincinnati Children’s Hosptial Medical Center and the University of Cincinnati.

AltAnalyze proposes many options for clustering algorithms and normalization, as well as unique features such as finding optimized clusters for single-cell analysis.

AltAnalyse can be downloaded and run on all operating systems, and comes with useful documentation (tutorials, blog, FAQ).

Heatmap and clustering generated with AltAnalyze


(Wiwie et al., 2015) Comparing the performance of biomedical clustering methods. Nature Methods.

(Alonso et al., 2015) Babelomics 5.0: functional interpretation for new generations of genomic data. Nucleic Acids Research.

(Emig et al., 2010) AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Research.

(Olson et al., 2016) Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature.


OMICtools uses code versioning to enhance tool traceability


Software tools on the Web evolve over time, with developers adding or removing features to improve their product, making code versioning an essential aspect of sustainable software development. Reliable code versioning allows developers to automatically track their work and revert to previous versions when needed.

Discover here how OMICtools can help you facilitate the traceability of your tool – and learn how to successfully upload your work in a community-controlled repository.

Benefits of version control

  • Provides a mechanism to keep track of code changes
  • Allows you to track the history of changes, work on the same code files, and merge code from different branches
  • Shows conflicts on code merges, allowing you to resolve them quickly

Following the FAIR data principles

The FAIR system provides recommendations for scientific data management and stewardship. OMICtools applies the FAIR guidelines to bioinformatics tools, by making them easily Findable, Accessible, Interoperable and Reusable.

  • Findability: Software and code versions are easy to find with the OMICtools advanced search engine. We continuously collect and update information from original articles, websites and repositories to make available the latest bioinformatics tools.
  • Accessibility: OMICtools ensures ongoing access to software tools, contributing to make knowledge and support available for users. Clear and accessible relevant information as well as a direct link to the original source are provided for each tool. We also provide metadata about the tool maintenance and use (name and email of the tool developer, forum and feedbacks from the biomedical community).
  • Interoperability: As far as possible, we record all tools which can be combined with other datasets by either users or computer systems. Maintainability of software is only one of the quality dimensions. Each tool on the OMICtools website also has a unique Research Resource Identifier (RRID), developed under the Resource Identification Initiative, which is transferred to the Neuroscience Information Framework (NIF) registry.
  • Reusability: OMICtools keeps a complete history of code versions so that they can be easily accessed and/or downloaded. This long-term software archive allows all users easy access to a previous version. 

Uploading and versioning your source code

If you want to version your source code, you can get started by finding your tool in the OMICtools repository using the search engine. Once you find it, click the upload version button and follow the instructions. All you need to do is indicate the version of the source code, the operating system and architecture, and add the publication  linked to the code. It’s as quick and easy as that – and of course you can contact us with any questions.

Once your code is uploaded, the programmatic access to DataCite’s API automatically generates the corresponding DOI. This unique identifier is defined by the International DOI Foundation and assigned by OMICtools to allow precision long-term preservation of your tool. If a DOI has already been attributed for your code version, you can let us know and it will be directly imported from the software platform you used. The DOI and files can’t be modified later.


This control repository service is designed to facilitate the development, maintenance and follow-up of bioinformatic tools by the designers themselves.

An overview of the variety of distribution channels of the tools

Remember that each published source code version is registered with a unique DOI which provides a permanent identification of your resource, even if material is moved or rearranged. Hence OMICtools, not only supports scientists in the analysis and understanding of biological datasets, but also improves the precision of citing bioinformatics methods used to produce and reproduce results, thereby promoting the quality of scientific publications in accordance with the FAIR guiding principles for scientific data management.

Want to share your thoughts? Offer your feedback. Tell us if you’re satisfied and what you think can be improved.