Analyse co-expression gene modules with CEMItool


Identifying single changes in gene expression levels is a common analysis step after a microarray or RNA-Seq experiment. The expression levels of co-expressed genes can also be analyzed and visualized by gene co-expression networks (GCNs), which are undirected graphs used to represent co-expression relationships between pairs of genes across samples.

Dr. Helder Nakaya from Sao Paolo University has recently developed CEMItool, an easy-to-use method to automatically run gene co-expression analyses in R. Here, he describes the features provided by CEMItools.

Analyse your transcriptomic data for co-expression modules

The analysis of co-expression gene modules can help uncover the mechanisms underlying diseases and infection. CEMItool is a fast and easy-to-use Bioconductor package that unifies the discovery and the analysis of co-expression modules.

Among its features, CEMItool evaluates whether modules contain genes that are over-represented by specific pathways or that are altered in a specific sample group, as well as it integrates transcriptomic data with interactome information, identifying the potential hubs on each network.

In addition, CEMiTool provides users with a novel unsupervised gene filtering method, and automated parameter selection for identifying modules. The tool then reports everything in HTML web pages with high-quality plots and interactive tables.

CEMItool features

Several functions can be run independently, or all at once using the cemitool function.

Using a simple command line, CEMItool can generate a plot that displays the expression of each gene within a module:

Expression of each genes within a module

CEMItool can also determine which biological functions are associated with the module by performing an over representation analysis (ORA). For this command, a pathway list must be provided in the form of GMT file:


Biological functions associated with the module.

Finally, interaction data, such as protein-protein interactions can be visualized in annotated module graphs:

Annotated graph showing interactions within a module.

Overall, the CEMItool provides the following benefits:

  • Easy-to-use package, automating within a single R function (cemitool) the entire module discovery process – including gene filtering and functional analyses
  • Perform comprehensive modular analysis
  • Fully automated process

A comprehensive instruction guide for CEMItool is provided on Bioconductor : Link


Russo P, Ferreira G, Bürger M, Cardozo L and Nakaya H (2017). CEMiTool: Co-expression Modules identification Tool. R package version 1.1.1.


MEDLINE queries made easy with the MEDOC tool


MEDOC (MEdline DOwnloading Contrivance) is a Python program designed to download data from MEDLINE on an FTP and to load all extracted information into a local MySQL database, thus making MEDLINE search easy.

MEDLINE, the biomedical data keeper

Since MEDLINE’s database has been released almost 50 years ago, the number of indexed publications rose from 1 million in 1970 to 27 millions this year. The aim of this repository is to facilitate the access to the scientific literature for everyone.

Evolution of the number of document on PubMed

The NIH (National Institute of Health, USA) also provides a powerful search engine, which allows to query this database throught the well-know web interface PubMed. This search engine supports complex queries by using logical operators (OR, AND) and indexes different text blocks (such as title, abstract) for refined search. Moreover, different API services have been released to allow routine search, informatics parsing of the results, and data extraction.

However, to query these API (eUtilities), the user needs to program a different script for every search (which can become time-consuming when many data requiring different parsing are needed) and to query the API many times to retrieve individual data from unique article.

To make data-mining easier, the NIH now allows to download MEDLINE’s data from a FTP containing XML-tagged file.

Relational database to the rescue

Even if noSQL databases are rising up these last years, a local and relationnal-based version of the MEDLINE database is useful for complex and frequent queries. The idea behind MEDOC was thus to build a relational scheme and load XML files into this mySQL version.


The figure above presents every steps executed by the Python3 wrapper to construct this local database. 13 tables were created to store every data contained into XML files extracted from the NIH FTP (authors, chemical products, MESH, corrections, citation subset, publication type, language, grant, data bank, personal name subject, other ID and investigator).

Example of request

It took 113 hours (4 days and 17 hours) for MEDOC to load the 1174 files contained into the FTP in the mySQL database (representing 61.3 Go of disk space used).

Querying this version is almost instantenious, even if joining several tables together. In the example provided bellow, the 10 last publications about antioxidants indexed on PubMed were retrieved with SQL queries.


The following result was provided in 0.022 secondes:

Result obtained with the example query

In summary, this indexed relational database allows the user to build complex and rapid queries. All fields can thus be searched for desired information, a task that is difficult to accomplish through the PubMed graphical interface. MEDOC is free and publicly available on Github

Your top 3 gene clustering software tools


Clustering is a fundamental step in the analysis of biological and omics data. It is used to construct groups of objects (genes, proteins) with related function, expression patterns, or known to interact together. In microarrays or RNA-seq experiments, gene clustering is often associated with heatmap representation for data visualization.

Choosing the right clustering tool for your analysis

Many clustering methods and algorithms have been developed and are classified into partitioning (k-means), hierarchical (connectivity-based), density-based, model-based and graph-based approaches.

To help you choose between all the existing clustering tools, we asked OMICtools members to vote for their favorite software. Here are the top 3 tools, chosen by 23 voters.

First place for ClustEval

ClustEval is a web-based clustering analysis platform developed at the Max Planck Institute for Informatics and the University of Southern Denmark. It is designed to objectively compare the performance of various clustering methods from different datasets.

More precisely, ClustEval has compared the performances of 18 different clustering methods among the most used, using 24 different datasets. These datasets include gene expression data, protein sequence similarity, protein structure similarity, social network, word sense disambiguation, etc. The performance of a clustering method is then evaluated by a F1-score (harmonic mean of precision and recall).

Finally, ClustEval can be downloaded and installed by users to perform their own clustering analysis comparison, using VirtualBox image, Docker & Docker Compose or as a R package.

Performance of all clustering tools on all nonartificial data sets on the basis of F1 scores.

Second position for Babelomics

Babelomics is a web application developed by the Computational Genomics Department of the Principe Felipe Research Center in Valencia. It performs a wide range of functional analysis of gene expression and genomic data, from processing to expression analysis and gene set enrichment.

In its current version, Babelomics 5, the web-site displays a user-friendly and intuitive interface for the clustering of microarray or RNA-seq data using one of three different methods: UPGMA, SOTA, and k-means. The subsequent result can be visualized as a heatmap. Examples of data set and analysis are provided for every functionality of the application, and tutorials available here.

Babelomics clustering tool.

Third place for AltAnalyze

AltAnalyze is a comprehensive application for the analysis of single-cell and bulk RNA-seq data that can automatize every step of gene expression and splicing analysis, including clustering and heatmap representation. It was developed in the Nathan Salomonis laboratory at Cincinnati Children’s Hosptial Medical Center and the University of Cincinnati.

AltAnalyze proposes many options for clustering algorithms and normalization, as well as unique features such as finding optimized clusters for single-cell analysis.

AltAnalyse can be downloaded and run on all operating systems, and comes with useful documentation (tutorials, blog, FAQ).

Heatmap and clustering generated with AltAnalyze


(Wiwie et al., 2015) Comparing the performance of biomedical clustering methods. Nature Methods.

(Alonso et al., 2015) Babelomics 5.0: functional interpretation for new generations of genomic data. Nucleic Acids Research.

(Emig et al., 2010) AltAnalyze and DomainGraph: analyzing and visualizing exon expression data. Nucleic Acids Research.

(Olson et al., 2016) Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature.