Evaluating biomedical data production with text mining


Estimating biomedical data

Evaluating the impact of a scientific study is a difficult and controversial task. Recognition of the value of a biomedical study is widely measured by traditional bibliographic metrics such as the number of citations of the paper and the impact factor of the journal.

However a more relevant critical success criteria for a research study likely lies in the production itself of biological data, both in terms of quality and also how these datasets can be reused to validate (or reject!) hypotheses and support new research projects. Although biological data can be deposited in specific repositories such as the GEO database, ImmPort, ENA, etc., most data are primarily disseminated in articles within the text, figures and tables. This raises the question – how can we find and measure the production of biomedical data diffused in scientific publications?

To address this issue, Gabriel Rosenfeld and Dawei Lin developed a novel text-mining strategy that identifies articles producing biological data. They published their method “Estimating the scale of biomedical data generation using text mining” this month on BioRxiv.

Text mining analysis of biomedical research articles

Using the Global Vector for Word Representation (GloVe) algorithm, the authors identified term usage signatures for 5 types of biomedical data: flow cytometry, immunoassays, genomic microarray, microscopy, and high-throughput sequencing.

They then analyzed the free text of 129,918 PLOS articles published between 2013 and 2016. What they found was that nearly half of them (59,543) generated 1 or more of the 5 data types tested, producing 81,407 data sets.


Estimating PLOS articles generating each biomedical data type over time (from “Estimating the scale of biomedical data generation using text mining“, BioRxiv).

This text-mining method was tested on manually annotated articles, and provided a valuable balance of precision and recall. The obvious next  – and exciting – step is to apply this approach to evaluate the amount and types of data generated within the entire PubMed repository of articles.


Estimating PLOS articles generating each biomedical data type over time (from “Estimating the scale of biomedical data generation using text mining“, BioRxiv).

A step beyond data dissemination

Evaluating the exponentially growing amount and diversity of datasets is currently a key aspect of determining the quality of a biomedical study. However in today’s era of bioinformatics, in order to fully exploit the data we need to take this a step beyond the publication and dissemination of datasets and tools, towards the critical parameter of improving data reproducibility and transparency (data provenance, collection, transformation, computational analysis methods, etc.).

Open-access and community-driven projects such as the online bioinformatics tools platform OMICtools, provide access not only to a large number of repositories to locate valuable datasets, but also to the best software tools for re-analyzing and exploiting the full potential of these datasets.

In a virtual circle of discovery, previously generated datasets could be repurposed for new data production, interactive visualization, machine learning and artificial intelligence enhancement, allowing us to answer new biomedical questions.

Upgrade your search experience


High throughput technologies are generating increasingly vast amounts of biological data streams. Thousands of new bioinformatics tools have been designed for their management and analysis. Finding the appropriate program for a specific need is typically a challenging and time-consuming task.

Find the right answer for your biological data analysis

OMICtools addresses this challenge and empowers you to locate and access information for more than 17,000 software tools and databases in a single place. Manually curated and updated, OMICtools resources are methodically organized in a didactic hierarchical classification. The number of available tools is constantly growing. From more than 1,100 categories, you can easily retrieve a relevant list of tools which precisely relate to a specific step of data analysis for the biotechnology you are using. To better meet the needs of our users, OMICtools launched a powerful customized search engine in December 2016.

Now you can directly type in your biological question and get the computational answer.

OMICtools has an interactive query interface offering a user-friendly method to find the tools you need, whatever your level of expertise in bioinformatics. The OMICtools search engine has been optimized for speed, precision and recall performances. It rapidly identifies a list of relevant tools matching your query, no matter whether you use precise vocabulary from a specific field or a common expression.

You can search the OMICtools website the same way you search on Google – type a few keywords, and let the search engine handle the rest. If you want to narrow down your results, just add a word or two to your search terms. What’s more, once you find a tool that interests you, we systematically propose related tools that are potentially of interest to perform your data analysis. OMICtools seeks to understand query intent and deliver our users the best-adapted results to answer their question quickly. The new search engine responds to this need, reducing the inherent ambiguity of scientific language and more precisely articulating information-seeking intent.

OMICtools enhances its search retrieval quality

We have developed a tool ranking system to facilitate your navigation of potential tools for your analyses. When you submit a request, the resulting list of tools is organized according to this ranking. Our methodology to calculate a “tool score” is based not only on the query-term proximity but also includes several practical and social metrics.

Tool ranking weights are attributed to several parameters, with the tool’s domains of application being the primary parameter, as per its classification in OMICtools. Parameters relating to availability and quality of the tool’s website are also key; these include functional links, documentation, support, tutorials, maintenance, etc. Our ranking calculation also takes into account the number and quality of associated publications. As a community search platform, we promote social interactions with bioinformatics tools, so OMICtools users’ ratings, reviews and comments also impact ranking.

Filters will soon be proposed to make your search more precise. You will have the option to select tools according to your level of computer skills, the type of biotechnology, your operating system, resource type, software type, programing languages, stability, interface, restrictions for use, license, parallelization and database management system.

Because all too often the most used tools are the first tools developed but are not necessarily the best ones available, our ranking strategy has been designed to emphasize tool quality, allowing new effective and promising tools to be highlighted. And as you can guess, the more well-known a tool is, the more likely it is to be retrievable. With our ranking system, relevance, usability, and popularity come together!