MEDOC (MEdline DOwnloading Contrivance) is a Python program designed to download data from MEDLINE on an FTP and to load all extracted information into a local MySQL database, thus making MEDLINE search easy.
MEDLINE, the biomedical data keeper
Since MEDLINE’s database has been released almost 50 years ago, the number of indexed publications rose from 1 million in 1970 to 27 millions this year. The aim of this repository is to facilitate the access to the scientific literature for everyone.
The NIH (National Institute of Health, USA) also provides a powerful search engine, which allows to query this database throught the well-know web interface PubMed. This search engine supports complex queries by using logical operators (OR, AND) and indexes different text blocks (such as title, abstract) for refined search. Moreover, different API services have been released to allow routine search, informatics parsing of the results, and data extraction.
However, to query these API (eUtilities), the user needs to program a different script for every search (which can become time-consuming when many data requiring different parsing are needed) and to query the API many times to retrieve individual data from unique article.
To make data-mining easier, the NIH now allows to download MEDLINE’s data from a FTP containing XML-tagged file.
Relational database to the rescue
Even if noSQL databases are rising up these last years, a local and relationnal-based version of the MEDLINE database is useful for complex and frequent queries. The idea behind MEDOC was thus to build a relational scheme and load XML files into this mySQL version.
The figure above presents every steps executed by the Python3 wrapper to construct this local database. 13 tables were created to store every data contained into XML files extracted from the NIH FTP (authors, chemical products, MESH, corrections, citation subset, publication type, language, grant, data bank, personal name subject, other ID and investigator).
Example of request
It took 113 hours (4 days and 17 hours) for MEDOC to load the 1174 files contained into the FTP in the mySQL database (representing 61.3 Go of disk space used).
Querying this version is almost instantenious, even if joining several tables together. In the example provided bellow, the 10 last publications about antioxidants indexed on PubMed were retrieved with SQL queries.
The following result was provided in 0.022 secondes:
In summary, this indexed relational database allows the user to build complex and rapid queries. All fields can thus be searched for desired information, a task that is difficult to accomplish through the PubMed graphical interface. MEDOC is free and publicly available on Github.