Snakemake for dummies (or how to create a pipeline easily?)

snakemake-pipelines-omictools

Written by Raoul Raffel from Bioinfo-fr.net, translated by Sarah Mackenzie.

If you haven’t already heard about this tool, you have certainly not read the article Formalising your protocols with Snakemake by Louise-Amélie Schmitt (maybe not surprising if you don’t understand french!). So, what are the advantages of rewriting your ready-to-go pipelines in Snakefile?

The answer? Code readability, resource management, and reproducibility

Once you are ready to publish, you will have to prepare to explain to your future readers exactly how you have obtained your data, in order to allow other bioinformaticians to take your raw data and reproduce the same results. This is a critical aspect of bioinformatics – indeed of all scientific research: reproducibility. Which tools were used, which versions, which parameters etc., i.e., each and every one of even the tiniest steps that allow you to obtain your data. It is imperative that any scientific article reporting key results identified using bioinformatics is associated with a pipeline/script allowing the identical results to be reproduced.

The last few years have seen a massive, almost exponential, increase in the volume of data being produced. In parallel, the resources made available (such as computer clusters) have been continuously increasing in size (number of CPU) and in calculation power (CPU speed), keeping bioinformatics tools ahead of the race. To be able to optimally exploit these tools, some tasks need to be performed in parallel; many bioinformatics tools have thus developed options allowing simultaneous use of several CPUs (e.g. bwa, STAR, etc.), while others were not designed to be used in a multi-CPU environment (awk, macs2, bedtools, etc.). In the latter cases, the tasks need to either be performed manually in parallel (by launching several tasks in the background by adding an ‘&’ at the end of the command), or put the operations in a waiting list and under-exploit the machine you are using (sequential task management). An upcoming article will explain how you can – very easily – use Snakemake to take parallelization a step further.

The inventor of Snakemake, Johannes Köster [1], figured out how to take advantage of Python (clear syntax) and GNU Make (parallel execution, resource management, system rules). Along with the addition of numerous functionalities, such as tracking versions/parameters, creation of html reports, and visualization of tasks and their dependences represented as a graph (DAG), this language has maximal potential for a promising future.

The general principles of Snakemake

As is the case for GNU Make, Snakemake functions using the principle of rules. A rule is a group of elements which allow a file to be created, and can be considered one of the pipeline steps. They are written in a Snakefile (the equivalent of a Makefile).

Each rule has at least one output file and one (or several) input file(s) – I deliberately chose to start with the output which is the Snakemake standpoint, to start from the end in order to get to the start.

The first rule of a Snakefile defines the files that you want to see at the end of your data processing (file target). The order (and the choice) of rules is established automatically using target file/folder name(s). This will bring the files to the top from one rule to the next until it finds a file with a rule with a more recent input than output. This rule, and all that follow, will then be executed.

Snakemake functions using file names

The easiest approach is to start with an example:

snakemake-functions-omictools

As you can probably guess, if you have data/raw/s1.fq and data/raw/s2.fq, which are more recent than data/raw/s1.fq.gz and data/raw/s2.fq.gz, the gzip rule will create/replace the targets. Furthermore, we can execute the operations in parallel by converting the number of processors to use to be a parameter (option -j N, –jobs=N).

Each rule can have several keywords. Snakemake uses keywords to define the rules.

Input and output

  • input = file(s) to use to create the output
  • output = file(s) that will be generated with the rule

snakemake-functions-omictools

  • wildcards = variables which are defined with {} and a part of a name/pathway of output files, which allow the rules to be generalized. As an example, we used two wildcards (genome and sample), and we can also use regular expressions to precisely define the definition of a wildcard such as for genome which is “hg[0-9]+”.

Use of object wildcards (by section)

  • log : “mapping/{genome}/{sample}.sort.log”
  • shell/run/message :  “mapping/{wildcards.genome}/{wildcards.sample}.sort.log”
  • params : param1 = lambda wildcards : wildcards.genome + ‘.fa’

Explanation: in addition to the input/output/log we can also directly use the variable name because the wildcards object is only known as the shell/run/message. For this we need to move to “wildcards.variable” in the three last sections. Other than shell/run/message, we can use “lambda wildcards: wildcards.variable”. Here the lambda function allows the wildcards object to be used directly in the sections.

Processing the created files

  • temps = temporary file deleted after use.
  • protected = a file that will be protected in writing after creation.

Commands to generate the output file(s)

  • shell = use a UNIX command to create the output file(s).
  • run = use Python code to build the output file(s).

note: you will have to chose between run and shell
note2: with run you can use the function shell()

Other rule parameters

  • threads = maximum number of processors that can be used
  • message = message to show at the start
  • log = log file
  • priority = allows rules to be prioritized (e.g. when several rules use the same input)
  • params* = parameters of the commands used
  • version* = version of the rule (or of the command used)

* These two keywords allow you to associate the parameters and rules version with the output files. Thus output files can be (re)generated if a parameter or a version of the rule has been modified. In addition, code tracking allows you to follow the evolution of the pipeline and the files generated.

A concrete example

We are going to create a genomic data analysis pipeline (e.g. ChIP-seq data analysis) which will align/map small sequences derived from high-debit sequencing (reads) on a reference genome.

snakemake-functions-omictools

snakemake-functions-omictools

snakemake-functions-omictools

With approximately 130 lines, this high-performance code is ready-to-use on a PC, server or a computer cluster. Using Snakefile, we can also generate a graph to represent the dependency between the rules and the command:

snakemake –rulegraph | dot | display

Graph showing the dependence between rules

snakemake-functions-omictools

We can also generate a more complete graph which takes into account the wildcards using the following  command:

snakemake –dag | dot | display

Graph representing the complete pipeline

snakemake-functions-omictools

Based on the recent paper:

[1] (Köster et al., 2012) Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis. Bioinformatics.

Job opportunities in scientific curation

scientific-curation-omictools

omicX is an innovative and dynamic French bioinformatics start-up located in Rouen. We have developed the OMICtools website, a cutting-edge platform with a global reputation, which is used by the scientific community worldwide. The website is designed to assist researchers and clinicians identify the optimal bioinformatics tools for their biological analyses. It is based on a collaborative community model, classifying nearly 20,000 bioinformatics softwares and databases.

We are actively seeking a biocurator to join our team of data scientists, data librarians, bioinformaticians and curators to help develop the OMICtools website. Currently in a critical growth phase, we are looking for independent motivated collaborators, with a strong understanding of current bioinformatics tools, ready to join us in investing in the future of our project.

Your role as a biocurator

  • Develop the scientific content to enrich the database used by the scientific community.
  • Participate in the enrichment of a body of bioinformatics pipelines, extracted by manual curation from the scientific literature, to permit analysis of biological data derived from sequencing and/or microarrays and or/proteomics.
  • Monitoring the scientific literature.
  • Participate in team meetings.

Your profile

  • Strong interest in following the latest scientific publications and developments.
  • Experience in creating or developing bioinformatics pipelines and biological data analysis.
  • Interest in sequencing and/or microarray and or/proteomics technologies.
  • Willingness to interact as part of a team as well as to work independently.
  • Good communication, interpersonal and organizational skills
  • Rigorous, with close attention to detail.
  • Excellent level of English (written and spoken) and a good level of French (working language).

Education

MSc/PhD (French “Bac+5/+8”) in biology and/or bioinformatics (a double-degree is a plus)

Working conditions : permanent full-time contract (French “Contrat de Durée Indeterminé”)

Remuneration

30 – 40 K€ (annual brut) according to your level of experience. The package includes health insurance and educational benefits.

Location

Seine Innopolis, Le Petit Quevilly, Rouen, France

Our work environment has an impressive concentration of technical and scientific bioinformatics expertise and an informal culture.

Interested in joining our young, dynamic and passionate team? Contact us now! Please send your CV with a cover letter expressing your interest in the position to recrutement@omictools.com.

Community members: your publication on arXiv.org!

Community experts, we’re pleased to inform you that you’re co-authors of the article titled OMICtools: a community-driven search engine for biological data analysis. The preprint version of this article has been assigned the permanent arXiv identifier 1707.03659 and is available on arXiv.org.

This publication gives us the opportunity to thank all the OMICtools biocurators and users for their helpful collaboration in providing high-quality and updated information on bioinformatics tools.

omictools-features-publication

Overview of the collaborative functionality offers on the OMICtools platform

Here’s the abstract: with high-throughput biotechnologies generating unprecedented quantities of data, researchers are faced with the challenge of locating and comparing an exponentially growing number of programs and websites dedicated to computational biology, in order to maximize the potential of their data.

OMICtools is designed to meet this need with its open-access search engine offering an easy means of locating the right tools corresponding to each researcher and their specific biological data analyses. The OMICtools website centralizes more than 18,500 software tools and databases, manually classified, by a team of biocurators including many scientific experts. Key information, a direct link, and access to discussions and evaluations by the biomedical community are provided for each tool. Anyone can join the OMICtools community and create a profile page, to share their expertise and comment on tools. In addition, developers can directly upload their code versions which are registered for identification and citation in scientific publications, improving research traceability.

The OMICtools community links thousands of life scientists and developers worldwide, who use this bioinformatics platform to accelerate their projects and biological data analyses.