Snakemake: taking parallelization a step further

pipes-snakemake

Written by Raoul Raffel from Bioinfo-fr.net, translated by Sarah Mackenzie.

Hello and welcome back to a new episode of the series of Snakemake tutorials, this one dealing with parallelization. If you missed the first episode introducing you to Snakemake for Dummies, check out the article to catch up on it before you read on.

Here we are going to see how easy Snakemake makes it to parallelize data. The general idea revolves around cutting out the raw files from the start of your pipeline and then putting them back together after the calculation-intensive steps. We are also going to find out how to use a JSON configuration file. This file is the equivalent of a dictionary / hash table and can be used to stock global variables or parameters used by the rules. It will make it easier to generalize your workflow and modify the parameters of the rules without touching Snakefile.

snakemake-parallelization-omictools

 To use it in a Makefile file, you need to add the following key word:

snakemake-parallelization-omictools

You can then access elements as if it were a simple dictionary:

 snakemake-parallelization-omictools

A single key word for parallelizing

Only one new key word (dynamic) and two rules (cut and merge) are needed to parallelize.

It’s easiest to illustrate this using the workflow example from the previous tutorial. In it, the limiting step was Burrows-Wheeler Aligner (rule bwa_aln), as this command doesn’t have a parallelization option. We can overcome this limitation with the following two rules.

snakemake-parallelization-omictools

In this case I have simplified as much as possible to show the power of this functionality, however if you want to use them in the workflow from the previous tutorial you will have to adapt these two rules.

Note: the option –cluster allows the use of a scheduler (e.g. –cluster ‘qsub’).

Taking automatization further

The file configfile.json allows automatic generation of target files (i.e., the files you want at the end of your workflow). The following example uses the configuration files presented earlier to generate the target file. Note that {exp} and {samples} come from pairs.

snakemake-parallelization-omictools

Here’s an example of the workflow with parallelization:

snakemake-parallelization-omictools

snakemake-parallelization-omictools

snakemake-parallelization-omictools

snakemake-parallelization-omictools

So to sum up, we’ve taken another step forward in learning the functionalities of Snakemake using only a single key word and two new rules. This is an easy way to improve the efficacy of your workflows by reducing the calculation time for calculation-intensive steps.

However it’s important to keep in mind that excessive parallelization is not necessarily the optimal strategy. For example if I  decide to cut a file which contains 1000 lines into 1000 files of a single line each and I have only two poor processors at my disposal, I’m likely facing a loss of time rather than a gain. So it’s up to you to make the most judicious choice for your parallelization strategy on the basis of the machine(s) available, the size of the files to cut up, and the software/scripts and extra time that your two new rules will add to the workflow.

But if you are facing particularly demanding tasks, and a computer cluster is available, you may well see an impressive gain in time.

Snakemake for dummies (or how to create a pipeline easily)

snakemake-pipelines-omictools

Written by Raoul Raffel from Bioinfo-fr.net, translated by Sarah MacKenzie.

If you haven’t already heard about this tool, you have certainly not read the article Formalising your protocols with Snakemake by Louise-Amélie Schmitt (maybe not surprising if you don’t understand French!). So, what are the advantages of rewriting your ready-to-go pipelines in Snakefile?

The answer? Code readability, resource management, and reproducibility

Once you are ready to publish your data, you will have to prepare yourself to explain to your future readers exactly how you have obtained your data in order to allow other bioinformaticians to take your raw data and reproduce the same results. This is a critical aspect of bioinformatics – indeed of all scientific research: reproducibility. Which tools were used, which versions, which parameters etc., i.e., details of each and every one of even the tiniest steps that allow you to obtain your data. It is imperative that any scientific article reporting key results identified using bioinformatics is associated with a pipeline/script allowing the identical results to be reproduced.

The last few years have seen a massive, almost exponential, increase in the volume of data being produced. In parallel, the resources made available (such as computer clusters) have been continuously increasing in size (number of CPU) and in calculation power (CPU speed), keeping bioinformatics tools ahead of the race. To be able to optimally exploit these tools, some tasks need to be performed in parallel; many bioinformatics tools have thus developed options allowing simultaneous use of several CPUs (e.g. bwa, STAR, etc.), while others were not designed to be used in a multi-CPU environment (awk, macs2, bedtools, etc.). In the latter cases, the tasks need to either be performed manually in parallel (by launching several tasks in the background by adding an ‘&’ at the end of the command), or you have to put the operations on a waiting list and under-exploit the machine you are using (sequential task management). An upcoming article will explain how you can – very easily – use Snakemake to take parallelization a step further.

The inventor of Snakemake, Johannes Köster [1], figured out how to take advantage of Python (clear syntax) and GNU Make (parallel execution, resource management, system rules). Along with the addition of numerous functionalities, such as tracking versions/parameters, creation of html reports, and visualization of tasks and their dependences represented as a graph (DAG), this language has great potential for a promising future.

The general principles of Snakemake

As is the case for GNU Make, Snakemake functions using the principle of rules. A rule is a group of elements which allow a file to be created, and can be considered one of the pipeline steps. They are written in a Snakefile (the equivalent of a Makefile).

Each rule has at least one output file and one (or several) input file(s) – I deliberately chose to start with the output which is the Snakemake standpoint, to start from the end in order to get to the start.

The first rule of a Snakefile defines the files that you want to see at the end of your data processing (file target). The order (and the choice) of rules is established automatically using target file/folder name(s). This will bring the files to the top from one rule to the next until it finds a file with a rule with a more recent input than output. This rule, and all that follow, will then be executed.

Snakemake functions using file names

The easiest approach is to start with an example:

snakemake-functions-omictools

As you can probably guess, if you have data/raw/s1.fq and data/raw/s2.fq, which are more recent than data/raw/s1.fq.gz and data/raw/s2.fq.gz, the gzip rule will create/replace the targets. Furthermore, we can execute the operations in parallel by converting the number of processors to use to be a parameter (option -j N, –jobs=N).

Each rule can have several keywords. Snakemake uses keywords to define the rules.

Input and output

  • input = file(s) to use to create the output
  • output = file(s) that will be generated with the rule

snakemake-functions-omictools

  • wildcards = variables which are defined with {} and a part of a name/pathway of output files, which allow the rules to be generalized. As an example, we used two wildcards (genome and sample), and we can also use regular expressions to precisely define the definition of a wildcard such as for genome which is “hg[0-9]+”.

Use of object wildcards (by section)

  • log : “mapping/{genome}/{sample}.sort.log”
  • shell/run/message :  “mapping/{wildcards.genome}/{wildcards.sample}.sort.log”
  • params : param1 = lambda wildcards : wildcards.genome + ‘.fa’

Explanation: in addition to the input/output/log we can also directly use the variable name because the wildcards object is only known as the shell/run/message. For this we need to move to “wildcards.variable” in the last three sections. Other than shell/run/message, we can use “lambda wildcards: wildcards.variable”. Here the lambda function allows the wildcards object to be used directly in the sections.

Processing the created files

  • temps = temporary file deleted after use.
  • protected = a file that will be protected in writing after creation.

Commands to generate the output file(s)

  • shell = use a UNIX command to create the output file(s).
  • run = use Python code to build the output file(s).

note: you will have to chose between run and shell
note2: with run you can use the function shell()

Other rule parameters

  • threads = maximum number of processors that can be used
  • message = message to show at the start
  • log = log file
  • priority = allows rules to be prioritized (e.g. when several rules use the same input)
  • params* = parameters of the commands used
  • version* = version of the rule (or of the command used)

* These two keywords allow you to associate the parameters and rules version with the output files. Thus output files can be (re)generated if a parameter or a version of the rule has been modified. In addition, code tracking allows you to follow the evolution of the pipeline and the files generated.

A concrete example

We are going to create a genomic data analysis pipeline (e.g. ChIP-seq data analysis) which will align/map small sequences derived from high-debit sequencing (reads) on a reference genome.

snakemake-functions-omictools

snakemake-functions-omictools

snakemake-functions-omictools

With approximately 130 lines, this high-performance code is ready-to-use on a PC, server or a computer cluster. Using Snakefile, we can generate a graph to represent the dependency between the rules and the command:

snakemake –rulegraph | dot | display

Graph showing the dependence between rules

snakemake-functions-omictools

We can also generate a more complete graph which takes into account the wildcards using the following  command:

snakemake –dag | dot | display

Graph representing the complete pipeline

snakemake-functions-omictools

Based on the recent paper:

[1] (Köster et al., 2012) Parallelization, Scalability, and Reproducibility in Next-Generation Sequencing Analysis. Bioinformatics.