Analyzing population genomics using ENDOG software

Dr. Juan Pablo Gutiérrez, developer of the ENDOG software, talks here about his tool and how you can use it to perform various demographic and genealogical analyses of genomics data.

genetics-tree-omictools

ENDOG: one of the most popular softwares for genealogical analysis

The program ENDOG has become one of the most popular software tools  for genealogical analyses. It includes not only computation of classical parameters in population genetics but also new parameters based on computation of individual increase of inbreeding or co-ancestry.

endog-inbreeding

Inbreeding per Generation submenu screen

ENDOG allows you to conduct several demographic and genetic analyses including:

  • Individual inbreeding and average relatedness coefficients
  • Effective population size
  • Parameters characterizing the concentration of both gene and individuals origin, such as the effective number of founders and ancestors, the effective number of founder herds, etc
  • F-statistics and paired genetic distances for each subpopulation under study
  • Descriptors of the genetic importance of herds in a population
  • Generation intervals

The program helps breeders and researchers to monitor changes in genetic variability and population structure, with limited costs from preparing datasets.

The current version of ENDOG calculates effective population size following various methodologies including regression approaches, and in particular calculation from individual increase in inbreeding and modified to account for avoidance of self-fertilization.

endog-pedigree

ENDOG Individual Pedigree submenu screen

Highlights of the ENDOG program

Why has ENDOG become a popular software for scientists and breeders?

Using ENDOG for your genetic analyses allows you to fit your data to real-world populations. It is specifically designed for analyzing diploid populations in which selfing is not possible.

It allows you to compute reliable genetic parameters, particularly effective population sizes even when pedigrees are shallow (have accumulated, on average three, or more generations). The authors have also made available a version of the program in which selfing is possible to allow plant breeders to carry out genealogical analyses.

What about ENDOG quality: availability, usability and flexibility?

  • Free software: ENDOG is a freely available software. You can download here the latest version 4.8 (10 November 2010, in English), an additional file for the Selfing Version (28 September 2011), and Endog 4.0 (15 November 2006, in Spanish).
  • Intuitive: Users can upload bulk pedigree data with a limited need for formatting. ENDOG provides tools to help users check for errors. Interface, Access Tables and .txt files generated by ENDOG are user-friendly and self-informative.
  • Software documentation: The ENDOG users’ guide provides information on the methods implemented in the software, but also gives tips to help any users trying out the software for the first time via the ENDOG interface. In addition, the authors are known to be very responsive to users’ questions.
  • A compiled version of ENDOG is available for the Microsoft Windows environment only.
  • Tested in several studies: The ENDOG program has been cited 191 times in papers indexed in the Web of Knowledge.

endog-founders-omictools

ENDOG Founders submenu screen

 About the author:

Dr. Juan Pablo Gutiérrez, DVM, graduated from the UCM (Complutense University of Madrid) in Spain at the School of Veterinary Medicine in 1987, and completed a post-graduate degree at the same University in 1991. He  also completed a degree in computer engineering at the UNED (National University of Distance Education) in Spain, with a specialization in Animal Breeding in 1989.

He is currently a Full Professor at the Department of Animal Production at the UCM, and the Director of the UCM Consolidated Research Group MOSEVAR (Animal Selection and Genetic Evaluation Models). As of July 2017, he has 30 years of experience in the field of animal breeding, and has published approximately 200 research papers, 91 of which are published in journals appearing in JCR (Journal Citation Reports). He has worked in genetic evaluation in multiple species including many breeds of cattle, sheep, mice, alpacas and horses.

Felix Goyache and Isabel Cervantes are co-authors.

References:

(Gutiérrez and Goyache, 2005) A note on ENDOG: a computer program for analysing pedigree information. Journal of Animal Breeding and Genetics.

 

How to visualize your data using R

Data visualization and R

Data visualization is one of the key steps of any data science related process. There are hundreds of possibilities available when it comes to visualizing data, and choosing and using the right one are determining factors in accurately getting your message across. R programming language is one of the most common tools used for this process.

R is fabulous for creating charts. It allows you to create all types of visualizations and lots of libraries are developed to help users scan them quicker. However, it can sometimes be challenging – and extremely frustrating – to look for the right code to compute a desired plot.

The R Graph Gallery by its creator, Yan Holtz

The R graph gallery is a collection of R graph examples, organized by chart type, searchable by R function, and with reproducible codes and explanations.

r-graph-gallery

The R graph gallery presents hundreds of dataviz possibilities 

This gallery is a collection of over 300 R graphics. The website displays them with explanations and reproducible codes allowing all users to rapidly understand the code and apply it to their own dataset. To facilitate research, charts are classified by type (such as boxplot, scatterplot, map or histogram) and the search bar will help you to find specific R functions.

welcome-r-graph-gallery

The R graph gallery home page 

Browsing the All graph page is also a good way to find some inspiration for new ways to visualize your data: why not try stream graphs, radar charts or circular plots? If you are already familiar with the basic functions of R, check out the interactive charts section that contains examples of some of the html widgets. Interactive graphics are highly accessible with R and can greatly improve your dataviz skills. Another plus is the complete section dedicated to the popular ggplot2 library!

Last but not least, the gallery proposes several useful examples for “omic people”. R is widely used by biologists and there are a lot of packages out there that have been developed for omic data analysis. Suppose for example that you need to draw a Manhattan plot to show the relationship between SNPs and a phenotype – try the qqman library and the graph #101 that will show you how to use it!

manhattan-plot-r-graph-gallery Manhattan plot

The R graph gallery is new and growing rapidly, notably thanks to the numerous contributors (whom I warmly thank). Please feel free to contact me if you have any suggestions to improve this project or if you detect any malfunctioning. And of course, any new R charts are more than welcome!

 

Snakemake: taking parallelization a step further

pipes-snakemake

Written by Raoul Raffel from Bioinfo-fr.net, translated by Sarah Mackenzie.

Hello and welcome back to a new episode of the series of Snakemake tutorials, this one dealing with parallelization. If you missed the first episode introducing you to Snakemake for Dummies, check out the article to catch up on it before you read on.

Here we are going to see how easy Snakemake makes it to parallelize data. The general idea revolves around cutting out the raw files from the start of your pipeline and then putting them back together after the calculation-intensive steps. We are also going to find out how to use a JSON configuration file. This file is the equivalent of a dictionary / hash table and can be used to stock global variables or parameters used by the rules. It will make it easier to generalize your workflow and modify the parameters of the rules without touching Snakefile.

snakemake-parallelization-omictools

 To use it in a Makefile file, you need to add the following key word:

snakemake-parallelization-omictools

You can then access elements as if it were a simple dictionary:

 snakemake-parallelization-omictools

A single key word for parallelizing

Only one new key word (dynamic) and two rules (cut and merge) are needed to parallelize.

It’s easiest to illustrate this using the workflow example from the previous tutorial. In it, the limiting step was Burrows-Wheeler Aligner (rule bwa_aln), as this command doesn’t have a parallelization option. We can overcome this limitation with the following two rules.

snakemake-parallelization-omictools

In this case I have simplified as much as possible to show the power of this functionality, however if you want to use them in the workflow from the previous tutorial you will have to adapt these two rules.

Note: the option –cluster allows the use of a scheduler (e.g. –cluster ‘qsub’).

Taking automatization further

The file configfile.json allows automatic generation of target files (i.e., the files you want at the end of your workflow). The following example uses the configuration files presented earlier to generate the target file. Note that {exp} and {samples} come from pairs.

snakemake-parallelization-omictools

Here’s an example of the workflow with parallelization:

snakemake-parallelization-omictools

snakemake-parallelization-omictools

snakemake-parallelization-omictools

snakemake-parallelization-omictools

So to sum up, we’ve taken another step forward in learning the functionalities of Snakemake using only a single key word and two new rules. This is an easy way to improve the efficacy of your workflows by reducing the calculation time for calculation-intensive steps.

However it’s important to keep in mind that excessive parallelization is not necessarily the optimal strategy. For example if I  decide to cut a file which contains 1000 lines into 1000 files of a single line each and I have only two poor processors at my disposal, I’m likely facing a loss of time rather than a gain. So it’s up to you to make the most judicious choice for your parallelization strategy on the basis of the machine(s) available, the size of the files to cut up, and the software/scripts and extra time that your two new rules will add to the workflow.

But if you are facing particularly demanding tasks, and a computer cluster is available, you may well see an impressive gain in time.