Predict transcription factor binding from DNase footprints with Sasquatch


Predicting the impact of regulatory sequence variation on transcription factor (TF) binding is an important challenge as the vast majority of disease associated SNPs are found in the non-coding genome (Vaquerizas, Kummerfeld, Teichmann, & Luscombe, 2009)⁠. Most existing approaches rely on large catalogs of cell type and TF specific functional annotations. As only a minority of TFs is well characterized (Maurano et al., 2015; Rockman & Kruglyak, 2006)⁠, identifying the relevant factors and probing them in the appropriate cell types represents a major limitation of TF centric approaches.

With this in mind, Ron Schwessinger and colleagues from University of Oxford have developed the Sasquatch tool to use DNase footprinting data to estimate and visualize the effects of non-coding variants on TF binding. Here, they talk about the features and benefits of their tool.

Sasquatch – predicting TF binding from average, k-mer based DNase footprints

DNase I cuts the genome preferentially in accessible regions, associated with regulatory function. By mapping only the very cut sites instead of entire fragments, DNase-seq can reveal protein occupation at bp resolution. Sasquatch analyses DNase footprints to comprehensively determine any k-mer’s potential for cell type specific TF binding in the context of open chromatin and how this may be changed by sequence variants. Sasquatch is an unbiased approach, independent of known TF binding sites and motifs and only requires a single DNase-seq dataset per cell type.

Probe TF binding potential

Querying and k-mer from the tissue repository retrieves the relative DNase cut profile over a 250 bp window of that k-mer. By automatically detecting shoulders and footprints in the profile, Sasquatch quantifies the Shoulder-to-Footprint Ratio (SFR) and thus the average protein occupancy of that k-mer within open-chromatin. The SFR is cell type specific with tissue specific TF only yielding a footprint and high SFR in relevant cell types while housekeeping TF score consistently high SFRs.


Predict impact of sequence variation

Comparing the footprint profile of a reference and variant sequence yields a total and relative damage score of a particular variant. For that Sasquatch compares the SFRs in a sliding window approach. Variants with high damage scores are predicted to strongly alter TF binding potential in open chromatin in a cell type specific manner.


Priorities non-coding sequence variants

By querying sequence variants in batch mode, Sasquatch can quickly prioritize thousands of variants for their potential to alter TF binding. Importantly, Sasquatch assumes open-chromatin context. Therefore, filtering variants for location in potential open-chromatin is advised when dealing with many variants.

In silico mutate entire regions

By predicting the damage score of every possible base substitution at every bp, Sasquatch can create in silico mutation plots. Peaks of high damage predict the location of likely binding sites within open-chromatin that can be damaged by sequence variants. By analyzing entire genomic elements Sasquatch can help to characterize regulatory elements.


Concluding Remarks

We implemented Sasquatch as webtool for fast and straight forward usage ( To allow for a more flexible and customized usage, we also made it available as R implementation ( We pre-processed all human ENCODE DNase data to supply a large repository of cell types for your analysis. Custom DNase-seq data can be easily pre-preprocessed and run in our R implementation.


Maurano, M. T. et al. (2015). Large-scale identification of sequence variants influencing human transcription factor occupancy in vivo. Nature Genetics. Rockman, M. V., and Kruglyak, L. (2006). Genetics of global gene expression. Nature Reviews Genetics.
Schwessinger, R. et al. (2017). Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Research.
Vaquerizas, J. M., Kummerfeld, S. K., Teichmann, S. A., & Luscombe, N. M. (2009). A census of human transcription factors: function, expression and evolution. Nature Reviews Genetics.

Identify and map genomic islands with xenoGI


Genomic islands (GI) are part of a genome that have been transferred horizontally beween organisms, typically bacteria. They code for many functions, such as symbiosis, pathogenesis, antibiotic resistance, etc. Determining GI is often performed by base composition analysis and phylogeny estimations.

Dr. Eliot Bush has developed xenoGI, a tool that proposes several features to aid identify islands of genes that entered via common horizontal transfer events, to map those events onto the phylogenetic tree, and more. Here, he briefly describes the features and benefits of his tool.

Features and benefits of xenoGI

Microbes have acquired many important traits through the horizontal transfer of genomic islands. Understanding the evolution of these traits often requires us to understand the adaptive path that has produced them.

The goal of xenoGI is to reconstruct the history of genomic island insertions in a clade of closely related microbes. It takes as input a set of sequenced genomes and a tree specifying their phylogenetic relationships.

It then identifies genomic islands and maps their origin on the phylogenetic tree, determining which branch they inserted on.

The key challenge in this problem is to accurately identify the origin of genes. Every gene in the input genomes must have one of two origins. Either it is a core gene present in the most recent common ancestor of the strains, or it arrived via a horizontal transfer event. The algorithm seeks to determine which is which by creating gene families in a way that takes account of both the species tree and synteny information. It then identifies families whose members are adjacent and whose most recent common ancestor is shared, and merges them into islands reflecting a common origin.

The figure below shows an example of this sort of analysis related to acid tolerance in the Escherichia clade. gadB is a glutamate decarboxylase enzyme known to be involved in acid tolerance in E. coli. In an analysis of eleven enteric species, xenoGI finds that gadB is part of an island of eight genes that inserted on the branch leading to Escherichia, before the divergence of E. fergusonii.

Example of analysis with xenoGI

The fact that it operates in the context of a clade makes xenoGI distinctive compared with previous genomic island finding methods. Other distinctive features include the fact that it is gene based, doesn’t depend on the aligner MAUVE, and integrates species tree and synteny information from an early stage of its analysis.

In the past, reconstructing the history of GI insertions into a clade typically required heavy human involvement. xenoGI provides an automated solution to this problem. Beyond this, a thorough comparative analysis is the gold standard for genomic island finding (even if one’s goal is only to find islands in only a single genome). Our hope is that xenoGI will make this sort of analysis accessible to more users.


Bush et al. (2017). xenoGI: reconstructing the history of genomic island insertions in clades of closely related bacteria. BioRxiv.


Towards standardized protocols for microbiome studies


The study of the microbiome – that is, the ensemble of microbial communities living inside us – has become a major application for high-throughput DNA sequencing. Functional changes in the composition of the gut microbiome have been implicated in multiple human diseases.

Due to its complexity, the analysis of sequencing data from microbiome study typically involves a lot of different protocols and bioinformatics tools. From sample collection and DNA extraction to sequencing and computational analysis, technical errors and bias can occur at each step, rendering the uniformization of protocols a complex task.

To this end, two consortia recently proposed to examine the sources of inter-laboratory variability in various aspects of microbiome data generation. This work was published in the last issue of Nature Biotechnology.

Steps in the microbiome data generation process and technical sources of error and bias at each step. From Gohl D. M. 2017

The Microbiome Quality Control (MBQC) project consortium

The first study published by Sinha et al. and the MBQC focuses on two main sources of variation: data handling (extraction, amplification, and sequencing) and bioinformatics processing. To assess these potential sources of bias, they sent human samples to 15 laboratories and subjected the dataset to analysis by 9 bioinformatics protocols.

Microbiome Quality Control Project baseline study design.

DNA extraction and library preparation showed the highest degree of variation among laboratories, while different bioinformatics analysis introduced little variability.

The authors also provide guidelines for optimal use of bioinformatics protocols to mitigate this variability, such as performing relative (rather than absolute) diversity measures, phylogenetic (rather than taxonomic), and analyses and quantitative (rather than based on presence or absence) measures.

The next phase of the MBQC project will consist in carrying out systematic surveys of microbiome assay protocols, with the goal to establish a shared library of positive- and negative-control standards for different microbial habitats.

Assessing DNA extraction protocols

In a second study, Costea et al. tested 21 representative DNA extraction protocols on the same fecal samples and quantified differences in terms of microbial community composition.

The authors identified three protocols that performed better in yield and integrity of the extracted DNA, and in their ability to represent hard-to-extract Gram-positive species.

The details of these protocols, together with standard methods for sample and library preparation can be found here:

In this study, however, DNA extraction methods were the sole source of variation investigated. Nonetheless, this study provides a benchmark for future development of new extraction methods, as well as a set of recommendations to improve cross-study comparability.

Together, these works underline the necessity to work towards uniform and standardized protocols for microbiote study.