Welcome to SuperTAD


Introduction

SuperTAD is an open-source command-line TAD detection package written in C++. It takes either raw or normalized Hi-C contact maps as inputs. Given an input matrix, SuperTAD provides two modes for users, both are to find the optimal coding tree from the input. If a user supplies an integer parameter h, it will construct the optimal coding tree of height at most h, as SuperTAD(h); otherwise, it will construct the optimal tree among all the possible height. Given the optimal trees, we also design and provide a filter to the tree nodes and prune the non-TAD nodes.

The analysis of simulation data illustrates that SuperTAD has higher accuracy and robustness under great noise ratio and variance of sizes. With the constraint of two-layer, our experiments show that SuperTAD(2) finds the structure with less structure entropy than deDoc. The comparison with other seven methods shows that SuperTAD has a significant enrichment of structural proteins around predicted boundaries and histone modifications within TADs, and displays a high consistency between different resolutions of an identical Hi-C matrix, which proves that SuperTAD has the potential to identify the essential structure of the Hi-C data.


The Overview of SuperTAD pipeline

With the same input matrix, SuperTAD provides two modes for users. SuperTAD (the first mode) does not require any user-defined parameter and can determine the height of the coding tree by self-learning. SuperTAD(h) (the second mode) receives the manually selected h as the only parameter and find the optimal coding tree with the constraint of h. For both modes, many coding tree candidates with various leaves number k are created. The optimal coding tree is selected by determining the most appropriate k. For SuperTAD, optional nodes filtering is performed to prune false-positive TADs from the optimal binary coding tree. The result after pruning is referred to as SuperTAD(F).


Install

use git:

git clone https://github.com/deepomicslab/SuperTAD SuperTAD

or download from source

wget https://supertad.deepomics.org/home/download_src -O SuperTAD.tar.gz
tar -xzvf SuperTAD.tar.gz

then

cd ./SuperTAD
mkdir build
cd build
cmake ..
make


Usage

COMMANDS:
  binary: The first mode requires no user-defined parameters, run the nodes filtering by default
    e.g. ./SuperTAD binary <input Hi-C matrix> [-option values]
    OPTIONS:
      --no-filter: If given, do not filter TADs after TAD detection

  multi: The second mode requires a parameter h to determine the number of layers
    e.g. ./SuperTAD multi <input Hi-C matrix> -h <height> [-option values]
    OPTIONS:
      -h <int>: The height of coding tree, default: 2

  SHARED OPTIONS for binary and multi COMMAND:
    -K <int>: The number of leaves in the coding tree, default: nan (determined by the algorithm)
    --chrom1 <string>: chrom1 label, default: chr1
    --chrom2 <string>: chrom2 label, default: the same as chrom1
    --chrom1-start <int>: start pos on chrom1, default: 0
    --chrom2-start <int>: start pos on chrom2, default: the same as --chrom1-start
    -r/--resolution <int>: bin resolution, default: 10000

  filter: The nodes filter for optimal coding tree:
    e.g. ./SuperTAD filter <input Hi-C matrix> -i <original result>
    OPTIONS:
      -i <string>: The list of TAD candidates

  compare: The symmetric metric overlapping ratio to assess the agreement between two results
    e.g. ./SuperTAD compare <result1> <result2>

GLOBAL OPTIONS:
  -w <string>: Working directory path, default: the directory where the input Hi-C matrix is located
  -v/--verbose: Print verbose


Input and Output

SuperTAD only supports the Hi-C contact matrix as input for now. We upload two examples of Hi-C matrix from Rao et al., Cell 2014 as well as the results into ./data.

The binary mode's result before filtering is stored in *.binary.original.tsv;
The binary mode's result after filtering or the filter mode's result is stored in *.binary.filter.tsv;
The multi mode's result is stored in *.multi.tsv;
All of the TAD results use the eight-column format, which records the bin indexes of detected boundaries and the genomic start and end coordinates.

An example output is shown below (resolution=1kb):

chr1 1 0 1000 chr1 44 43000 44000
chr1 9 8000 9000 chr1 16 15000 16000
chr1 17 16000 17000 chr1 44 43000 44000
...

Each column is represented as:
1st-the chromosome of left boundary
2nd-the bin index that identified as the left boundary (start bin)
3rd-the start coordinate of start bin, in bp
4th-the end coordinate of start bin, in bp
5th-the chromosome of right boundary
6th-the bin index that identified as the right boundary (end bin)
7th-the start coordinate of end bin, in bp
8th-the end coordinate of end bin, in bp


Interpret Result

One example result as well as its input Hi-C contact map is shown in the left, the formed coding tree of the example result is shown in the right.


Examples

./build/SuperTAD binary ./data/example_sub_GM12878_chr19_KR25kb_matrix.txt --chrom1 chr19 -r 25000 --chrom1-start 30000000
This command will run binary mode (SuperTAD) on the contact map of GM12878,chr19 at 25kb resolution and save all TADs to the example_sub_GM12878_chr19_KR25kb_matrix.txt.binary.original.tsv.
As --no-filter is not given, the mode runs nodes filtering by default and saves the filtered TADs to the example_sub_GM12878_chr19_KR25kb_matrix.txt.binary.filter.tsv.

./build/SuperTAD multi ./data/example_sub_GM12878_chr19_KR25kb_matrix.txt -h 2 --chrom1 chr19 -r 25000 --chrom1-start 30000000
This command will run multi-nary mode (SuperTAD(h)) on the contact map of GM12878,chr19 at 25kb resolution and save all TADs to the example_sub_GM12878_chr19_KR25kb_matrix.txt.multi.tsv.

./build/SuperTAD filter ./data/example_sub_GM12878_chr19_KR25kb_matrix.txt -i ./data/example_sub_GM12878_chr19_KR25kb_matrix.txt.binary.original.tsv
This command will independently run the nodes filtering for the TADs in -i indicated result and save the selected TADs to *.binary.filter.tsv.

./build/SuperTAD compare ./data/example_sub_GM12878_chr19_KR25kb_matrix.txt.multi.tsv ./data/example_sub_IMR90_chr19_KR25kb_matrix.txt.multi.tsv
This command will compute the overlapping ratio between two results.