Genome folding in evolution and disease

5.5 Methods

5.5.1 CTCF motifs in the human genome

The recognition motif of CTCF is well defined and available from the JASPAR database (MA0139.1) (Mathelier et al. 2015). We downloaded TF binding site predictions with the CTCF motif (MA0139.1) in the human genome hg19 from the JASPAR database (http://expdata.cmmt.ubc.ca/JASPAR/downloads/UCSC_tracks/2018/hg19/tsv/MA0139.1.tsv.gz). Motif hits were filtered for p-value ≤ 2.5 x 10^-6, resulting in 38,316 highly significant CTCF motif hits genome-wide and 717,137 motif pairs within 1 Mb genomic distance that are considered as potential loop interaction anchors in this study.

5.5.2 Loop interaction data for training and validation

For training and validating the prediction model we used 9,448 published loops derived from high-resolution in-situ Hi-C experiments (Rao et al. 2014) and 206,399 CTCF and Pol2 ChIA-PET interactions (Tang et al. 2015) in human GM12878 cells. We considered each CTCF motif pair as positive (true looping interaction) if there was at least one measured looping interaction for which each loop anchor overlapped one of the CTCF motifs. Overlaps were calculated using the R package InteractionSet (Lun et al. 2016). This resulted in 30,025 (4.2%) of 717,137 candidate motif pairs that were labeled as true looping interactions in GM12878. For the prediction validation in HeLa cells we used the 3,094 Hi-C loops and 402,722 ChIA-PET interactions for CTCF and Pol2 in HeLa from the same studies (Rao et al. 2014; Tang et al. 2015) and labeled 12,480 (1.7 %) of motif pairs as true loops in HeLa cells.

5.5.3 ChIP-seq datasets in GM12878 cells

We downloaded publicly available ChIP-seq data from the ENCODE data portal (Dunham et al. 2012; Davis et al. 2017) by requiring the assay to be ChIP-seq, the target to be a transcription factor, the biosample term name to be GM12878, the genome assembly to be hg19, and the file-type to be bigWig. Furthermore, we filtered the data to have output type ‘fold change over control’ or ‘signal’ and to be built from two replicates. Then we selected for each TF only one unique experiment as bigWig file with either output type ‘fold change over control’ or, if unavailable, output type ‘signal’. This resulted in 124 ChIP-seq experiments for different TFs (Table S1). ChIP-seq data for HeLa were retrieved analogously and filtered for the selected targets: RAD21, CTCF, ZNF143, STAT1, EP300, and ZNF143 (Table S2).

5.5.4 ChIP-seq data types

To analyze the effect of different ChIP-seq signal types and other genomic assays on loop prediction performance, we selected five TFs (ZNF143, STAT1, SMC3, RAD21, and CTCF) and downloaded the mapped reads of ChIP-seq experiments as BAM files from the ENCODE data portal (Davis et al. 2017) and from the UCSC Genome Browser (Hinrichs et al. 2006). Furthermore, we downloaded signal tracks as bigWig files for ChIP-seq input control experiment and DNase-seq experiments in GM12878 cells. File accession identifiers and download links are provided in Table S3. We used the ChIP-seq peak caller Q (Hansen et al. 2015) with option ‘-w’ for each human chromosome to generate signal tracks in BED format of shifted reads and qfrags. ‘Shifted reads’ are counts of mapped reads that are shifted in 5’ direction by half of the estimated fragment size. ‘qfrags’ are pairs of forward and reverse mapped reads within a given distance (Hansen et al. 2015) and are shown to improve signal to noise ratio in ChIP-seq peak calling (Hansen et al. 2015). We then combined resulting BED files from all chromosomes and converted them to the bedGraph and bigWig formats using the bedtools (Quinlan and Hall 2010) and bedGrpahtoBigWig tools from the UCSC Genome Browser (Kent et al. 2010).

5.5.5 ChIP-nexus data processing for RAD21 and SCM3

ChIP-nexus data for RAD21 and SMC3 in GM12878 cells were published recently (Tang et al. 2015). We downloaded the corresponding raw reads from the Sequence Read Archive (SRA) (Run IDs SRR2312570 and SRR2312571). Reads were processed using felxcut for barcode removal and adapter trimming as recommended in the user guide of the Q-nexus tool (Hansen et al. 2016). Reads were than mapped to human genome hg19 using Bowtie version 2.3.2 with default settings. Duplicate reads were removed using nexcat (Hansen et al. 2016). Finally, we created shifted-reads and qfraq profiles using Q-nexus (Hansen et al. 2016) with options ‘–nexus-mode’ and ‘-w’ for each chromosome and combined them to bigWig files as described above.

5.5.6 Similarity of ChIP-seq profiles as correlation of coverage around motifs

For each CTCF sequence motif in the human genome, we quantified the number of reads overlapping each base within +/- 500 bp around the motif center. This results in a vector \(x_i = (x_{i,1}, x_{i,2}, ..., x_{i,n})\) where \(x_{i,k}\) is the coverage signal at position \(k\) around CTCF motif \(i\). Coverage vectors for motif hits reported on the minus strand were reversed because CTCF motif sites are assumed to be symmetrically aligned to each other when cooperating at loop anchors (Fig. 5.1A) (Rao et al. 2014; Tang et al. 2015; Guo et al. 2015; Wit et al. 2015; Sanborn et al. 2015). For all considered pairs of CTCF motifs \(i\) and \(j\), we calculated the ChIP-seq profile similarity as Pearson correlation coefficient \(r_{i,j}\) of the corresponding coverage vectors \(x_i\) and \(x_j\).

5.5.7 Genomic sequence features of chromatin loops

Besides the correlation of ChIP-seq profiles, we used genomic features of motif pairs as features to predict interactions. The distance d is the number of bp between the two motif centers. The categorical variable orientation o is either, convergent, forward, reverse, or divergent, depending on the orientation of CTCF motifs in the pair (+-, ++, –, and -+, respectively). The motif hit similarity s is the minimum of the two motif hit scores in each pair; we derived these motif scores from the JASPAR motif hit tracks as \(-\log_{10}\) transformed p-values (Khan et al. 2018).

5.5.8 Chromatin Loop prediction model

We used a logistic regression model to predict the log-likelihood probability of CTCF motif pairs to perform chromatin looping interactions. The probability \(p\) that two sites interact is modeled as: \[ \ln (\frac{p}{1-p}) = \beta_0 + \beta_1 x_1, + ... + \beta_k x_k \] where \(\beta\) are the unknown model parameters and \(x_1, ..., x_k\) the features.

More specifically, for the 7C model with a single ChIP-seq experiment as input, the logistic regression model for the interaction probability p is: \[ \ln (\frac{p}{1-p}) = \beta_0 + \beta_1 d, + \beta_2 o + \beta_3 s + \beta_4 r \]

Parameters were estimated using the function ‘glm()’ with option ‘family=binomial()’ in R during model training as described below.

5.5.9 Training and validation of prediction model

We used the R package rsample for 10-fold cross-validation. Thereby, we randomly split the dataset of CTCF motif pairs into ten equal sized subsets. For each round of cross-validation one subset is held out (test dataset) and the model parameters are trained on the remaining 90% of the samples (training dataset). The model parameters are shown for six selected TFs and combined models in Figure S1A. For each split, the performance of the model is than evaluated on the test dataset. For prediction performance in HeLa cells, we trained on all motif pairs using ChIP-seq and true loops from GM12878 cells and evaluated performance on all motif pairs using the true loop data in HeLa.

5.5.10 Analysis of prediction performance

We quantified prediction performance using the receiver operating characteristic (ROC) and precision recall curves (PRC) as implemented in the R package precrec (Saito and Rehmsmeier 2016).

Given the number of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN), the sensitivity is defined as TP/(TP+FN), specificity as TN/(TN+FP), precision as TP/(TP+FP), and recall as TP/(TP+FN). For each cross-validation split, the area under the curve is computed separately, and the mean across splits together with the standard deviation reported. To get binary prediction outputs, we computed the f1-score as harmonic mean of precision and recall for all prediction scores on all cross-validation folds using the R package ROCR (Sing et al. 2005). Then we computed the prediction score that maximizes the f1-score as default cutoff for binary prediction output (Fig. D.1B).

5.5.11 Implementation of 7C and compatibility to other tools

We implemented 7C as R package, termed sevenC, by using existing infrastructure for chromatin interaction data from the interactionSet package (Lun et al. 2016) and functionality for reading bigWig files from the rtracklayer package (Lawrence et al. 2009) from the Bioconductor project (Huber et al. 2015). Predicted loops can be written as interaction tracks for visualization in the WashU Epigenome Browser (Zhou et al. 2013) or as BEDPE format using the GenomicInteractions package (Harmston et al. 2015) for visualization in the Juicebox tool (Durand et al. 2016). The package is freely available and easy to install from GitHub under https://ibn-salem.github.io/sevenC/ and has been submitted to the Bioconductor project (Huber et al. 2015). All analysis presented in this work were implemented in R and all scripts used have been made available in a separate GitHub repository: https://github.com/ibn-salem/sevenC_analysis.