Genome folding in evolution and disease

2.2 Materials and methods

2.2.1 Selection of pairs of paralog genes

All human genes and human paralog gene pairs were retrieved from Ensembl GRCh37 (Ensembl 75) database by using the biomaRt package (Durinck et al. 2009 b, 2005) from within the statistical programming environment R. For each gene we downloaded the Ensembl gene ID, HGNC symbol, transcription sense, transcription start site (TSS) coordinates, and gene length. We only considered protein coding genes with “KNOWN” status that are annotated in the 22 autosomes or the 2 sexual chromosomes. For each gene we used the earliest TSS coordinate. Within this set of genes, all pairs of human paralog genes were downloaded from Ensembl (Vilella et al. 2009). This resulted in a total of 19,430 human genes; more than half of those had at least one human paralog gene (Fig. A.1A).

However, many human genes have more than one paralog (Fig. A.1B). To avoid overrepresentation of genes, we filtered the pairs such that each gene occurred only once. Thereby we selected the pairs by minimizing the rate of synonymous mutations (dS) between them using a maximum-weighted matching graph algorithm implement in the python package NetworkX (Galil 1986). The number of synonymous mutations between paralogs has been used to approximate the duplication age (Lan and Pritchard 2016). Therefore our implementation favours the selection of young paralog pairs for larger paralog families and guaranties that each gene occurs only once. This filtering strategy resulted in \(6256\) unique paralog pairs for downstream analysis (Table 2.1). We observed that modications of this strategy to select unique paralog genes did not affect essentially the results of our study (e.g. by selecting pairs while maximising dS; Fig. A.2).

Analogously to the human data we downloaded all pairs of protein coding paralog genes from the Mus musculus (GRCm38.p2) and Canis lupus familiaris (CanFam3.1) genomes from Ensembl. The numbers of filtered gene pairs are shown in Table 2.1 . Furthermore, we related human paralog genes to orthologs in mouse and dog only if there was a unique one-to-one orthology relationship reported in the Ensembl database.

Table 2.1: Filtering of human paralog gene pairs
Paralog pairs	Human	Mouse	Dog
All paralog pairs	46546	110490	28293
One pair per gene	6256	7323	4959
On the same chromosome	1560	2397	658
Close pairs (TSS distance \(\leq\) 1 Mb)	1114	1774	455
Distal pairs (TSS distance \(>\) 1 Mb)	446	623	203

2.2.2 Enhancers to gene association

Human enhancer annotations, including their genome locations and the corresponding genes they regulate, were obtained from the supplementary data of a recent CAGE analysis (Andersson et al. 2014). In this study, the activity of enhancers and genes was correlated within 500kb over hundreds of human cell types to provide a regulatory interaction map between 27,451 enhancers and 11,604 genes consisting of 66,942 interactions.

2.2.3 Topological associating domains

We obtained topological associating domain (TAD) calls from two recently published Hi-C studies in human cells (Dixon et al. 2012; Rao et al. 2014). TAD locations mapped to the hg18 genome assembly were converted to hg19 using the UCSC liftOver tool (Hinrichs et al. 2006). A/B-compartment and sub-compartment annotations were obtained from high-resolution Hi-C experiments in human GM12878 cells (Rao et al. 2014).

2.2.4 Hi-C interaction maps

Individual chromatin-chromatin contact frequencies from IMR90 cells at 5 kb resolution were retrieved from (Rao et al. 2014)(NCBI GEO accession: GSE63525). We used only reads with mapping quality \(\geq\) 30 and normalized the raw contact matrices applying the provided normalization vectors for KR normalization by the matrix balancing approach (Knight and Ruiz 2013). We only considered pairwise gene interactions if the TSSs of the two genes were located in different bins of the Hi-C matrix with normalized contacts \(\geq\) 0. Capture Hi-C data between promoter regions in human GM12878 cells were downloaded from ArrayExpress (accession: E-MTAB-2323) (Mifsud et al. 2015).

2.2.5 Randomization

We analysed the distribution of paralog pairs over chromosomes depending on the linear distance between them. For doing so, we sampled gene pairs from all human genes with equal and independent probability and refer to them as random gene pairs.

For strand analysis, co-localisation in TADs, and Hi-C contact quantification between paralog pairs, we constructed a carefully sampled control set of gene pairs as null-model. Thereby we accounted for the linear distance bias observed for paralog pairs. First, we calculated all possible non-overlapping pairs of human genes on the same chromosome. From the resulting set of gene pairs we randomly sampled pairs according to the linear distance distribution of paralog gene pairs. Therefore, we assigned to each gene pair a sampling weight that is proportional to the probability to sample the pair. The sampling weight \(w(g_{i}, g_{j})\) for a given pair of genes \(g_{i}\) and \(g_{j}\) with absolute distance \(d_{i,j}\) is defined as

\[ w(g_{i}, g_{j}) = \frac{ f_{\mathrm{paralogs}}(d_{i,j}) }{f_{\mathrm{all}}(d_{i,j})} \]

where \(f_{\mathrm{paralogs}}\) is the observed frequencies of distances in the paralog genes and \(f_{\mathrm{all}}(d_{i,j})\) the frequency of pairwise distances in the population of gene pairs from which we sample. We computed the observed frequencies by dividing the distances into 90 equal-sized bins after \(log_{10}\) distance transformation and counted occurrences of gene pairs for each bin. The resulting sampling weights for all gene pairs are normalized to sum up 1 and were then used as probabilities for sampling:

\[ p_{\mathrm{dist}}(g_{i}, g_{j}) = \frac{ w(g_{i}, g_{j}) }{ \sum_{i,j} w(g_{i}, g_{j}) } \tag{2.1} \]

Next, for comparison of shared enhancers we slightly modified the sampling of gene pairs to account for the observation that paralogs tend to be associated to more enhancers than non-paralogs (Fig. A.1D). Assuming that the number of enhancers associated to genes is independent from the distance, we computed sampling probabilities by \[ p_{\mathrm{dist+eh}}(g_{i}, g_{j}) = p_{\mathrm{dist}}(g_{i}, g_{j}) \cdot p_{\mathrm{eh}}(n_{i}) \cdot p_{\mathrm{eh}}(n_{j}) \] whereby \(n_{i}\) and \(n_{j}\) are the number of enhancers associated to \(g_{i}\) and \(g_{j}\), respectively and \(p_{\mathrm{eh}}(n)\) is the probability to sample a gene associated to n enhancers:

\[ p_{\mathrm{eh}}(n) = \frac{ w_{\mathrm{eh}}(n) }{ \sum_{i=0}^{N} w_{\mathrm{eh}}(i) } \tag{2.2} \]

and \[ w_{\mathrm{eh}}(n) = \frac{ f_{\mathrm{paralogs}}(n) }{f_{\mathrm{all}}(n)} \]

where \(f_{\mathrm{paralogs}}(n)\) and \(f_{\mathrm{all}}(n)\) gives the frequency of genes associated to \(n\) enhancers observed in the paralog pairs and all gene pairs, respectively.

Analogously, we sampled sets of pairs accounting additionally for the observed bias in paralog pairs to be in the same strand. \[ p_{\mathrm{dist+eh+strand}}(g_{i}, g_{j}) = p_{\mathrm{dist}}(g_{i}, g_{j}) \cdot p_{\mathrm{eh}}(n_{i}) \cdot p_{\mathrm{eh}}(n_{j}) \cdot p_{\mathrm{strand}}(s_{i,j}) \] whereby \(s_{i,j}\) is \(1\) if both genes, \(g_{i}\) and \(g_{j}\), are transcribed from the same strand and \(0\) otherwise. The probability \(p_{\mathrm{strand}}(s_{i,j})\) is computed in the same way as the probability by number of enhancers \(p_{\mathrm{eh}}(n)\) in equation (2.2).

Lastly, we sampled a set of gene pairs by taking additionally the gene length into account and computed sampling probabilities by \[ p_{\mathrm{dist+eh+len}}(g_{i}, g_{j}) = p_{\mathrm{dist}}(g_{i}, g_{j}) \cdot p_{\mathrm{eh}}(n_{i}) \cdot p_{\mathrm{eh}}(n_{j}) \cdot p_{\mathrm{len}}(l_{i}) \cdot p_{\mathrm{len}}(l_{j}) \]

whereby \(p_{\mathrm{len}}(l)\) for gene length \(l\) is computed in the same way as for distances between gene pairs (equation (2.1)) and by dividing gene lengths into 20 equal sized binds after \(log_{10}\) transformation of gene lengths in bp.

For each paralog pair on the same chromosome within 1 Mb distance, we sampled \(10\) random gene pairs with this procedures each resulting in \(n=156,000\) sampled gene pairs that served as background in our statistical analysis. These sampling approaches resulted in similar distribution of linear distances (Fig. A.3), associated enhancers of each gene (Fig. A.4), same strand (Fig. A.5), and gene lengths (Fig. A.6).

2.2.6 Statistical tests

We compared observed fractions of gene pairs, on the same chromosome, with the same transcription sense, within the same TAD or compartment, and with at least one shared enhancer between pairs of paralogs and random or sampled pairs using the Fisher’s exact test. Hi-C contact frequencies and genomic distances between TSS of gene pairs were compared using a Wilcoxon rank-sum test. All analyses were carried out in the statistic software R version 3.2.2.