Genome folding in evolution and disease

3.5 Methods

3.5.1 Rearrangement breakpoints from whole-genome alignments

Rearrangement breakpoints were identified between human and 12 selected vertebrate species from whole-genome-alignment data (Table 3.1). Alignment data were downloaded as net files from UCSC Genome Browser for human genome hg38 and the genomes listed in Table 3.1. The whole-genome data consists of consecutive alignment blocks that are chained and hierarchically ordered in the so-called nets (Kent et al. 2003). Chains represent blocks of interrupted syntenic regions and may include larger gaps. When hierarchically arranged in a net file, child chains can complement their parents when they align nearby segments that fill the alignment gaps of their parents but may also break the synteny when incorporating distal segments. We implemented a computer program to extract rearrangement breakpoints from net files based on the length and type of fills. Start and end points of top-level or non-syntenic fills are reported as rearrangement breakpoint if the fill exceeds a given size threshold. We used different size thresholds to optimize both the number of identified breakpoints and to avoid biases of transposable elements that might be responsible for many small interruptions of alignment chains. In this way, we extracted rearrangement breakpoints between human and 12 genomes using size thresholds of 10 kb, 100 kb, and 1000 kb. To compare breakpoints to TADs we converted the breakpoint coordinates from hg38 to hg19 genome assembly using the liftOver tool from UCSC Genome Browser (Hinrichs et al. 2006).

Table 3.1: Species used for breakpoint identification from whole-genome alignments with human.
Common name	Species	Genome Assembly	Divergence to human (mya)
Chimpanzee	Pan troglodytes	panTro5	6.65
Gorilla	Gorilla gorilla gorilla	gorGor5	9.06
Orangutan	Pongo abelii	ponAbe2	15.76
Rhesus	Macaca mulatta	rheMac8	29.44
Mouse lemur	Microcebus murinus	micMur2	74
Mouse	Mus musculus	mm10	90
Cattle	Bos taurus	bosTau8	96
Manatee	Trichechus manatus latirostris	triMan1	105
Opossum	Monodelphis domestica	monDom5	159
Chicken	Gallus gallus	galGal5	312
Clawed frog	Xenopus tropicalis	xenTro7	352
Zebrafish	Danio rerio	danRer10	435

3.5.2 Topologically associating domains and contact domains

We obtained topologically associating domain (TAD) calls from published Hi-C experiments in human embryonic stem cells (hESC) (Dixon et al. 2012) and contact domains from published in situ Hi-C experiments in human GM12878 cells (Rao et al. 2014). Genomic coordinates of hESC TADs were converted from hg18 to hg19 genome assembly using the UCSC liftOver tool (Hinrichs et al. 2006).

3.5.3 Breakpoint distributions at TADs

To quantify the number of breakpoints around TADs and TAD boundaries we enlarged TAD regions by 50% of their total length on each side. The range was then subdivided into 20 equal sized bins and the number of overlapping breakpoints computed. This results in a matrix in which rows represent individual TADs and columns represent bins along TAD regions. The sum of each column indicates the number of breakpoints for corresponding bins and therefore the same relative location around TADs. For comparable visualization between different data sets, the column-wise summed breakpoint counts were further normalized as percent values of the total breakpoint number in the matrix.

3.5.4 Quantification of breakpoint enrichment

To quantify the enrichment of breakpoints at domain boundaries, we generated random breakpoints as background control. For each chromosome, we placed the same number of actual breakpoints at a random position of the chromosome. For each breakpoint data set we simulated 100 times the same number of random breakpoints. We then computed the distribution of random breakpoints around TADs in the same way as described above for actual breakpoints. To compute enrichment of actual breakpoints compared to simulated controls, we classified each breakpoint located in a window of 400 kb around TAD borders in either close to a TAD boundary, if distance between breakpoint and TAD boundary was smaller or equal to 40 kb or as distant, when distance was larger than 40 kb. This results in a contingency table of actual and random breakpoints that are either close or distal to TAD boundaries. We computed log odds ratios as effect size of enrichment and p-values according to Fishers two-sided exact test. Additionally, we compared the distance of all actual and random breakpoints to their nearest TAD boundary using the Wilcoxon’s rank-sum test.

3.5.5 Expression data for mouse and human orthologs

Promoter based expression data from CAGE analysis in human and mouse tissues from the FANTOM5 project (Forrest et al. 2014) were retrieved from the EBI Expression Atlas (Hinrichs et al. 2006) as baseline expression values per gene and tissue. The meta data of samples contains tissue annotations as term IDs from Uberon, an integrated cross-species ontology covering anatomical structures in animals (Herrero et al. 2016). Human and mouse samples were assigned to each other if they had the same developmental stage and matching Uberon term IDs. This resulted in 19 samples for each organism with corresponding tissues.

We used the R package biomaRt to retrieve all human genes in the Ensembl database (version grch37.ensembl.org) and could assign 13,065 to ortholog genes in mouse by allowing only the one-to-one orthology type (Herrero et al. 2016). Of these ortholog pairs, 12,696 are contained in the expression data described above. For each pair of orthologs we computed the correlation of expression values across matching tissues as Pearson’s correlation coefficient.

3.5.6 Classification of TADs and genes according to rearrangements and GRBs

We classified hESC TADs according to rearrangements between human and mouse genomes. We define a TAD as conserved if it is completely enclosed within a fill in the net file and no rearrangement breakpoint from any size threshold is located in the TAD region with a distance larger than 80 kb from the TAD boundary. A TAD is defined as rearranged, if the TAD is not enclosed completely by any fill in the net file, overlaps at least one breakpoint inferred using a 1000 kb fill size threshold, and this breakpoint is further than 80 kb away from each TAD boundary. TADs were also classified according to their overlap with GRBs as in (Harmston et al. 2017). A given TAD is a GRB-TAD if it overlaps with more than 80% of the TAD size with a GRB. A TAD is classified as non-GRB if it has less than 20% overlap with GRBs. The 12,696 human genes with mouse ortholog and expression data were grouped according to their location with respect to hESC TADs. We used the transcription start site (TSS) of the longest transcript per gene to group each gene as within TAD if the TSS overlaps a hESC TAD or as outside TADs, if not. Furthermore, we grouped genes in TADs according to conserved or rearranged TADs and separately according to GRB and non-GRB TADs.

3.5.7 Source code and implementation details

The source code of the entire analysis described here is available on GitHub: https://github.com/Juppen/TAD-Evolution. The identification of breakpoints and extraction of fills from whole-genome alignment data was implemented in Python scripts. Reading of BED files and overlap calculations with TADs and TAD bins were computed in R with Bioconductor (Huber et al. 2015) packages rtracklayer (Lawrence et al. 2009) and GenomicRanges (Lawrence et al. 2013). Gene coordinates and ortholog assignments were retrieved from Ensemble data base (version grch37.ensembl.org) using the package biomaRt (Durinck et al. 2009 a). For data integration and visualization we used R packages from tidyverse (Wickham and Grolemund 2017).