Genome folding in evolution and disease

2.4 Discussions

The generation of large datasets of gene expression across multiple tissues allowed the observation of clusters of pairs and triplets of co-expressed genes in higher eukaryotes (e.g. in Drosophila (Boutanaev et al. 2002) or in mammals (Purmann et al. 2007)) and it was previously suspected that the structure of chromatin would have to do with this (Sproul et al. 2005), particularly cis-acting units (Purmann et al. 2007). The discovery and characterization of topologically associating domains (TADs) has finally brought to the light the chromatin structure that could be responsible for this co-regulation.

To study the interplay between TADs, gene co-regulation and evolution in the human genome, we decided to focus on pairs of paralogs because they have a tendency to be produced by tandem duplication (Newman et al. 2015) and, because of homology, result in proteins with related functions. However, the particular emergence and evolution of paralogs are probably responsible for special properties that distinguish them from non-paralog genes as we described: greater gene length, more enhancers, as well as a shorter distance to the next enhancer. These differences, which could be partially explained by the observation that paralogs are more often tissue specific (Fig. A.1F), complicated the methodology for choosing meaningful control pairs (see section 2.2).

Once we ensured the generation of the appropriate backgrounds, we could study the position of pairs of paralogs respect to TADs. This allowed us to test, on the one hand, the resilience of TADs to genome shuffling and, on the other hand, the rate of accommodation and gain of functionally related genes. Possibly, the generation of paralogs by tandem duplication might continuously impose a strain in the pre-existing genomic and regulatory structure, but also a chance for the evolution of new functionality.

On the one hand, we observed many pairs of paralogs within TADs. On the other hand, pairs of paralogs in different TADs, however distant from each other, tend to have more contacts than control gene pairs. This suggests a many-step mechanism where first tandem duplication fits TAD structure but then subsequent chromosomal rearrangements relocate paralogs at larger distances (while keeping contacts) and eventually reorganization of regulatory control allow their increased independence being eventually placed even in different chromosomes where contact is no longer necessary. Thus, TADs are units of co-regulation but do not have a strong preference for keeping co-regulated genes within during evolution. This model agrees with the recent work from Lan and Pritchard reporting that young pairs of paralogs are generally close in the genome (Lan and Pritchard 2016).

A second effect that we observed was the existence of fewer contacts between close pairs of paralogs than in comparable pairs of non-paralog genes, particularly if they are in the same TAD (Fig. 2.4B), while sharing more enhancers (Fig. 2.4E). This result could reflect the existence of pairs of paralogs encoding proteins that replace each other, for example sub-units of a complex that occupy the same position in a protein complex but are expressed in different cells. One such case is exemplified by CBX2, CBX4 and CBX8, which occupy neighbouring positions within the same TAD in human chromosome 17 and encode replaceable subunits of the polycomb repressive complex 1 (PRC1) complex involved in epigenetic regulation of cell specification (Becker et al. 2015). The expression of such groups of paralogs require active coordination to ensure exclusive expression of only one gene or a subset of genes per condition, resulting in patterns of divergent expression. Since there might be also conditions where none of these genes are expressed, such divergent expression patterns are different from negative correlation.

Previous work studying gene expression of duplicated genes already studied how after gene duplication paralogs tend to diverge in their expression (Makova and Li 2003; Huminiecki 2004; Rogozin et al. 2014) but it was observed that while some paralogs are co-expressed some others have negative correlation across tissues (Makova and Li 2003). Our interpretation of these observations together with our results is that the initial tandem duplication event forming a paralog is advantageous to situate the new copy in an environment that allows its controlled regulation, ideally under the same regulatory elements than the original copy, and this can be attained by duplicating both gene and surrounding regulatory elements. This would preclude the duplication of genes with very entangled regulatory associations. Once this happens, if the new protein evolves into a replacement, then the regulatory constraints on its coding gene are strong and there would be a tendency to keep it in the vicinity of the older gene so that a divergent pattern of expression can be ensured.

To support this hypothesis, we contrasted our data with the data collected in the HIPPIE database of experimentally verified human protein-protein interactions (Schaefer et al. 2012). We observed the well-known fact that paralog pairs generally encode for proteins that interact more often than non-paralog proteins (Fig. A.14). But, most importantly, we observed that the chances of close pairs of genes to encode for interacting proteins raise \(2.3\)-fold if they are in the same TAD, while, in contrast, if these genes are paralogs the difference is much smaller (\(1.2\)-fold, Fig. A.14). We interpret this result as evidence for a significant population of within TAD paralog pairs encoding for non-interacting proteins, which supports our hypothesis that paralog pairs within the same TAD would have a tendency to encode for proteins replacing each other.