6.5 Towards predicting regulatory pathomechanisms of structural variants
Structural variation, either inherited or arising from de novo germline or somatic mutations, make up the majority of different nucleotides among human genomes (Sudmant et al. 2015; Auton et al. 2015). By their ability to alter gene dosage, promote gene fusions, unmask recessive alleles, or disrupt associations between genes and their regulatory elements, structural variants are involved in numerous congenital syndromes and cancers (Stankiewicz and Lupski 2010). However, many structural variants affect only non-coding regions in the genome and are therefore difficult to interpret in modern molecular diagnoses.
In several syndromes, structural variants disrupt TADs and rewire regulatory interactions between enhancers and genes (Section 6.4). These effects suggest that molecular diagnoses can benefit from taking the genome folding structure into account to identify the underlying pathogenic effect mechanism.
In this thesis (Chapter 4) and previous work (Ibn-Salem et al. 2014), we demonstrated how publicly available chromatin interaction data could be used together with diverse genomic and structured phenotype data to predict regulatory effect mechanism and candidate genes. We discovered 16 genes for 11 DGAP cases that are top-ranking position effect candidates for the subjects’ clinical phenotypes. Importantly, we also applied our computational pipeline to cases with known pathogenic genes from the DGAP cohort, and correctly prediction 52 out of 57 genes, indicating a high sensitivity of this approach. While our approach cannot prove causal relationships, compared to more elaborate experimental screenings, our computational analysis is rapid and can provide additional information to benefit the clinical assessment of both coding and non-coding genome variants. Therefore, cost-efficient analysis by computational integration of genomic and phenotypic data is a crucial step towards prediction of pathogenic consequences of genetic variation observed in prenatal samples.
However, the here presented computational prediction method is limited by the quality of genomic and phenotypic data. An essential step in our analysis was the quantification of phenotype similarities between phenotypes annotated to genes, on the one hand, and phenotypes observed in the subject, on the other hand. Such prioritization of candidate genes was only possible through detailed phenotypic annotation data of subjects using the Human Phenotype Ontology (HPO). The HPO provides a standardized vocabulary of phenotypic abnormalities encountered in human disease and relations between them. Furthermore, HPO links phenotype terms to genes using data from mono-genic diseases (Köhler et al. 2014). The brought application of such deeply structured phenotypic information will be essential for correctly interpreting genomic variants causing individual abnormalities observed in heterogeneous patient cohorts (Brookes and Robinson 2015). While the continuously decreasing sequencing costs allow variant detection at high-resolution in large cohorts and populations, the detailed and precise phenotypical annotation is lacking often behind. Therefore, standardized vocabularies of phenotypes and electronic health records need to become the standard in clinical practice.
The crucial implications of genome folding structure on the influence of genetic variants highlight the importance of taking genome folding data into account in clinical practice. However, an apparent limitation in predicting the effect of genetic variants on genome folding is the lack of high-resolution chromatin interaction data that is specific for the tissue of interest. While TAD boundaries are mostly stable between cell-types (Dixon et al. 2012; Rao et al. 2014; Schmitt et al. 2016) and during differentiation (Dixon et al. 2015), many regulatory interactions are cell type-specific (Le Dily et al. 2014; Dixon et al. 2015). This tissue-specificity raises the question of whether chromatin interaction data from different tissues can be used to predict regulatory interactions changes in the cell-types and tissues relevant for pathogenesis. Although genome-wide contact maps become increasingly available in recent years, high heterogeneity between proximity-ligation methods makes it challenging to train prediction methods on data across different studies. Therefore, it will be crucial to computationally predict long-range interaction in the tissue of interest from other tissue-specific genomic data that is measured along the linear genome.
Another critical challenge is to predict not only the disrupted chromatin interactions but also gained interactions and resulting domain organization. Similar to the creation of neo-TADs by tandem-duplication of TAD boundaries, inversion and translocations can create novel folding structures by connecting loci that were before separated by TAD boundaries or even on different chromosomes (Fig. 6.3. However, which regions in such fused TADs interact across rearrangement breakpoints is not clear and can currently only be measured experimentally in transgenic mouse models. Computational models that predict long-range interactions from genomic sequence could help here. A prediction model could be trained on the unaffected genome to learn the sequence features that are predictive for its folding. Application of the model to genomic sequence that is altered by structural variations can then predict the resulting folding structure. Such predictive modeling can help to understand better if and how new TADs are formed and which regulatory interactions are gained upon structural variations. We provide initial support for such an approach by developing 7C to predict chromatin looping interactions from protein binding data and genomic sequence features (Chapter 5). Approaches like these, allow predicting genome folding by taking into account, both, the tissue specificity and the potentially altered genomic sequence. However, the application in predicting folding of rearranged genomes needs to be demonstrated.