Association of DNA modifications with somatic mutations in cancer

(1)

FACULDADE DE CIÊNCIAS

DEPARTAMENTO DE QUÍMICA E BIOQUÍMICA

Association of DNA modifications with somatic mutations in cancer

João Pedro Agostinho de Sousa

Mestrado em Bioquímica Especialização em Bioquímica

Dissertação orientada por:

Professor Doutor Francisco Rodrigues Pinto

Professora Doutora Ana Rita Grosso

(2)

(3)

Começo por agradecer à Doutora Ana Rita Grosso por me ter apoiado em todas as fases do trabalho e por mostrar-se sempre interessada e ativa mesmo quando estávamos em diferentes laboratórios. A sua simpatia, sentido de humor e conhecimento fez com que a realização deste trabalho fosse uma experiência agradável e que me permitiu desenvolver capacidades que me serão úteis no futuro.

Agradeço também ao Doutor Francisco Rodrigues Pinto por ter concordado em ser o meu orientador interno e por ter estado sempre disponível durante o desenvolvimento do trabalho.

Agradeço ao Doutor Sérgio de Almeida por ter acreditado nas minhas capacidades e por me ter dado sempre liberdade e apoio no seu laboratório no Instituto de Medicina Molecular. Quero também mencionar a importante ajuda dos meus colegas de laboratório que permitiu resolver problemas que iam surgindo durante a execução do trabalho. Em particular, a Ana Margarida Ferreira, Mafalda Matos e João Sabino que me deram as ferramentas e conhecimento necessário para iniciar o projeto. Mas também os restantes colegas: Alexandra Vítor, Ana Duarte, Beatriz Lucas, Cristiana Morgado, Inês Faleiro, Lim Swert, Sílvia Carvalho, Sreerama Sridhara, Robert Martin, e aos restantes que participaram nas atividades do laboratório. Agradeço também aos laboratórios Claus M. Azzalin, Nuno Morais e Mário Ramirez por terem respondido às minhas dúvidas e ajudado no que foi necessário; aos departamentos do Instituto de Medicina Molecular por me terem acolhido e disponibilizado todas as ferramentas para a realização do trabalho, com um especial agradecimento ao departamento de sistemas de informação;

e à direção do Instituto de Medicina Molecular por transmitirem motivação e um espírito inquisitivo na investigação científica.

Quero também agradecer aos meus pais e familiares por me terem sempre apoiado e acreditado em mim mesmo nos momentos mais difíceis. Sem eles a conclusão deste trabalho e finalização do curso não seria possível. E a todos os meus amigos por estarem ao meu lado ao longo do ano e por me terem providenciado com todo o apoio e distração nas alturas de maior stress e ansiedade.

Por fim, quero agradecer aos Serviços de Ação Social da Universidade de Lisboa (SAS-UL) por me darem o apoio financeiro tão necessário para a realização do meu mestrado. Este programa permitiu aliviar bastante o esforço financeiro e estou muito grato por existirem este tipo de fundos e pela Universidade de Lisboa considerar importante apoiar estudantes com carências financeiras a obterem uma educação ao nível do ensino superior.

(4)

O cancro desenvolve-se a partir da acumulação de mutações e alterações epigenéticas que modificam os padrões celulares de expressão génica, conferindo às células uma vantagem seletiva de crescimento no ambiente em que se inserem. A identificação das condições e mecanismos que originaram essas mutações e alterações epigenéticas é extremamente importante para perceber a carcinogénese e o conhecimento fundamental sobre os processos que controlam e/ou perturbam as informações contidas na cromatina permitiria desenvolver estratégias para prevenir, mitigar e tratar o cancro. Nos últimos anos, o crescimento de novas técnicas de sequenciação permitiu o identificar características comuns entre os vários tipos de cancro. Por exemplo, Hodgkinson et al. demostraram, utilizando dados de sequenciação, que a densidade de mutações somáticas estava relacionada entre três tipos de cancro analisados e também com mutações em linhas germinais. Este estudo destacou que certas propriedades do genoma humano podem moldar a distribuição de mutações de forma não aleatória e incentivou Schuster-Böckler et al. a compararem diversos fatores genéticos e epigenéticos, utilizando amostras de diferentes tipos de cancro, com a densidade de mutações. Observou-se nessa análise que regiões com alta frequência mutações estão relacionadas com um tipo de modificação de histonas associada a heterocromatina, H3K9me3 (histona H3 lisina 9 trimetilada), tendo esta modificação sido responsável por mais de 40% da variação da taxa de mutações, indicando que a estrutura da cromatina tem uma forte influência na densidade de mutações em células somáticas humanas. Outro estudo, por Polak et al., investigou a distribuição de mutações de vários tipos de cancro e comparou-a com a específica distribuição de modificações epigenéticas de diferentes tipos de células. Os resultados revelaram que a acessibilidade e a modificação da cromatina, juntamente com o tempo de replicação, explicam até 86%

da variação das taxas de mutação, com regiões de alta densidade de mutações a associar com uma estrutura de cromatina condensada. Surpreendentemente, as características epigénicas derivadas do tipo de célula de origem dos cancros foram melhores determinantes da distribuição de mutações que as características epigenéticas das linhas celulares dos cancros testados. Estes estudos destacaram a relevância da influência da estrutura da cromatina, e das modificações epigenéticas associadas, na densidade de mutações. No entanto, estudos recentes evidenciam que modificações epigenéticas podem ter um papel mais direto na modelação da distribuição de mutações em cancro, tendo sido observado uma associação entre um tipo de modificação epigenética e dois mecanismos que podem afetar o genoma de forma oposta: formação de R-loops e reparação do DNA. Essa modificação epigenética é a 5-hidroximetilcitosina (5hmC), uma modificação no DNA predominantemente estável, presente em altos níveis no cérebro e em células estaminais embrionárias, com funções na regulação de diversos processos celulares e de desenvolvimento, incluindo na pluripotência de células estaminais embrionárias, no desenvolvimento de neurônios e na tumorogénese em mamíferos. Para além dessas funções, observou-se que esta modificação de DNA pode contribuir para a reparação do DNA e manutenção da integridade do genoma. Essa contribuição foi primeiro observada pelo grupo Jiali Li, quando em células Purkinje de murganho houve uma produção de 5hmC mediada por TET1 e dependente da ATM em resposta a danos induzidos no DNA. Posteriormente, o mesmo grupo demonstrou que quando existe dano no DNA, induzido por luz ultravioleta ou por um inibidor da topoisomerase I, a cinase ATR fosforila TET3 e promove um aumento moderado nos níveis globais de 5hmC. Apesar destes estudos não revelarem se o aumento de 5hmC se localiza especificamente em regiões do DNA onde o dano ocorreu, Kafer et al. demostraram a colocalização de 5hmC com vários fatores de reparação (53BP1, Rad51, γ-H2AX) em células humanas HeLa, HCC827 e A594, sendo a TET2 necessária para a produção de 5hmC após a indução de dano, evidenciando que o enriquecimento desta modificação do DNA pode ocorrer especificamente nas regiões onde o DNA foi danificado. Por

(5)

outro lado, recentemente no laboratório onde esta tese foi conduzida, dados preliminares indicaram uma conexão entre a presença de 5hmC e a formação de R-loops—estruturas híbridas de DNA e RNA que se formam durante a transcrição, resultantes da invasão do RNA nascente a extensões de DNA com cadeias separadas a montante da RNA polimerase e da sua subsequente ligação com a cadeia de DNA complementar. R-loops têm funções importantes na iniciação e terminação da transcrição, mas também são ameaças à integridade do genoma. Uma das formas que pode criar instabilidade no genoma é quando a sua formação inadequada causa a colisão dos processos transcrição e de replicação, e, consequentemente, gera quebras no DNA que, se não forem reparadas adequadamente, podem produzir mutações. Estes resultados criaram um aparente conflito entre a observada função de 5hmC na reparação do DNA e a sua associação à formação de R-loops podendo ser uma fonte de instabilidade genómica.

Este conflito incentivou uma análise bioinformática para avaliar a densidade de mutações somáticas em regiões enriquecidas por 5hmC, com o objetivo de esclarecer o papel promissor dessa modificação na instabilidade genómica e/ou na reparação do DNA e, em última análise, no desenvolvimento ou prevenção de cancro. Adicionalmente, o mesmo tipo de análise foi realizado para regiões enriquecidas por outra modificação da citosina, 5-metilcitosina (5mC), para distinguir como diferentes tipos de modificações de DNA afetam a densidade de mutações. A análise bioinformática foi executada com dados de sequenciação, contendo informação sobre a distribuição de 5hmC e sobre a localização e tipo de mutações nos genomas de pacientes com leucemia mielóide aguda (AML) e de pacientes com melanoma cutâneo (SKCM), recolhidos de base de dados públicas de estudos previamente realizados.

Utilizando esses dados, foi conduzida uma análise comparativa entre o número de mutações em exões localizados em regiões enriquecidas por 5hmC e exões localizados em regiões sem modificações de citosinas (sem 5mC e 5hmC). A partir dos resultados obtidos, verificou-se que pode haver uma relação entre a presença de 5hmC e mecanismos de reparação do DNA apenas para mutações C>T, uma vez que foi observada uma redução significativa no número desse tipo de mutações em exões localizados em regiões enriquecidas por 5hmC. Mas, a análise revelou também que a presença 5hmC pode estar associada a instabilidade genómica, já que em pacientes com AML foi observado um aumento de deleções em exões localizados em regiões enriquecidas por 5hmC. No entanto, também se observou um aumento de deleções em exões localizados em regiões enriquecidas por 5mC, indicando que ambos os tipos de modificação possam ter um papel na instabilidade genómica. De facto, houve uma maior densidade de praticamente todos os tipos de mutações em exões localizados em regiões enriquecidas por 5mC, fortalecendo a anteriormente observada correlação entre mutações e cromatina repressiva, uma vez que 5mC é conhecida por ter um papel na regulação da heterocromatina. Em conclusão, os resultados não indicaram claramente uma relação entre 5hmC e reparação do DNA ou instabilidade genómica. Serão necessários mais estudos para compreender qual é a influência de 5hmC no genoma do cancro e qual é papel das modificações do DNA na formação de mutações somáticas. Este tipo de análise deve ser complementado com mais cancros, com um maior número de pacientes, com diversas modificações epigenéticas, com dados de acessibilidade da cromatina, e com perfis de transcrição, para que sejam considerados todos os processos que possam estar envolvidos na formação de mutações e que sejam determinados os mecanismos que afetam a densidade de mutações em genomas de cancro. Ainda assim, tanto este estudo como outros que relacionam modificações epigenéticas com mutações realçam a necessidade de investigar os processos fundamentais e dinâmicos que moldam e interagem com o genoma para que no futuro tenhamos um conhecimento detalhado sobre os mecanismos que levam à carcinogénese.

Palavras-chave: 5-hidroximetilcitosina; mutações somáticas; reparação do DNA; R-loops; cancro.

(6)

Cancer develops from an accumulation of mutations and epigenetic alterations that modify cellular patterns of gene expression, granting the cells a selective growth advantage in their microenvironment.

Identifying the conditions and mechanisms that originated those abnormal alterations is remarkably important to understand carcinogenesis. In recent years, the development and growth of high-throughput sequencing techniques allowed the identification of common features between cancers by comparing cancer genomes. Indeed, Hodgkinson et al., using the sequencing data gathered from three types of cancer, showed that the density of somatic mutations is correlated between all three cancers and also with germline mutations. This intriguing observation instigated the research into the influence of chromatin organization on the mutation distribution observed in cancers, with an aim to uncover the factors that shape the cancer genome. The search for the potential effects of the regional chromatin architecture on the incidence of mutations revealed that high mutation density is located in regions with epigenetic and genetic features related with heterochromatin and that the best predictors of local somatic mutation density are epigenomic features derived from the most likely cell type of origin of the corresponding cancer. Those studies highlighted the importance of the chromatin environment, and associated epigenetic modifications, in affecting mutation density. However, recent evidence suggests that epigenetic modifications might have more direct role in shaping the cancer mutation landscape, with one type of epigenetic modification being associated with DNA damage and repair. That epigenetic modification is 5-hydroxymethylcytosine (5hmC), a predominantly stable DNA modification present in high levels in the brain, embryonic stem cells, and shown to regulate many cellular and developmental processes, including the pluripotency of embryonic stem cells, and neuron development. Previous studies showed that upon DNA damage induction, there is a global increase in the levels of 5hmC and a colocalization with DNA damage response markers, opening the possibility for a role of 5hmC in DNA repair and, consequently, in reducing mutation formation. Conversely, recently, preliminary research showed that regions with 5hmC could be related with R-loops—sources of genomic instability—and, therefore, with an increase in DNA damage and mutation formation. These conflicting results prompted a bioinformatic analysis to evaluate the density of somatic mutations in regions enriched for 5hmC with the objective to clarify the promising role for this DNA modification in DNA damage and/or DNA repair and, ultimately, in promoting or protecting cancer formation. To evaluate the density of somatic mutations in regions enriched for 5hmC it was collected high-throughput sequencing data with somatic mutations and with 5hmC from acute myeloid leukemia (AML) patients and from skin cutaneous melanoma (SKCM) patients, and it was performed a comparative analysis between the number of mutations in exons containing 5hmC and exons without DNA modifications.

From the results obtained, 5hmC could be associated with DNA repair mechanisms regarding only C>T mutations since it was observed a significant decrease in the number of that mutation type in exons containing 5hmC. But, 5hmC could also be associated with DNA damage due to an increase in deletions in AML. However, the data for this type of mutation was limited and another cytosine modification, 5- methylcytosine (5mC), also showed an increase in deletions. The results also revealed a higher density of virtually all mutation types in exons containing 5mC, possibly providing additional evidence for the relationship between mutations and repressive chromatin, since 5mC is known to have a role in heterochromatin regulation. In conclusion, further experiments are necessary to understand the influence of 5hmC on the cancer genome and the direct role of DNA modifications in the formation of somatic mutations.

Keywords: 5-hydroxymethylcytosine; somatic mutations; DNA repair; R-loops; cancer.

(7)

Agradecimentos ... iii

Resumo ... iv

Abstract ... vi

Table of contents ... vii

List of figures and tables ... ix

List of abbreviations and acronyms ... xii

1 Introduction ... 1

1.1 Cancer mutational landscape ... 1

1.2 Influence of chromatin organization on mutation density ... 2

1.3 DNA methylation ... 4

1.4 DNA demethylation ... 6

1.5 5hmC in DNA repair ... 7

1.6 5hmC in R-loop formation ... 8

2 Aim of the dissertation ... 10

2.1 Open questions ... 10

3 Materials and methods ... 11

3.1 Experimental design ... 11

3.2 Databases and samples ... 12

3.2.1 Epigenomic data ... 12

3.2.2 Somatic mutation data ... 13

3.3 Quality assessment and data filtering ... 13

3.4 Genome alignment and peak calling ... 15

3.5 Leveraging biological replicates... 16

3.6 Exome definition and categorization ... 16

3.7 Statistical methods ... 18

4 Results ... 19

4.1 Annotated exons qualification and quantification ... 19

4.2 Mutations qualification and quantification ... 20

4.3 Top mutated genes... 23

(8)

4.4 Comparative analysis between mutations in epigenomic-categorized exons and mutations in

unmethylated exons ... 24

5 Discussion ... 29

Bibliography ... 34

Supplementary figures ... 42

Supplementary tables ... 46

(9)

Figures

Figure 1.1. Results from the comparison between the location of 5hmC and R-loop peaks in E14 cells.

... 9 Figure 3.1. Scheme of the methodology used for merging the overlapping 5hmC enriched regions from the AML hMe-Seal patients. ... 16 Figure 3.2. Scheme of the annotated exons categorization using the epigenomic enriched regions. ... 17 Figure 4.1. Annotated exons length distribution and exon category percentages by length. ... 20 Figure 4.2. Mutation type quantities and SNVs frequency distribution in trinucleotide context. ... 21 Figure 4.3. Mutation signatures generated with SNVs from all annotated exons. AML had only one signature, likely due to the low amount of mutations, and SKCM had three distinct signatures. ... 22 Figure 4.4. Top mutated genes based on the number of mutations per length of the gene exome. ... 23 Figure 4.5. Graphical results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, and non-C>T mutations in AML and SKCM... 25 Figure 4.6. Graphical results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for C>A, C>G, C>T, T>A, T>C, T>G, INS, and DEL mutations in AML and SKCM. ... 26 Figure 4.7. Graphical results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, and non-C>T mutations in AML with 5hmC data merged and AML with 5hmC data separated. ... 27 Figure 4.8. Graphical results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for C>A, C>G, C>T, T>A, T>C, T>G, INS, and DEL mutations in AML with 5hmC data merged and AML with 5hmC data separated.

... 28

Supplementary figures

Supplementary Figure 1. Distribution of peak regions by length from the epigenomic data. ... 42 Supplementary Figure 2. SNV subtypes frequency in a trinucleotide context for each category of annotated exons. ... 43

(10)

Supplementary Figure 3. AML top mutated genes based on number of mutations per length of the gene exome in each of the annotated exon categories. ... 44 Supplementary Figure 4. SKCM top mutated genes based on the number of mutations per length of the gene exome in each of the annotated exon categories. ... 45

Tables

Table 3.1. Summary of the methods used for data collection, processing, and analysis. ... 12 Table 4.1. The number of exons and their percentages in each exon category for AML and SKCM. . 19 Table 4.2. The number of base pairs and their percentages in each exon category for AML and SKCM.

... 19 Table 4.3. The number of mutations, their percentages, and the number of mutations per base pair in each exon category for AML and SKCM... 22

Supplementary tables

Supplementary Table 1. Results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, non-C>T, and non-C>T (SNV only) mutations in AML and SKCM. Calculated using the Cochran–Mantel–Haenszel test. ... 46 Supplementary Table 2. Results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, non-C>T, and non-C>T (SNV only) mutations for AML and SKCM. Calculated using the Cochran–Mantel–Haenszel test. ... 47 Supplementary Table 3. Results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, non-C>T, and non-C>T (SNV only) mutations and each SNVs subtype, INS, and DEL.

The hMe-Seal-seq raw data was collected from the AML patient sample GSM1278419 and the MeDIP- seq from the AML patient sample GSM700408. Calculated using the Cochran–Mantel–Haenszel test.

... 48 Supplementary Table 4. Results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, non-C>T, and non-C>T (SNV only) mutations and each SNVs subtype, INS, and DEL.

... 49

(11)

Supplementary Table 5. Results from the comparative analysis between mutations in 5mC only, 5hmC only, and 5mC & 5hmC exons and mutations in unmethylated exons for All, SNVs, INDELs, CpG>T, Cp(A/T/C)>T, non-C>T, and non-C>T (SNV only) mutations and each SNVs subtype, INS, and DEL.

... 50 Supplementary Table 6. Comparison between the odds ratio of the AML (merged) results with the odds ratio average of the three AML 5hmC data patients from the AML (separated) results. ... 51 Supplementary Table 7. Information about called peaks generated with MACS2 for the AML MeDIP- seq data sample GSM700408 and for the AML hMe-Seal-seq data samples GSM1278419 (patient P1), GSM1278420 (patient P2), and GSM1278421 (patient P3). ... 52 Supplementary Table 8. Information about called peaks generated with MACS2 for the SKCM MeDIP-seq data sample GSM937085 and for the SKCM hMeDIP-seq data sample GSM937080... 53

(12)

5caC 5-carboxylcytosine

5fC 5-formylcytosine

5hmC 5-hydroxymethylcytosine

5mC 5-methylcytosine

AML Acute Myeloid Leukemia

ASCII American Standard Code for Information Interchange ATM Ataxia-telangiectasia mutated

ATR Ataxia-telangiectasia and Rad3 related BED Browser Extensible Data

BER Base Excision Repair

ChIP-seq Chromatin Immunoprecipitation sequencing

DDR DNA damage response

DEL Deletions

DNA Deoxyribonucleic Acid

FDR False Discovery Rate

GDC Genomic Data Commons

GEO Gene Expression Omnibus

GRCh38 Genome Reference Consortium Human Build 38 hESC Human embryonic stem cells

hMeDIP-seq Hydroxymethylated DNA immunoprecipitation sequencing hMe-Seal 5-hydroxymethylcytosine selective chemical labeling INDEL Insertions and Deletions

INS Insertions

lncRNA Long non-coding RNA MAF

MBD4

Mutation Annotation Format Methyl-CpG-binding protein 4

MeDIP-seq Methylated DNA immunoprecipitation sequencing mESC Mouse embryonic stem cells

NCBI National Center for Biotechnology Information PCR Polymerase Chain Reaction

PGC Primordial Germ Cell

PIKK Phosphatidylinositol 3-kinase-related kinase

RNA Ribonucleic Acid

(13)

SKCM Skin Cutaneous Melanoma SNV Single Nucleotide Variant SRA Sequence Read Archives TCGA The Cancer Genome Atlas

TDG Thymine-DNA glycosylase

UV Ultraviolet

WES Whole-Exome Sequencing

(14)

1.1 Cancer mutational landscape

Genetic and epigenetic alterations are the drivers of carcinogenesis [1–4]. Genetic alterations are changes in the inherited DNA code—a four-letter code comprised of four canonical DNA nucleobases:

two pyrimidines, ‘C’ for cytosine and ‘T’ for thymine; and two purines, ‘A’ for adenine and ‘G’ for guanine—and epigenetic alterations are changes in the epigenetic code—a combination of chemical groups that are added, removed or replaced to the chromatin, having essential functions in chromatin organization and transcription regulation [5]. The accumulation of these alterations modify the cellular patterns of gene expression and, consequently, the signaling pathways, granting the cells a selective growth advantage in their microenvironment by overrunning cell cycle regulation and evading the immune system, allowing their undesirable survival, rapid growth, uncontrolled proliferation, and even to colonize different organs [6, 7]. Furthermore, the metabolic changes during carcinogenesis lead to additional epigenetic alterations that, in combination with genetic alterations, provide cancer cells enhanced adaptative capabilities during the stages of carcinogenesis [8]. For these reasons, identifying the conditions and mechanisms that originated those abnormal alterations is remarkably important. It would help to prevent, mitigate, and treat cancer by improving our fundamental knowledge about the processes that control and/or disrupt the information contained in the chromatin.

The role of genetic alterations (mutations)—changes in the DNA code sequence: point mutations, chromosomal mutation, and copy number variations—in carcinogenesis have been the main focus of cancer research at the genomic level [9, 10]. Mutations are the result of the inappropriate repair of DNA sequence alterations generated by internal or external chemical substances that interact with the DNA, spontaneous processes, exposure to ionizing radiation, and perturbations or inaccuracies during DNA replication, repair and proofreading [11]. These mechanisms often generate single-stranded DNA breaks, double-stranded DNA breaks or nucleotide mismatches (i.e. when the nucleobase in one strand is not complementary to the nucleobase in the opposite strand), and each can be detected and repaired through different mechanisms with characterized efficiencies [12, 13]. A change in the sequence can also occur via transposable elements—DNA sequences that move from one genomic location to another, that can be deregulated in cancer cells and even be responsible for driver mutations in carcinogenesis [14, 15]. Depending on the damage-causing or error-prone conditions by which the DNA is often subjected, certain types of mutations will occur more frequently than others and thus creates a specific mutational pattern or ‘mutational signature’ [16]. For instance, cancers with high rates of UV-induced DNA damage have a mutational pattern with a higher frequency of nucleotide substitutions from C to T (cytosine to thymine) at dipyrimidine sites [17, 18]. Mutational signatures, complemented with studies that experimentally induce DNA damage or perturb DNA repair pathways, can offer detailed information about the processes responsible for the different types of mutations observed in cancer genomes and improve our understanding of the causes of carcinogenesis. Research into the sources of mutations is rapidly improving with an increase in the availability of cancer data and enhancements in computational tools that can discriminate signatures from a large aggregate of mutation data [19, 20].

Each cancer type usually has a typical mutation pattern due to the more prominent sources of DNA damage or abnormal function of proteins involved in regulatory or repair pathways [21]. Interestingly, it has been observed that, although the mutation patterns are different, the density of mutations is

(15)

correlated between cancer types, and that genes associated with cancer occupy regions of the genome with significantly lower mutation rates than the average [22]. This surprising result highlighted the fact that certain properties in the human genome might shape the mutational landscape, exhibiting a non- random distribution, as genetic alterations occur over time.

1.2 Influence of chromatin organization on mutation density

The mammalian genome is packaged inside the cell nucleus in several levels of organization aided by DNA-interacting proteins that, together with the DNA, form the chromatin. The processes of transcription, replication, and repair are influenced by the local chromatin organization and require precise coordination and organization to maintain proper cellular functions. Regions where the DNA is highly compacted, i.e. where the nucleosomes are closer together with long stretches of DNA packaged into a small region of the nucleus, are termed ‘heterochromatin’, and regions with a lightly compacted DNA, where the nucleosomes are not confined to such tight regions, are termed ‘euchromatin’ [23–26].

Histones, proteins that interact with the DNA and form the nucleosomes, are the main proteins involved in chromatin organization and have important roles in transcription and repair [27, 28]. Their function can change with post-translational modifications at the N-terminal tail residues that, together with specific chemical modification in cytidine DNA nucleosides, regulate the packaging and accessibility of the DNA. The combination of modification in the histone tails and cytosine nucleobases are the previously mentioned epigenetic code (or ‘histone code’ if only considering modifications in the histone tails) [29]. Those modifications are essential for chromatin organization and gene expression regulation, and their perturbation can significantly contribute to pathologies ranging from obesity and autoimmunity to neurodegenerative diseases [30].

Because epigenetic modifications have regulatory functions in the DNA [31], there has been an effort to determine which modifications associate with chromatin architectural features, regulatory sequences (e.g. enhancers and promoters), active or inactive transcription, and other genomic regions where epigenetics may play a role. It has been observed that specific combinations of epigenetic modifications modulate the function of regulatory sequences—e.g. active and poised states of gene enhancer elements can be partitioned based on specific combinations of histone modifications [32]; promoters, and their role in the regulation of transcription, can be classified based on distinct histone modification patterns [33]; and the formation of heterochromatin, packaging the genome in a permanently inactive form, is accomplished with an interaction between three modifications: histone hypoacetylation, histone H3 lysine 9 methylation, and cytosine methylation [34]. This enriched knowledge of the relationship between epigenetic modifications and genomic features; and the observation that the genomic distribution of somatic mutations is correlated between cancer types, led to the hypothesis that regions with higher mutations density have specific epigenetic modifications and genome properties associated.

Schuster-Böckler et al. [35] tested this hypothesis by comparing diverse genetic and epigenetic features, using samples from different cancer tissues, with the mutation distribution, and showed that regions with high mutation rates in cancer genomes are correlated with levels of the heterochromatin- associated histone modification H3K9me3 (histone H3 lysine 9 trimethylation). This modification alone, at a megabase scale, accounted for more than 40% of the mutation-rate variation, indicating that the chromatin architecture has a strong influence on the density of mutations in human somatic cells. In the publication, was speculated that the association between heterochromatin and higher mutation rate could be due to the less accessible nature of the DNA in those regions impeding efficient repair of DNA

(16)

damage (by the reducing accessibility to repair signaling pathways and/or DNA repair complexes) or maybe due to the increased exposure to mutagens at the nuclear periphery, since the heterochromatin is generally located at the nuclear periphery and lamina-associated domains can be hotspots for de novo point mutations [36, 37].

Recently, Polak et al. [38] published a study where they investigated the distribution of mutations from several cancer types and compared them with cell-type-specific epigenetic features. The goal of this study was to confirm if the specific cell-type variation of epigenetic features is correlated with the somatic mutations. The results showed that chromatin accessibility and modification, together with replication timing, explain up to 86% of the variance in mutation rates along cancer genomes, with the high density of mutations corresponding to less accessible chromatin. Surprisingly, the epigenetic features from the cell type of origin of the corresponding malignancy were stronger determinants of cancer mutation profiles than epigenetic features of matched cancer cell lines. Furthermore, by comparing the mutation distribution data from an unidentified cancer cell type to the enrichment of epigenomic variables from a normal single cell type, they were able to correctly predict the tissue of origin for individual cancers in 88% of the samples tested, proving evidence that sequencing the DNA of a tumor of unknown primary origin can allow the precise identification or categorization of the cell type of origin of that tumor.

An important detail from those two studies is that the mutation data used contained only single nucleotide variants (SNVs). Therefore, the results described above are only applicable for that mutation type, and the association between mutations and closed chromatin may not be a general feature of the mutational process. In fact, a study inferred from human-orangutan alignments showed that very high levels of SNVs and insertions and deletions (INDELs) are enriched in regions of open chromatin [39].

Nevertheless, overall these studies emphasized the significant interplay between chromatin packaging—

with the associated epigenetic modifications—and mutations, highlighting the importance of studying the molecular mechanisms of carcinogenesis by complementing the mutation data with genetic processes, epigenetic modifications and chromatin organization data (reviewed by Makova et al. [40]).

The relationship between epigenetic modifications and cancer mutations has shown promising results, as revealed in the studies described above. However, the results obtained seem to be related to the types of chromatin features observed in the regions of high or low mutation density rather than the epigenetic modifications themselves. But it has been known for years that one epigenetic modification is a mutation hotspot: a cytosine with a methyl group added to the 5^th carbon position, an epigenetic modification termed ‘5-methylcytosine’ (5mC) with high rates of C to T (C>T) mutations [41]. Yet, recently, there has been some evidence indicating that another epigenetic modification—a oxidation derivative of 5mC, 5-hydroxymethylcytosine (5hmC)—might have an influence on the density of mutations due to an association with DNA repair and/or with the formation of DNA-RNA hybrid structures that can cause genomic instability.

Previous research projects that studied the relationship between epigenetic modifications and somatic mutations in cancer did not include 5hmC in their analysis, because the fundamental roles of this modification have only been recently researched. The only exception found during the literature research for this thesis was a study performed in the Schuster-Boeckler Lab. They compared the frequency of C>T mutations in 5mC nucleotides with the frequency observed in 5hmC nucleotides. The results showed that 5hmC is associated with an up to 53% decrease in C>T mutations compared to 5mC, similar to the results obtained with unmodified cytosines [42]. This thesis, instead of focusing on mutations in sites of the modified cytosines (since it has been studied previously), aims to determine the regional influence that 5hmC might have on the density of somatic mutations in cancer since it appears to be

(17)

associated with DNA damage and/or repair. For that, it is necessary to understand the origin of this cytosine modification, its prevalence in the genome, and the influence that might have on the genomic integrity.

1.3 DNA methylation

‘DNA methylation’ is commonly used term to describe 5-methylcytosine (5mC), the product of the addition of a methyl group to the carbon in the 5^th position of the nucleobase cytosine—the epigenetic modification that has, for a long time, been directly linked with cancer mutations [43, 44].

5mC can be copied and maintained through somatic cell generations [45]. To achieve this, in mammals, the cytosine modification is typically, but not always [46], located in 5’–C–phosphate–G–3’

dinucleotides [47] termed ‘CpG’ dinucleotides, where the ‘p’ indicates the phosphate group between the nucleotides on the same DNA strand. The fact that DNA methylation is located in a symmetrical motif, having CpG dinucleotides in both strands due to the complementary nature of the DNA, is what makes it possible to be maintained after cell division. If both cytosines in the two CpG strands are methylated, then after replication, the two new genomes generated are going to have single methylated cytosines in that position—one strand with an unmethylated cytosine, and the other with a methylated cytosine—

known as a ‘hemimethylated’ site. This methylated state can be identified by enzymes responsible to add a methyl group to the cytosine in the opposite strand, thus maintaining the methylation pattern after each division. Granted that the process is, in reality, more nuanced than how is described here [48]. De novo and maintenance methylation are performed by a family of enzymes called DNMTs—DNA methyltransferases. These enzymes catalyze the transfer of the methyl group to the cytosine and, typically, are separated by their role in DNA methylation: DNMT3A, DNMT3B, and DNMT3L are associated with de novo methylation; and DNMT1 is associated with methylation maintenance.

Although this model, in which DNMTs have separate roles in methylation, has been the commonly held view in molecular biology, recent experimental evidence showed that the model needs revision and new concepts have been proposed [45, 48].

Only about 1% of the human genome is constituted by CpG dinucleotides, an approximate total of 28 million CpGs [49]; much less than the expected frequency by random chance (4.41% of the total genome). This lack of CpGs is due to the mutation-prone properties of 5mC, that by continually occurring during the vertebrate evolution led to the depletion observed currently in the human genome.

There are three potential mechanisms for the mutations to occur specifically in methylated cytosines:

1. The first is widely accepted and often mentioned in the literature: spontaneous deamination [50, 51]. The molecular structure of the nucleobase cytosine is similar to uracil (one of the four nucleobases present in RNA), with one difference: in the 4^th carbon of a cytosine, there is an amine group, and in the uracil, there is a ketone group instead. Therefore, through spontaneous deamination, the amine group is hydrolyzed, and the cytosine is transformed into uracil. 5mC has the same molecular structure as the cytosine, with an additional methyl group in the 5^th carbon, and differs from thymine just like the cytosine differs from uracil—with an amine group in the 4^th carbon in 5mC, and a ketone group in the thymine. Meaning that through spontaneous deamination, 5mC changes into a thymine. Uracil in the DNA, appearing by misincorporation or deamination of a cytosine, is efficiently removed by the uracil-DNA glycosylase (UDG), one of the base-excision DNA-repair enzymes [52]. In the case of a 5mC deamination to a thymine,

(18)

since thymine is a canonical DNA nucleobase, there are specific base-excision repair enzymes for T:G mismatches: thymine DNA glycosylase (TDG) and methyl-CpG-binding protein 4 (MBD4) [53];

2. Another potential mechanism for the higher mutation rates in 5mC is enzymatic deamination.

During class switch recombination and somatic hypermutation, in the antigen-dependent antibody diversification, there is the involvement of the enzyme activation-induced cytidine deaminase (AID) that deaminates cytosines in single-stranded DNA. Also, the family of enzymes APOBEC play a role in innate immunity, using cytosine deamination in defense against viruses, retroviruses, and transposable elements, and have been proposed to be involved in the deamination of 5mC. Specifically, APOBEC3A appears to efficiently deaminate 5mC [54, 55]. Further studies are still necessary to determine a significant involvement of enzymatic deamination in the mutations rates of 5mC [56]. There are two speculations for the higher mutation rates in 5mC due to deamination (either spontaneous or enzymatic): (1) The deamination of 5mC occurring in higher rates compared to the deamination of C, generating more T:G mismatches than U:G mismatches; (2) The repair of T:G mismatches is more inefficient than U:G mismatches [57];

3. The third possible mechanism arose from a recent study showing that tumors with somatic mutations in DNA mismatch-repair genes or in the proofreading domain of DNA polymerase ε exhibit more 5mC to T transitions than would be expected, given the kinetics of hydrolytic deamination [58]. In this study, they located regions of the genome where replication starts and, based on the direction of replication, determined the template strand and the lagging strand during replication. By comparing the mutation bias between the template strand and the lagging strand, they observed that the template strand has a much higher rate of C>T mutations than what would be excepted by a mutational process prior to the replication, like spontaneous deamination, and by the usual rate of mistakes during replication. This was a surprising result and opened another possibility for the observation of the high mutation rate of C>T mutations in methylated cytosines.

The modification of the methylation landscape can have strong implications for a cell. DNA methylation has essential functions in regulatory elements, gene transcription, X chromosome inactivation, genomic imprinting and repetitive sequences [59]. In a normal cell, DNA methylation is spread along the genome, with around 60-80% of the CpGs being methylated. However, there are regions of the genome where CpG dinucleotides are abundant and non-methylated, termed “CpG islands”. CpG islands are defined as stretches of DNA with a length larger than 200bp, with a CG:GC ratio of more than 0.5, and a ratio of observed to expected CpG greater than 0.6 [60, 61]. The majority are located in gene promoter regions, although some CpG islands are found in gene bodies with possible roles in regulating alternative promoters [62]. The formation of CpG islands is related to the properties of 5mC, that since often mutated, led to the global decrease of CpG motifs in the human genome. CpG islands are regions of the genome that evaded this global depletion of CpG dinucleotides by maintaining a hypomethylated state, and thus protecting the CpG islands from mutations associated with DNA methylation [63].

Methylation in promoter regions is negatively correlated with gene expression, although not all repressed genes have methylated promoters [64]. The transcription regulation of a gene is dependent on transcription factors, but if a gene has to be active or repressed for longer periods, epigenetic modifications establish the effects of activation or repression—e.g. some histone modifications are

(19)

associated with repressive promoters and others with active promoters [33], and DNA methylation in promoters serves as a memory signal for long-term maintenance of gene silencing [65]. In contrast with promoter regions, gene body methylation is correlated with active gene transcription [66]. This is possibly due to the presence of alternative promoters in gene bodies, that once methylated silence the cryptic transcription initiation, and allow a proper transcription of the gene that fosters that promoter [67]. Another potential reason is that some of the studies that measured methylation on active gene bodies used microarrays that do not separate 5mC from other cytosine modifications, making it difficult to determine if it is the 5mC or another cytosine modification that correlates with active transcription in the gene bodies. Additionally, methylation in gene bodies is often found close to the boundaries between exons and introns, indicating a putative role of DNA methylation in splicing regulation [68].

1.4 DNA demethylation

DNA methylation is a dynamic process. During cellular differentiation and development, some methylated regions are demethylated and others are de novo methylated to perform their function along the differentiation stages [69]. Also, immediately following fertilization in the zygote and during the establishment of the primordial germ cells (PGCs), there is a global demethylation, allowing the zygote to erase the epigenetic signature inherited from the gametes (except for imprinted genes that maintain their methylation pattern), and the PGCs to reprogram their epigenetic landscape restoring the developmental potential and the erasure of parental imprints. DNA methylation patterns are subsequently re-established with the commitment towards a distinct cell fate [70].

DNA demethylation is a process of replacing 5mC for C and is separated into two mechanisms:

active and passive. Active demethylation is mediated by an enzymatic process independent of DNA replication, and passive demethylation occurs upon failure to maintain both cytosines methylated after replication of a methylated CpG, so that after several rounds, the methylation gets diluted on that CpG site. Until recently, it was uncertain how active demethylation unfolded, since it was yet to be discovered an enzyme that could hydrolyze the added methyl group of 5mC. One possible mechanism often proposed was through deamination and subsequent removal and replacement of the mismatched base for a cytosine by TDG, MBD4 and base excision repair (BER). However, this mechanism remains speculative since there is no strong evidence for an involvement of AID or enzymes from the APOBEC family in genome-wide active demethylation processes [71, 72]. In 2009, Tahiliani et al. discovered that TET1, one of the members of the ten-eleven translocation (TET) family of Fe(II) and 2-oxoglutarate- dependent DNA dioxygenases, can modify 5mC through oxidation [73]. Soon after, two more TET proteins, TET2 and TET3, where observed to have the same function, oxidize 5mC to 5hmC [74].

Surprisingly, all three TET proteins were found to successive oxidize 5mC to 5hmC, 5hmC to 5- formylcytosine (5fC), and 5fC to 5-carboxylcytosine (5caC), with 5caC being specifically recognized and excised by TDG, revealing a new pathway for active DNA demethylation [75, 76]. Afterward, it was demonstrated that TDG also rapidly excises 5fC [77]. In this active methylation pathway, 5fC and 5caC are the only modified cytosines base-excised during demethylation, generating an abasic site as part of the BER process that regenerates unmodified cytosine. The modified base 5mC and 5hmC are not directly removed and replaced during this process, they remain in the genome unless oxidized by one of the TET proteins, or unless the region where they are located gets repaired after damage and the modified cytosine is replaced by an unmethylated cytosine. The discovery of demethylation intermediates led to the research into their function, and there is growing evidence suggesting that these base modifications may possess unique regulatory functions and can indeed be stable epigenetic marks

(20)

[78]. In mammals, the distribution of 5mC accounts for about 1–6% of total nucleotides of the genome with the vast majority of 5mC occurring at CpG dinucleotides [79]. 5hmC, however, is highly tissue- specific, ranging from 0.03% of all cytosines in the spleen to 0.7% in the brain, and are reduced up to eightfold in cancer tissues relative to healthy ones [80]. 5fC and 5caC are found in many cell types and all major organs, yet it is present at a level of 0.002 to 0.02% of cytosines, much lower than 5mC and 5hmC [81, 82].

The functional role of all demethylation intermediates has been an active topic of research showing that those epigenetics marks, rather than merely intermediates of DNA demethylation, have important functional roles [78]. 5hmC, in particular, a predominantly stable DNA modification [80] present in high levels in the brain, embryonic stem cells, and altered in cancer tissues, was shown to regulate many cellular and developmental processes, including the pluripotency of embryonic stem cells, neuron development, and tumorigenesis in mammals [83]. Furthermore, this demethylation intermediate, was observed to have some contribution to DNA repair and genome integrity maintenance [84].

1.5 5hmC in DNA repair

The contribution of 5hmC to DNA repair is a recent topic of research in the field of epigenetics. The Jiali Li group reported, in 2015, an ATM-dependent TET1-mediated 5hmC production in mouse Purkinje cells [85] and later, in 2017, that the DNA damage response (DDR)-activated ATR kinase (initiated by ultraviolet or DNA topoisomerase I inhibitor camptothecin) phosphorylates TET3 in mammalian cells, promotes DNA demethylation and a moderate increase in the global levels of 5hmC [86]. These results provided evidence that PIKKs (Phosphatidylinositol 3-kinase-related kinases), such as ATM and ATR, can phosphorylate TET1 and TET3 enzymes, respectively, and stimulate 5hmC production. In TET-deficient cells, the ATR- and ATM-induced DDR signaling remained unaffected, but the repair of the DNA lesions was compromised. Unfortunately, from these studies, it was unclear whether PIKK-induced DNA demethylation occurred specifically at sites of DNA damage, and the colocalization of 5hmC with γ-H2AX, a known DNA damage response marker, was minimal. Thus, the global 5hmC production might have no functional significance, but simply reflect global DNA demethylation due to DNA repair-associated gene expression activation. Recently, Kafer et al.

demonstrated that DNA damage caused an enrichment of 5hmC over broad chromosomal domains and that TET enzymes promote correct chromosome segregation during replication stress. Using laser microirradiation and prolonged aphidicolin treatment, they showed the colocalization of 5hmC with several repair factors (53BP1, Rad51, γ-H2AX) in human HeLa, HCC827 and A594 cells, and that TET2 was responsible for this DNA damage-induced 5hmC production. 5hmC was even enriched in 53BP1 nuclear bodies even in non-treated cells [87]. These results were not in perfect agreement with the ones from the Jiali Li group, since showed an enrichment of 5hmC at DNA damage/repair foci. Subsequently, it is still important to determine if the TET activation/5hmC production is dependent on the types of DNA damage; if the 5hmC production is a by-product of TET oxidation at sites of DNA damage to remove 5mC; and if the contribution to the DNA repair is made mostly by TET enzymes but not by 5hmC. Nevertheless, this possible relationship between 5hmC and DNA repair inspires and motivates further research into the functional role of this demethylation intermediate and how it may affect genomic integrity.

Conversely, from preliminary data collected prior to this study, regions with 5hmC associate with regions containing R-loops—DNA-RNA hybrid structures that form during transcription with the

(21)

nascent RNA being transcribed annealing with a single-stranded complementary DNA sequence upstream of the RNA polymerase—and since those hybrid structures, if formed inappropriately, are sources of genomic instability, 5hmC could be considered a mutagenic epigenetic modification by association with R-loop formation.

1.6 5hmC in R-loop formation

The modification of the chemical structure of cytosines, with the formation of cytosine modified bases, have an effect on DNA flexibility and nucleosome mechanical stability [88, 89] and can alter the structure of the DNA double helix [90]. Severin et al., in 2011, published a study, using molecular force assay and single-molecule force spectroscopy, revealing that 5mC can either inhibit or facilitate strand separation depending on methylation level and sequence context, providing the first evidence that methylation in epigenetics may regulate gene expression by changing the mechanical properties of DNA [91]. Later, in 2013, the same group published a paper using a similar methodology to find out how 5hmC affects strand separation. They observed significant effects of 5hmC on stretching-induced strand separation, showing that 5hmC can either upregulate or downregulate DNA’s strand separation propensity. Both studies measured and simulated DNA stretching in shear and zipper geometry to cover the different force directions that actually arise in the cell: the zipper geometry might be representative of the mechanical force exerted by DNA helicases; and the shear geometry might be more representative of the mechanical manipulation of DNA in transcription initiation. The results demonstrated a pronounced effect of 5mC and 5hmC in DNA strand separation, but the general properties and the biological relevance was not uncovered.

In 2015, a team from the Zymo Research Corporation published a paper, in the company’s biotechnical newsletter, reporting their findings on the effects of having 5mC and 5hmC base modifications to the DNA stability. They used high resolution melting to measure the DNA stability in an 897bp DNA fragment with relative evenly distributed G, A, T, and C. The C was either 100% native C, or 100% 5mC or 5hmC. The results showed that for the DNA fragment containing 5mC there was an increase in DNA melting temperature, with an average of 92.397ºC, and for 5hmC, there was a decrease, with an average of 84.203ºC. The DNA fragment containing only unmodified cytosines had an average of 86.51ºC. The samples were done in triplicate. They also performed the same melting experiment using a 52bp fragment with only one modified cytosine in the middle and obtained similar results: the fragment with 5mC showed an increase in the melting temperature, with a value of 70.27423282ºC, in comparison with the fragment with C, with a value of 70.06067317ºC; and the fragment with 5hmC showed a decrease in the melting temperature, with a value of 69.94249309ºC [92]. This paper from the Zymo Research Corporation was an internal study without peer review, the experiments were limited and without the calculation of statistical probability, and it was not discussed if the difference in the values obtained for the melting temperature is structurally significant in the cellular environment.

Nevertheless, the results had enough appeal to pursue research into the DNA stability in regions of the genome with 5hmC. The reasoning behind this research was that if stretches of DNA with 5hmC have a lower melting temperature—indicating a lower binding stability between the DNA strands, in comparison with having C or 5mC—during transcription, those regions will have higher propensity to separate the strands upon the formation of negative supercoils upstream of the RNA polymerase [93]. If this is true, it would be expected a higher rate of R-loops in those regions, since the nascent RNA could more readily invade the separated DNA strand region and form the DNA-RNA hybrid. In the Sérgio de

(22)

Almeida laboratory, where this thesis was conducted, a bioinformatic analysis was performed to compare 5hmC peak regions (from hMeDIP-seq data) and R-loop peak regions (from DRIP-seq data) in E14 cells, a mouse embryonic stem cell line. The results indicated that there is an association between 5hmC and R-loops in E14 cells, with the number of intersecting 5hmC and R-loop peak regions having a p-value lower than 0.05 (Fig. 1.1). Although these results are still preliminary, they reveal a new possible role for 5hmC in genomic instability, or perhaps, in DNA repair, if these results reflect a repair process that occurred in regions with R-loops.

Figure 1.1. Results from the comparison between the location of 5hmC and R-loop peaks in E14 cells. a, Graphical representation of the peaks generated with the sequencing data from 5hmC and R-loops in selected genes. b, Distribution of 5hmC and R-loop peak regions in a 20Kbp window, color-coded with the respective Z-Score. c, Diagram with the number of 5hmC and R-loop peak regions that intersected and that did not intersect between them, inside highly transcribed genes. (* p- value < 0.05).

(23)

2 Aim of the dissertation

The aim of this dissertation is to evaluate the density of cancer somatic mutations in genomic regions enriched for 5hmC.

Previous studies showed an increase in the levels of 5hmC upon DNA damage and an association of the genomic distribution of 5hmC with DNA damage response markers, opening the possibility for a role of 5hmC in DNA repair and, consequently, in reducing mutation formation [84, 85, 87, 94].

Additionally, 5hmC has been shown to decrease the frequency of C>T mutations in a CpG context, in comparison with 5mC, likely due to the lower propensity to deaminate or to produce fewer mistakes during replication [42, 56]. However, recently, preliminary research showed that regions with 5hmC could be related with R-loops—sources of genomic instability—and, therefore, with an increase in DNA damage and mutation formation. Because of these conflicting results, it was decided to perform an analysis to determine the density of somatic mutations in regions enriched for 5hmC, with the objective of revealing a relationship between the presence of 5hmC and mutations in cancer that could clarify the promising role for this DNA modification in DNA damage and/or DNA repair and, ultimately, in promoting or protecting cancer formation. Furthermore, the same type of analysis was performed for regions with 5mC, and regions with both 5mC and 5hmC, to compare the results with regions enriched for 5hmC and to distinguish how different types of cytosine modifications affect mutation density.

2.1 Open questions

• Does the presence of 5hmC influence the local density of mutations?

• It increases or reduces mutation density?

• Does the association differ or disappear for different mutation types?

• Does the presence of 5mC influence the local density of mutations?

• How different is from the association between the presence 5hmC and somatic mutations?

• Does the presence of both 5mC and 5hmC has a different influence on the local density of mutations compared to the presence of only 5mC or only 5hmC?

(24)

3.1 Experimental design

To study the association between cancer somatic mutations and regions of the genome with 5hmC or with 5mC, it was first necessary to collect data from cancer patients and then perform a bioinformatic analysis of the data. However, the data for this study was not generated in the laboratory where the thesis was conducted. 5hmC has been a recent research interest of the laboratory and the goal of this study was to clarify an uncertainty in the field and, hopefully, to provide a direction for further experiments in the relationship between DNA modifications and cancer. So, the data for the thesis was collected from public databases and publications available online.

In an ideal experimental setting, each cancer patient would have whole-genome sequencing data with somatic mutations, with 5mC, and with 5hmC, so that the genomic location association, if exists, could be more accurately detected. However, each patient would need to have a large number of mutations to observe statistically significant effects. This ideal situation was not possible to materialize for this study because 5hmC sequencing data from cancer patient’s tissues is scarce and has not been generated together with 5mC and somatic mutations for the same patient (at least it was not found during the database search). Thus, the focus shifted to find 5mC and 5hmC sequencing data from the same cancer patients and somatic mutation data from other patients with the same cancer type, while being aware that an analysis with this type of data assumes that the patients from the somatic mutation data have similar epigenomic distribution to the patients from the 5mC and 5hmC data. The only cancer types with available 5mC and 5hmC sequencing data and somatic mutation data were: Acute myeloid leukemia (AML) and Skin cutaneous melanoma (SKCM). Unfortunately, for AML, the patients from the 5mC data were different from the 5hmC data patients, and there was a distinct number of patients for each modification type.

The sequencing data with 5mC and 5hmC was generated using MeDIP-seq (Methylated DNA immunoprecipitation sequencing), hMeDIP-seq (Hydroxymethylated DNA immunoprecipitation sequencing) and hMe-Seal (5-hydroxymethylcytosine selective chemical labeling). The first two sequencing techniques enrich for methylated or hydroxymethylated DNA sequences using an antibody against 5mC (in MeDIP-seq) or 5hmC (in hMeDIP-seq) producing, after processing, peak regions with approximately 200 bp, indicating a region of the genome where the cytosine modification is located.

The third sequencing technique, hMe-Seal, is a chemical labeling enrichment technique based on β- glycosyltransferase (βGT)-catalyzed 5hmC glycosylation that produces peak regions similar in length to the ones from MeDIP-seq and hMeDIP-seq. The techniques used to generate the AML data were MeDIP-seq and hMe-Seal because they were collected from different datasets. It is important to note that having two different enrichment methods to produce epigenomic data that will be used in combination, can affect the results of the analysis and their interpretation. Preferably, both would be generated using DIP-seq with antibodies for each cytosine modification, as in the SKCM data, with MeDIP-seq and hMeDIP-seq.

Moreover, the data with the somatic mutations in patients with AML and SKCM was generated using whole-exome sequencing (WES) and the 5mC and 5hmC data was generated using whole-genome sequencing (WGS). Because of this disparity, the data analysis had to be restricted to the exome.

(25)

The design of the experiment was conditioned by the data available and the necessary boundaries implemented in the analysis. An overview of the experimental design is represented in Table 3.1.

Table 3.1. Summary of the methods used for data collection, processing, and analysis. The details of each method are described along this section.

① Databases and samples

② Quality assessment and

data filtering

③ Genome alignment and

peak calling

④ Leveraging biological replicates

⑤ Exome definition and categorization

⑥ Statistical methods

AML:

MeDIP-seq (1 patient)

SKCM:

MeDIP-seq (1 patient, same as hMeDIP-seq)

FastQC

↓ fastq_quality_filter

↓ Trim Galore!

Bowtie 2

↓

Only uniquely mapped reads

↓ MarkDuplicates

↓ MACS2

Categorizing annotated exons based on the overlap

with epigenomic regions

5mC only exons 5hmC only exons 5mC & 5hmC exons Unmethylated exons

Mutations in:

5mC only 5hmC only 5mC & 5hmC

exons

vs.

mutations in:

unmethylated exons

Cochran–

Mantel–

Haenszel test AML:

hMe-Seal (3 patients)

SKCM:

hMeDIP-seq (1 patient, same as MeDIP-seq)

Merging overlapping 5hmC enriched regions between patients (AML hMe-Seal only)

Somatic mutations (WES)

Comparison and filtering with human reference

genome

Selecting only mutations inside annotated exons

3.2 Databases and samples

3.2.1 Epigenomic data

In this study, the term ‘epigenomic data’ refers to the 5mC and 5hmC data collected to perform the analysis.

Acute myeloid leukemia (AML)

Methylation data: MeDIP-seq [95] raw data, from tissue samples of patients, was collected from the GEO database with the GEO Series number GSE28314. The data contained a total of 12 biological replicates sequenced from different patients, though, for all biological replicates, the corresponding input DNA data was not available in the database. From the 12 biological replicates, just one passed the quality assessment (sample with the GEO accession number: GSM700408) and was used for the analysis. The platform that generated the MeDIP-seq raw data was the Illumina Genome Analyzer II (GPL9115).

(26)

Hydroxymethylation data: hMe-Seal [96] raw data was collected from the GEO database with the GEO Series number GSE52945. There was a total of 3 biological replicates from different patients and the corresponding input DNA data for each sample was available. All samples passed the quality assessment and were used for the analysis. The platform that generated the hMe-Seal raw data was the Illumina HiSeq 2000 (GPL11154). Note: The patients from this dataset were different from the MeDIP- seq raw data patients.

Skin cutaneous melanoma (SKCM)

Methylation and Hydroxymethylation data: MeDIP-seq and hMeDIP-seq [97] raw data were collected from the same study, available in the GEO database with the GEO Series number GSE38231.

There was 1 biological replicate from one patient together with the input DNA data (same patient for MeDIP-seq and hMeDIP-seq). The samples passed the quality assessment and were used for the analysis. The platform that generated the raw data was the Illumina Genome Analyzer II (GPL9115).

3.2.2 Somatic mutation data

The data with the somatic mutations, from AML and SKCM patients, was collected from the GDC Data Portal (https://portal.gdc.cancer.gov/). The experimental strategy used was whole-exome sequencing (WES) and the data type was ‘Masked Somatic Mutation’. The method used for the identification of somatic point mutations in the next generation sequencing data was MuTect [98]. The mutations in the dataset were small-scale mutations: single nucleotide variants/substitutions (SNVs), small insertions (INS) and small deletions (DEL). The somatic mutations for both cancers were obtained using the same methodology, had the same data type and mutation types.

3.3 Quality assessment and data filtering

3.3.1 Epigenomic data

Sequencing quality

The raw sequencing data from high-throughput sequencing platforms are stored in Sequence Read Archive (SRA) file type. The SRA files for this project were downloaded from the GEO database and then converted to fastq files using the tool ‘fastq-dump’ (version 2.8.2 from NCBI’s SRA Toolkit:

https://www.ncbi.nlm.nih.gov/sra/docs/toolkitsoft/). The software FastQC (version 0.11.5 from https://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was then used to assess the quality of the high-throughput sequence data collected [99]. All the epigenomic high-throughput sequencing raw data in this project were sequenced using single-read sequencing.