Detection of Copy Number Variation (CNV) and its characterization in Brazilian population

Texto

(1)Detecção de Copy Number Variation (CNV) e sua caracterização na população brasileira. Ana Cláudia Martins Ciconelle. Dissertação apresentada ao Instituto de Matemática e Estatística da Universidade de São Paulo para Obtenção do Título de Mestre em Bioinformática. Programa: Mestrado em Bionformática Orientadora: Prof. Dra. Júlia Maria Pavan Soler Durante o desenvolvimento deste trabalho a autora recebeu auxílio financeiro do CNPq e CAPES São Paulo, Janeiro de 2018.

(2) Detection of Copy Number Variation (CNV) and its characterization in Brazilian population. Esta versão da dissertação contém as correções e alterações sugeridas pela Comissão Julgadora durante a defesa da versão original do trabalho, realizada em 06/02/2018. Uma cópia da versão original está disponível no Instituto de Matemática e Estatística da Universidade de São Paulo.. Comissão Julgadora: • Prof. Dra. Júlia Maria Pavan Soler - IME-USP • Prof. Dr. Alexandre da Costa Pereira - HCFMUSP • Prof. Dr. Benilton de Sá Carvalho - UNICAMP.

(3) Agradecimentos Agradeço aos meus pais, Claudio e Marcia, meu irmão Lucas, minhas avós, Dorga e Isabel, e todos os outros familiares que sempre me apoiaram e me fazem acreditar no significado de família. Agradeço especialmente á minha professora, orientadora e amiga, Júlia M. P. Soler, que desde da minha iniciação científica sempre esteve disponível para me ensinar, orientar e aconselhar com muita paciência, carinho e apoio. Agradeço também meus amigos de graduação em Ciências Moleculares, em especial ao Chico, Otto e o Leo, e aos amigos do IME que sempre me ajudaram em todos os sentidos possíveis. Agradeço aos professores do Ciências Moleculares e do IME por me mostrarem o caminho da ciência. Este trabalho não seria possível sem o apoio do INCOR/FMUSP por conceder os dados do Projeto Corações de Baependi e das agências CNPq e CAPES pelo apoio financeiro.. i.

(4) ii.

(5) Resumo CICONELLE, A. C. M. Detecção de Copy Number Variation (CNV) e sua caracterização na população brasileira. Programa de Bioinformática - Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, 2018. Estudos de associação genética (do inglês, Genome-wide association studies - GWAS ) são uma ferramenta fundamental para associar marcadores genéticos, genes e regiões genômicas com doenças e fenótipos complexos, permitindo compreender em mais detalhes essa rede de regulação bem como mapear genes e, com isso, desenvolver técnicas de diagnóstico e tratamento. Atualmente, a principal variante genética utilizada nos estudos de associação é o SNP (do inglês, Single Nucleotide Polymorphism), uma variação que afeta apenas uma base do DNA, sendo o tipo de variação mais comum tanto entre os indivíduos como dentro do genoma.. Apesar das diferentes técnicas disponíveis para os estudos de associação, muitas doenças e traços complexos ainda possuem parte de sua herdabilidade inexplicada. Para contribuir com estes estudos, foram criados banco de dados genéticos de referência, como o HapMap e o 1000 Genomes, que possuem representantes das variantes genéticas comuns das populações mundiais (européias, asiáticas e africanas). Nos últimos anos, duas das solucões adotadas para tentar explicar a herdabilidade de doenças e fenótipos complexos correspondem a utilizar diferentes tipos de variantes genéticas e incluir variantes raras e específicas para uma determinada população. O CNV (do inglês, Copy Number Variation) é uma variante estrutural que está ganhando espaço nos estudos de associação nos últimos anos. Essa variante é caracterizada pela deleção ou duplicação de uma região do DNA que pode ser de apenas alguns pares de bases até cromossomos inteiros, como no caso da síndrome de Down. Em parceria com o Instituto do Coração (InCor-FMUSP), este trabalho utiliza os dados do projeto Corações de Baependi para estabelecer uma metodologia para caracterizar os CNVs na população brasileira a partir de dados de SNPs e associá-los com a altura. O projeto inclui dados genéticos e fenótipos de 1,120 indivíduos relacionados (estruturados em famílias). Para a detecção dos CNVs, os recursos do software PennCNV são utilizados e metodologias de processamento, normalização, identificação e análises envolvidas são revisadas. A caracterização dos CNVs obtidos inclui informações de localização, tamanho e frequência na iii.

(6) iv população e padrões de herança genética em trios. A associação dos CNVs com a altura é realizada a partir de modelos lineares mistos e utilizando informações sobre a estrutura de família. Os resultados obtidos indicaram que a população brasileira contém regiões (únicas) com variação no número de cópias que não estão identificadas na literatura. Características gerais dos CNVs, como tamanho e frequência no indivíduo, foram semelhantes ao que é apontado na literatura. Também foi observado que a transmissão de CNV pode não seguir as leis mendelianas, uma vez que a frequência de trios com um dos pais com deleção/duplicação e filho normal era superior à frequência dos trios com filho portador da mesma variação. Este trabalho também identificou uma região no cromossomo 9 que pode estar associada com a altura, sendo que portadores de uma duplicação nesta região podem ter uma diminuição esperada de aproximadamente 3cm na altura. Palavras-chave: CNV, herdabilidade, fenótipos complexos, SNPs, dados de família..

(7) Abstract CICONELLE, A. C. M. Detection of Copy Number Variation (CNV) and its characterization in Brazilian population. 2016. Master in Bioinformatics - Instituto de Matemática e Estatística, Universidade de São Paulo, São Paulo, 2018.. Genome-wide association studies (GWAS) are a tool of high importance to associate genetic markers, genes and genomic regions with complex phenotypes and diseases, allowing to understand in details this regulation of gene expression as well as the genes, and then develop new techniques of diagnoses and treatment of diseases. Nowadays, the main genetic marker used in GWAS is the SNP (single nucleotide polymorphism), a variation that affects only one base of the DNA, being the most common type of variation between individuals and inside the genome. Even though there are multiple techniques available for GWAS, several complex traits still have unexplained heritability. To contribute to these studies, reference genetic maps are being created, such as the HapMap and 1000 Genomes, which have common genetic variants from world wide population (including European, Asian and African populations). In the last years, two solutions adopted to solve the missing heritability are to use different types of genetic variants and include the rare and population specific markers. Copy number variation (CNV) is a structural variant which use is increasing in GWAS in the last years. This variant is characterized for the deletion or duplication of a region a DNA and its length can be from few bases pair to the whole chromosome, as in Down syndrome. In collaboration of the Heart Institute (InCor-FMUSP), this work uses the dataset from Baependi Heart Study to establish a methodology to characterized the CNVs in the Brazilian population using SNP array data and associate them with height. This project uses the genetic and phenotype data of 1,120 related samples (family structure). For CNV calling, resources from the software PennCNV are used and methodologies of preprocessing, normalization, identification and other analysis are reviewed. The characterization of CNVs include information about location, size, frequency in our population and the patterns of inheritance in trios. The association of CNVs and height is made using linear mixed models and with information of family structure. The obtained results indicate that the Brazilian population has regions with variation in the number of copies that are not in the literature. General characteristics, such as length v.

(8) vi and frequency in samples, are similar to the information found in the literature. In addition, it was observed that the transmission of CNVs could not follow the Mendelian laws, since the frequency of trios which one parent has a deletion/duplication and the offspring is normal is higher than the frequency of trios with one parent and the offspring has a deletion/duplication. This work also identified a region on chromosome 9 that could be associated to height, being that carries of a duplication in this region can have the expected height dropped by approximately 3cm. Keywords: CNV, heritability, complex phenotypes, SNPs, family data..

(9) Contents List of Abbreviations. ix. List of Figures. xi. List of Tables. xiii. 1 Introduction 2 Copy Number Variation (CNV) 2.1 Biological background . . . . . . 2.2 Mechanisms for CNVs Generation 2.3 CNV Calling . . . . . . . . . . . 2.4 Association Studies and CNVs . . 2.4.1 CNVs and Height . . . . .. 1. . . . . . . . . . . . . . . and CNV Transmission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 3 Materials and Methods 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 SNP array platform . . . . . . . . . . . . . . 3.2 Methodology Overview . . . . . . . . . . . . . . . . 3.3 Preprocessing of SNP data . . . . . . . . . . . . . . 3.3.1 Quantile normalization . . . . . . . . . . . . 3.3.2 Median polish . . . . . . . . . . . . . . . . . 3.3.3 SNP Genotype Calling . . . . . . . . . . . . 3.4 Log R Ratio (LRR) and B Allele Frequency (BAF) 3.4.1 Log R Ratio (LRR) . . . . . . . . . . . . . . 3.4.2 B Allele Frequency (BAF) . . . . . . . . . . 3.5 Hidden Markov Models (HMM) . . . . . . . . . . . 3.6 Selection of CNV Regions . . . . . . . . . . . . . . 3.6.1 Quality Control . . . . . . . . . . . . . . . . 3.6.2 Minimal Regions . . . . . . . . . . . . . . . 3.6.3 Filtering CNV Regions . . . . . . . . . . . . 3.7 Association Study and Polygenic Mixed Model . . .. vii. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. . . . . . . . . . . . . . . . .. . . . . .. 7 7 11 14 16 20. . . . . . . . . . . . . . . . .. 23 23 24 26 30 31 32 33 35 36 36 38 41 41 42 44 44.

(10) viii. CONTENTS. 4 Application in Baependi Heart Study 4.1 Log R Ratio (LRR) and B Allele Frequency (BAF) 4.2 CNV calling . . . . . . . . . . . . . . . . . . . . . . 4.3 Quality Control . . . . . . . . . . . . . . . . . . . . 4.4 Minimal Regions . . . . . . . . . . . . . . . . . . . 4.5 CNV Filter . . . . . . . . . . . . . . . . . . . . . . 4.6 Baependi Samples . . . . . . . . . . . . . . . . . . . 4.7 CNVs in Brazilian Population . . . . . . . . . . . . 4.7.1 How many CNVs does an individual have? . 4.7.2 How long are the CNVs? . . . . . . . . . . . 4.7.3 Where are the CNVs? . . . . . . . . . . . . 4.8 CNV Inheritance . . . . . . . . . . . . . . . . . . . 4.8.1 CNV occurrences in trios . . . . . . . . . . . 4.8.2 CNV Trait Heritability . . . . . . . . . . . . 4.9 CNVs and Height . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. . . . . . . . . . . . . . .. 49 50 52 52 56 56 58 59 60 64 65 68 68 70 72. 5 Final Considerations. 81. A Quantile Normalization. 83. B Median Polish. 87. C Minimal Regions. 91. D CNV Filter. 95. E Pedigree and Kinship Matrix. 99. F Frequency of CNVs by Chromosome. 101. G Proportion of CNV occurrences in trios. 105. H IDs and CEL Files Correspondence. 113. Bibliography. 117.

(11) List of Abbreviations BAF CGH CN CNP CNV CNVRs FoSTeS GWAS HMM indel LCR LD LRR LRT MAF MCF NAHR NGS NHEF SNP SD SV VNTR. B Allele Frequency Comparative genomic hybridization Copy Number Copy Number Polymorphism Copy Number Variation CNV-containing regions Fork Stalling and Template Switching Genome-Wide Association Study Hidden Markov Model Small insertions/deletions Region-specific Low-Copy-Repeat Linkage Disequilibrium Log R Ratio Likelihood Ratio Test Minor Allele Frequency Minor Copy Frequency Nonallelic Homologous Recombination Next Generation Sequencing Nonhomologous End Joining Single Nucleotide Polymorphism Segmental Duplications Structural Variants Variable-number Tandem Repeat. ix.

(12) x. LIST OF ABBREVIATIONS.

(13) List of Figures 1.1. Illustration of a single nucleotide polymorphisms (SNP). . . . . . . . . . . .. 2.1 2.2 2.3 2.4 2.5 2.6 2.7. Types of structural variants. . . . . . . . . . . . . . . Illustration of a CNV. . . . . . . . . . . . . . . . . . Example of aneuploidy. . . . . . . . . . . . . . . . . . Proportion of CNVs in each chromosome. . . . . . . . Illustration of the four major mechanisms underlying rangements and CNV formation. . . . . . . . . . . . Publications of SNPs and CNVs. . . . . . . . . . . . CNV-containing region (CNVRs) . . . . . . . . . . .. 3.1 3.2 3.3 3.4 3.5 3.6 3.7. Illustration of DNA Microarray. . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of signal extraction for a given molecular marker. . . . . . . . . . Flowchart of the pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of SNP clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . Values of LRR and BAF for each case of CNV. . . . . . . . . . . . . . . . . Representation of the procedure to find minimal regions across samples. . . . Example of the kinship matrix (φ) given the family represented by the pedigree.. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9 4.10. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . human genomic rear. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. Intensity of probes A and B of one SNP from 1,120 samples. . . . . . . LRR and BAF of one SNP from 1120 samples. . . . . . . . . . . . . . . X and Y probe intensities for all 1120 samples. . . . . . . . . . . . . . . Histogram of the standard deviation of Log R Ratio. . . . . . . . . . . Histogram of the BAF mean. . . . . . . . . . . . . . . . . . . . . . . . Histogram of the BAF drifiting. . . . . . . . . . . . . . . . . . . . . . . Histogram of waviness factor. . . . . . . . . . . . . . . . . . . . . . . . CNV region with four categories. . . . . . . . . . . . . . . . . . . . . . CNV region with three categories. . . . . . . . . . . . . . . . . . . . . . Distribution of the age and height of all the 910 samples and for males females. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.11 Distribution of individual ancestry. . . . . . . . . . . . . . . . . . . . . 4.12 Total of CNVs in each procedure. . . . . . . . . . . . . . . . . . . . . .. xi. . . . . . . . . . . . . . . . . . . . . . . . . . . . and . . . . . . . . .. 2 8 9 10 11 13 17 19 25 26 27 33 37 43 46 50 51 53 54 55 55 56 57 57 58 59 60.

(14) xii. LIST OF FIGURES. 4.13 Absolute frequency of samples based on the individual number of detected CNVs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.14 Distribution of individual number of detected CNVs for all samples with less than 100 CNVs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.15 Distribution of CNVs regarding the number of copies. . . . . . . . . . . . . . 4.16 Absolute frequency of samples based on the 8,794 CNVs. . . . . . . . . . . . 4.17 Number of CNVs according to the age. . . . . . . . . . . . . . . . . . . . . . 4.18 Histograms of CNV length. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.19 Histogram of filtered CNVs lenght. . . . . . . . . . . . . . . . . . . . . . . . 4.20 Proportion of CNVs in each chromosome based in total of base pairs. . . . . 4.21 Frequency of CNVs per region after finding the minimal regions. . . . . . . . 4.22 Cases of CNV transmission. . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.23 Manhattan plot of the intraclass correlation coefficient for each CNV. . . . . 4.24 Distribution of the intraclass correlation coefficient. . . . . . . . . . . . . . . 4.25 Manhattan plot of the p-values from the first model. . . . . . . . . . . . . . 4.26 Manhattan plot of the heritability from the first model. . . . . . . . . . . . . 4.27 Manhattan plot of the p-values from the second model. . . . . . . . . . . . . 4.28 Manhattan plot of the heritability from the second model. . . . . . . . . . . 4.29 Manhattan plot of the p-values from the third model. . . . . . . . . . . . . . 4.30 Manhattan plot of the heritability from the third model. . . . . . . . . . . . 4.31 Distribution of the height based on the number of copies. . . . . . . . . . . .. 61 63 63 64 64 65 66 67 69 71 71 73 73 74 74 75 76 78. A.1 Raw data in a regular plot (a) and in a quantile-quantile plot (b). . . . . . . A.2 New quantile-quantile plot . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 84 84. 61. E.1 Genogram corresponding to the family data. . . . . . . . . . . . . . . . . . . 100 F.1 Frequency of CNVs per region after finding the minimal regions. . . . . . . . 101 F.2 Frequency of CNVs per region after finding the minimal regions. . . . . . . . 102 F.3 Frequency of CNVs per region after finding the minimal regions. . . . . . . . 103.

(15) List of Tables 2.1 2.2. Number of associations between human genome variants and phenotypes. . . Examples of GWAS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 16 18. 3.1 3.2 3.3. States defined for the HMM. . . . . . . . . . . . . . . . . . . . . . . . . . . . Hypothetical example of the merged and cleaned outputs from PennCNV. . Copy number of each sample for all minimal regions. . . . . . . . . . . . . .. 38 43 44. 4.1 4.2 4.3 4.4 4.5 4.6. Example of the file containing the CNVs from sample 1. . . . . . . . . . . . . Quality control measurements from PennCNV. . . . . . . . . . . . . . . . . . Cumulative frequency of samples based on the number of CNVs. . . . . . . . Absolute frequency of CNV based on relative frequency of samples. . . . . . Distribution of CNV for 910 samples. . . . . . . . . . . . . . . . . . . . . . . Mean relative frequency (%) for CNV occurrences in trios with one normal parent and another with single deletion. . . . . . . . . . . . . . . . . . . . . Mean of the relative frequencies per chromosome (%) for CNV occurrences in trios with one normal parent and another with single duplication. . . . . . . Mean of the relative frequency (%) for CNV occurrences in trios with two normal parents. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Models used for heritability estimation. . . . . . . . . . . . . . . . . . . . . . The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a dichotomous covariate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a continuous variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The top-20 CNVs with lower p-values for the Model 4.2 with CNV as a categorical variable. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Number of individuals with normal genotype, duplication and double duplication for the two regions of chromosome 9. . . . . . . . . . . . . . . . . . .. 52 54 62 66 67. 4.7 4.8 4.9 4.10 4.11 4.12 4.13. G.1 G.2 G.3 G.4. Relative Relative Relative Relative. frequency frequency frequency frequency. of of of of. occurrences occurrences occurrences occurrences. of of of of. CNVs CNVs CNVs CNVs. xiii. in in in in. trios. trios. trios. trios.. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. . . . .. 69 69 70 72 77 78 79 80 108 109 110 111.

(16) xiv. LIST OF TABLES. G.5 Relative frequency of occurrences of CNVs in trios. . . . . . . . . . . . . . . 112 H.1 Correspondence between IDs of samples and CEL Files. . . . . . . . . . . . . 114 H.2 Correspondence between IDs of samples and CEL Files. . . . . . . . . . . . . 115.

(17) Chapter 1 Introduction The research in genetics is growing every day and new technologies to understand the genes, genetic variations, and traits heritability in living organisms are being developed and generating a massive quantity of genetic information. Therefore, Genome Wide Association Studies (GWAS) aims to associate genetic markers, candidate genes or genome regions with complex traits and diseases, which are likely derived from multiple genes and the environment, such as height and diabetes (Lewis and Knight, 2012; Nature Education, 2014). In addition, discovering the associations between diseases and genetic factors is an important step to understand the pathogenesis of the diseases and to facilitate the process of diagnosis and treatment (Lewis and Knight, 2012; The International HapMap Consortium, 2003). There are several methods for performing genome wide association studies described in the literature. They vary according to the type of genetic marker used. Genetic markers are DNA sequences with known physical and molecular locations on chromosomes, such as single nucleotide polymorphisms (SNPs), microssatellites and copy number probes (National Cancer Institute; Ziegler and König, 2006). They can also be called as genetic variant when they also indicate a common variation in the genome, usually present at least 1-5% of the population (National Cancer Institute). The most used genetic variant for GWAS is the SNP, but other variants, as microsatellites, small insertions/deletions (indels), variable-number tandem repeats (VNTRs), and copynumber variations (CNVs), are also available (Lewis and Knight, 2012). Single nucleotide polymorphisms (SNPs) are variations at a single position in DNA that are present in at. 1.

(18) 2. INTRODUCTION. 1.0. least 1% of the population (Figure 1.1). On average, they can occur once for every 300 nucleotides, meaning that a human genome can contain roughly 10 million SNPs. Some of them are documented to have a direct influence in some phenotypes, while others still do not have a known effect (Genetics Home Reference, 2017). In Chapter 2, we detail the role of SNPs and CNVs in GWAS.. Figure 1.1: Illustration of a single nucleotide polymorphisms (SNP). A SNP is a change in the genetic code of an individual where a single nucleotide is replaced by another nucleotide in the DNA sequence. In this hypothetical example, at position 5, one individual have a C nucleotide, while the other two have a T nucleotide.. Several studies are being performed to catalogue the human genetic variants to facilitate GWAS. A pioneer project is the HapMap Project which aims to find the patterns of DNA sequence variation in the human genome based on SNPs, make this information freely available to the public domain and help investigators to discover the genetic factors that contribute to susceptibility to disease, to protection against illness and to drug response (The International HapMap Consortium, 2003). This project was developed using 270 DNA samples from African, Asian and European populations, including unrelated individuals and trios. Those populations from different ancestral geographic locations were chosen to ensure that the database would contain the most of the common variation and rare variants from each population. In addition, it allows to obtain information about patterns of linkage disequilibrium (LD) since common SNPs tend to be older than rare SNPs and it is a reflection of historical recombination and demographic events (The International HapMap Consortium, 2003). As the name suggests, the HapMap uses the information of 6.8 millions SNPs to create a.

(19) 1.0. 3. map of haplotypes, which are DNA sequences within an organism that are inherited together from a single parent since they are in linkage disequilibrium (The International HapMap Consortium, 2003; Ziegler and König, 2006). Therefore, HapMap maps groups of SNPs that are usually in block, allowing researchers to sequence only some SNPs and then impute the SNPs which are from the same block. In 2016, the HapMap Project was merged to the 1000 Genomes Project. With a similar goal to HapMap, 1000 Genomes identifies genetic variants with frequencies of at least 1% in the studied populations (1000 Genomes Project Consortium et al., 2016), including not only SNPs, but also structural variants and small insertions/deletions. This project accounts with a total of 2,504 samples from 26 populations, which is a modest number per population and, consequently, detects only the most common variants in the worldwide population. Even though there is a major success in gene discovery, the percentage of variance explained by GWAS loci for many traits is relatively low. Thus, a substantial part of the traits variation is still unexplained. This phenomenon is called missing heritability. One example of trait with a high missing heritability is the height, as described in Chapter 2. In Manolio et al. (2010), two of the solutions cited to revealing the missing heritability is to use different types of genetic variants, and to include common and rare variants. Based on these scenarios, in this work, our focus is on CNV detection since this kind of variant is not as well characterized as SNPs, but it is expected to be associated with several traits and diseases. Copy number variation (CNV) occurs when the number of copies of a particular region (one or more loci) of the DNA differs from two in autosomes or one/two in allosomes and has an important role in the genetic variability in humans. The effects of CNVs to human diseases are not yet well known (National Institutes of Health, 2017), although several diseases have been associated to this kind of polimorphism, such as uric acid (Scharpf et al., 2014), pancreatitis (Maréchal et al., 2006) and nervous system disorders (Lee and Lupski, 2006). Also, the procedure to obtain the CNV information is not simple and involves different statistical and computational procedures. In Chapter 2, we give a detailed description of CNVs and describe the biologic mechanisms of formation of this variant, as well as the possible technologies of detection of them. In addition, we present some association studies that include this kind of structural variant of the human genome..

(20) 4. INTRODUCTION. 1.0. Genome-wide association studies are usually based on reference maps such as 1000 Genomes, which do not take into account the population-specific and rare variants since do not have a representative number of samples per worldwide population. In addition, Sanna et al. (2011) shows that adding rare variants in association studies doubled the explained heritability, and Tennessen et al. (2012) describes that around 82% of rare SNPs (less than 1% of the population) are population specific. Therefore, identifying different types of variants and including data from specfic populations can explain the missing heritability of traits and disease. This motivates the creation of genomic reference maps for specific populations, for example, the Genome of Netherlands (Boomsma et al., 2014), a project similar to the 1000 Genomes Project, which aims to characterize genetic variants from dutch population, including rare variants. Motivated by the unknown influence of CNVs on anthropometric measurements and the lack of studies based on Brazilian population, this work was developed in collaboration with the Laboratório de Genética e Cardiologia Molecular/InCor-FMUSP. Using the database from the Bapendi Heart Study (de Oliveira et al., 2008; Egan et al., 2016), described in Chapter 3, we analyzed the genotype (SNP data) and phenotype data from 80 families to characterize the CNVs in Brazilian population and to understand their association with phenotypes, such as height. In summary, the main purpose of this project is to present methodologies to quantify and call CNVs from SNP platforms and to analyze such data considering family based designs. For this, the project will focus on the following specific aims: • Set a methodology that includes bioinformatics and statistical tools, allowing to process and quantify CNVs from SNPs expression data; • Apply such methodologies by using data from Baependi Heart Study and characterize the patterns of the CNVs detected in this population; • Study the CNVs inheritance patterns; • Associate the CNVs to height. The methodology involved work is composed by two procedures: the CNV calling and.

(21) 1.0. 5. the CNV analysis, as described in Chapter 3. The CNV calling consists in quantifying and genotyping CNVs from the SNP array data obtained by blood samples of individuals (human population). In other words, from each sample, we identify the CNV regions and classify them based on the number of copies. The CNV analysis focus on characterizing the CNVs from Brazilian samples and identifying CNVs that might be associated with height. In addition, given that the Baependi Heart Study data contains the family structure between samples, we explore the inheritance of CNVs, which is scarcely described in the CNV studies. In Chapter 4, we illustrate some steps of the the CNV calling, including some outputs obtained during the procedure. Also, we describe the main characteristics of the detected CNVs, such as length, location in the genome and number of CNVs per sample, besides some inheritance patterns occurring in trios data. The results of the association between CNVs and height are presented with the annotation of the 20 most significant CNVs. Chapter 5 presents the final considerations and conclusions, including suggestions for further analysis..

(22) 6. INTRODUCTION. 1.0.

(23) Chapter 2 Copy Number Variation (CNV) This chapter explains the biological background of Copy Number Variations (CNVs), including its definition, characteristics and mechanisms of formation and detection by new technologies. The importance and role of CNVs in genome-wide association studies (GWAS) is also described to introduce the methods explained in Chapter 3.. 2.1. Biological background. Segments of DNA can show different kinds of structural variants (SVs). They can be modifications in orientation (inversions), chromosomal location (translocations) or copy number (deletions, insertions and duplications). SVs can be either balanced, with no loss or gain of genetic material, or unbalanced, where a part of the genome is lost or duplicated. The formers comprise inversions or translocations of a stretch of DNA within or between chromosomes while the later is termed copy variation number (CNV). A representation of these variations is in Figure 2.1 (Escaramís et al., 2015). This work is focused on the study of copy number variation (CNV), a subtype of SVs. CNV is an alteration in the number of copies of a segment of the DNA (Figure 2.2), unsettling the normal biological balance of the diploid state in humans at any given locus. The segment can include from a single nucleotide polymorphism (SNP) to several genes. A CNV can be classified in three groups: Duplication (when there is three or more copies of the segment), deletion (when the number of copies is below 2) or complex (when there. 7.

(24) 8. COPY NUMBER VARIATION (CNV). 2.1. Figure 2.1: Types of structural variants. Unbalanced SVs are represented in the top two rows including deletion, insertion of novel sequence and duplication (interspersed duplication and tandem duplication). Balanced SVs are represented in the third row and include inversions and translocations. Examples of complex SVs are presented in the bottom row. Source: Escaramís et al. (2015).. is a combination of deletions and duplications). Insertion is a mutation that increases the number of DNA bases. It can be considered as a CNV when the added sequence is the same as the neighbor sequences, being equivalent to a duplication. In the literature, it is possible to find CNVs named as copy number polymorphisms (CNPs), in which the only difference lies on the frequency of the variation in the population. CNV is a rare mutation present in less than 1% of the population (Campbell et al., 2011), while CNPs are more common mutations, present in more than 1% of the population. In this project, we don’t distinguish between CNVs and CNPs. Studies usually define the size of CNV as 1kb or larger (Feuk et al., 2006). However, some works using Watson and Venter genomes describe CNVs ranging from 300 to 350bp (Levy et al., 2007; Wheeler et al., 2008). Therefore, there is no consensus about the size of CNVs. As long as analyses of higher resolution are performed, many more CNVs of smaller size ranges are likely to be discovered (Zhang et al., 2009). The CNVs with less than 1kb can be described as small insertions/deletions (indels),.

(25) 2.1. BIOLOGICAL BACKGROUND. 9. and when the CNV takes over a whole chromosome (Figure 2.3), driving the human somatic cell to contain more or less than the normal 46 chromosomes, it is called aneuploidy and it is considered extreme case of unbalanced SV (Escaramís et al., 2015), as in the trisomy 21 in patients with Down syndrome and the monosomy X with Turner syndrome (Stankiewicz and Lupski, 2010). Due to the difficulty of identifying CNVs and its real length and location, some works deal with CNV-containing regions (CNVRs), instead of CNVs, considering that the region contains at least one CNV, but its location is not precise (McCarroll and Altshuler, 2007).. Figure 2.2: Illustration of a CNV. A human contains two copies of the same chromosome, one from the mother and another from the father, thus the total copy of each segment is two (middle). When a deletion of segment II occurs (left), instead of two copies of the segment II, the individual has only one copy; When a duplication of II occurs, the total copies of the individual is three. The segment can be from one single SNP to several genes.. Association studies involving CNVs are an important step for the comprehension of complex diseases, traits and evolution. As described in Zhang et al. (2009), the Database of Genomic Variants included 38,406 SVs ranging from 100bp to 3Mb, which covers 29.74% of the reference genome. In addition, the SNP database comprises 14,708,752 SNPs, but it covers less than 1% of the reference genome. Therefore, SVs can account for a big part of the genetic diversity in humans. For a better understanding of the variability of the human genome in healthy individuals.

(26) 10. COPY NUMBER VARIATION (CNV). 2.2. Figure 2.3: Example of aneuploidy. In this type of variant, the chromosome has three copies instead of two. For chromosome 21, this case is Down Syndrome.. and the role of CNVs, Zarrei et al. (2015) developed a CNV map covering the data from various populations, in which less than 10% were from South American population. They created two maps, one including all CNVs and CNV-containing regions (CNVRs) from several published studies (defined as inclusive map) and another containing only CNVs and CNVRs with at least two subjects in two independent studies (defined as stringent map). From the inclusive map, they estimated that 9.5% of the human genome contains gains or losses, while in the stringent map, this value dropped to 4.8%. The maps created by Zarrei et al. (2015) also shows that CNVs are not evenly distributed in the chromosomes (Figure 2.4), in which, in general, chromosomes 19, 22 and Y have the biggest proportions of CNVs. This proportion is the total of base pairs of CNVs of a chromosome divided by the total of base pairs of the chromosome. When compared the proportions of losses and gains, they identified that the chromosomes are more susceptible to losses than gains with proportion intervals of 4.3% to 19.2% and 1.1% to 16.4%, respectively..

(27) 2.2. MECHANISMS FOR CNVS GENERATION AND CNV TRANSMISSION. 11. Figure 2.4: Proportion of CNVs in each chromosome. The horizontal dashed lines indicate the genome average for the inclusive map (upper line) and the stringent map (lower line). Source: Zarrei et al. (2015).. 2.2. Mechanisms for CNVs Generation and CNV Transmission. CNVs can be de novo or transmitted from parents, but the de novo CNV rate per transmission (µ) is around 2 × 10−2 (Itsara et al., 2010). Therefore, the most part of CNVs of an individual is due to the presence of the CNV in the parents haplotypes. As described in Itsara et al. (2010), this rate can change when two groups are being compared. Regarding family data, for example, cases of multiplex autism obtained a µ = 2.2 × 10−2 , while unaffected siblings showed a µ = 5.4 × 10−3 , implying that the presence of de novo CNVs increases the risk of autism. The generation of de novo SV can occur both meiotically and mitotically. Thus, monozygotic twins can carry differences in SV, and individuals can be mosaic carriers of structural variants, between tissues and even within tissues (Escaramís et al., 2015). For this reason, a CNV can be found either in one type of cell or all the somatic cells of the individual. Stankiewicz and Lupski (2002, 2010) describe four possible sporadic mechanisms for formation of CNVs: NAHR, NHEJ, FoSTeS, and L1-mediated retrotransposition. • Nonallelic Homologous Recombination (NAHR): Region-specific Low-Copy-Repeats (LCRs), also called segmental duplications (SDs),.

(28) 12. COPY NUMBER VARIATION (CNV). 2.2. are DNA blocks of ∼10–400 kb with ≥97% identity and exist in multiple locations as a result of duplication events. Due to their size and similarity, SDs often result in forms of chromosomal rearrangement and can cause genome instability. Although they are rare in most mammals, LCRs comprise a large portion of the human genome owing to a significant expansion during primate evolution (Stankiewicz and Lupski, 2002). When LCRs are located at a distance less than ∼10Mb from each other, they can lead to misalignment of chromosomes or chromatids and mediate nonallelic homologous recombination (NAHR) that can result in unequal crossing-over, with recombination hotspots, gene conversion, and apparent minimal efficient processing segments. NAHR between directly oriented LCRs results in deletions or reciprocal duplications of the genomic segment between them. This molecular mechanism has been shown to be responsible for the vast majority of the common sized recurrent rearrangements — reciprocal deletions and duplications, or inversions (Stankiewicz and Lupski, 2010), an example is illustrated in Figure 2.5. NAHR can occur both in meiosis and in mitosis and it will happen only in the presence of substrates (LCRs or SDs). In meiosis, NAHR can lead to unequal crossing over and genomic rearrangements that will be present in all the cells. On the other hand, in mitosis, it leads to mosaic populations of somatic cells carrying copy number or SVs (Zhang et al., 2009). • Nonhomologous End Joining (NHEJ): In Nonhomologous End Joining (NHEJ), double strand breaks are detected. Then both broken DNA ends are bridged, modified, and finally linked. The product of the repair often contains additional nucleotides at the DNA end junction, leaving a “molecular scar" (Stankiewicz and Lupski, 2010) as shown in Figure 2.5. This process does not require a substrate and usually leads to deletions or small insertions. • Replication-Error Mechanisms (FoSTeS): The fork stalling and template switching (FoSTeS) is a mechanism based on DNA replication error. During the DNA replication, the fork of one DNA region stalls and the.

(29) 2.2. MECHANISMS FOR CNVS GENERATION AND CNV TRANSMISSION. 13. strand is released from its original template, resuming the DNA synthesis in another replication fork in physical proximity (Stankiewicz and Lupski, 2010; Zhang et al., 2009). As shown in Figure 2.5, this process can repeat multiple times (FoSTes x2, FoSTes x3, ...). FoSTeS does not require a substrate. However, a small sequence of base pairs must be the same in both forks (microhomology) so the DNA synthesis can be resumed. This is the only mechanism that creates complex CNVs. • L1-mediated retrotransposition: This mechanism will generate only insertions, meaning that the result is not to be necessarily a CNV. Long interspersed element-1 (L1) is the only element still active in the human genome. It comprises approximately 17% and contains two opened reading frames (ORF), regions that indicate where the translation starts and ends. The insertion happens when RNA polymerase II transcribes the region between ORFs.. Figure 2.5: Illustration of the four major mechanisms underlying human genomic rearrangements and CNV formation: Non-Allelic Homologous Recombination (NAHR); NonHomologous End-Joining (NHEJ); Fork Stalling and Template Switching (FoSTeS); retrotransposition. Figure from Zhang et al. (2009).. Given that CNVs can be transmitted, understanding this information is valuable for association studies, which detect transmitted CNVs and explore how they underlie Mendelian diseases in families (McCarroll and Altshuler, 2007). For example, it was reported a triplication of an approximately 605 kb segment containing the PRSS1 and PRSS2 genes that causes.

(30) 14. COPY NUMBER VARIATION (CNV). 2.3. hereditary pancreatitis, which has 80% penetrance (80% of the CNV carriers develop the disease) (Maréchal et al., 2006). In addition, a research group identified a CNV responsible for the Pelizaeus-Merzbacher disease, in which 65% of the cases had inherited the condition (Lee and Lupski, 2006). The findings of CNV transmission indicate that it is consistent with normal Mendelian inheritance (Locke et al., 2006) and this information can be considered by some CNV calling algorithms such as the one described in Wang et al. (2008a) and Chu et al. (2013). However, in Locke et al. (2006), the considered CNVs were located within duplicated regions of the human genome and the 269 samples (individuals) studied had European, Yoruba and Asian ancestry. On the other hand, as described in Palta et al. (2015), considering all the regions of the genome with a CNV, the transmission rate is 45.5% with statistically significant deviation from the expected Mendelian transmission rate of 50%, specially, when the regions are smaller than 10kb. Thus, in this project, we aim to evaluate the CNV transmission rate in Brazilian population, taking into account the whole genome. For this analysis, we use the family data from the Baependi Heart Study, but we only considered trios data.. 2.3. CNV Calling. Copy Number Variations can be detected by different technologies, and the methods documented in the literature can be divided in three groups depending on the type of data: Comparative Genomic Hybridization (CGH), SNP-array and Next Generation Sequencing (NGS). The first method for identification of CNVs was the array-CGH (Comparative genomic hybridization). It uses two genomes, a sample and a control, which are hybridized against the same oligonucleotides. The fluorescent signal intensity ratio between them can be compared across each chromosome to identify copy number changes (Theisen, 2017). In spite of being a tool offered by many companies and being capable of using target arrays for clinical tests, it cannot detect an absolute number of copies (Escaramís et al., 2015)..

(31) 2.4. CNV CALLING. 15. For a deep study of structural variants, the NGS methods are preferred since its four strategies (Read Depth (RD), Paired Read (PR), Split Reads(SR)/Clip Reads(CR) and de novo Sequence Assembly (AS)) have different advantages that can be combined to detect all kinds of CNVs. Another method is based on SNP Array data. Single nucleotide polymorphisms (SNPs) are the most common type of genetic variation in humans and represent a difference in a single nucleotide in the DNA (Laframboise, 2009), as described in Chapter 1. Adopting the SNP information, it is possible to infer the presence of CNVs and, due to large amounts of SNP data for genome-wide association studies, there are several algorithms that have been developed, such as PennCNV (Wang et al., 2007b) and CRLMM/VanillaIce (Carvalho et al., 2007; Scharpf et al., 2011). However, the major disadvantage is the poor overlap results among the different softwares (Eckel-Passow et al., 2011). As described in Wang et al. (2008b), the algorithms for CNV calling based on SNP arrays can be summarized in a 3-step process which analyzes each individual separately. 1. Preprocessing: Quantify the intensities of each allele (A and B) of a given SNP. The most common values used are the Log R Ratio and B Allele frequencies that are inferred based on the raw data (Escaramís et al., 2015). 2. CNV calling: Values obtained in the preprocessing are used to estimate a "copy-number measurement", which is usually continuously distributed across populations. Based on these values, it is possible to call the "copy-number genotypes", that can vary as a simple "loss" or "gain" qualification or as discrete values such as (0, 1, 2) referring to the number of copies of a given allele (McCarroll and Altshuler, 2007). 3. Smoothing across the chromosome: Techniques of smoothing and quality control are applied to reduce the noise and to obtain better CNV callings. Each step can vary from software to software. The complete description of resources and methods used during this project can be found in Chapter 3..

(32) 16. 2.4. COPY NUMBER VARIATION (CNV). 2.4. Association Studies and CNVs. Genome-wide association studies (GWAS) are a gene mapping approach that involves the identification of candidate genes or genome regions that contribute to a specific disease by testing for association between disease status and genetic variants. They are the main tool for identifying genes that contribute to complex traits and diseases, such as diabetes and heart diseases, which receive the term "complex" because they are explained by genetic and environmental factors besides their interaction (Lewis and Knight, 2012). Although there are several types of genetic markers that can be used for this kind of studies, SNPs are the most commonly used. The online and public database ClinVar archives all published results of associations between human genetic variants and phenotypes, Table 2.1 summarizes the total of results based on the genetic variant (Landrum et al., 2016), showing that 76% of GWAS are made based on SNPs, while CNVs (deletions ans duplications) account for 19% of the studies. The massive use of SNPs in comparison with CNVs can also be seen in Figure 2.6, which represents the results of our search of papers with the keywords SNP and CNV, separately. Table 2.1: Number of associations between human genome variants and phenotypes. It includes conditions and diseases, such as obesity and cancer.. Genetic Variant SNP Deletion Duplication Indel Insertion. Associations with Phenotypes 266,709 40,963 24,107 2,392 15,454. Relative Frequency (%) 76 12 7 0.6 4.4. CNVs can represent benign polymorphic variations or modify expected phenotypes by mechanisms such as altered gene dosage and gene disruption (Zhang et al., 2009). In addition, CNVs have an important role in genomic variation. Several studies have already reported the association between CNVs and some conditions, as presented in Table 2.2 and, for a more complete database, DECIPHER (DatabasE of genomiC varIation and Phenotype in Humans using Ensembl Resources) includes a list of genetic variants (sequence variants or copy number variants) that has a role in the pathology of some specified syndromes (Firth et al., 2009; Wellcome Trust Sanger Institute, 2009). Therefore, this polymorphism.

(33) 2.4. ASSOCIATION STUDIES AND CNVS. 17. Figure 2.6: Publications of SNPs and CNVs. Number of publications with keywords SNP and CNV in Google Academics.. is as relevant as SNPs for association studies, and, as shown before, the literature has given bigger emphasis in SNP studies compared to CNVs studies. The statistical analysis for GWAS varies according to the subject as well as the type of data and dependent variable (phenotype), some examples are described in Table 2.2. The statistical model used in this work is described in Section 4.9. Despite the analysis applied to infer the association between trait and genetic marker, significant genetic associations can have three different interpretations (Lewis and Knight, 2012): 1. Direct association, in which the genetic marker is the true causal variant conferring disease susceptibility; 2. Indirect association, in which a genetic marker in linkage disequilibrium (LD) with the true causal variant is genotyped;.

(34) 18. 2.4. COPY NUMBER VARIATION (CNV). Table 2.2: Examples of GWAS. For each study, we have the phenotype, the type of data and software used in the CNV calling, the CNV associated to phenotype and the statistical model applied for the association analysis.. Phenotype (Reference). Uric acid concentrations (Scharpf et al., 2014). Williams-Beuren Syndrome (WBS) (Dutra et al., 2011) Miller-Dieker lissencephaly syndrome (Cardoso et al., 2003; Ledbetter et al., 1992). Severe Obesity (Wheeler et al., 2013). Data for CNV calling. SNP array data (Softwares: APT/PennCNV/VanillaICE). Microsatellites. Data from fluorescence in situ hybridization (FISH) (Similar to Array-CGH). SNP array (Software: APT). Genetic Variation (CNV). Statistical Analysis. Duplication/ Deletion in 4p16.1. Adjusted serum log uric acid concentrations (continuous) in a mixed effects regression model with fixed effects for copy number (modeled as continuous quantitative variable with scale 0-4), age, log-transformed BMI, gender, and study center. Chemistry plate was added to the model as a random effect.. Deletion in 7q11.23. Fisher’s exact test between clinical features of WBS and CNV traits.. Deletion in 17p13.3. Contingency table analysis between affection and CNV presence .. Duplication/ Triplication in LEPR, POMC, MC4R, BDNF and SH2B1. A likelihood ratio test to model the distribution of per-sample CNV measurements as a Gaussian mixture and compares the goodness of fit with or without association to affected status.. 3. A false-positive result, in which there is either chance or systematic confounding, such as population stratification. One way to minimize this problem is to adjust for ancestry coefficients, commonly calculated as principal components (Price et al., 2006). As described in Yang et al. (2010), two problems found in association studies using SNPs are the small effect of each marker over the phenotype and the incomplete linkage disequilibrium between causal variants and the SNPs. These factors require studying the effect of block of SNPs over the phenotype instead of the effect of individual SNPs. The overview by McCarroll and Altshuler (2007) discuss the evidence for the effects of.

(35) 2.4. ASSOCIATION STUDIES AND CNVS. 19. CNVs in phenotypes and describes the challenges of the association studies involving CNVs. Despite the listed problems that have been solved since then, for instance, the development of new technologies for CNV detection, some of them are still open such as the transmission rate of CNVs. Unlike the SNPs, whose population frequencies and precise locations are very well characterized, CNVs are still uncertain. For example, the Affymetrix 6.0 platform used in this project has approximately 1.8 million markers, in which "202,000 probes targeting 5,677 CNV regions" (Affymetrix, 2008), i.e., CNV probe locations in fact correspond to the locations of CNV-containing regions (CNVRs) (McCarroll and Altshuler, 2007). As showed in Figure 2.7, the defined position can represent more than one kind of CNVR. When using SNP array for CNV calling, this problem is minimized since it accounts SNP information and linkage disequilibrium approaches. As one can see in Section 3.5, PennCNV includes the SNP information in the transition probabilities of the hidden Markov model, giving more precision to the CNV calling.. Figure 2.7: CNV-containing region (CNVRs). Possible CNV locations related to the coordinates of a reported CNV-containing region (CNVR). For bacterial artificial chromosome (BAC) probe (red region), it is possible to find a duplication in only part of the region or in its vicinity (blue regions). The same occurs for deletions (the last four lines). Source: McCarroll and Altshuler (2007).. As mentioned before, CNV calling techniques have low overlap among them (Eckel-Passow et al.,.

(36) 20. COPY NUMBER VARIATION (CNV). 2.4. 2011), and a few percent of the thousands CNVs detected can be genotyped in the available samples, which means that, for example, one could identify thousands of CNVs from a set of samples, but only few of them can actually be genotyped, since it is difficult to detect CNVs at the same DNA position in a pertinent number of samples (McCarroll and Altshuler, 2007). This can be observed in Section 4.5, in which we filter the CNVs to keep only the ones that are present in at least 2% of the population. These problems were related to the quality of the CNV calling, and even though it is clear that all data used in an association study must be as accurate as possible, obtaining the CNV categories is the biggest challenge during the study. Since this step is completed, different statistical models can be applied for association between the CNV and the desired phenotype depending on the subject and data used. Some software and R packages were developed for facilitating the implementation of the statistical analysis in GWAS, such as SOLAR (Blangero et al., 2015) and kinship2 R package (Therneau et al., 2015). They can deal with several kinds of data, including family information, and will be better described on Chapter 3.. 2.4.1. CNVs and Height. Published studies estimate height heritability as approximately 80%, which means to say that 80% of height variation in populations are due to genetic effects. GWAS succeeded to identify around 50 singular variants that may be associated with height and could explain up to 5% of the total heritability when considered independently. However, this value increases to 45% if polymorphisms groups are used to describe the phenotypic variation (Yang et al., 2010). Thus, choosing height in this work was based on the interest in characterizing the pattern of missing heritability for this phenotype in the Brazilian population through CNV analysis. Association studies between copy number variation (CNV) and human height have already been performed as described in Dauber et al. (2011); Kim et al. (2013); Li et al. (2010). In all these studies, two of which were performed in Asian population, only unrelated individuals were considered (Kim et al., 2013; Li et al., 2010). Their findings suggest an association of short height with combined deletions (Dauber et al., 2011), specifically the.

(37) 2.4. ASSOCIATION STUDIES AND CNVS. 21. ones on the region 12q2433, a neighbour of the gene GPR133, which in turn contains SNPs previously associated to height (Kim et al., 2013). Other CNV regions associated with human height were found (Li et al., 2010), but none of them was validated after multiple tests correction..

(38) 22. COPY NUMBER VARIATION (CNV). 2.4.

(39) Chapter 3 Materials and Methods This chapter describes the dataset and the SNP array platform used in the Baependi Heart Study. The following sections present methodologies, which are part of the pipeline we proposed. They are divided in two parts: CNV calling and CNV analysis. The first part is described in Section 3.3 and 3.4, where we explain the procedures included in the pre-processing of genetic data and obtain two new values (Log R Ratio and B Allele Frequency) from the SNP array data. Section 3.5 introduces the Hidden Markov Model(HMM) used for CNV calling. The second part of the methodology begins with the pre-processing of the HMM output and ends with the description of the models used to infer the heritability of height and the CNV transmission rate (Sections 3.5, 3.6.2, 3.6.3 and 2.4). Illustrations of some procedures are in Chapter 4.. 3.1. Dataset. Due to multiple waves of immigration, Brazil has a highly admixed population, which can be driven by genetic and environmental influences on several traits. The Baependi Heart Study is being conducted by the Heart Institute since 2005 to develop a longitudinal familybased cohort study for understanding the variation of cardiovascular risk factors within the Brazilian population and disentangle its genetic and environmental components. The study contains two steps of data collecting in accordance with a planned sample design. The first. 23.

(40) 24. 3.1. MATERIALS AND METHODS. wave was performed between December 2005 and January 2006, and the second wave was followed-up in 2010 (details are described in de Oliveira et al. (2008); Egan et al. (2016)). The data considered in this work is from the first wave of the described study and it provides information about 105 families (1,666 individuals, 723 male and 943 females) living in the village of Baependi, in the state of Minas Gerais, Brazil. Data from 631 nuclear families were available, with an offspring ranging from 1 to 14. The number of generations per family varied from 2 to 4 (54% of the families had 3 generations, and 45% had 2 generations). Only individuals aged 18 years or older were considered eligible for participating in the study. The mean age was 44 years, with a range of 18 to 100 years (de Oliveira et al., 2008). For each participant a questionnaire was used to obtain information regarding family relationships, demographic characteristics, medical history and environmental risk factors. Anthropometric measures, physical examination and electrocardiogram of the participants were performed by trained medical students. Also, fasting blood glucose, total cholesterol, lipoprotein fractions and triglycerides were obtained by standard techniques in blood samples. Serum samples were stored at –80. o. C and genomic DNA was extracted by standard. procedures. From DNA samples, genotyping with SNP array was made with Affymetrix Platform 6.0 and 1120 CEL files were obtained.. 3.1.1. SNP array platform. DNA microarray is a technology used to perform experiments on multiple genes at the same time. This tool allows to determine whether the DNA of an individual contains a genetic variant. For this purpose, a DNA microarray contains multiple and unique spots with several identical strands of DNA as illustrated in Figure 3.1 (Genetic Science Learning Center, 2013; National Human Genome Research Institute (NHGRI)). SNP array is a type of DNA microarray which is used to detect single nucleotide polymorphisms within a population. This platform 6.0 includes 906,600 SNPs markers and 946,000 CN probes, in which 202,000 targets 5,677 CNV regions from the Toronto Database of Genomic Variant (Affymetrix, 2008). The human genome of two individuals are 99.9% identical at the nucleotide level, and the presence of SNPs in the genome is the largest source of genetic diversity among hu-.

(41) 3.2. DATASET. 25. Figure 3.1: Illustration of DNA Microarray. The microarray chip contains several spots, each spot includes the multiple and identical DNA strands. During the experiment, the isolated genetic material of a sample will bind to those DNA strands correspondent to genes that are turned on. Given the information of each spot, the final output lists the active genes. Source: Genetic Science Learning Center (2013).. mans. However, some regions of the genome can have no or few SNPs (Laframboise, 2009; Shen et al., 2008). In addition, as described in Section 2.4, a single probe can represent multiple CNV regions. For these reasons, CN probes evenly spaced along the genome were added to the platform to include these regions that would be not covered by SNPs. The procedure to obtain the intensity of the alleles for a given marker using the Affymetrix assay is illustrated by Figure 3.2. For a given SNP, different oligonucleotides of 25 nucleotides (25-mer probes) for both alleles containing the SNP in different positions are used to bind to the DNA strand. When the probe is complementary to all 25 bases, a brighter signal is detected. Otherwise if there is a mismatch at the SNP site, the signal is lighter. (Laframboise, 2009). The values of these intensities are stored in CEL files that will be used for the CNV analysis as well as SNP analysis..

(42) 26. MATERIALS AND METHODS. 3.2. Figure 3.2: Illustration of signal extraction for a given molecular marker. In this hypothetical example, there is a sequence containing a SNP which can be A or C. For this SNP, the SNP array chip has different DNA strands (probes) including the SNP with the reference and the altered nucleotide which are defined as alleles A and B. When the denatured DNA of the sample is inserted in the chip, a perfect match occurs when all the bases of the sample DNA and the probe bind and, then, a higher intensity signal is detected (orange). Otherwise, a lower intensity signal (yellow) is detected. Source: Laframboise (2009).. 3.2. Methodology Overview. The used methodology can be summarized by Figure 3.3, which describes the preprocessing of SNP data, the CNV calling and the CNV analysis. For the preprocessing of SNP data and the CNV calling, the software Affymetrix Power Tools, PennCNV and packages from the R environment were used. Briefly, the CEL files from Affymetrix 6.0 platform are used to obtain the signal intensities of alleles A and B for each single nucleotide polymorphism (SNP). Based on these values, the genotypes (AA, AB e BB) are predicted by unsupervised clustering algorithms. In addition, using the intensities values, new values were obtained by polar transformation: log R ratio (LRR) and B Allele frequency.

(43) 3.2. METHODOLOGY OVERVIEW. 27. (BAF). These new information is used in a hidden Markov model for CNV estimation.. Figure 3.3: Flowchart of the pipeline. The number indicates which function was used. Box I indicates the CNV calling and box II indicates the CNV analysis.. The following functions were used in the preprocessing described in Figure 3.3 (Wang et al., 2007a): 1. Function apt-probeset-summarize from Affymetrix Power Tools. Given the CEL files, signal intensity values for probes are normalized using quan-.

(44) 28. MATERIALS AND METHODS. 3.2. tile normalization (Section 3.3.1). Then, the function applies the median polish (Section 3.3.2) to get the final cleaned intensity values for alleles A and B for each SNP. 2. Function apt-probeset-genotype from Affymetrix Power Tools. Given the CEL files, this function generates the individual genotype calls using the Birdseed algorithm. For each SNP in each sample, the genotype will be coded as 0, 1 and 2 for AA, AB, BB and -1 for missing values, respectively, with its corresponding confidence scores. Also, a final report will infer the sample sex. 3. Function generate_affy_geno_cluster.pl from PennCNV. This function generates canonical genotype clustering files based on the output files from functions 1 and 2. These files contain cluster positions of each SNP for each canonical genotype (AA, AB and BB). 4. Function normalize_affy_geno_cluster.pl from PennCNV. The calculation of LRR and BAF values for each SNP and each sample are made using the genotype clustering file and intensities values of the alleles A and B. More details are in Section 3.4. The preprocessing phase generates auxiliary files, such as the genotype clustering file and genotype confidence scores, and the file containing the LRR and BAF for all markers and samples. This information is used for the CNV calling (Figure 3.3). The following functions were used in this process (Wang et al., 2007a): 5. Function kcolumn.pl from PennCNV. The output from the previous function is a table with markers in the rows and LRR and BAF values for each sample in columns. Since PennCNV detects the CNVs separately for each sample, this function splits the table into tables of three columns (Marker, LRR and BAF) for each sample. 6. Function: detect_cnv.pl from PennCNV. The CNV calling is performed for each sample (Section 3.5). Some additional information from HapMap reference are used. For example, the SNP coordinates and the probability of B allele..

(45) 3.2. METHODOLOGY OVERVIEW. 29. PennCNV is used since it estimates locus-level copy number, performs segmentation, evaluates CNV-specific quality-control metrics within a single software package, has relatively small bias and variability, and detects regions while maintaining an estimated false-positive rate (Eckel-Passow et al., 2011). The identified CNV regions are specific for each sample (individual). As showed in Figure 3.3, we excluded the samples that do not pass in the quality control (Section 4.3). Then, a new set of minimal regions, defined by the overlap regions across all samples, was built. Then, a filter is made, removing all minimal regions with a low frequency of CNVs. The final regions are then ready for the CNV analysis of this work. 7. Function filter_cnv.pl from PennCNV. As PennCNV does not filter the samples, this function returns the values of mean, median and standard deviation of LRR and BAF for each sample. This information allows one to evaluate the quality of the CNV calling based on the criteria described in Section 3.6.1. The output from filter_cnv.pl is loaded in R to select the samples that passed the quality control. 8. Function CNTools package from Bioconductor (Zhang, 2017). Each sample contains its own CNV regions. However, for posterior analysis, we need the same variables for all samples. The solution adopted was to identify the minimal regions (Section 3.6.2), which are the overlapping of all identified regions. CNTools package was created with a similar aim, so we made some adaptations for our necessity. 9. Basic functions from R. The obtained minimal regions take into account all CNVs identified in all samples. Then if a CNV is present in only one sample, the minimal region correspondent to this CNV will have one sample with a mutation. For this reason, we filtered the regions so that at least 2% of the samples would have a mutation using basic functions from R. The procedure is described in Section 3.6.3 10. Basic functions from R and polygenic from Solar (Blangero et al., 2015)..

(46) 30. MATERIALS AND METHODS. 3.3. The script to analyze the identified CNVs includes simple functions from R. To estimate the CNV transmission rate, we use the polygenic function from Solar, which is described in Section 3.7. 11. Function polygenic from Solar (Blangero et al., 2015) and kinship2 package from R (Therneau and Sinnwell (2015)). To estimate the heritability of phenotypes using CNVs as covariates, we use the polygenic function from Solar and the function lmkin from kinship2 package as described in Section 3.7.. 3.3. Preprocessing of SNP data. The CEL file output is a specific file from Affymetrix with the information for each sample. It stores the intensity values of each probe array and its standard deviation, a flag to indicate an outlier, a user defined flags, and the number of pixels values collected from an Affymetrix GeneArray scanner (Affymetrix, 2009a,b). To generate the intensity values and SNP genotype calls from the CEL files obtained from our samples, the Affymetrix Power Tools (APT) was used. The procedure, described in McCall et al. (2010), involves: • Quantile normalization; • Median polish; • SNP genotype calling (Birdseed). Quantile normalization and median polish are used to normalize and remove outliers from the intensity values of alleles A and B. For a given SNP, when these intensities val/[ues from several samples are plotted, it is easy to observe the formation of three clusters, since we have three genotypic classes (AA, AB, BB). An example can be seen in Figure 3.4. The SNP genotype calling is the procedure to predict the genotypes based on the intensities values of allele A and B..

(47) 3.3. PREPROCESSING OF SNP DATA. 3.3.1. 31. Quantile normalization. Quantile normalization is a procedure to normalize two or more vectors and make them have the same or a similar distribution. This method is highly used in biostatistics for analysis of data generated from experiments on DNA, RNA, and protein microarrays. In this analysis, the quantile normalization is performed to remove unknown variation possibly due to target preparation and hybridization (McCall et al., 2010). The step aims to make the distribution of SNP intensities for each individual more comparable. The algorithm for this procedure is described in Bolstad et al. (2003) and it was developed based on the fact that two data vectors with the same distribution will show a straight diagonal line across the origin given by the unit vector ( √12 ,. √1 ) 2. in a quantile-. quantile plot. This can be generalized for n data vectors with the same distribution, which n-dimensions quantile-quantile plot will be a straight line across the origin given by the unit vector ( √1n , . . . , √1n ). Thus, the quantile normalization is the procedure of projecting the points of a n-dimensional quantile-quantile plot onto the diagonal, following the 5-steps algorithm: 1. Given n p−dimensional vectors to be normalized, create the X matrix of dimension p × n where each vector is a column; 2. Sort each column of X to give Xsort ; 3. Take the means across rows of Xsort ; 0 4. Substitute each element in the row by the row mean to get Xsort ; 0 5. Get Xnormalized by rearranging each column of Xsort to have the same ordering as. original X. Even though, there is a quantile function for R (Package ‘preprocessCore) (Bolstad, 2001), a simplified example of how quantile normalization works using a R script can be found in Appendix A. In our case, the quantile normalization is performed across chips, thus, n indicates the number of samples and p, the number of SNPs/CN probes..

(48) 32. 3.3. MATERIALS AND METHODS. 3.3.2. Median polish. Based on the robust multiarray analysis (RMA) (McCall et al., 2010), after the quantile normalization, the median polish is performed to remove possible outliers in the array. This technique described in Tukey (1970) extracts the effects of row and column factors in a twoway table using medians and it is similar to ANOVA, but using median instead of mean. A given Xp×n = xij , for i = 1, . . . , p and j = 1, . . . , n, can be decomposed as in Equation 3.1, in which α is a constant, ri is the effect associated to the i − th row, cj is the effect associated to the j − th column and ij is the residual associated to the element xij of the matrix.. xij = α + ri + cj + ij .. (3.1). The median polish follows the algorithm: 1. Given a matrix Xp×n , set the values α = 0 as the constant, r = {r1 . . . rp } and c = {c1 . . . cn } as a vector of zeros with length p and n to be the row and column effects, respectively, and δ = 0 and X (t) = X as auxiliary variables; 2. For each line i of X, compute the median (mi. ), subtract mi. from each element of Xi. and sum mi. to ri ; 3. δ is defined as the median of c. Subtract δ from each element of c. Then, sum up δ and α; 4. For each column j of X, compute the median (m.j ), subtract m.j for each element of X.j and sum m.j to cj . This new X represents the residual matrix; 5. Now, δ is defined as the median of r. Subtract δ from each element r. Then, sum up δ and α; 6. If the sum of the absolute values of X is equal to 0 or X is very similar to the X (t) , the values of the constant, row and column effects and residual are inferred. Otherwise, set X (t) = X and repeat the steps 2-5 until one of the criteria is reached..

(49) 3.3. PREPROCESSING OF SNP DATA. 33. The residual matrix, X (t) , is subtracted from the original X. This means that the outlier will continue in the dataset, but the residual associated to it will be removed. In our application, the row and columns factors indicates the effects of the probe (genetic variant) and the chip (sample) and it used to protect against outlier probes (Affymetrix, 2017; Irizarry et al., 2003; Rabbee and Speed, 2006).. 3.3.3. SNP Genotype Calling. Birdseed is a genotyping and clustering algorithm for Affymetrix SNP arrays platforms. For a given SNP, three genotypes among samples of a population are expected: AA, AB and BB. Based on intensities of alleles A and B for several samples, this phenomenon can be observed with the formation of one to three clusters associated to those genotypes. As we can see in Figure 3.4, the first SNP has the three expected genotypes, while the second one has no samples with genotype BB.. Figure 3.4: Illustration of SNP clustering. Given the values of intensities A and B of a SNP, the plot will show the formation of clusters. Plot on the left forms three clusters, while plot on the right forms two clusters. Source: Korn et al. (2008) (Supplemental Material).. Birdseed uses a customized Expectation-Maximization (EM) algorithm to fit two-dimensional Gaussians to normalized and summarized (median polish) SNP data (A-signal vs. B-signal) and identify the clusters. This procedure gives a genotype and confidence scores for every.