Quantifying the genetic predisposition to a complex disease through genome-wide association

(1)

Quantifying the genetic

predisposition to a

complex disease

through genome-wide

association

Ana Margarida Carrapatoso Macedo Master’s degree thesis presented to

Faculdade de Ciências da Universidade do Porto in Mathematical Engineering 2019 Qu a n tif y in g th e g e n e ti c p re d is p o s iti o n to a c o m p le x di s e a s e thr ough ge nom e -wi d e a s s o c ia tio n An a M ar g ar id a Ca rra p at o so M ac ed o FCUP 2019 2.º CICLO

(2)

predisposition to a

complex disease

through genome-wide

association

Ana Margarida Carrapatoso Macedo

Mathematical Engineering

Department of Mathematics 2019

Supervisor

Alexandra Lopes, Assistant Researcher, i3S – Instituto de Investigação e Inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto

Co-supervisor

Nádia Pinto, Junior Researcher, i3S –

Instituto de Investigação e inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto, CMUP – Centro de Matemática da Universidade do Porto

(3)

O Presidente do Júri,

(4)

Agradecimentos

Ao Ipatimup e ao grupo de Gen ética Populacional, que me acolheram no mundo da investigaç ão cient´ıfica.

`

As minhas orientadoras Alexandra Lopes e N ádia Pinto, que foram incans áveis e estiveram sempre dispon´ıveis para as minhas d úvidas, apostaram em mim e me deram a oportunidade de fazer parte

deste projeto at ´e ao fim.

Ao meu irm ˜ao, que me corrigiu os par ´agrafos mais esquisitos sempre que a l´ıngua inglesa me falhou.

Ao meu companheiro de todos os dias, que nunca deixa de acreditar em mim e me incentiva a fazer sempre mais.

Aos meus pais, que me proporcionaram uma educac¸ ˜ao superior e nunca duvidaram das minhas escolhas.

(5)

Abstract

The main goal of this work was to contextualize and apply the methods used in genome-wide asso-ciation studies, as well as studies of target regions with functional relevance to the phenotype under

analysis.

The methods explored in detail included the quality control steps of genetic data, statistical tests for

common-variant association (Pearson’s chi-squared test, Fisher’s exact test and the Cochran-Armitage test for trend) and the SKAT-O method, which combines burden and non-burden approaches to the study

of rare-variant association.

Subsequently, the methods were applied to a data set of Alzheimer’s disease (AD) patients and healthy

controls from north Iberian Peninsula, in the scope of the multicenter study ”AD-EEGWA”. It features new data from a still understudied population, regarding genetic association to this disease. Besides

the genetic component and biographic data, we had access to electroencephalography measures for most of the study participants. The project is currently ongoing, and more biological samples are being

collected to empower genetic analyses.

The SKAT-O method allowed to identify one gene (PLEKHA5) with significantly different minor allele frequencies between cases and controls in our sample: collectively, the rare alleles were present in 10%

of controls, but in just 1% of cases. This gene had previously been identified as having differential gene expression in astrocytes between AD cases and controls.

The common-variant association methods allowed to identify nine SNPs with nominally significant differences of allele and genotype distribution between cases and controls, three of which had been

associated with Alzheimer’s disease in previous studies. We then inquired about the possibility of an association between these genetic variants and the brainwaves obtained from EEGs. Four out of nine

SNPs showed significant differences for some brainwaves, concerning mean relative power values within cases and controls with different alleles/genotypes. However, the obtained results did not match what

was expected, considering the EEG brainwave behavior in AD cases and controls, and the possible risk allele of each SNP.

All in all, we concluded that, even in small samples, it is possible to find association between pheno-types and the aggregate effects of rare variants. It will be interesting to replicate the study in a larger

(6)

ex-pected from the analysis of the genetics-EEG relation motivates new approaches to the problem. The

complexity of this disease strongly impels to insist on a interdisciplinary approach, that explores the effect of genotypes on well defined disease endophenotypes, to help diagnosis.

Keywords: association study, Alzheimer’s disease, complex phenotype, genetic heterogeneity, rare

(7)

Resumo

O objetivo principal deste trabalho foi contextualizar e aplicar os m étodos utilizados em estudos de associaç ão do genoma completo, bem como de regi ões alvo com relev ância funcional para o fen ótipo

sob estudo.

Os m ´etodos estudados em mais detalhe inclu´ıram os passos de controlo de qualidade dos dados

gen éticos, testes estat´ısticos para a associaç ão de variantes comuns (teste do qui-quadrado de Pear-son, teste exato de Fisher e teste de Cochran-Armitage para tend ência) e o m étodo SKAT-O, que

com-bina as abordagens burden e non-burden para o estudo da associac¸ ˜ao com variantes raros.

Posteriormente, aplicaram-se os m ´etodos a um conjunto de dados de doentes de Alzheimer e

con-trolos saud áveis da regi ão norte da Pen´ınsula Ib érica, no âmbito do projeto multic êntrico ”AD-EEGWA”. Trata-se de um novo conjunto de dados de uma populaç ão ainda pouco estudada, do ponto de vista da

associaç ão gen ética à doença. Para al ém da componente gen ética e dados biogr áficos, tivemos acesso a medidas de eletroencefalograma para grande parte dos participantes do estudo. O projeto ainda se

encontra em curso, e mais amostras biol ógicas est ão a ser recolhidas de modo a trazer mais poder às an álises gen éticas.

Com o m ´etodo SKAT-O, foi poss´ıvel identificar um gene (PLEKHA5) com diferenc¸as significativas de

frequ ências dos seus alelos raros entre casos e controlos da nossa amostra: coletivamente, os alelos raros estavam presentes em cerca de 10% dos controlos, mas apenas em 1% dos casos. Trata-se de um gene que j á tinha sido anteriormente identificado como tendo express ão gen ética diferencial em

astr ´ocitos entre casos e controlos de Alzheimer.

Os m étodos de associaç ão para variantes comuns permitiram identificar nove SNPs com diferenças nominalmente significativas na distribuiç ão de alelos e gen ótipos entre casos e controlos, tr ês dos quais

j á tinham sido associados à doença de Alzheimer em estudos pr évios. Seguidamente, averiguamos a possibilidade de haver uma associaç ão entre estes variantes gen éticos e as ondas cerebrais

obti-das com os EEGs. Dos nove, quatro SNPs mostraram diferenc¸as significativas para algumas onobti-das cerebrais, relativamente aos valores m ´edios de ”poder relativo” em casos e controlos com diferentes

alelos/gen ´otipos. Contudo, os resultados obtidos n ˜ao corresponderam ao que era esperado, tendo em conta os valores de EEGs em casos de Alzheimer e controlos e o poss´ıvel alelo de risco de cada SNP.

(8)

o efeito agregado de variantes raros. Ser ´a interessante replicar o estudo numa amostra maior e perceber

se os resultados se mant êm. A falta de coer ência entre os resultados da an álise da relaç ão gen ética-EEG e o que era esperado motiva novas abordagens a este problema. A complexidade desta doença

incita fortemente à insist ência numa abordagem interdisciplinar, que explore o efeito de gen ótipos e endofen ótipos bem definidos da doença, para auxiliar no seu diagn óstico.

Palavras-chave: estudo de associaç ão, doença de Alzheimer, fen ótipo complexo, heterogeneidade

(9)

List of Tables

1.1 (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) in-dividuals (1sthybrid generation); (b) Possible genetic constitution of the offspring resulting

from crossing (Aa) individuals among each other (2nd hybrid generation). . . 4

1.2 Blood group (phenotype) of the offspring, depending on the ABO alleles (genotype) inher-ited from the parents. . . 6

1.3 Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium. . . 9

1.4 Observed haplotype frequencies in the population, considering linkage disequilibrium. . . 9

1.5 Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z. . . 12

1.6 Mating outcomes assuming Hardy-Weinberg equilibrium. . . 13

1.7 Causes of Alzheimer’s disease. . . 19

1.8 APOE allele according to the genotype for SNPs rs429358 and rs7412. . . 20

3.1 (a) Contingency table of allele counts; (b) Contingency table of genotype counts. . . 34

3.2 Counts of cases and controls in each of the n exposure categories Ej, in a sample of c individuals. . . 41

4.1 Summary table of the individual counts after each per-individual QC step, according to disease status, gender and country of origin (note: whenever male and female counts did not add up to the ”Total” column, it was due to the presence of individuals with unknown gender; this issue was overcome as of the sex check step). . . 49

4.2 Summary table of probe/variant counts at each per-marker QC step. . . 50

(12)

4.4 Identifier and description of each gene list to be tested for association, and number of

genes and rare variants they contain. . . 55

4.5 Number of significant genes without and with p-value correction in each gene list and model (Model 1 – sex as the only covariate; Model 2 – sex, age, PC1 and PC2 as covariates). 56

4.6 Frequency and properties of PLEKHA5 rare variants identified in our sample. (*) in com-plete LD (r2 _{= 1.0) Notes: ”PHRED” refers to the PHRED-scaled CADD score; MA - minor}

allele; MAF - minor allele frequency; ”gnomAD NFE” refers to the Non-Finnish European population of the gnomAD database (the number of genotyped alleles for each SNP was,

respectively, 129 088, 129 088, 75 296 and 113 128); the p-values refer to Fisher’s exact test for differences between control frequencies in our data and the gnomAD database. . 56

4.7 P-values obtained in ANOVA tests when testing for differences in RP of each of the

brain-waves between cases and controls. . . 59

4.8 P-values obtained in Tukey test for multiple comparisons of the RP in each brainwave between controls and cases in each disease stage (CON controls; MIL mild AD; MOD

-moderate AD; SEV - severe AD). The values below 0.05/30 = 1.67 × 10−3are underlined and in bold. . . 60

4.9 Gene and p-values obtained in the allelic and genotypic tests for the SNPs which were

nominally significant in both at the α = 0.05 level. (*) SNPs previously associated to AD. . 61

4.10 Wald test p-values for the variation of relative power across each frequency band in cases

and controls with different alleles for each of the considered SNPs. The values below 0.05 are underlined and in bold. . . 62

4.11 Wald test p-values for the variation of relative power across each frequency band in cases

and controls with different genotypes for each of the considered SNPs. The values below 0.05 are underlined and in bold. . . 62

4.12 Minor and alternative alleles of each SNP and their respective frequencies of the minor

allele in the sample, in cases and in controls. (*) SNP previously associated to AD. . . 63

C.1 Nominally significant genes in SKAT-O under null Model 1 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the

(13)

C.2 Nominally significant genes in SKAT-O under null Model 1 tested for ”Dementia” list, their

respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 94

C.3 Nominally significant genes in SKAT-O under null Model 1 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the

Burden test on SKAT-O, or the correlation between regression coefficients). . . 95 C.4 Nominally significant genes in SKAT-O under null Model 1 tested for ”AD Disgenet” list,

their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 96

C.5 Nominally significant genes in SKAT-O under null Model 2 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the

Burden test on SKAT-O, or the correlation between regression coefficients). . . 106 C.6 Nominally significant genes in SKAT-O under null Model 2 tested for ”Dementia” list, their

respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 106

C.7 Nominally significant genes in SKAT-O under null Model 2 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the

Burden test on SKAT-O, or the correlation between regression coefficients). . . 106 C.8 Nominally significant genes in SKAT-O under null Model 2 tested for ”AD Disgenet” list,

their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 108

(14)

List of Figures

1.1 Mendel’s fundamental experiment. . . 3

4.1 Missing data rate vs. heterozygosity across individuals passing the DQC and QCCR steps. Shading indicates sample density; the dashed lines represent the defined heterozygosity

threshold; the outliers are highlighted in red. . . 46

4.2 (a) Plot of the first principal component against the second, calculated with the software

EIGENSOFT; (b) Plot of the same PCs as in (a), zoomed in on the cluster which contains the european populations (the outliers are encircled in red). In the legends, PT and ES

represent the portuguese and spanish individuals in our sample, respectively; the remain-ing populations come from the 1KGP dataset. . . 48

4.3 Distribution of age at the time of sample collection. . . 52

4.4 Proportion of variance explained by the first i principal components; (b) is a zoom-in of (a) on the first 100 PCs. . . 53

4.5 Plot of the first and second principal components, restricted to the sample subjects. . . . 53

4.6 Probability density function of Beta(p, 1, 25). . . 54

4.7 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2

in cases and controls. . . 59

4.8 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2

in controls and mild, moderate and severe AD cases (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD). . . 60

4.9 Plot of the mean relative power distribution in the different frequency bands of controls

with each (a) allele or (b) genotype for SNP rs71336232. . . 64

4.10 Plot of the mean relative power distribution in the different frequency bands of controls

(15)

4.11 Plot of the mean relative power distribution in the different frequency bands of cases with

each (a) allele or (b) genotype for SNP rs10833211. . . 65 4.12 Plot of the mean relative power distribution in the different frequency bands of cases with

(16)

List of Abbreviations

1KGP The 1000 Genomes Project

AD Alzheimer’s disease

APT Affymetrix Power Tools

ANOVA analysis of variance

CADD combined annotation dependent depletion

CDCV common disease common variant

CDRV common disease rare variant

CG candidate gene

CT computed tomography

df degrees of freedom

DGE differential gene expression

DNA deoxyribonucleic acid

DQC dish quality control

EEG electroencephalogram

GUI graphical user interface

GWA genome-wide association

HWE Hardy-Weinberg equilibrium

IBD identity by descent

IBS identity by state

indel insertion/deletion variant

LD linkage disequilibrium

MAF minor allele frequency

MCI mild cognitive impairment

MIL mild Alzheimer’s disease

MMSE Mini Mental State Examination

MOD moderate Alzheimer’s disease

MRI magnetic resonance imaging

(17)

OR odds ratio

PC(A) principal component (analysis)

QC quality control

QCCR quality control call rate

RNA ribonucleic acid

SEV severe Alzheimer’s disease

SKAT sequence kernel association test

SNP single nucleotide polymorphism

SNV single nucleotide variant

(18)

Introduction

This work aimed to describe the general procedures and methods used in genome-wide association studies, and their posterior application to a data set composed of cases and controls of Alzheimer’s

disease from the Iberian Peninsula. The work is reflected in the present dissertation, which is structured as follows.

To place the work in context, we start by introducing in Chapter 1 some basic concepts of biology and population genetics, and the general idea behind association studies. We also briefly introduce Alzheimer’s disease, its symptoms, causes and means of diagnosis.

In Chapter 2, we describe the procedures relative to study design and quality control of genomic data. Here, we present relevant thresholds for various quality measures, as criteria to keep or discard

individuals and genetic variants from the study.

Chapter 3 was intended to portray in detail the models of association used to study the effect of both

common and rare variation on a complex phenotype. We focus on case-control studies, but also consider the case of quantitative traits whenever relevant.

Chapter 4 is the application of the previously described methods on a data set of Alzheimer’s patients and controls. It is split in two distinct phases of analysis. In a first stage, given the modest sample size,

we focus on the aggregate effect of rare variation on the disease, using the SKAT-O method. In a second and final stage, we assess the behavior of different brainwaves at various stages of the disease; we

also seek a possible association between a set of genetic variants and the values of EEG at different frequency bands.

Finally, on Chapter 5, we make some considerations about the work, namely on the importance of an approach combining genetic data and other endophenotypes associated with the disease, for a hopefully

(19)

(20)

Chapter 1 Theoretical framework

1.1 Introductory concepts of biology and genetics

All our ideas about the transmission of specific characters and changes in the characteristics of

pop-ulations make use of concepts introduced in 1865 by the one who is often referred to as the Father of Genetics - Gregor Mendel [1]. Working in a small garden with nothing but peas as his material, he was

able to formulate a hypothesis that explains the inheritance of some traits in a very simple way.

The simplified experiment was as follows: Mendel crossed purebred plants with green peas and pure-bred plants with yellow peas, obtaining a first hybrid generation of all yellow peas; he then crossed these

hybrid plants among each other and obtained peas of both colors, in a proportion of approximately 3 yellow to 1 green. This experiment is illustrated in Figure 1.1 and Table 1.1.

AA Aa Aa aa

Aa Aa

AA aa Initial generation

1st _{hybrid generation}

2nd hybrid generation

(21)

From these results, Mendel formulated his hypothesis for sexual reproduction, which can be expressed

as follows:

1. Each character of an individual is controlled by two ”factors”, the alleles, one of which the individual receives from his father, and the other from his mother.

2. From the two alleles carried by the individual, one is expressed (dominant), while the effect of the other may not be apparent (recessive).

3. A reproductive cell (egg and sperm, in humans) produced by an individual bears, for each

charac-ter, one and only one of the two alleles which the individual carries.

In the initial generation, the plants with yellow peas carried only ”yellow” genes; their genetic

represen-tation for this trait is (AA). The plants with green peas carried only ”green” genes, hence having genetic constitution (aa). Crossing individuals of these two types can only generate individuals of one type, (Aa). When crossing (Aa) individuals with each other, it is possible to obtain individuals with genetic

consti-tution (AA), (aa) or (Aa), with probabilities 1/4, 1/4 and 1/2, respectively. Because A is dominant over a, the (Aa) peas are yellow, hence the obtained proportions of all yellow peas in the 1st _{hybrid generation,}

and 3 yellow peas to one green in the 2nd.

(a) a a (b) A a

A (Aa) (Aa) A (AA) (Aa)

A (Aa) (Aa) a (Aa) (aa)

Tbl. 1.1 – (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) individuals (1st_{hybrid generation); (b) Possible}

genetic constitution of the offspring resulting from crossing (Aa) individuals among each other (2nd_{hybrid generation).}

Mendel’s experiment was a starting point for modeling the inheritance of other, more complex,

pheno-types, namely diseases, involving the contribution and/or interaction of various genes [2].

The human body is composed of billions of different types of cells. Cells are constantly dying and

being replaced by newly formed ones through the process of cell division – the source of an individual’s development and growth. The very first cell of an individual already carries all of the genetic information

he/she will bear throughout his/her whole life, encoded in their DNA.

DNA, or deoxyribonucleic acid, is a 3 billion long double helix molecule made of two complementary

(22)

(T), cytosine (C) and guanine (G) –, attached to one sugar molecule and one phosphate molecule. In

each nucleotide chain, A pairs up with T in the opposite chain, and C with G. Therefore, taking one of the chains as reference, DNA can essentially be seen as a single sequence of As, Ts, Cs and Gs. During

cell division, DNA replicates itself, providing each new cell with an identical copy of all genetic material (if no mutation occurs). This process is called ”replication”.

DNA is tightly coiled into structures called ”chromosomes”. Most human cells, the somatic cells,

con-tain 23 pairs of chromosomes, for a total of 46, and are hence called ”diploid”. The exceptions are the reproductive cells or gametes, which carry only half of an individual’s genetic material, i.e., 23 unpaired

chromosomes, thus being ”haploid”.

During cell division occurs a phenomenon called ”genetic recombination”, which is the crossover of the arms of each chromosome in the cell, leading to the exchange of genetic material between them.

Recombination occurs between homologous regions, meaning that the alleles or genes are in a similar order of arrangement in both chromosomes.

In somatic cells, all chromosome pairs but one are termed ”autosomes”; the one pair that differs in its

structure, usually referred to as the 23rdpair, is the pair of sex chromosomes, which determines human gender - females have two X chromosomes, whereas males have one X and one Y. In each reproductive

cell, since only half of the genetic material is present, there is an X chromosome in female gametes and either an X or a Y in male ones; indeed, it is the information carried in the paternal gamete that

determines the sex of the offspring.

The set of all genetic information of an individual is called his/her ”genome”. The ”exome” is the small

fraction (about 1%) of the genome known to encode for protein production. The information to produce a functional protein is encoded in a special unit of DNA at a determined genetic site (”locus”), called

a ”gene”. Each gene can have various modes of action, determined by variants of its sequence. The multiple versions of a gene are called ”alleles”. A gene is said to be ”polymorphic” if, for its locus, the

minor allele frequency (MAF) is at least 1% within a population [3].

Because humans have twenty-two paired chromosomes, most genes are represented twice in our genome, through alleles that may or may be not identical. A ”genotype” is defined as the set of alleles

found at an individual’s locus. A ”phenotype” is an observable trait that concerns a particular locus (or the combined action of several loci). For example, in Mendel’s experiment described in 1.1, the color of

(23)

An ”endophenotype” is somewhere between the previous two definitions: it is a quantitative,

non-observable trait that also shows a genetic connection. A set of genotypes observed at linked loci of one individual is called his/her ”haplotype”.

For each autosomal locus, an individual is said to be ”homozygous” if he/she carries two copies of the same allele, and ”heterozygous” if the alleles are different. We can also refer to homozygous individuals

as ”homozygotes”, and to heterozygous individuals as ”heterozygotes”.

It is the union of one male and one female reproductive cells that generates a ”zygote”, i.e., a fertilized

egg cell, with a full complement of hereditary information necessary for the development of a human being. At the moment of fertilization, the new individual receives for each autosomal locus one allele

from his father and one from his mother; as for the 23rd pair, if the new individual is female, then she has inherited one X chromosome from each of her parents, while if he is male, he has inherited an X

chromosome from his mother and a Y from his father. Women can be homozygous or heterozygous for genes in the 23rdchromosome; men can only be ”hemizygous”, due to X and Y not being homologous

in all of their extension, except for short ”pseudoautosomal regions” on their tips.

Individuals with different genotypes may show simillar phenotypes due to genetic dominance-recessiveness

relationships. Allele A is said to be ”dominant” to a (or, equivalently, a is ”recessive” to A) if the action of A, but not that of a, is manifested in the phenotype of a (Aa) heterozygote. Alleles A and B are said to be

co-dominant if they are both expressed in an (AB) individual’s phenotype. An example of co-dominant expression is the AB blood type in the human ABO blood system, as shown in Table 1.2 [3]. This table

also illustrates the dominance of alleles A and B over allele O.

A B O

A A AB A

B AB B B

O A B O

(24)

1.2 Population genetics concepts

Population genetics involves the study of genetic variation within and between populations, by examin-ing allele frequencies at different loci over time and space. Mathematical models are used to investigate

and predict the occurrence of specific alleles (or combinations of alleles) in populations, based on the ever increasing understanding of genetics and evolution. As such, it becomes necessary to introduce

some concepts regarding genetic variations, and how they relate to each other and influence the evolu-tion of species.

1.2.1 Genetic variants

Chromosomes are not perfectly stable entities: changes in the DNA sequence may occur as a result

of external or internal factors, such as the interaction with radiation, chemicals or viruses, or simply an error during the replication process. These changes are called ”mutations”, and give rise to genetic

variants; if their frequency among a population is above 1%, they are considered common and called ”polymorphisms” [3].

For example, the replacement of one nucleotide by another is called a ”single nucleotide variant” (SNV), or a ”single nucleotide polymorphism” (SNP) in case it is common. The most common single base-pair

changes are between the two existing classes of nucleotides – purines (A↔G) and pyrimidines (C↔T) –, thus, most SNPs in a population are ”biallelic”. Another class of well-known variants are indels, which

is the insertion or deletion of a portion of DNA, no larger than 1 000 bases, into the genome.

Variants are part of the evolution of species. Some variants do not change an individual’s phenotype, while others may greatly affect it; some variants can increase an individual’s fitness in the surrounding

environment, while others can have deleterious effects and generate disease. ”Monogenic” diseases result from deleterious variants in a single gene. They are inherited according to Mendel’s laws, hence

also being called ”Mendelian” diseases. If a disease results from the joint contribution of a number of independently acting or interacting genes, it is called ”polygenic”.

Variants can have numerous classifications depending on their length, placement and function. Exonic variants are located in portions of a gene that will encode a part of the final mature RNA produced by

(25)

regions but may have an important role in regulating gene expression. These are variants of potential

functional importance and could be good candidates for further analysis in association studies.

A SNV that is in a coding region of the genome but results in no change to the encoded amino acid is called a ”synonymous” substitution; when a genetic SNV influences the protein expression, it is termed

”non-synonymous”. Indel variants may yield ”frameshift” variations, which are potentially deleterious. ”Stop-gain” and ”stop-loss” variations result, respectively, in a premature termination and an abnormal

extension to the protein translation process, and thus alter the protein itself. Of these classes of genetic variants, the synonymous substitutions are the least functionally relevant, and it is not uncommon to

prioritize the analysis of variants falling in the remaining classes when searching for an association with a phenotype or disease [3].

Some genetic variants are known to be more likely to ”travel together” from generation to generation

than would be expected if different loci associated in a random manner. This phenomenon of non-random association is termed ”linkage disequilibrium”.

1.2.2 Linkage disequilibrium

Various studies have confirmed that the inheritance of certain alleles within a population is often

cor-related, causing many individuals to share the same haplotype. The alleles are thus said to be in linkage disequilibrium (LD). Even though genetic distance influences LD, it does not necessarily cause it; two

loci being in LD simply means that the alleles appear together in the same population more (or less) frequently than chance would have us expect.

Suppose that allele A at locus 1 and allele B at locus 2 are found at frequencies p and q, respectively,

in the population. If the two loci were independent, then we would expect to see the [AB] haplotype at frequency pq; however, if the frequency of the [AB] haplotype was either higher or lower than pq, then

the two loci could be in LD.

Let us consider two biallelic loci on the same chromosome, with alleles A and a at the first locus, and B and b at the second. Their allelic frequencies in the population are pA, pa, pBand pb (note that, because

the loci are biallelic, pa= 1 − pAand pb = 1 − pB); the haplotype frequencies are pAB, pAb, paB and pab. Table 1.3 shows the observed and expected haplotype frequencies under linkage equilibrium; Table 1.4

(26)

Observed frequencies Expected frequencies

B b B b

A pAB pAb A pApB pApb

a paB pab a papB papb

Tbl. 1.3 – Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium.

Observed frequencies

B b Total

A pApB+ D pApb− D pA a papB− D papb+ D pa

Total pB pb

Tbl. 1.4 – Observed haplotype frequencies in the population, considering linkage disequilibrium.

The measure of linkage disequilibrium D is the difference between the expected haplotype frequencies and the observed, defined as

D = pAB− pApB.

In order to standardize D, we need to find its boundaries. Using the fact that the observed frequencies must be non-negative, we obtain

pApB+ D ≥ 0 ⇔ D ≥ −pApB

papb+ D ≥ 0 ⇔ D ≥ −papb

pApb− D ≥ 0 ⇔ D ≤ pApb

papB− D ≥ 0 ⇔ D ≤ papB

Thus we can define

D0 = D Dmax

(27)

Dmax=      min{pApb, papB}, if D > 0 max{−pApB, −papb}, if D < 0 .

This normalization causes D0 to range between −1 and 1. When D0 = ±1, then at least one of the haplotypes was not observed; if allele frequencies are similar, a high D0 value means the markers are

good surrogates for each other.

Another widely used measure to calculate LD between loci, preferred by population geneticists, is Pearson’s coefficient of correlation r,

r = √ D

pApapBpb

or, more commonly, its squared value (r2_{). When r}2 _{= 1, the two loci are in total linkage disequilibrium,}

i.e., they both provide identical information; if r2 = 0, they are in perfect equilibrium, i.e. the genetic information is transmitted independently [4]. Tipically, two loci are considered to be correlated when an

r2 value greater than 0.2 is achieved [5].

Linkage disequilibrium is of major importance in association studies, namely at the level of marker selection in the study design phase. Indeed, mapping LD across the human genome has made possible

to deduce an individual’s genotype at a given locus through others in high disequilibrium. This is done by strategically choosing single tagSNPs to represent entire haplotypes of regions in high LD, which results

in a less costly study.

1.2.3 Identity by descent

Even after taking LD into account, loci which are independent within a population may still show

sig-nificant similarities among individuals, introducing a degree of relatedness which must be accounted for, especially when performing an association study with a sample of unrelated individuals.

If relatives are present, a bias may be introduced to the study, because the genotypes within families will be over-represented and the sample may no longer be an accurate reflection of the allele frequencies

in the entire population.

An important measure of relatedness used to identify such cases is identity by descent (IBD), a degree

(28)

descended from the same ancestral allele. Mutation breaks identity by descent. Two individuals are said

to be related if they may share IBD alleles. There is some point in the past beyond which individuals are assumed to be unrelated.

Identical twins are expected to have a proportion of shared IBD alleles equal to 1; first-degree relatives, 0.5; second-degree relatives, 0.25; and so on [5].

A similar concept is that of identity by state (IBS), which is based on the average proportion of indistin-guishable alleles shared at genotyped variants for each pair of individuals. Therefore, two alleles which

are IBD are also IBS, but the opposite may not be true, because alleles IBS may not originate from the same common ancestor; similarly, an individual may have more alleles IBS than IBD, but the opposite

can never occur.

Purcell et al. [6] considered a method-of-moments approach to estimate the probability of sharing 0,

1, or 2 IBD alleles for any pair of individuals from the same homogeneous, random-mating population. Denoting IBS states as I and IBD states as Z (in both cases, the possible states being 0, 1, and 2), then

we have that P (Z = 0) = N (I = 0) N (I = 0 | Z = 0) P (Z = 1) = N (I = 1) − P (Z = 0)N (I = 1 | Z = 0) N (I = 1 | Z = 1) P (Z = 2) = N (I = 2) − P (Z = 0)N (I = 2 | Z = 0) − P (Z = 1)N (I = 2 | Z = 1) N (I = 2 | Z = 2)

where N (I = i | Z = z) is the expected count of variants with IBS state I = i conditional on IBD state Z = zfor the entire genome, and is defined as

N (I = i | Z = z) = L X m=1

P (I = i | Z = z)

where the summation is over all variants with genotype data on both individuals, and the conditional probabilities are calculated as in Table 1.5.

We can thus define the proportion of alleles shared IBD as

ˆ

π = P (Z = 1)

(29)

I Z P (I | Z) 0 0 2p2q2 1 0 4p3q + 4pq3 2 0 p4+ 4p2q2+ q4 0 1 0 1 1 2p2q + 2pq2 2 1 p3+ p2q + pq2+ q3 0 2 0 1 2 0 2 2 1

Tbl. 1.5 – Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z.

Due to genotyping errors, LD and population structure, a ˆπvalue higher than 0.98 is considered enough

to consider two samples under analysis as duplicates. The usual procedure is to remove one individual from each pair with ˆπ > 0.1875(a value halfway between the expected IBD for second- and third-degree

relatives) [5].

1.2.4 The Hardy-Weinberg equilibrium

The Hardy-Weinberg equilibrium (HWE) is a law of genetics which states that allele and genotype frequencies in a population will remain constant from generation to generation, under the following

as-sumptions:

(a) the population size is so large that it can be treated as infinite;

(b) generations are discrete, and individuals from different generations do not breed together;

(c) mating is at random;

(d) migration does not occur;

(e) selection does not occur (i.e., individuals with different genotypes are assumed to have equal

(30)

(f) mutations do not occur (i.e., individuals with genotype (AiAj) can only produce gametes with an

Ai or an Aj allele at that locus);

(g) initial genotype frequencies are equal in the two sexes.

The equilibrium in autosomal loci

Let’s suppose, for simplicity and because most human loci are biallelic, that there are n = 2 observed alleles, A1 and A2with proportions p and q = 1 − p, respectively, for a given locus in a population. There

are 3 possible genotypes, (A1A1), (A1A2)(identical to (A2A1)) and (A2A2), with initial proportions u, v and w, respectively.

From the genotype proportions, it is possible to deduce the allele proportions:

p = u +1 2v q = w + 1

2v

Under the stated assumptions, the next generation will be composed as shown in Table 1.6.

Mating Type Frequency Nature of Offspring

(A1A1) × (A1A1) u2 (A1A1) (A1A1) × (A1A2) 2uv 1₂(A1A1) +1₂(A1A2) (A1A1) × (A2A2) 2uw (A1A2) (A1A2) × (A1A2) v2 1₄(A1A1) +1₂(A1A2) +1₄(A2A2) (A1A2) × (A2A2) 2vw 1₂(A1A2) +1₂(A2A2) (A2A2) × (A2A2) w2 (A2A2)

Tbl. 1.6 – Mating outcomes assuming Hardy-Weinberg equilibrium.

(31)

are, respectively, u2+ uv +1 4v 2₌ u +1 2v 2 = p2 uv + 2uw + 1 2v 2_{+ vw = 2} u +1 2v w +1 2v = 2pq 1 4v 2_{+ vw + w}2₌ w +1 2v 2 = q2 (1.1)

and, for the second generation,

p2+1 22pq 2 = [p(p + q)]2 = p2 2 p2+1 22pq q2+1 22pq = 2p(p + q)q(p + q) = 2pq q2+1 22pq 2 = [q(p + q)]2 = q2 (1.2)

meaning that, after a single round of random mating under the conditions above, the genotype fre-quencies stabilize at Hardy-Weinberg proportions [7].

Testing for equilibrium

Departures from HWE are generally measured at a given SNP using a χ2 _{goodness-of-fit test between}

the observed and expected genotypes. The χ2 statistics is defined as

χ2 =X i

(Oi− Ei)2 Ei

(1.3)

where Oi and Eiare the observed and expected absolute frequencies of each of the n genotypes in a

population at that locus. This test statistic has a χ2distribution with n − 1 degrees of freedom [8]. A deviation from HWE implies a violation of at least one of the assumptions stated above; it is usually

(32)

1.2.5 Population substructure

Population substructure, also referred to as population admixture or population stratification, is the

presence of genetic differences between subpopulations of an apparently homogeneous population due to genetic history (e.g., migration, selection, and/or ethnic integration). Principal component analysis

(PCA) is widely used to detect and visualize hidden population substructure that is not apparent in the data and which may be providing untrue results, when analyzing characteristics of the population as a

whole [8].

The central idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables (in the case of an association study, these are the thousands of genetic markers),

while retaining as much of the variation present in the data set as possible. This is achieved by trans-forming to a new set of variables, the principal components (PCs), which are uncorrelated, and which

are ordered so that the first few retain most of the variation present in all of the original variables.

Assuming our markers as biallelic, the data can be seen as a large rectangular matrix C, with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker, there is a

reference and an alternative allele. We suppose there are n markers and m individuals, and that the number of markers is much larger than the number of samples, n m.

Let C(i, j) be the number of reference alleles for marker j, individual i. Thus, for autosomal loci, we have C(i, j) ∈ {0, 1, 2}. For each column of C, we calculate its mean µ(j) and standard deviation σ(j),

and obtain a new matrix M

M (i, j) = C(i, j) − µ(j) σ(j) .

generally called the variance-standardized genetic relationship matrix. This step of normalization is intended to make the markers (co-variables) comparable, reducing their mean and variance to 0 and 1,

respectively. With this matrix, we can now define

X = M MT,

a square matrix m × m, with dimensions equal to the number of sampled individuals. We then com-pute the eigenvalues of matrix X and the corresponding eigenvectors, which are called the PCs. The

(33)

PC1; the eigenvector corresponding to the second highest eigenvalue is PC2; and so on. The variation

explained by PCs decreases, with the first PC explaining the most variation [9].

Plotting PCs against each other can show evidence of population substructure, by clustering the in-dividual data across these new axes of variation. Those PCs which are found to be significant can

posteriorly be used as co-variables in regression models (see Chapter 3).

1.3 Association studies

Variation in a DNA sequence can influence the risk of developing disease. Early studies investigated genetic variants underlying rare conditions that showed clear Mendelian inheritance patterns in families,

and turned out to be very successful due to these variants carrying 100% disease risk [10].

Scientific efforts have been made to put together as much information about the human genome vari-ation as possible, allowing for better design and less costly studies. Such efforts include the Human

Genome Project (1990-2003), the International HapMap Project (2002-2009) and, more recently, the 1000 Genomes Project (2008-2015).

Investigating the causes of complex diseases has proven to be a much more difficult task, because

there is not one single cause, but rather the combined action of many causal factors, genetic and/or envi-ronmental, that predispose to disease development. This means that even variants with a low increased

relative risk, when found together in the same genome, may significantly contribute to the disorder in question to manifest. Genetic association studies aim to detect such variants involved in complex

dis-eases.

The fundamental idea behind association studies is the comparison of allele or genotype frequencies between cases and controls, in order to relate genetic variants to a certain phenotype (such as a

dis-ease); if a particular allele/genotype is more common among cases than controls, it may be a risk factor and may be subject to further study [8].

There are two main theories for disease associated variants: the ”common disease common

vari-ant” (CDCV) hypothesis, and the ”common disease rare varivari-ant” (CDRV) hypothesis. These hypotheses argue contrary views concerning which variants carry the most penetrance, i.e., the proportion of

(34)

genetic variations with appreciable frequency in the population, but relatively low penetrance, are the

major contributors to genetic susceptibility to common diseases, the second reasons that multiple rare DNA sequence variations, each with relatively high penetrance, are the major contributors to genetic

sus-ceptibility to common diseases. Both hypotheses stand on empirical evidence, and each uses specific methods for association analysis [11].

In the early days of association studies, the initial choice was to focus on common variants, mainly due to genome-wide surveys of rare variation requiring many more assays than the arrays available at

the time could support. However, there was strong motivation to support the CDRV hypothesis, namely the idea that deleterious variants are likely to be rare due to purifying selection; indeed, loss-of-function

variants, which prevent the generation of functional proteins, are especially rare [12].

A number of softwares have been developed to deal with genetic data files, which are evidently of very large sizes due to the thousands of variants for analysis in most genetic studies. These programs are

mostly command-line based, which makes dealing with these types of files more computationally efficient than working with GUI-based softwares; they are also readily and freely available online for use. One

such program is PLINK [6], which contains multiple basic commands such as calculating allele frequen-cies, IBD and heterozygotic proportions, converting between multiple file types and performing basic

allelic/genotypic chi-squared association tests. It also performs PCA, but a more specific command-line program for this effect is EIGENSOFT. Among others, this program contains the EIGENSTRAT

stratifica-tion correcstratifica-tion method, which uses principal component analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation.

One last command-line program worth mentioning is ANNOVAR. This tool, given a list of variants and

their corresponding genetic coordinates, uses a number of databases to functionally annotate them. The resulting features of each given variant includes the gene it belongs to, whether or not it falls in a coding

region, whether or not it yields a change to the produced amino-acid, among others. An important feature provided by ANNOVAR is the Combined Annotation Dependent Depletion (CADD) score, a measure of

the deleteriousness of SNVs and indels in the human genome. The CADD scores are ”PHRED-scaled”, meaning their values are ranked in order of magnitude terms rather than the precise rank itself. For

example, variants at the top 10% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, and so on. ANNOVAR also provides information from the Genome Aggregation

(35)

are Non-Finnish Europeans (NFE) wherein the Iberian population is included.

Several psychiatric disorders, such as schizophrenia [13], depression [14] or Alzheimer’s disease (AD) [15], have been found to be polygenic and were associated with several genetic variants. In this work, we describe the general procedures of association studies and the corresponding statistical methods,

followed by an application to the case of AD in patients from the Iberian Peninsula.

1.4 An introduction to Alzheimer’s disease

1.4.1 Symptoms, causes and available diagnosis

Alzheimer’s disease (AD) is the most common type of dementia. It usually begins with subtle memory failure, which worsens over time and begins to affect an individual’s daily living. A person suffering from

this condition will eventually have trouble recognizing people, naming objects, dealing with everyday chores and personal care, behaving appropriately in social situations, among others. At an advanced

stage of the disease, the patient will require constant care. After the first symptoms appear, an individual usually survives 8 to 10 years, but the course of the disease can go up to 25 years, ending in death by

pneumonia, malnutrition or general inanition [16].

There are three stages to AD. The early stage is mild Alzheimer’s disease, when a person can still function independently but has few memory lapses, such as forgetting familiar words or the location of

everyday objects. Individuals with mild AD are firstly diagnosed with a condition called mild cognitive impairment (MCI), which has similar associated symptoms to the early stage of AD; deciding whether

the MCI observed in an individual is due to AD relies on brain imaging and cerebrospinal fluid tests. The middle stage, or moderate Alzheimer’s disease, is typically the longest stage; the symptoms become

more pronounced and the patient will require more care. The late stage is called severe Alzheimer’s disease, when individuals lose ability to respond to the environment and need constant help in performing

daily activities. It is often not easy to place an individual at a specific stage, as they may overlap [16].

AD can be classified according to the age of onset. The most common type is late-onset Alzheimer’s disease, which constitutes approximately 95% of the cases, and affects individuals whose first symptoms

(36)

the age of onset is below 65 [17].

Cause % of cases

Late-onset familial 15-25

Early-onset familial <2

Down syndrome <1

Unknown (includes genetic/environment interactions) ∼75 Tbl. 1.7 – Causes of Alzheimer’s disease.

The main causes of AD are described in Table 1.7. Approximately 25% of all AD is familial (i.e., ≥3

persons in a family have AD) and 75% is nonfamilial (i.e., an individual with AD and no known family history of AD); the onset of nonfamilial Alzheimer’s disease is usually at an advanced age. Because

familial and nonfamilial AD appear to have the same clinical and pathologic phenotypes (observable manifestation), they can only be distinguished by family history and/or by molecular genetic testing [16].

Most cases of early-onset AD are due to genetic factors transmitted from parent to child. Research

has shown that this form of the disease mostly results from a variation in one of these three genes: APP, PSEN1 or PSEN2. When any of these genes is altered, large amounts of amyloid β-peptide, a

toxic protein fragment, are produced in the brain. This peptide builds up to form clumps called ”amyloid plaques”, characteristic of Alzheimer’s disease, which lead to the death of nerve cells and the progressive

signs and symptoms of this disorder [16].

Some evidence indicates that essentially all persons with Down syndrome develop the neuropathologic hallmarks of AD after age 40. Down syndrome, a condition characterized by intellectual disability and

other health problems, occurs when a person is born with an extra copy of chromosome 21 in each cell. The presumed reason for the association between these two conditions is the lifelong overexpression of

APP on chromosome 21, and the resultant overproduction of β-amyloid in the brain [17].

Research has come to support the concept that late-onset Alzheimer’s disease is a complex disorder, with many susceptibility genes involved, as well as environmental factors (such as higher education, or

exposure to electromagnetic fields [18]).

The gene APOE has been extensively studied and proven to have great influence in the manifestation of AD. APOE is polymorphic, with three major alleles: ε2, ε3 and ε4. The presence of the ε4 allele

(37)

respectively. APOE ε2 allele has shown to have a protective effect [16]. APOE alleles are determined

by the two SNPs rs429358 and rs7412 as shown in Table 1.8.

rs429358 rs7412 Allele

C T ε1

T T ε2

T C ε3

C C ε4

Tbl. 1.8 – APOE allele according to the genotype for SNPs rs429358 and rs7412.

However, the presence of APOE ε4 does not determine that an individual will develop the disease; in fact, approximately 42% of individuals with AD do not have any APOE ε4 allele. Similarly, the absence

of APOE ε4 does not rule out the possibility of one developing AD.

Currently, the only definitive way to establish a diagnosis of AD is to microscopically examine a section of the person’s brain tissue after death. However, there are still several approaches that have been

proven to be highly effective in the diagnosis of Alzheimer’s disease to a living patient.

The initial step is to consult with a specialized doctor (psychiatrist), who will review the individual’s medical history and analyze the symptoms, as well as conduct a series of tests to the cognitive and

physical abilities of the individual. The Mini Mental State Examination (MMSE) is one such test widely used for this purpose. It is not unusual for the doctor to interview friends and family of the patient, to

better understand their behavioral changes over time. This series of clinical assessments often provide enough information to perform a correct diagnosis [16]; however, it isn’t always clear, and may require

further, more advanced testing.

Analysis of electroencephalograms (EEGs) can also be used as a means of diagnosis. EEGs are used to register electrical activity in the brain, focusing namely on spectral measures, which include the

classical brainwaves in delta, theta, alpha, beta and gamma frequencies. Each one of these brainwaves in an endophenotype to Alzheimer’s disease. Brainwaves are activated according to our actions, feelings,

circadian rhythm, and some disorders may trigger the over-expression or inhibition of a given brainwave. In order to interpret the EEG, it is important to understand which behaviors lead to certain variations in

activity of each brainwave.

(38)

gener-ated in deep meditation and dreamless sleep, when healing and regeneration processes are triggered.

Theta (θ, 4-8 Hz) brainwaves are connected with the learning, memory, and intuition functions. Al-pha (α, 8-13 Hz) waves aid overall mental coordination and learning. Beta (β, 13-30 Hz) brainwaves

are present when we are alert, engaged in problem solving, judgment, decision making, or focused mental activity; they dominate our normal waking state of consciousness. Gamma (γ, >30 Hz)

brain-waves are the fastest of brain brain-waves, and relate to simultaneous processing of information from different brain areas. More detailed information can be found online at https://brainworksneurotherapy.com/

what-are-brainwaves.

Other means of diagnosis include laboratory testing, which is usually performed as a way of ruling out conditions that cause similar symptoms to Alzheimer’s, such as nutritional deficiencies or other diseases

that could be affecting the person’s memory. These tests make use of blood, urine and cerebrospinal fluid samples [19].

One final diagnosis method worth mentioning is brain-imaging testing, such as computed tomography

(CT) or magnetic resonance imaging (MRI) scans. They allow to look for evidence of trauma, tumors, and stroke that could be causing dementia and to look for brain atrophy, shrinkage that may be present

later in the Alzheimer disease progression. These tests require that the person remain still for a period of time [19].

All the methods above provide information that allow to rule out a series of conditions that cause symp-toms similar to AD. Such conditions are, for instance, past strokes, Parkinson’s disease and depression [19].

1.4.2 The association studies approach

The methods of diagnosis described above require that the individual is showing symptoms of the

disease. Some of them may be considered invasive, such as lab testing (which requires lumbar puncture and spinal fluid collecting); others may be unaccessible to the majority of the population due to their

high cost, such as an MRI scan. In addition, imaging techniques may provide poor quality results, as Alzheimer’s patients tend to have a hard time standing still even for short periods of time, especially at

an advanced stage of the disease.

(39)

which increase risk of developing the disease, it could be possible to make an early diagnosis, since the

genetic material we carry is the same throughout our lifetime (except for mutations that may occur). This way, the disease could be prevented even before the appearance of symptoms, and thus prolong the

quality of life of a potential future AD patient.

This would also be a cheaper alternative, and can be made less invasive, while maintaining sample

(40)

Chapter 2 Study design and data quality control

Genetic association studies can essentially be divided into candidate gene (CG) and genome-wide

association (GWA) studies. CG studies are based on the prior hypothesis of a potential role of selected genes or genetic regions on a specific phenotype or disease, taking into consideration their biological

function or association in previous studies. Genome-wide association studies, on the other hand, make use of information on the variation across the entire human genome, and are useful for

hypothesis-generating purposes [10].

GWA analyses usually target relatively common SNPs. CG studies, however, focus on the effects of rare variants, which may be hard to detect, especially when dealing with small sample sizes. Besides

being more cost effective than sequencing an entire human genome, studying the part that rare variants play on disease has been largely motivated by the CDRV hypothesis [11]. Indeed, if a certain variant

has a large deleterious effect, it may also impact fitness and thus become less and less frequent in each generation.

This chapter describes the general procedures of an association study from the study design, through the process of data collection and up to the quality control steps, taking into account which methods suit

(41)

2.1 Study design

Describing the phenotype accurately

The phenotype of interest must be defined as accurately and specifically as possible, in a way that minimizes the likely causal heterogeneity based on existing clinical and biological evidence. Such

defi-nitions may change, as more information becomes available. This will increase power of detection of an effect and allow for replication studies [10].

Checking disease heritability

Heritability is a measure of how well differences between individuals’ genes account for differences in their traits, i.e., how much of the variation in a given trait can be attributed to genetic variation (as

opposed to environmental causes) [3].

Heritability is assessed by studying disease patterns in family members, namely by comparing

monozy-gotic with dizymonozy-gotic twins. Because monozymonozy-gotic twins are genetically identical (the two alleles in each locus are IBD), while dizygotic twins are expected to share, on average, half of their alleles, comparing

disease status in twins can enlighten on the role of genetic factors [10].

Diseases which have been shown to have low heritability will likely need very large sample sizes in order to find etiological genetic variants. Moreover, in diseases with heritability close to zero, there isn’t

much advantage in conducting a genetic case-control study [10].

Choosing the best approach to the problem

Concerning sample relatedness, association studies can be sorted into two categories: population-based case-control studies, and family-population-based studies. The first approach may require several thousands

of cases of the phenotype of interest. This number can be decreased, and power of the study increased, by recruiting cases with family history of the condition, or even multiple cases from the same family

(adjusting for familial correlation), for a sample with a more homogeneous genetic background; this is called ”enrichment sampling” [20]. This sampling method does not always increase power in genetic

studies, as familial aggregation may be due to shared environmental factors, for example [10].

(42)

more of the underlying genetic variants are common. Moderately rare variants could also be detected,

but only if they carry a large effect. A prior hypothesis that all undetected variants are rare and of small effects would require an unfeasibly large sample size, in order to have power to detect the effect of single

variants [21].

If the case definition is a phenotype that shows clear segregation in families, then a population-based case-control approach is no longer suitable, and a family-based study is preferable [10].

Control selection

The golden rule of control selection for any case-control study is that cases and controls should belong to the same population, and they must be representative of that population who would have become cases, according to the case definition and the recruitment strategies for the study. This minimizes false

positives and confounding [22].

Bias due to environmental factors is generally not a problem in association studies; the most important type of bias is related to the ethnic origin of cases and controls. This is commonly referred to as

”pop-ulation stratification”, and is an example of a confounding variable. Under this situation, differences in allelic frequencies between cases and controls are due to the underlying sampling scheme, rather than

an actual effect of the variant on disease risk [23].

The effects of population stratification can sometimes be avoided at the study design level (by matching controls to cases on potentially important confounders) or the data analysis level (by adjusting the results

for these confounders). Matching is only essential when the effect of the confounder cannot be accurately measured or is too large to be adjusted for in the analysis [24].

Population stratification is minimized when controls are matched to cases on ethnicity, or when the

sample is restricted to a particular ethnic group. Further matching on sex can reduce population strat-ification in situations where there are gender differences in disease prevalence. Matching on age may

improve power of the study by ensuring that controls had the same opportunity as cases to develop (and be diagnosed with) the disease. This could be a problem when dealing with age-related diseases

such as Alzheimer’s disease. Whether or not further matching is necessary and decreases population stratification will depend on the disease in question [10]. Remaining stratification can be investigated

and controlled (to some extent) by analytical methods [25, 26].

(43)

healthy controls, mainly due to it being a much more economical approach. It is important that basic

characteristics of such panels are known, such as ethnicity, sex, age and area of recruitment, so that they can be matched to in the design or adjusted for in the analysis [10].

The described methods of control selection are specific of studies intended to assess genetic risk, and

no longer suited if we incorporate environmental factors [10].

Sample size

Sample sizes for each study will depend on the existence of case sub-groups and a priori hypotheses

to be tested, on whether it is a CG or a GWA approach, among (many) other factors. Estimating the required sample size often relies on empirical results from simulation studies [10].

The lack of availability of genetic information from cases for an association study often relates to economic issues. When testing many SNPs, a one-stage design can be very expensive, so one can

resort to a multi-stage design, where all SNPs are tested in a random subset of cases and controls, and those found significant are taken through to be tested in the remainder of the study sample [27]. The

power of a study can also be potentially improved with an increased control/case ratio [24].

Replication studies

Theoretical considerations prove that, when true discovery is claimed based on crossing a threshold

of statistical significance and the discovery study is underpowered, the observed effects are expected to be inflated. Furthermore, flexible analyses coupled with selective reporting may inflate the published

dis-covered effects. Therefore, a study designed to replicate a finding should base sample size calculations on smaller effect sizes [28].

A true replication study must be performed on a population comparable to the original, i.e., it must involve the analysis of the same polymorphism in the same direction of the effect, in the same ethnic

population measured on the same phenotype. Failure to replicate findings in a different population does not allow judgement of the validity of the results in the original study; it can only elucidate on the lack of

(44)

2.2 Data collection and variant calling

Following the study design is collecting the data for analysis. Genotyping individuals in association

studies is usually done with DNA microarrays. These consist of specific DNA sequences (known as probes) corresponding to a short section of a gene or other DNA sequence of the human genome.

Probes are usually 100 to 10 000 bases long and fluorescently labeled. Among the manufacturers of DNA microarrays was Affymetrix, Inc., a company now owned by Thermo Fisher Scientific. This company

developed the GeneChip array technology and the Affymetrix Power Tools (APT), which can be used for variant calling, quality control and genotyping.

A GeneChip array can contain up to thousands of DNA probes, designed to vary in specific locations matching those of known human genome variation. When placing the probes and a DNA sample in

the same environment, DNA breaks up into fragments which attach to the corresponding probe in a process called hybridization, issuing a fluorescent measurable signal that allows to identify the nucleotide

sequence in each fragment and thus determine the DNA sample sequence.

Two important measures to consider when assessing the quality of the variant calling process are the

dish quality control (DQC) and the quality control call rate (QCCR). DQC is a measure of the contrast between the adenine-thymine (AT) and cytosine-guanine (CG) signals, and is defined as

DQC = AT Signal - CG Signal AT Signal + CG Signal

QCCR is the proportion of non-missing data for each individual.

The ”Axiom Genotyping Analysis Guide” by Affymetrix provides guidelines for these measures, and any subject falling below these values should be eliminated from further study. Their best practices

guide can be found online at https://assets.thermofisher.com/TFS-Assets/LSG/manuals/axiom_

genotyping_solution_analysis_guide.pdf.

Another relevant measure when doing probe QC is the heterozygosity rate. When this value is too high (usually higher than µ + 3σ) for a given individual, it could hint sample contamination; when it is too low (below µ − 3σ), it could mean that there are related individuals in the sample. In any case, it is

recommended that the individuals falling outside this interval be discarded from further analysis [5].

One final quality control step for variants before genotyping is to sort them into categories according

Quantifying the genetic predisposition to a complex disease through genome-wide association