Quantifying the genetic
predisposition to a
complex disease
through genome-wide
association
Ana Margarida Carrapatoso Macedo Master’s degree thesis presented to
Faculdade de Ciências da Universidade do Porto in Mathematical Engineering 2019 Qu a n tif y in g th e g e n e ti c p re d is p o s iti o n to a c o m p le x di s e a s e thr ough ge nom e -wi d e a s s o c ia tio n An a M ar g ar id a Ca rra p at o so M ac ed o FCUP 2019 2.º CICLO
predisposition to a
complex disease
through genome-wide
association
Ana Margarida Carrapatoso Macedo
Mathematical EngineeringDepartment of Mathematics 2019
Supervisor
Alexandra Lopes, Assistant Researcher, i3S – Instituto de Investigação e Inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto
Co-supervisor
Nádia Pinto, Junior Researcher, i3S –
Instituto de Investigação e inovação em Saúde, IPATIMUP - Instituto de Patologia e Imunologia Molecular da Universidade do Porto, CMUP – Centro de Matemática da Universidade do Porto
O Presidente do Júri,
Agradecimentos
Ao Ipatimup e ao grupo de Gen ´etica Populacional, que me acolheram no mundo da investigac¸ ˜ao cient´ıfica.
`
As minhas orientadoras Alexandra Lopes e N ´adia Pinto, que foram incans ´aveis e estiveram sempre dispon´ıveis para as minhas d ´uvidas, apostaram em mim e me deram a oportunidade de fazer parte
deste projeto at ´e ao fim.
Ao meu irm ˜ao, que me corrigiu os par ´agrafos mais esquisitos sempre que a l´ıngua inglesa me falhou.
Ao meu companheiro de todos os dias, que nunca deixa de acreditar em mim e me incentiva a fazer sempre mais.
Aos meus pais, que me proporcionaram uma educac¸ ˜ao superior e nunca duvidaram das minhas escolhas.
Abstract
The main goal of this work was to contextualize and apply the methods used in genome-wide asso-ciation studies, as well as studies of target regions with functional relevance to the phenotype under
analysis.
The methods explored in detail included the quality control steps of genetic data, statistical tests for
common-variant association (Pearson’s chi-squared test, Fisher’s exact test and the Cochran-Armitage test for trend) and the SKAT-O method, which combines burden and non-burden approaches to the study
of rare-variant association.
Subsequently, the methods were applied to a data set of Alzheimer’s disease (AD) patients and healthy
controls from north Iberian Peninsula, in the scope of the multicenter study ”AD-EEGWA”. It features new data from a still understudied population, regarding genetic association to this disease. Besides
the genetic component and biographic data, we had access to electroencephalography measures for most of the study participants. The project is currently ongoing, and more biological samples are being
collected to empower genetic analyses.
The SKAT-O method allowed to identify one gene (PLEKHA5) with significantly different minor allele frequencies between cases and controls in our sample: collectively, the rare alleles were present in 10%
of controls, but in just 1% of cases. This gene had previously been identified as having differential gene expression in astrocytes between AD cases and controls.
The common-variant association methods allowed to identify nine SNPs with nominally significant differences of allele and genotype distribution between cases and controls, three of which had been
associated with Alzheimer’s disease in previous studies. We then inquired about the possibility of an association between these genetic variants and the brainwaves obtained from EEGs. Four out of nine
SNPs showed significant differences for some brainwaves, concerning mean relative power values within cases and controls with different alleles/genotypes. However, the obtained results did not match what
was expected, considering the EEG brainwave behavior in AD cases and controls, and the possible risk allele of each SNP.
All in all, we concluded that, even in small samples, it is possible to find association between pheno-types and the aggregate effects of rare variants. It will be interesting to replicate the study in a larger
ex-pected from the analysis of the genetics-EEG relation motivates new approaches to the problem. The
complexity of this disease strongly impels to insist on a interdisciplinary approach, that explores the effect of genotypes on well defined disease endophenotypes, to help diagnosis.
Keywords: association study, Alzheimer’s disease, complex phenotype, genetic heterogeneity, rare
Resumo
O objetivo principal deste trabalho foi contextualizar e aplicar os m ´etodos utilizados em estudos de associac¸ ˜ao do genoma completo, bem como de regi ˜oes alvo com relev ˆancia funcional para o fen ´otipo
sob estudo.
Os m ´etodos estudados em mais detalhe inclu´ıram os passos de controlo de qualidade dos dados
gen ´eticos, testes estat´ısticos para a associac¸ ˜ao de variantes comuns (teste do qui-quadrado de Pear-son, teste exato de Fisher e teste de Cochran-Armitage para tend ˆencia) e o m ´etodo SKAT-O, que
com-bina as abordagens burden e non-burden para o estudo da associac¸ ˜ao com variantes raros.
Posteriormente, aplicaram-se os m ´etodos a um conjunto de dados de doentes de Alzheimer e
con-trolos saud ´aveis da regi ˜ao norte da Pen´ınsula Ib ´erica, no ˆambito do projeto multic ˆentrico ”AD-EEGWA”. Trata-se de um novo conjunto de dados de uma populac¸ ˜ao ainda pouco estudada, do ponto de vista da
associac¸ ˜ao gen ´etica `a doenc¸a. Para al ´em da componente gen ´etica e dados biogr ´aficos, tivemos acesso a medidas de eletroencefalograma para grande parte dos participantes do estudo. O projeto ainda se
encontra em curso, e mais amostras biol ´ogicas est ˜ao a ser recolhidas de modo a trazer mais poder `as an ´alises gen ´eticas.
Com o m ´etodo SKAT-O, foi poss´ıvel identificar um gene (PLEKHA5) com diferenc¸as significativas de
frequ ˆencias dos seus alelos raros entre casos e controlos da nossa amostra: coletivamente, os alelos raros estavam presentes em cerca de 10% dos controlos, mas apenas em 1% dos casos. Trata-se de um gene que j ´a tinha sido anteriormente identificado como tendo express ˜ao gen ´etica diferencial em
astr ´ocitos entre casos e controlos de Alzheimer.
Os m ´etodos de associac¸ ˜ao para variantes comuns permitiram identificar nove SNPs com diferenc¸as nominalmente significativas na distribuic¸ ˜ao de alelos e gen ´otipos entre casos e controlos, tr ˆes dos quais
j ´a tinham sido associados `a doenc¸a de Alzheimer em estudos pr ´evios. Seguidamente, averiguamos a possibilidade de haver uma associac¸ ˜ao entre estes variantes gen ´eticos e as ondas cerebrais
obti-das com os EEGs. Dos nove, quatro SNPs mostraram diferenc¸as significativas para algumas onobti-das cerebrais, relativamente aos valores m ´edios de ”poder relativo” em casos e controlos com diferentes
alelos/gen ´otipos. Contudo, os resultados obtidos n ˜ao corresponderam ao que era esperado, tendo em conta os valores de EEGs em casos de Alzheimer e controlos e o poss´ıvel alelo de risco de cada SNP.
o efeito agregado de variantes raros. Ser ´a interessante replicar o estudo numa amostra maior e perceber
se os resultados se mant ˆem. A falta de coer ˆencia entre os resultados da an ´alise da relac¸ ˜ao gen ´etica-EEG e o que era esperado motiva novas abordagens a este problema. A complexidade desta doenc¸a
incita fortemente `a insist ˆencia numa abordagem interdisciplinar, que explore o efeito de gen ´otipos e endofen ´otipos bem definidos da doenc¸a, para auxiliar no seu diagn ´ostico.
Palavras-chave: estudo de associac¸ ˜ao, doenc¸a de Alzheimer, fen ´otipo complexo, heterogeneidade
Contents
Agradecimentos . . . i
Abstract . . . ii
Resumo . . . iv
List of Tables . . . viii
List of Figures . . . xi
List of Abbreviations . . . xiii
Introduction 1 1 Theoretical framework 3 1.1 Introductory concepts of biology and genetics . . . 3
1.2 Population genetics concepts . . . 7
1.3 Association studies . . . 16
1.4 An introduction to Alzheimer’s disease . . . 18
2 Study design and data quality control 23 2.1 Study design . . . 24
2.2 Data collection and variant calling . . . 27
2.3 Variant quality control . . . 28
3 Models of association 33 3.1 Statistical tests for common variants . . . 33
3.2 Rare variant association approaches . . . 36
3.3 P-value adjustment . . . 40
4 An application of association studies to Alzheimer’s disease 43
4.1 Aim and objectives . . . 43
4.2 Subjects and methods . . . 43
4.3 Data quality control . . . 45
4.4 Rare-variant analysis . . . 51
4.5 Analysis of electroencephalography data . . . 58
4.6 Discussion . . . 66
Conclusion 69 Appendices 77 A Informed consent of participation in research study . . . 79
B Mini Mental State Examination (MMSE) . . . 81
C Lists of nominally significant variants in SKAT-O . . . 85
C.1 Model 1 (Sex as covariate) . . . 85
List of Tables
1.1 (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) in-dividuals (1sthybrid generation); (b) Possible genetic constitution of the offspring resulting
from crossing (Aa) individuals among each other (2nd hybrid generation). . . 4
1.2 Blood group (phenotype) of the offspring, depending on the ABO alleles (genotype) inher-ited from the parents. . . 6
1.3 Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium. . . 9
1.4 Observed haplotype frequencies in the population, considering linkage disequilibrium. . . 9
1.5 Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z. . . 12
1.6 Mating outcomes assuming Hardy-Weinberg equilibrium. . . 13
1.7 Causes of Alzheimer’s disease. . . 19
1.8 APOE allele according to the genotype for SNPs rs429358 and rs7412. . . 20
3.1 (a) Contingency table of allele counts; (b) Contingency table of genotype counts. . . 34
3.2 Counts of cases and controls in each of the n exposure categories Ej, in a sample of c individuals. . . 41
4.1 Summary table of the individual counts after each per-individual QC step, according to disease status, gender and country of origin (note: whenever male and female counts did not add up to the ”Total” column, it was due to the presence of individuals with unknown gender; this issue was overcome as of the sex check step). . . 49
4.2 Summary table of probe/variant counts at each per-marker QC step. . . 50
4.4 Identifier and description of each gene list to be tested for association, and number of
genes and rare variants they contain. . . 55
4.5 Number of significant genes without and with p-value correction in each gene list and model (Model 1 – sex as the only covariate; Model 2 – sex, age, PC1 and PC2 as covariates). 56
4.6 Frequency and properties of PLEKHA5 rare variants identified in our sample. (*) in com-plete LD (r2 = 1.0) Notes: ”PHRED” refers to the PHRED-scaled CADD score; MA - minor
allele; MAF - minor allele frequency; ”gnomAD NFE” refers to the Non-Finnish European population of the gnomAD database (the number of genotyped alleles for each SNP was,
respectively, 129 088, 129 088, 75 296 and 113 128); the p-values refer to Fisher’s exact test for differences between control frequencies in our data and the gnomAD database. . 56
4.7 P-values obtained in ANOVA tests when testing for differences in RP of each of the
brain-waves between cases and controls. . . 59
4.8 P-values obtained in Tukey test for multiple comparisons of the RP in each brainwave between controls and cases in each disease stage (CON controls; MIL mild AD; MOD
-moderate AD; SEV - severe AD). The values below 0.05/30 = 1.67 × 10−3are underlined and in bold. . . 60
4.9 Gene and p-values obtained in the allelic and genotypic tests for the SNPs which were
nominally significant in both at the α = 0.05 level. (*) SNPs previously associated to AD. . 61
4.10 Wald test p-values for the variation of relative power across each frequency band in cases
and controls with different alleles for each of the considered SNPs. The values below 0.05 are underlined and in bold. . . 62
4.11 Wald test p-values for the variation of relative power across each frequency band in cases
and controls with different genotypes for each of the considered SNPs. The values below 0.05 are underlined and in bold. . . 62
4.12 Minor and alternative alleles of each SNP and their respective frequencies of the minor
allele in the sample, in cases and in controls. (*) SNP previously associated to AD. . . 63
C.1 Nominally significant genes in SKAT-O under null Model 1 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the
C.2 Nominally significant genes in SKAT-O under null Model 1 tested for ”Dementia” list, their
respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 94
C.3 Nominally significant genes in SKAT-O under null Model 1 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the
Burden test on SKAT-O, or the correlation between regression coefficients). . . 95 C.4 Nominally significant genes in SKAT-O under null Model 1 tested for ”AD Disgenet” list,
their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 96
C.5 Nominally significant genes in SKAT-O under null Model 2 tested for ”All Genes” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the
Burden test on SKAT-O, or the correlation between regression coefficients). . . 106 C.6 Nominally significant genes in SKAT-O under null Model 2 tested for ”Dementia” list, their
respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 106
C.7 Nominally significant genes in SKAT-O under null Model 2 tested for ”DGE Brain” list, their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the
Burden test on SKAT-O, or the correlation between regression coefficients). . . 106 C.8 Nominally significant genes in SKAT-O under null Model 2 tested for ”AD Disgenet” list,
their respective rare variants, obtained p-values and values of ρ (ρ refers to the weight of the Burden test on SKAT-O, or the correlation between regression coefficients). . . 108
List of Figures
1.1 Mendel’s fundamental experiment. . . 3
4.1 Missing data rate vs. heterozygosity across individuals passing the DQC and QCCR steps. Shading indicates sample density; the dashed lines represent the defined heterozygosity
threshold; the outliers are highlighted in red. . . 46
4.2 (a) Plot of the first principal component against the second, calculated with the software
EIGENSOFT; (b) Plot of the same PCs as in (a), zoomed in on the cluster which contains the european populations (the outliers are encircled in red). In the legends, PT and ES
represent the portuguese and spanish individuals in our sample, respectively; the remain-ing populations come from the 1KGP dataset. . . 48
4.3 Distribution of age at the time of sample collection. . . 52
4.4 Proportion of variance explained by the first i principal components; (b) is a zoom-in of (a) on the first 100 PCs. . . 53
4.5 Plot of the first and second principal components, restricted to the sample subjects. . . . 53
4.6 Probability density function of Beta(p, 1, 25). . . 54
4.7 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2
in cases and controls. . . 59
4.8 Distribution of the relative power of frequency bands delta, theta, alpha, beta-1 and beta-2
in controls and mild, moderate and severe AD cases (CON - controls; MIL - mild AD; MOD - moderate AD; SEV - severe AD). . . 60
4.9 Plot of the mean relative power distribution in the different frequency bands of controls
with each (a) allele or (b) genotype for SNP rs71336232. . . 64
4.10 Plot of the mean relative power distribution in the different frequency bands of controls
4.11 Plot of the mean relative power distribution in the different frequency bands of cases with
each (a) allele or (b) genotype for SNP rs10833211. . . 65 4.12 Plot of the mean relative power distribution in the different frequency bands of cases with
List of Abbreviations
1KGP The 1000 Genomes Project
AD Alzheimer’s disease
APT Affymetrix Power Tools
ANOVA analysis of variance
CADD combined annotation dependent depletion
CDCV common disease common variant
CDRV common disease rare variant
CG candidate gene
CT computed tomography
df degrees of freedom
DGE differential gene expression
DNA deoxyribonucleic acid
DQC dish quality control
EEG electroencephalogram
GUI graphical user interface
GWA genome-wide association
HWE Hardy-Weinberg equilibrium
IBD identity by descent
IBS identity by state
indel insertion/deletion variant
LD linkage disequilibrium
MAF minor allele frequency
MCI mild cognitive impairment
MIL mild Alzheimer’s disease
MMSE Mini Mental State Examination
MOD moderate Alzheimer’s disease
MRI magnetic resonance imaging
OR odds ratio
PC(A) principal component (analysis)
QC quality control
QCCR quality control call rate
RNA ribonucleic acid
SEV severe Alzheimer’s disease
SKAT sequence kernel association test
SNP single nucleotide polymorphism
SNV single nucleotide variant
Introduction
This work aimed to describe the general procedures and methods used in genome-wide association studies, and their posterior application to a data set composed of cases and controls of Alzheimer’s
disease from the Iberian Peninsula. The work is reflected in the present dissertation, which is structured as follows.
To place the work in context, we start by introducing in Chapter 1 some basic concepts of biology and population genetics, and the general idea behind association studies. We also briefly introduce Alzheimer’s disease, its symptoms, causes and means of diagnosis.
In Chapter 2, we describe the procedures relative to study design and quality control of genomic data. Here, we present relevant thresholds for various quality measures, as criteria to keep or discard
individuals and genetic variants from the study.
Chapter 3 was intended to portray in detail the models of association used to study the effect of both
common and rare variation on a complex phenotype. We focus on case-control studies, but also consider the case of quantitative traits whenever relevant.
Chapter 4 is the application of the previously described methods on a data set of Alzheimer’s patients and controls. It is split in two distinct phases of analysis. In a first stage, given the modest sample size,
we focus on the aggregate effect of rare variation on the disease, using the SKAT-O method. In a second and final stage, we assess the behavior of different brainwaves at various stages of the disease; we
also seek a possible association between a set of genetic variants and the values of EEG at different frequency bands.
Finally, on Chapter 5, we make some considerations about the work, namely on the importance of an approach combining genetic data and other endophenotypes associated with the disease, for a hopefully
Chapter 1
Theoretical framework
1.1
Introductory concepts of biology and genetics
All our ideas about the transmission of specific characters and changes in the characteristics of
pop-ulations make use of concepts introduced in 1865 by the one who is often referred to as the Father of Genetics - Gregor Mendel [1]. Working in a small garden with nothing but peas as his material, he was
able to formulate a hypothesis that explains the inheritance of some traits in a very simple way.
The simplified experiment was as follows: Mendel crossed purebred plants with green peas and pure-bred plants with yellow peas, obtaining a first hybrid generation of all yellow peas; he then crossed these
hybrid plants among each other and obtained peas of both colors, in a proportion of approximately 3 yellow to 1 green. This experiment is illustrated in Figure 1.1 and Table 1.1.
AA Aa Aa aa
Aa Aa
AA aa Initial generation
1st hybrid generation
2nd hybrid generation
From these results, Mendel formulated his hypothesis for sexual reproduction, which can be expressed
as follows:
1. Each character of an individual is controlled by two ”factors”, the alleles, one of which the individual receives from his father, and the other from his mother.
2. From the two alleles carried by the individual, one is expressed (dominant), while the effect of the other may not be apparent (recessive).
3. A reproductive cell (egg and sperm, in humans) produced by an individual bears, for each
charac-ter, one and only one of the two alleles which the individual carries.
In the initial generation, the plants with yellow peas carried only ”yellow” genes; their genetic
represen-tation for this trait is (AA). The plants with green peas carried only ”green” genes, hence having genetic constitution (aa). Crossing individuals of these two types can only generate individuals of one type, (Aa). When crossing (Aa) individuals with each other, it is possible to obtain individuals with genetic
consti-tution (AA), (aa) or (Aa), with probabilities 1/4, 1/4 and 1/2, respectively. Because A is dominant over a, the (Aa) peas are yellow, hence the obtained proportions of all yellow peas in the 1st hybrid generation,
and 3 yellow peas to one green in the 2nd.
(a) a a (b) A a
A (Aa) (Aa) A (AA) (Aa)
A (Aa) (Aa) a (Aa) (aa)
Tbl. 1.1 – (a) Possible genetic constitution of the offspring resulting from crossing (AA) with (aa) individuals (1sthybrid generation); (b) Possible
genetic constitution of the offspring resulting from crossing (Aa) individuals among each other (2ndhybrid generation).
Mendel’s experiment was a starting point for modeling the inheritance of other, more complex,
pheno-types, namely diseases, involving the contribution and/or interaction of various genes [2].
The human body is composed of billions of different types of cells. Cells are constantly dying and
being replaced by newly formed ones through the process of cell division – the source of an individual’s development and growth. The very first cell of an individual already carries all of the genetic information
he/she will bear throughout his/her whole life, encoded in their DNA.
DNA, or deoxyribonucleic acid, is a 3 billion long double helix molecule made of two complementary
(T), cytosine (C) and guanine (G) –, attached to one sugar molecule and one phosphate molecule. In
each nucleotide chain, A pairs up with T in the opposite chain, and C with G. Therefore, taking one of the chains as reference, DNA can essentially be seen as a single sequence of As, Ts, Cs and Gs. During
cell division, DNA replicates itself, providing each new cell with an identical copy of all genetic material (if no mutation occurs). This process is called ”replication”.
DNA is tightly coiled into structures called ”chromosomes”. Most human cells, the somatic cells,
con-tain 23 pairs of chromosomes, for a total of 46, and are hence called ”diploid”. The exceptions are the reproductive cells or gametes, which carry only half of an individual’s genetic material, i.e., 23 unpaired
chromosomes, thus being ”haploid”.
During cell division occurs a phenomenon called ”genetic recombination”, which is the crossover of the arms of each chromosome in the cell, leading to the exchange of genetic material between them.
Recombination occurs between homologous regions, meaning that the alleles or genes are in a similar order of arrangement in both chromosomes.
In somatic cells, all chromosome pairs but one are termed ”autosomes”; the one pair that differs in its
structure, usually referred to as the 23rdpair, is the pair of sex chromosomes, which determines human gender - females have two X chromosomes, whereas males have one X and one Y. In each reproductive
cell, since only half of the genetic material is present, there is an X chromosome in female gametes and either an X or a Y in male ones; indeed, it is the information carried in the paternal gamete that
determines the sex of the offspring.
The set of all genetic information of an individual is called his/her ”genome”. The ”exome” is the small
fraction (about 1%) of the genome known to encode for protein production. The information to produce a functional protein is encoded in a special unit of DNA at a determined genetic site (”locus”), called
a ”gene”. Each gene can have various modes of action, determined by variants of its sequence. The multiple versions of a gene are called ”alleles”. A gene is said to be ”polymorphic” if, for its locus, the
minor allele frequency (MAF) is at least 1% within a population [3].
Because humans have twenty-two paired chromosomes, most genes are represented twice in our genome, through alleles that may or may be not identical. A ”genotype” is defined as the set of alleles
found at an individual’s locus. A ”phenotype” is an observable trait that concerns a particular locus (or the combined action of several loci). For example, in Mendel’s experiment described in 1.1, the color of
An ”endophenotype” is somewhere between the previous two definitions: it is a quantitative,
non-observable trait that also shows a genetic connection. A set of genotypes observed at linked loci of one individual is called his/her ”haplotype”.
For each autosomal locus, an individual is said to be ”homozygous” if he/she carries two copies of the same allele, and ”heterozygous” if the alleles are different. We can also refer to homozygous individuals
as ”homozygotes”, and to heterozygous individuals as ”heterozygotes”.
It is the union of one male and one female reproductive cells that generates a ”zygote”, i.e., a fertilized
egg cell, with a full complement of hereditary information necessary for the development of a human being. At the moment of fertilization, the new individual receives for each autosomal locus one allele
from his father and one from his mother; as for the 23rd pair, if the new individual is female, then she has inherited one X chromosome from each of her parents, while if he is male, he has inherited an X
chromosome from his mother and a Y from his father. Women can be homozygous or heterozygous for genes in the 23rdchromosome; men can only be ”hemizygous”, due to X and Y not being homologous
in all of their extension, except for short ”pseudoautosomal regions” on their tips.
Individuals with different genotypes may show simillar phenotypes due to genetic dominance-recessiveness
relationships. Allele A is said to be ”dominant” to a (or, equivalently, a is ”recessive” to A) if the action of A, but not that of a, is manifested in the phenotype of a (Aa) heterozygote. Alleles A and B are said to be
co-dominant if they are both expressed in an (AB) individual’s phenotype. An example of co-dominant expression is the AB blood type in the human ABO blood system, as shown in Table 1.2 [3]. This table
also illustrates the dominance of alleles A and B over allele O.
A B O
A A AB A
B AB B B
O A B O
1.2
Population genetics concepts
Population genetics involves the study of genetic variation within and between populations, by examin-ing allele frequencies at different loci over time and space. Mathematical models are used to investigate
and predict the occurrence of specific alleles (or combinations of alleles) in populations, based on the ever increasing understanding of genetics and evolution. As such, it becomes necessary to introduce
some concepts regarding genetic variations, and how they relate to each other and influence the evolu-tion of species.
1.2.1
Genetic variants
Chromosomes are not perfectly stable entities: changes in the DNA sequence may occur as a result
of external or internal factors, such as the interaction with radiation, chemicals or viruses, or simply an error during the replication process. These changes are called ”mutations”, and give rise to genetic
variants; if their frequency among a population is above 1%, they are considered common and called ”polymorphisms” [3].
For example, the replacement of one nucleotide by another is called a ”single nucleotide variant” (SNV), or a ”single nucleotide polymorphism” (SNP) in case it is common. The most common single base-pair
changes are between the two existing classes of nucleotides – purines (A↔G) and pyrimidines (C↔T) –, thus, most SNPs in a population are ”biallelic”. Another class of well-known variants are indels, which
is the insertion or deletion of a portion of DNA, no larger than 1 000 bases, into the genome.
Variants are part of the evolution of species. Some variants do not change an individual’s phenotype, while others may greatly affect it; some variants can increase an individual’s fitness in the surrounding
environment, while others can have deleterious effects and generate disease. ”Monogenic” diseases result from deleterious variants in a single gene. They are inherited according to Mendel’s laws, hence
also being called ”Mendelian” diseases. If a disease results from the joint contribution of a number of independently acting or interacting genes, it is called ”polygenic”.
Variants can have numerous classifications depending on their length, placement and function. Exonic variants are located in portions of a gene that will encode a part of the final mature RNA produced by
regions but may have an important role in regulating gene expression. These are variants of potential
functional importance and could be good candidates for further analysis in association studies.
A SNV that is in a coding region of the genome but results in no change to the encoded amino acid is called a ”synonymous” substitution; when a genetic SNV influences the protein expression, it is termed
”non-synonymous”. Indel variants may yield ”frameshift” variations, which are potentially deleterious. ”Stop-gain” and ”stop-loss” variations result, respectively, in a premature termination and an abnormal
extension to the protein translation process, and thus alter the protein itself. Of these classes of genetic variants, the synonymous substitutions are the least functionally relevant, and it is not uncommon to
prioritize the analysis of variants falling in the remaining classes when searching for an association with a phenotype or disease [3].
Some genetic variants are known to be more likely to ”travel together” from generation to generation
than would be expected if different loci associated in a random manner. This phenomenon of non-random association is termed ”linkage disequilibrium”.
1.2.2
Linkage disequilibrium
Various studies have confirmed that the inheritance of certain alleles within a population is often
cor-related, causing many individuals to share the same haplotype. The alleles are thus said to be in linkage disequilibrium (LD). Even though genetic distance influences LD, it does not necessarily cause it; two
loci being in LD simply means that the alleles appear together in the same population more (or less) frequently than chance would have us expect.
Suppose that allele A at locus 1 and allele B at locus 2 are found at frequencies p and q, respectively,
in the population. If the two loci were independent, then we would expect to see the [AB] haplotype at frequency pq; however, if the frequency of the [AB] haplotype was either higher or lower than pq, then
the two loci could be in LD.
Let us consider two biallelic loci on the same chromosome, with alleles A and a at the first locus, and B and b at the second. Their allelic frequencies in the population are pA, pa, pBand pb (note that, because
the loci are biallelic, pa= 1 − pAand pb = 1 − pB); the haplotype frequencies are pAB, pAb, paB and pab. Table 1.3 shows the observed and expected haplotype frequencies under linkage equilibrium; Table 1.4
Observed frequencies Expected frequencies
B b B b
A pAB pAb A pApB pApb
a paB pab a papB papb
Tbl. 1.3 – Observed (left) and expected (right) haplotype frequencies in the population, under total linkage equilibrium.
Observed frequencies
B b Total
A pApB+ D pApb− D pA a papB− D papb+ D pa
Total pB pb
Tbl. 1.4 – Observed haplotype frequencies in the population, considering linkage disequilibrium.
The measure of linkage disequilibrium D is the difference between the expected haplotype frequencies and the observed, defined as
D = pAB− pApB.
In order to standardize D, we need to find its boundaries. Using the fact that the observed frequencies must be non-negative, we obtain
pApB+ D ≥ 0 ⇔ D ≥ −pApB
papb+ D ≥ 0 ⇔ D ≥ −papb
pApb− D ≥ 0 ⇔ D ≤ pApb
papB− D ≥ 0 ⇔ D ≤ papB
Thus we can define
D0 = D Dmax
Dmax= min{pApb, papB}, if D > 0 max{−pApB, −papb}, if D < 0 .
This normalization causes D0 to range between −1 and 1. When D0 = ±1, then at least one of the haplotypes was not observed; if allele frequencies are similar, a high D0 value means the markers are
good surrogates for each other.
Another widely used measure to calculate LD between loci, preferred by population geneticists, is Pearson’s coefficient of correlation r,
r = √ D
pApapBpb
or, more commonly, its squared value (r2). When r2 = 1, the two loci are in total linkage disequilibrium,
i.e., they both provide identical information; if r2 = 0, they are in perfect equilibrium, i.e. the genetic information is transmitted independently [4]. Tipically, two loci are considered to be correlated when an
r2 value greater than 0.2 is achieved [5].
Linkage disequilibrium is of major importance in association studies, namely at the level of marker selection in the study design phase. Indeed, mapping LD across the human genome has made possible
to deduce an individual’s genotype at a given locus through others in high disequilibrium. This is done by strategically choosing single tagSNPs to represent entire haplotypes of regions in high LD, which results
in a less costly study.
1.2.3
Identity by descent
Even after taking LD into account, loci which are independent within a population may still show
sig-nificant similarities among individuals, introducing a degree of relatedness which must be accounted for, especially when performing an association study with a sample of unrelated individuals.
If relatives are present, a bias may be introduced to the study, because the genotypes within families will be over-represented and the sample may no longer be an accurate reflection of the allele frequencies
in the entire population.
An important measure of relatedness used to identify such cases is identity by descent (IBD), a degree
descended from the same ancestral allele. Mutation breaks identity by descent. Two individuals are said
to be related if they may share IBD alleles. There is some point in the past beyond which individuals are assumed to be unrelated.
Identical twins are expected to have a proportion of shared IBD alleles equal to 1; first-degree relatives, 0.5; second-degree relatives, 0.25; and so on [5].
A similar concept is that of identity by state (IBS), which is based on the average proportion of indistin-guishable alleles shared at genotyped variants for each pair of individuals. Therefore, two alleles which
are IBD are also IBS, but the opposite may not be true, because alleles IBS may not originate from the same common ancestor; similarly, an individual may have more alleles IBS than IBD, but the opposite
can never occur.
Purcell et al. [6] considered a method-of-moments approach to estimate the probability of sharing 0,
1, or 2 IBD alleles for any pair of individuals from the same homogeneous, random-mating population. Denoting IBS states as I and IBD states as Z (in both cases, the possible states being 0, 1, and 2), then
we have that P (Z = 0) = N (I = 0) N (I = 0 | Z = 0) P (Z = 1) = N (I = 1) − P (Z = 0)N (I = 1 | Z = 0) N (I = 1 | Z = 1) P (Z = 2) = N (I = 2) − P (Z = 0)N (I = 2 | Z = 0) − P (Z = 1)N (I = 2 | Z = 1) N (I = 2 | Z = 2)
where N (I = i | Z = z) is the expected count of variants with IBS state I = i conditional on IBD state Z = zfor the entire genome, and is defined as
N (I = i | Z = z) = L X m=1
P (I = i | Z = z)
where the summation is over all variants with genotype data on both individuals, and the conditional probabilities are calculated as in Table 1.5.
We can thus define the proportion of alleles shared IBD as
ˆ
π = P (Z = 1)
I Z P (I | Z) 0 0 2p2q2 1 0 4p3q + 4pq3 2 0 p4+ 4p2q2+ q4 0 1 0 1 1 2p2q + 2pq2 2 1 p3+ p2q + pq2+ q3 0 2 0 1 2 0 2 2 1
Tbl. 1.5 – Probability values of IBS state I = i conditional on IBD state Z = z for each value of I and Z.
Due to genotyping errors, LD and population structure, a ˆπvalue higher than 0.98 is considered enough
to consider two samples under analysis as duplicates. The usual procedure is to remove one individual from each pair with ˆπ > 0.1875(a value halfway between the expected IBD for second- and third-degree
relatives) [5].
1.2.4
The Hardy-Weinberg equilibrium
The Hardy-Weinberg equilibrium (HWE) is a law of genetics which states that allele and genotype frequencies in a population will remain constant from generation to generation, under the following
as-sumptions:
(a) the population size is so large that it can be treated as infinite;
(b) generations are discrete, and individuals from different generations do not breed together;
(c) mating is at random;
(d) migration does not occur;
(e) selection does not occur (i.e., individuals with different genotypes are assumed to have equal
(f) mutations do not occur (i.e., individuals with genotype (AiAj) can only produce gametes with an
Ai or an Aj allele at that locus);
(g) initial genotype frequencies are equal in the two sexes.
The equilibrium in autosomal loci
Let’s suppose, for simplicity and because most human loci are biallelic, that there are n = 2 observed alleles, A1 and A2with proportions p and q = 1 − p, respectively, for a given locus in a population. There
are 3 possible genotypes, (A1A1), (A1A2)(identical to (A2A1)) and (A2A2), with initial proportions u, v and w, respectively.
From the genotype proportions, it is possible to deduce the allele proportions:
p = u +1 2v q = w + 1
2v
Under the stated assumptions, the next generation will be composed as shown in Table 1.6.
Mating Type Frequency Nature of Offspring
(A1A1) × (A1A1) u2 (A1A1) (A1A1) × (A1A2) 2uv 12(A1A1) +12(A1A2) (A1A1) × (A2A2) 2uw (A1A2) (A1A2) × (A1A2) v2 14(A1A1) +12(A1A2) +14(A2A2) (A1A2) × (A2A2) 2vw 12(A1A2) +12(A2A2) (A2A2) × (A2A2) w2 (A2A2)
Tbl. 1.6 – Mating outcomes assuming Hardy-Weinberg equilibrium.
are, respectively, u2+ uv +1 4v 2= u +1 2v 2 = p2 uv + 2uw + 1 2v 2+ vw = 2 u +1 2v w +1 2v = 2pq 1 4v 2+ vw + w2= w +1 2v 2 = q2 (1.1)
and, for the second generation,
p2+1 22pq 2 = [p(p + q)]2 = p2 2 p2+1 22pq q2+1 22pq = 2p(p + q)q(p + q) = 2pq q2+1 22pq 2 = [q(p + q)]2 = q2 (1.2)
meaning that, after a single round of random mating under the conditions above, the genotype fre-quencies stabilize at Hardy-Weinberg proportions [7].
Testing for equilibrium
Departures from HWE are generally measured at a given SNP using a χ2 goodness-of-fit test between
the observed and expected genotypes. The χ2 statistics is defined as
χ2 =X i
(Oi− Ei)2 Ei
(1.3)
where Oi and Eiare the observed and expected absolute frequencies of each of the n genotypes in a
population at that locus. This test statistic has a χ2distribution with n − 1 degrees of freedom [8]. A deviation from HWE implies a violation of at least one of the assumptions stated above; it is usually
1.2.5
Population substructure
Population substructure, also referred to as population admixture or population stratification, is the
presence of genetic differences between subpopulations of an apparently homogeneous population due to genetic history (e.g., migration, selection, and/or ethnic integration). Principal component analysis
(PCA) is widely used to detect and visualize hidden population substructure that is not apparent in the data and which may be providing untrue results, when analyzing characteristics of the population as a
whole [8].
The central idea of PCA is to reduce the dimensionality of a data set consisting of a large number of interrelated variables (in the case of an association study, these are the thousands of genetic markers),
while retaining as much of the variation present in the data set as possible. This is achieved by trans-forming to a new set of variables, the principal components (PCs), which are uncorrelated, and which
are ordered so that the first few retain most of the variation present in all of the original variables.
Assuming our markers as biallelic, the data can be seen as a large rectangular matrix C, with rows indexed by individuals, and columns indexed by polymorphic markers. For each marker, there is a
reference and an alternative allele. We suppose there are n markers and m individuals, and that the number of markers is much larger than the number of samples, n m.
Let C(i, j) be the number of reference alleles for marker j, individual i. Thus, for autosomal loci, we have C(i, j) ∈ {0, 1, 2}. For each column of C, we calculate its mean µ(j) and standard deviation σ(j),
and obtain a new matrix M
M (i, j) = C(i, j) − µ(j) σ(j) .
generally called the variance-standardized genetic relationship matrix. This step of normalization is intended to make the markers (co-variables) comparable, reducing their mean and variance to 0 and 1,
respectively. With this matrix, we can now define
X = M MT,
a square matrix m × m, with dimensions equal to the number of sampled individuals. We then com-pute the eigenvalues of matrix X and the corresponding eigenvectors, which are called the PCs. The
PC1; the eigenvector corresponding to the second highest eigenvalue is PC2; and so on. The variation
explained by PCs decreases, with the first PC explaining the most variation [9].
Plotting PCs against each other can show evidence of population substructure, by clustering the in-dividual data across these new axes of variation. Those PCs which are found to be significant can
posteriorly be used as co-variables in regression models (see Chapter 3).
1.3
Association studies
Variation in a DNA sequence can influence the risk of developing disease. Early studies investigated genetic variants underlying rare conditions that showed clear Mendelian inheritance patterns in families,
and turned out to be very successful due to these variants carrying 100% disease risk [10].
Scientific efforts have been made to put together as much information about the human genome vari-ation as possible, allowing for better design and less costly studies. Such efforts include the Human
Genome Project (1990-2003), the International HapMap Project (2002-2009) and, more recently, the 1000 Genomes Project (2008-2015).
Investigating the causes of complex diseases has proven to be a much more difficult task, because
there is not one single cause, but rather the combined action of many causal factors, genetic and/or envi-ronmental, that predispose to disease development. This means that even variants with a low increased
relative risk, when found together in the same genome, may significantly contribute to the disorder in question to manifest. Genetic association studies aim to detect such variants involved in complex
dis-eases.
The fundamental idea behind association studies is the comparison of allele or genotype frequencies between cases and controls, in order to relate genetic variants to a certain phenotype (such as a
dis-ease); if a particular allele/genotype is more common among cases than controls, it may be a risk factor and may be subject to further study [8].
There are two main theories for disease associated variants: the ”common disease common
vari-ant” (CDCV) hypothesis, and the ”common disease rare varivari-ant” (CDRV) hypothesis. These hypotheses argue contrary views concerning which variants carry the most penetrance, i.e., the proportion of
genetic variations with appreciable frequency in the population, but relatively low penetrance, are the
major contributors to genetic susceptibility to common diseases, the second reasons that multiple rare DNA sequence variations, each with relatively high penetrance, are the major contributors to genetic
sus-ceptibility to common diseases. Both hypotheses stand on empirical evidence, and each uses specific methods for association analysis [11].
In the early days of association studies, the initial choice was to focus on common variants, mainly due to genome-wide surveys of rare variation requiring many more assays than the arrays available at
the time could support. However, there was strong motivation to support the CDRV hypothesis, namely the idea that deleterious variants are likely to be rare due to purifying selection; indeed, loss-of-function
variants, which prevent the generation of functional proteins, are especially rare [12].
A number of softwares have been developed to deal with genetic data files, which are evidently of very large sizes due to the thousands of variants for analysis in most genetic studies. These programs are
mostly command-line based, which makes dealing with these types of files more computationally efficient than working with GUI-based softwares; they are also readily and freely available online for use. One
such program is PLINK [6], which contains multiple basic commands such as calculating allele frequen-cies, IBD and heterozygotic proportions, converting between multiple file types and performing basic
allelic/genotypic chi-squared association tests. It also performs PCA, but a more specific command-line program for this effect is EIGENSOFT. Among others, this program contains the EIGENSTRAT
stratifica-tion correcstratifica-tion method, which uses principal component analysis to explicitly model ancestry differences between cases and controls along continuous axes of variation.
One last command-line program worth mentioning is ANNOVAR. This tool, given a list of variants and
their corresponding genetic coordinates, uses a number of databases to functionally annotate them. The resulting features of each given variant includes the gene it belongs to, whether or not it falls in a coding
region, whether or not it yields a change to the produced amino-acid, among others. An important feature provided by ANNOVAR is the Combined Annotation Dependent Depletion (CADD) score, a measure of
the deleteriousness of SNVs and indels in the human genome. The CADD scores are ”PHRED-scaled”, meaning their values are ranked in order of magnitude terms rather than the precise rank itself. For
example, variants at the top 10% of CADD scores are assigned to CADD-10, top 1% to CADD-20, top 0.1% to CADD-30, and so on. ANNOVAR also provides information from the Genome Aggregation
are Non-Finnish Europeans (NFE) wherein the Iberian population is included.
Several psychiatric disorders, such as schizophrenia [13], depression [14] or Alzheimer’s disease (AD) [15], have been found to be polygenic and were associated with several genetic variants. In this work, we describe the general procedures of association studies and the corresponding statistical methods,
followed by an application to the case of AD in patients from the Iberian Peninsula.
1.4
An introduction to Alzheimer’s disease
1.4.1
Symptoms, causes and available diagnosis
Alzheimer’s disease (AD) is the most common type of dementia. It usually begins with subtle memory failure, which worsens over time and begins to affect an individual’s daily living. A person suffering from
this condition will eventually have trouble recognizing people, naming objects, dealing with everyday chores and personal care, behaving appropriately in social situations, among others. At an advanced
stage of the disease, the patient will require constant care. After the first symptoms appear, an individual usually survives 8 to 10 years, but the course of the disease can go up to 25 years, ending in death by
pneumonia, malnutrition or general inanition [16].
There are three stages to AD. The early stage is mild Alzheimer’s disease, when a person can still function independently but has few memory lapses, such as forgetting familiar words or the location of
everyday objects. Individuals with mild AD are firstly diagnosed with a condition called mild cognitive impairment (MCI), which has similar associated symptoms to the early stage of AD; deciding whether
the MCI observed in an individual is due to AD relies on brain imaging and cerebrospinal fluid tests. The middle stage, or moderate Alzheimer’s disease, is typically the longest stage; the symptoms become
more pronounced and the patient will require more care. The late stage is called severe Alzheimer’s disease, when individuals lose ability to respond to the environment and need constant help in performing
daily activities. It is often not easy to place an individual at a specific stage, as they may overlap [16].
AD can be classified according to the age of onset. The most common type is late-onset Alzheimer’s disease, which constitutes approximately 95% of the cases, and affects individuals whose first symptoms
the age of onset is below 65 [17].
Cause % of cases
Late-onset familial 15-25
Early-onset familial <2
Down syndrome <1
Unknown (includes genetic/environment interactions) ∼75 Tbl. 1.7 – Causes of Alzheimer’s disease.
The main causes of AD are described in Table 1.7. Approximately 25% of all AD is familial (i.e., ≥3
persons in a family have AD) and 75% is nonfamilial (i.e., an individual with AD and no known family history of AD); the onset of nonfamilial Alzheimer’s disease is usually at an advanced age. Because
familial and nonfamilial AD appear to have the same clinical and pathologic phenotypes (observable manifestation), they can only be distinguished by family history and/or by molecular genetic testing [16].
Most cases of early-onset AD are due to genetic factors transmitted from parent to child. Research
has shown that this form of the disease mostly results from a variation in one of these three genes: APP, PSEN1 or PSEN2. When any of these genes is altered, large amounts of amyloid β-peptide, a
toxic protein fragment, are produced in the brain. This peptide builds up to form clumps called ”amyloid plaques”, characteristic of Alzheimer’s disease, which lead to the death of nerve cells and the progressive
signs and symptoms of this disorder [16].
Some evidence indicates that essentially all persons with Down syndrome develop the neuropathologic hallmarks of AD after age 40. Down syndrome, a condition characterized by intellectual disability and
other health problems, occurs when a person is born with an extra copy of chromosome 21 in each cell. The presumed reason for the association between these two conditions is the lifelong overexpression of
APP on chromosome 21, and the resultant overproduction of β-amyloid in the brain [17].
Research has come to support the concept that late-onset Alzheimer’s disease is a complex disorder, with many susceptibility genes involved, as well as environmental factors (such as higher education, or
exposure to electromagnetic fields [18]).
The gene APOE has been extensively studied and proven to have great influence in the manifestation of AD. APOE is polymorphic, with three major alleles: ε2, ε3 and ε4. The presence of the ε4 allele
respectively. APOE ε2 allele has shown to have a protective effect [16]. APOE alleles are determined
by the two SNPs rs429358 and rs7412 as shown in Table 1.8.
rs429358 rs7412 Allele
C T ε1
T T ε2
T C ε3
C C ε4
Tbl. 1.8 – APOE allele according to the genotype for SNPs rs429358 and rs7412.
However, the presence of APOE ε4 does not determine that an individual will develop the disease; in fact, approximately 42% of individuals with AD do not have any APOE ε4 allele. Similarly, the absence
of APOE ε4 does not rule out the possibility of one developing AD.
Currently, the only definitive way to establish a diagnosis of AD is to microscopically examine a section of the person’s brain tissue after death. However, there are still several approaches that have been
proven to be highly effective in the diagnosis of Alzheimer’s disease to a living patient.
The initial step is to consult with a specialized doctor (psychiatrist), who will review the individual’s medical history and analyze the symptoms, as well as conduct a series of tests to the cognitive and
physical abilities of the individual. The Mini Mental State Examination (MMSE) is one such test widely used for this purpose. It is not unusual for the doctor to interview friends and family of the patient, to
better understand their behavioral changes over time. This series of clinical assessments often provide enough information to perform a correct diagnosis [16]; however, it isn’t always clear, and may require
further, more advanced testing.
Analysis of electroencephalograms (EEGs) can also be used as a means of diagnosis. EEGs are used to register electrical activity in the brain, focusing namely on spectral measures, which include the
classical brainwaves in delta, theta, alpha, beta and gamma frequencies. Each one of these brainwaves in an endophenotype to Alzheimer’s disease. Brainwaves are activated according to our actions, feelings,
circadian rhythm, and some disorders may trigger the over-expression or inhibition of a given brainwave. In order to interpret the EEG, it is important to understand which behaviors lead to certain variations in
activity of each brainwave.
gener-ated in deep meditation and dreamless sleep, when healing and regeneration processes are triggered.
Theta (θ, 4-8 Hz) brainwaves are connected with the learning, memory, and intuition functions. Al-pha (α, 8-13 Hz) waves aid overall mental coordination and learning. Beta (β, 13-30 Hz) brainwaves
are present when we are alert, engaged in problem solving, judgment, decision making, or focused mental activity; they dominate our normal waking state of consciousness. Gamma (γ, >30 Hz)
brain-waves are the fastest of brain brain-waves, and relate to simultaneous processing of information from different brain areas. More detailed information can be found online at https://brainworksneurotherapy.com/
what-are-brainwaves.
Other means of diagnosis include laboratory testing, which is usually performed as a way of ruling out conditions that cause similar symptoms to Alzheimer’s, such as nutritional deficiencies or other diseases
that could be affecting the person’s memory. These tests make use of blood, urine and cerebrospinal fluid samples [19].
One final diagnosis method worth mentioning is brain-imaging testing, such as computed tomography
(CT) or magnetic resonance imaging (MRI) scans. They allow to look for evidence of trauma, tumors, and stroke that could be causing dementia and to look for brain atrophy, shrinkage that may be present
later in the Alzheimer disease progression. These tests require that the person remain still for a period of time [19].
All the methods above provide information that allow to rule out a series of conditions that cause symp-toms similar to AD. Such conditions are, for instance, past strokes, Parkinson’s disease and depression [19].
1.4.2
The association studies approach
The methods of diagnosis described above require that the individual is showing symptoms of the
disease. Some of them may be considered invasive, such as lab testing (which requires lumbar puncture and spinal fluid collecting); others may be unaccessible to the majority of the population due to their
high cost, such as an MRI scan. In addition, imaging techniques may provide poor quality results, as Alzheimer’s patients tend to have a hard time standing still even for short periods of time, especially at
an advanced stage of the disease.
which increase risk of developing the disease, it could be possible to make an early diagnosis, since the
genetic material we carry is the same throughout our lifetime (except for mutations that may occur). This way, the disease could be prevented even before the appearance of symptoms, and thus prolong the
quality of life of a potential future AD patient.
This would also be a cheaper alternative, and can be made less invasive, while maintaining sample
Chapter 2
Study design and data quality control
Genetic association studies can essentially be divided into candidate gene (CG) and genome-wide
association (GWA) studies. CG studies are based on the prior hypothesis of a potential role of selected genes or genetic regions on a specific phenotype or disease, taking into consideration their biological
function or association in previous studies. Genome-wide association studies, on the other hand, make use of information on the variation across the entire human genome, and are useful for
hypothesis-generating purposes [10].
GWA analyses usually target relatively common SNPs. CG studies, however, focus on the effects of rare variants, which may be hard to detect, especially when dealing with small sample sizes. Besides
being more cost effective than sequencing an entire human genome, studying the part that rare variants play on disease has been largely motivated by the CDRV hypothesis [11]. Indeed, if a certain variant
has a large deleterious effect, it may also impact fitness and thus become less and less frequent in each generation.
This chapter describes the general procedures of an association study from the study design, through the process of data collection and up to the quality control steps, taking into account which methods suit
2.1
Study design
Describing the phenotype accurately
The phenotype of interest must be defined as accurately and specifically as possible, in a way that minimizes the likely causal heterogeneity based on existing clinical and biological evidence. Such
defi-nitions may change, as more information becomes available. This will increase power of detection of an effect and allow for replication studies [10].
Checking disease heritability
Heritability is a measure of how well differences between individuals’ genes account for differences in their traits, i.e., how much of the variation in a given trait can be attributed to genetic variation (as
opposed to environmental causes) [3].
Heritability is assessed by studying disease patterns in family members, namely by comparing
monozy-gotic with dizymonozy-gotic twins. Because monozymonozy-gotic twins are genetically identical (the two alleles in each locus are IBD), while dizygotic twins are expected to share, on average, half of their alleles, comparing
disease status in twins can enlighten on the role of genetic factors [10].
Diseases which have been shown to have low heritability will likely need very large sample sizes in order to find etiological genetic variants. Moreover, in diseases with heritability close to zero, there isn’t
much advantage in conducting a genetic case-control study [10].
Choosing the best approach to the problem
Concerning sample relatedness, association studies can be sorted into two categories: population-based case-control studies, and family-population-based studies. The first approach may require several thousands
of cases of the phenotype of interest. This number can be decreased, and power of the study increased, by recruiting cases with family history of the condition, or even multiple cases from the same family
(adjusting for familial correlation), for a sample with a more homogeneous genetic background; this is called ”enrichment sampling” [20]. This sampling method does not always increase power in genetic
studies, as familial aggregation may be due to shared environmental factors, for example [10].
more of the underlying genetic variants are common. Moderately rare variants could also be detected,
but only if they carry a large effect. A prior hypothesis that all undetected variants are rare and of small effects would require an unfeasibly large sample size, in order to have power to detect the effect of single
variants [21].
If the case definition is a phenotype that shows clear segregation in families, then a population-based case-control approach is no longer suitable, and a family-based study is preferable [10].
Control selection
The golden rule of control selection for any case-control study is that cases and controls should belong to the same population, and they must be representative of that population who would have become cases, according to the case definition and the recruitment strategies for the study. This minimizes false
positives and confounding [22].
Bias due to environmental factors is generally not a problem in association studies; the most important type of bias is related to the ethnic origin of cases and controls. This is commonly referred to as
”pop-ulation stratification”, and is an example of a confounding variable. Under this situation, differences in allelic frequencies between cases and controls are due to the underlying sampling scheme, rather than
an actual effect of the variant on disease risk [23].
The effects of population stratification can sometimes be avoided at the study design level (by matching controls to cases on potentially important confounders) or the data analysis level (by adjusting the results
for these confounders). Matching is only essential when the effect of the confounder cannot be accurately measured or is too large to be adjusted for in the analysis [24].
Population stratification is minimized when controls are matched to cases on ethnicity, or when the
sample is restricted to a particular ethnic group. Further matching on sex can reduce population strat-ification in situations where there are gender differences in disease prevalence. Matching on age may
improve power of the study by ensuring that controls had the same opportunity as cases to develop (and be diagnosed with) the disease. This could be a problem when dealing with age-related diseases
such as Alzheimer’s disease. Whether or not further matching is necessary and decreases population stratification will depend on the disease in question [10]. Remaining stratification can be investigated
and controlled (to some extent) by analytical methods [25, 26].
healthy controls, mainly due to it being a much more economical approach. It is important that basic
characteristics of such panels are known, such as ethnicity, sex, age and area of recruitment, so that they can be matched to in the design or adjusted for in the analysis [10].
The described methods of control selection are specific of studies intended to assess genetic risk, and
no longer suited if we incorporate environmental factors [10].
Sample size
Sample sizes for each study will depend on the existence of case sub-groups and a priori hypotheses
to be tested, on whether it is a CG or a GWA approach, among (many) other factors. Estimating the required sample size often relies on empirical results from simulation studies [10].
The lack of availability of genetic information from cases for an association study often relates to economic issues. When testing many SNPs, a one-stage design can be very expensive, so one can
resort to a multi-stage design, where all SNPs are tested in a random subset of cases and controls, and those found significant are taken through to be tested in the remainder of the study sample [27]. The
power of a study can also be potentially improved with an increased control/case ratio [24].
Replication studies
Theoretical considerations prove that, when true discovery is claimed based on crossing a threshold
of statistical significance and the discovery study is underpowered, the observed effects are expected to be inflated. Furthermore, flexible analyses coupled with selective reporting may inflate the published
dis-covered effects. Therefore, a study designed to replicate a finding should base sample size calculations on smaller effect sizes [28].
A true replication study must be performed on a population comparable to the original, i.e., it must involve the analysis of the same polymorphism in the same direction of the effect, in the same ethnic
population measured on the same phenotype. Failure to replicate findings in a different population does not allow judgement of the validity of the results in the original study; it can only elucidate on the lack of
2.2
Data collection and variant calling
Following the study design is collecting the data for analysis. Genotyping individuals in association
studies is usually done with DNA microarrays. These consist of specific DNA sequences (known as probes) corresponding to a short section of a gene or other DNA sequence of the human genome.
Probes are usually 100 to 10 000 bases long and fluorescently labeled. Among the manufacturers of DNA microarrays was Affymetrix, Inc., a company now owned by Thermo Fisher Scientific. This company
developed the GeneChip array technology and the Affymetrix Power Tools (APT), which can be used for variant calling, quality control and genotyping.
A GeneChip array can contain up to thousands of DNA probes, designed to vary in specific locations matching those of known human genome variation. When placing the probes and a DNA sample in
the same environment, DNA breaks up into fragments which attach to the corresponding probe in a process called hybridization, issuing a fluorescent measurable signal that allows to identify the nucleotide
sequence in each fragment and thus determine the DNA sample sequence.
Two important measures to consider when assessing the quality of the variant calling process are the
dish quality control (DQC) and the quality control call rate (QCCR). DQC is a measure of the contrast between the adenine-thymine (AT) and cytosine-guanine (CG) signals, and is defined as
DQC = AT Signal - CG Signal AT Signal + CG Signal
QCCR is the proportion of non-missing data for each individual.
The ”Axiom Genotyping Analysis Guide” by Affymetrix provides guidelines for these measures, and any subject falling below these values should be eliminated from further study. Their best practices
guide can be found online at https://assets.thermofisher.com/TFS-Assets/LSG/manuals/axiom_
genotyping_solution_analysis_guide.pdf.
Another relevant measure when doing probe QC is the heterozygosity rate. When this value is too high (usually higher than µ + 3σ) for a given individual, it could hint sample contamination; when it is too low (below µ − 3σ), it could mean that there are related individuals in the sample. In any case, it is
recommended that the individuals falling outside this interval be discarded from further analysis [5].
One final quality control step for variants before genotyping is to sort them into categories according