The Role of Non-coding DNA Structural Information in Phylogeny, Evolution and Disease

(1)

The Role of

Non-coding

DNA Structural

Information in

Phylogeny,

Evolution

and Disease

João Miguel Sotto Maior Faria Carneiro

Biologia

Departamento de Biologia da Faculdade de Ciências do Porto 2013

Orientador

António Manuel Amorim dos Santos, Professor Catedrático, Faculdade de Ciências da Universidade do Porto

Coorientador

Maria João Ribeiro Nunes Ramos, Professora Catedrática, Faculdade de Ciências da Universidade do Porto

(2)

ACKNOWLEDGEMENTS/AGRADECIMENTOS

Agradeço ao meu orientador António Amorim pela disponibilidade que sempre demonstrou para debater as ideias relacionadas com esta tese. Sobretudo agradeço a confiança que sempre depositou em mim. À minha co-orientadora, Maria João Ramos, um agradecimento pelo apoio e disponibilidade que demonstrou para discutir temas cruciais para a tese.

Agradeço a todas as pessoas que desenvolvem o seu trabalho no IPATIMUP e colaboraram de alguma forma comigo, pois parte da tese foi desenvolvida nas instalações e com os recursos deste instituto. Ao grupo de bioquímica teórica e computacional da Faculdade de Ciências da Universidade do Porto gratifico o apoio proporcionado em relação aos meios computacionais e apoio em dúvidas informáticas. Um agradecimento especial à Irina Moreira por todo o apoio que me deu.

A todo o grupo de genética populacional que desenvolve o seu trabalho no IPATIMUP um agradecimento pelo apoio e disponibilidade para tirar dúvidas que foram surgindo. Um agradecimento especial ao Filipe Pereira; à Luísa Azevedo, à Raquel Silva, ao Rune Matthiesen e ao Ricardo Araújo pelo apoio constante.

Agradeço ainda, a toda a minha família, especialmente ao meu pai e à minha mãe. Para a minha esposa, Selma, que esteve sempre a meu lado, agradeço a confiança que deposita em mim. Aos seus pais, Joaquim e Celeste também um agradecimento especial. A toda a minha família e amigos, obrigado pelo apoio.

(3)

Tese submetida à Faculdade de Ciências da Universidade do Porto para obtenção do grau de Doutor em Biologia Thesis presented for the Doctor Degree in Biology in the Faculty of Sciences, University of Porto

(4)

1. Summary

Non-coding deoxyribonucleic acid (DNA) regions represent approximately 98% of the human genome and a relevant part of mitochondrial DNA (mtDNA). There is a clear contrast between coding and non-coding DNA regions considering the levels of genetic diversity, genomic architecture and distribution of regulatory elements. By using recently developed methodologies to analyse DNA, the unique features of coding regions and non-coding regions were accessed. For this purpose, four genetic models were used in this thesis: a) metallothioneins (MT), where specific mutational events converted a transcribed coding region into a non-coding region; b) Nicotinamidases (PNCs) and Nicotinamide phosphoribosyltransferases (NAMPTs) genes which presented critical structural hotspots related with the functionality of the respective proteins, and might have implications in the maintenance of expressed coding regions; c) non-coding mtDNA regions, and d) non-coding short tandem repeats (STRs).

The contrasts between coding (protein genes) and non-coding region (pseudogenes) were focused using a phylogenetic analysis associated to duplicated genes (model a). Mammalian evolution history of post-duplication events was herein explored by the study of MT family members where different mutational events can determine the way to a new function or to pseudogenisation.

Analysis of NAMPTs and PNCs (model b) homologous genes in different species was used to establish the relationships between mutations occurred during evolution and their consequences in metabolic pathways and pathologic conditions (e.g., cancer). The critical residues at active site and at the interaction with the substrate of invertebrate NAMPTs, nicotinamide, were maintained, considering both protein-docking analysis and expression. Nevertheless, additional hydrogen bonds and hydrophobic contacts were found in PNCs, what can be explained from complementary amino acid changes as a result of epistatic (compensatory) interactions. Structural conservation validated by expression experimental data was used to ascertain the current functional status and the evolutionary time depth of transcriptional loss of both NAMPT and PNC proteins in different species. This was useful to understand the molecular behaviour of specific chemical bonds (e.g., H-bonds) in proteins, which were also analysed in the DNA non-B conformations (model c and d) localized in the non-coding regions, even though they represent different types of molecules. By this way the computational molecular systems knowledge applied to proteins can be used to build the models for the DNA structures found in non-coding regions.

The study of conformational structural changes in non-B DNA conformations is very important since, as in proteins, they can adopt different structures related with specific properties. Furthermore, the genome architecture (coding versus non-coding) led us to the

(5)

analysis of the specificities of non-B conformations formation in mtDNA complete genome and their implications in biological processes (model c). Non-coding regions were playing a critical role in the process of generating different mtDNA deleted molecules associated with disease.

Ultimately, a new methodology for detection of secondary and tertiary DNA structures in non-coding regions was developed (model d). Available data for Y-chromosome short tandem repeats (STRs) was investigated by using software for structures prediction and new algorithms to identify non-B DNA conformations. Evaluation of these structures was attempted using molecular dynamics simulations and molecular mechanics Poisson-Boltzmann surface area (MM-PBSA) calculations. Single-stranded and UNAFold predicted DNA conformations were analysed using chemical computational methodologies. Molecular structural features present in nuclear DNA (STRs) were inferred and correlated with different biological processes and diseases. Our analysis predicted hairpins that can arise in single-stranded STRs. The occurrence of these non-B DNA conformations in non-coding regions might influence/regulate processes of transcription occurring in protein-coding regions, and processes that depend of specific folding potential as DNA replication.

There was a clear contrast between protein-coding (model a and b) and non-coding genomes (model c and d). The possibility of these two different regions to generate or form three-dimensional structural molecules was accessed. The relevant DNA non-B conformations can adopt different conformations, as in proteins molecular systems, and was demonstrated in this thesis. In non-coding regions, the formation of DNA non-B conformations has implications in evolution, deletions, replication and disease (models c and d).

(6)

Sumário

As regiões não-codificantes de ácido desoxirribonucleico (ADN) representam cerca de 98% do genoma humano e de uma parte relevante do ADN mitocondrial (ADNmt). Há um contraste claro entre as regiões codificantes e regiões não codificantes do ADN considerando os níveis de diversidade genética, a arquitetura genómica e de distribuição de elementos de regulação. Utilizando metodologias recentemente desenvolvidas para a análise de ADN, as características únicas de regiões codificantes e não-codificantes foram determinadas. Para este efeito, quatro modelos genéticos foram utilizados neste trabalho: a) metalotioneínas (MT), onde padrões específicos de mutação podem converter uma região transcrita em uma região não codificante, b) os genes codificando as enzimas, nicotinamidase (PNCs) e nicotinamida phosphoribosyltransferases (NAMPTs), que apresentam ‘hotspots’ estruturais críticos relacionadas com a funcionalidade das proteínas respetivas, o que tem implicações na manutenção das regiões codificantes expressas; c) as regiões não codificantes do ADNmt, e d) as regiões não codificantes repetitivas em microssatélites.

Usando o modelo A, os contrastes entre as regiões codificante (genes) e não codificante (pseudogenes) foram analisados utilizando uma análise filogenética associada a genes duplicados. A evolução dos eventos pós duplicação dos genes MT nos mamíferos foi explorada pelo estudo de diferentes eventos mutacionais que podem determinar se um gene é ou não funcional.

A análise dos genes homólogos NAMPTs e PNCs (modelo b) em diferentes espécies foi usada para estabelecer as relações entre os resíduos resultantes de mutações durante a evolução e as suas consequências para as vias metabólicas e condições patológicas (por exemplo, cancro). Os resíduos críticos do centro ativo e interações de NAMPTs com o substrato, a nicotinamida, foram mantidos, considerando tanto a análise de ‘docking’ como a expressão das proteínas. No entanto, ligações de hidrogénio e contactos hidrofóbicos adicionais foram encontrados em PNCs, o que pode ser explicado a partir de alterações de aminoácidos complementares, como resultado de interações epistáticas. A conservação estrutural validada pelos dados experimentais de expressão foi usada para avaliar o estado funcional e a profundidade do tempo evolutivo de perda de transcrição nestas proteínas. Isto foi útil para compreender o comportamento molecular de ligações químicas específicas (por exemplo, ligações de hidrogénio) em proteínas, que também foram analisadas nas conformações de ADN não canónicas (conformações do modelo c e d) e localizadas nas regiões não codificantes. Desta forma, o conhecimento de sistemas moleculares computacionais aplicados a proteínas pode ser usado para construir modelos para as estruturas de ADN encontradas em regiões não codificantes.

(7)

O estudo das alterações estruturais em conformações não-B de ADN é muito importante uma vez que, tal como nas proteínas, podem adotar diferentes estruturas relacionadas com propriedades específicas. Além disso, a arquitetura do genoma (codificação versus não-codificação) levou-nos à análise das especificidades da formação de não-B conformações no genoma mitocondrial completo e suas implicações nos processos biológicos (modelo c). Estas estruturas localizadas em regiões não codificantes parecem desempenhar um papel crítico no processo de geração de deleções em moléculas de genoma mitocondrial associadas com determinadas doenças.

Por último, uma nova metodologia para deteção de estruturas de ADN não-B, em regiões não codificantes foi desenvolvido (modelo d). Os dados disponíveis para microssatélites do cromossoma Y foram estudados usando programas computacionais para a previsão de estruturas e novos algoritmos para identificar novas conformações não-B de ADN. A avaliação dessas estruturas foi tentada por meio de simulações de dinâmica molecular, de integração termodinâmica e cálculos MMPBSA (Molecular Mechanics – Poisson Boltzmann Surface Area). As características estruturais moleculares presentes em ADN nuclear (microssatélites) foram inferidas e correlacionadas com diferentes processos biológicos e doenças. Desta análise resultou a previsão da formação de estruturas específicas que podem surgir em ADN de cadeia simples. A ocorrência destas conformações não-B de ADN em regiões não codificantes pode influenciar/regular os processos de transcrição que ocorrem em regiões que codificam proteínas, ou processos que dependem de potencial específico de ‘folding’ como a replicação do ADN.

Há uma clara associação entre as regiões que codificam proteína (modelo a e b) e regiões não-codificantes dos genomas (modelo c e d). A possibilidade de estas duas regiões diferentes, gerarem ou formarem arranjos moleculares tridimensionais foi estudada nesta tese. O ADN não-B pode adotar diferentes conformações, tal como em sistemas de proteínas, o que ficou demonstrado nesta tese. Embora existam características estruturais únicas das proteínas e das estruturas de ADN não-B, os dois diferentes sistemas moleculares podem adotar conformações tridimensionais. Em regiões não codificantes, a formação de conformações de ADN não-B tem implicações na evolução em geral, bem como especificamente em deleções, na etiologia de várias doenças, e na replicação do material genético (modelo c e d).

(8)

2. General

Introduction

and

Discussion

List of Figures

FIGURE 1: MOLECULAR STRUCTURE OF DOUBLE-HELIX DNA SHOWING THE BASE-PAIRING BETWEEN NUCLEOTIDES AND BACKBONE CONFORMATION (FIGURE GENERATED WITH VMD[1,2] SOFTWARE ). ... 16 FIGURE 2:C-VALUE PARADOX IN EUKARYOTES: INCREASE OF GENOME SIZE DEPENDS OF THE INCREASE OF

NON-CODING ELEMENTS (FIGURE FROM LYNCH 2007,“THE ORIGINS OF GENOME ARCHITECTURE”). ... 17 FIGURE 3:DESCRIPTION OF DE NOVO PATHWAYS SYNTHESIZE OF NAD FROM TRYPTOPHAN OR ASPARTIC ACID

AND OF THE SALVAGE PATHWAYS THAT RECYCLES NAD FROM NICOTINAMIDE (NAM), NICOTINIC ACID (NA) AND THEIR RIBOSIDES (SOURCE FIGURE FROM REVOLLO ET AL.[3]). ... 20 FIGURE 4: SCHEMATIC VIEW OF STRAND-SLIPPAGE REPLICATION MECHANISM (FIGURE FROM JOBLING ET

AL.[69]). ... 24 FIGURE 5:BASIC WXPYTHON APPLICATION STRUCTURE [ADAPTED FROM WXPYTHON IN ACTION [122]. ... 32 FIGURE 6:INTERACTIONS BETWEEN CODING AND NON-CODING GENOME. ... 294

3. General

Introduction

and

Discussion

List of Tables

TABLE 1:MUTATION RATES OF UNIQUE DNA SEQUENCES AND STR SEQUENCES (ORDER OF MAGNITUDE). ... 22 TABLE 2:ALLELE RANGE, REPEAT MOTIF,GENBANK ACCESSION NUMBERS AND REFERENCE ALLELES OF

Y-STR LOCUS. REPEAT MOTIF ABBREVIATIONS A,T,G,C,W,Y,R,S CORRESPOND RESPECTIVELY TO ADENINE, THYMINE, GUANINE, CYTOSINE, WEAK (A OR T), PYRIMIDINE, PURINE, STRONG (G OR C) FOLLOWING THE INTERNATIONAL UNION OF PURE AND APPLIED CHEMISTRY (IUPAC). ... 23 TABLE 3:RELEVANT PYTHON MODULES IN MOLECULAR DATA ANALYSES. ... 31 TABLE 4:COMMON WIDGETS AND DIALOGS IMPLEMENTED IN WXPYTHON. ... 33 TABLE 5:VALUES OF ENTHALPY VARIATION (∆H), ENTROPY VARIATION (∆S), FREE ENERGY VARIATION (∆G),

AND FREE ENERGY VARIATION RELATIVE TO SINGLE-STRANDED DNA (∆∆G) FOR TESTED MOLECULAR SYSTEMS, CALCULATED IN AMBER.*THE REFERENCE IS SINGLE-STRANDED DNA(SS). ... 289

(13)

(14)

4. General

Introduction

and

Discussion

List of Abbreviations

Adenosine diphosphate ADP

Adenosine triphosphate ATP

Cytochrome c oxidase CO

Encyclopedia of DNA Elements ENCODE

Entropic contribution TS

Expressed sequence tag EST

Graphical user’s interface GUI

Hypervariable regions HVR

Hydrogen bonds H-bonds

International Union of Pure and Applied Chemistry IUPAC

Ion mobility spectrometry IMS

Metallothionein MT

Messenger Ribonucleic acid mRNA

Nanoelectrospray mass spectrometry Nano-ESI-MS

National Center for Biotechnology Information NCBI

Nicotinamide Nam

Nicotinamide Adenine Dinucleotide NAD

Nicotinamide phosphoribosyltransferase NAMPT

Nicotinic acid Na

Nucleic Acid Builder NAB

Nucleic acids database NDB

Operating system OS

Poly ADP-ribose polymerases PARPs

Polymerase chain reaction PCR

Protein databank PDB

Research Collaboratory for Structural Bioinformatics RCSB

Ribonucleic acid RNA

Ribosomal ribonucleic acid 12S and 16S rRNA

Root-Mean-Square-Deviation RMSD

Short tandem repeats STRs

Stepwise mutation model SMM

Base pairs Bp

(15)

Transfer RNAs tRNAs

Ubiquinolcytochrome c oxidase reductase Cyt b

Ubiquinone oxidoreductase NADH

Water WAT

Watson-Crick W-C

Molecular Dynamics MD

Molecular mechanics Poisson-Boltzmann solvent

accessible surface area MMPBSA

Generalized Born GB

Poisson-Boltzmann PB

Solvent accessible surface area SASA

(16)

5. General

Introduction

5.1. Coding versus Non-coding DNA

Although the molecular structure of DNA has been described long ago [4, 5] (Figure 1), the comprehension of the genome structural architecture is rather difficult due to its sequence plasticity. Processes as DNA replication, recombination, mutation and retro-transposition make difficult to disentangle cause and consequence in DNA structural local or global conformational behaviour. Several approaches have been used to reveal how these processes act and what kind of changes they can produce [6-8]. These studies have shown that several sections of the genome appear to be non-codifying regions without any relevance for living cells and therefore to organisms, with the exception of non-coding functional ribonucleic acid (RNA) and microRNAs. Classically, the non-coding regions represent the section of the genomes where transcription does not occur. Transcription is the first step of gene expression that converts a sequence of DNA is to a specific chemical molecule (messenger ribonucleic acid - mRNA) that then can be converted to protein. There are well organized sections of a gene delimited by the transcribed part (exons) and not transcribed (introns). Only exons are transcribed, but the intronic part has important roles in alternative splicing and other processes [9]. The mRNA->protein conversion, called translation, is a process where each three nucleotides in mRNA represent an amino acid in the protein. The transcribed gene can also represent non-coding molecules (e.g., ribosomal RNAs, micro RNAs). The highly organized architecture of coding regions, where genes are present and transcription occurs, was stated to be not present in large percentage of non-coding regions and that part of genome was called “junk DNA”. Several articles had demonstrated that the so called non-coding regions are indeed relevant and play a role in many cellular mechanisms, from prokaryotes to eukaryotes [10-19]. Recently the Encyclopedia of DNA Elements (ENCODE) project studied transcription, transcription factor association, chromatin structure and histone modification, that revealed biochemical functional regions in almost 80% of the human genome [13-20]. The functional DNA detected does not match protein-coding regions (exome), still it play an important role in the regulation of the genome. The ENCODE database of functional elements might be used to better understand different type of diseases (e.g., cancer, rare genetic disorders, common diseases with a genetic component), and therefore used to elucidate the relationship between functional non-coding DNA and coding DNA. Herein the name non-coding will be used for

(17)

regions that are not protein-coding, but where the transcription process can occur, might regulate transcription processes in protein-coding regions (e.g., act like transcription factors), or might present a biochemical signature in the cell. This thesis mainly focus the non-coding regions that are not transcribed (mtDNA control region and STRs), but present a biochemical signature in the cell, and also the importance and relevance of mutation events in the coding regions (MT, NAMPT and PNC genes) to the understanding of non-coding genome.

The proliferation of “selfish elements” (mainly segments of DNA called mobile elements) until it is prohibitive for the organism survival was suggested to support non-coding regions proliferation, but cannot explain spliceosomal introns, small repetitive DNAs and random insertions[21]. Others defend that non-coding DNA results from natural selection and genome size and should have a direct impact in nuclear volume, cell size and cell division rate [22].

There are approximately 250 full genomes from different prokaryotic species with 350-8000 genes. Eukaryotic genomes are represented by 2455 species already sequenced and the number of genes present in each are higher than 13000 [23-26]. The coding DNA described for prokaryotic and eukaryotic organisms represent only part of the total genome. In Eukaryotes, the genome has a high variation in size but the percentage of coding regions in the genomes remains the same (C-paradox) (Figure 2) [23]. The increase or decrease in genome size results mainly from expansion of introns and mobile elements (non-coding regions). The variation in complexity is possibly explained considering differences in gene deployment: patterns of transcriptional regulation and alternative splicing [23, 27]. The data from ENCODE project corroborates that the differences in gene deployment and transcriptional regulation depend not only on coding segments of the genome, but are associated to non-coding functional elements that interact with gene regions [16, 19, 20].

Figure 1: Molecular structure of double-helix DNA showing the base-pairing between nucleotides

(18)

Figure 2: C-value paradox in Eukaryotes: increase of genome size depends of the increase of non-coding elements (figure from Lynch 2007, “The origins of genome architecture”).

In humans, coding regions (≈24000 genes) represent 1% of the genome. The non-coding part, which includes non-non-coding functional RNA cis-regulatory elements, telomeres, introns, pseudogenes, repeat sequences, transposons and viral elements, represents about 99% of the total genomes [13-17, 19, 23]. In this thesis different analysis of some of these non-coding elements (e.g., pseudogenes - model a, control region mtDNA – model c, repeat sequences – model d) and coding regions that can often became coding or non-functional (model a, model b), were performed. Different gene regions that are transcribed into proteins were analysed, suggesting that they can become non-functional (by accumulation of critical mutations) in different model species. The relationships between different genomic regions and several biological processes (e.g., gene expression, gene pathways) were also accessed. Regions of genome that are transcribed can become often non-functional by mutational events, or even became non-transcribed elements in the genome (model a, model b). On the other hand, regions that are not subjected to transcription mechanism, therefore not under selective pressures, can have relevant roles associated with structural relevant features of DNA (model c, model d). The DNA regions prone to form any DNA conformation that is not the orthodox right-handed Watson-Crick B-form (non-B DNA) play an important role in critical biological processes (e.g., replication, deletions, transcription) that are now been understood. In this thesis, the analysed structural features of DNA are the non-B DNA conformations (hairpin, cruciform, cloverleaf-like elements and other secondary structures).

We have used four genetic models to address the questions related with non-coding regions: The metallothioneins (model a), NAD pathway relevant genes (model b), MtDNA

(19)

(model c), and nuclear short tandem repeats (model d). These models will be described briefly in chapter 5.3.

5.2. Contributions to Articles.

JC contribution to the article related with metallothioneins (model a) was the bioinformatics experiments and computational analysis of the data (e.g., phylogenetic analysis). JC had no participation in the RT-PCR and expressed sequence tag (EST) analyses.

JC contribution to NAD pathway relevant genes article (model b) was the bioinformatics analysis (e.g., protein-ligand binding, calculations of active site interactions). JC did not participate in the laboratory experiments.

In the mtDNA non-B conformations (model c) analysis, JC helped in the development of bioinformatics tools to analyse the data (e.g., python scripts for UNAFold, Circos diagrams) and helped in the interpretation of results. JC had no participation in collecting the data and in the statistical analysis.

JC and ISM designed the experiments and analysed the data for the Y-STRs article (model d). All authors helped in interpretation of results and in writing the article.

JC performed the design of the software and the implementation of algorithms in the SPInDel workbench article. All authors helped to write the article.

(20)

5.3. Genetic Models.

5.3.1. Gene Families: The Metallothioneins

Lineage-specific traits and development of novel biological functions may result from pre-existing genes [28-30]. The chance of occurrence of novel biological functions (neofunctionalisation) is expectedly lower than the chance of inactivation (pseudogenisation) [31]. By this way, adaptive changes are less frequent since most amino acid replacements are neutral or deleterious. Models like the mammalian metallothionein family (MT family) can be used to study particular pathways of neofunctionalisation, pseudogenisation or subfunctionalisation [32-34]. Since several genomes (mammalian genomes) are currently available the study MT clusters and the evolutionary steps underlying the expansion of this gene family is possible. MTs are metal-binding proteins involved in homeostasis and the transport of essential metals. They are also relevant in protecting cells against heavy metals toxicity [35, 36], having thus a critical role in many biological processes. The reconstruction of the evolutionary history of MT clusters, combined with the expression profile of MT genes and behaviour of structural interactions of specific residues can help us to understand the relevant features of non-coding regions that result from specific duplications in mammalian genomes.

(21)

5.3.2. NAD Pathway Relevant Genes: NAMPT and PNC.

Several redox reactions (chemical reactions in which atoms have their oxidation state changed) occurring in the cells from prokaryotes and Eukaryotes use nicotinamide adenine dinucleotide (NAD) as a cofactor [37-42]. Regulation of metabolism and energy production are mediated by NAD and it can also act as substrate for NAD-consuming enzymes, such as poly (ADP-ribose) polymerases (PARPs) and sirtuins. NAD is involved in DNA repair,

transcriptional silencing and cell survival [3].

The synthesis NAD was studied considering different routes that depend on alternative precursors. De novo pathways synthesize NAD from tryptophan or aspartic acid and the salvage pathways recycle NAD from nicotinamide (Nam), nicotinic acid (Na) and their ribosides [39] (Figure 3).

In humans the major source of intracellular NAD results from the nicotinamide salvage pathways [38] but several microorganisms also need this pathway to grow [42-44]. Mammalian cells do not present nicotinamidases which makes them a target to the development of drugs for infectious diseases and anti-parasitic therapies [43-47].

In yeast and invertebrates the nicotinamidase gene PNC1 has been described as biomarker of stress and regulator of sirtuin [37, 48]. There are studies that tried to correlate these enzymes with aging [49] and infection [43-45, 49].

Inflammation and disease has also been associated to the functional homologue of nicotinamidase in vertebrates, nicotinamide phosphoribosyltransferase (NAMPT) [50, 51]. Nicotinamidase expression protects human neural cells but increase in PNC1 and sirtuin activity also protects against proteotoxic stress in yeast and C. elegans [52, 53].

Figure 3:Description of de novo pathways synthesize of NAD from tryptophan or aspartic acid and of the salvage pathways that recycles NAD from nicotinamide (Nam), nicotinic acid (Na) and their ribosides (source figure from Revollo et al. [3]).

(22)

The two enzymes described before can be present in the same organism [40, 41], rising the question about which one of them is expressed in these species.

5.3.3. MtDNA

MtDNA is a circular stranded molecule with a length of approximately 16.569 base pairs (bp) in humans and is normally present in all animal nucleated cells. It is contained in a double-membrane intracellular organelle (the mitochondrion) that is responsible for the energy generating process of oxidative phosphorylation. The mtDNA usually encodes thirteen important polypeptides in respiratory complexes (NADH - ubiquinone oxidoreductase: NADH1-NADH6 and NADH4L for complex I; Cyt b - ubiquinolcytochrome c oxidase reductase for complex III; CO - cytochrome c oxidase: COI-III for complex IV; ATP - adenosine triphosphate: ATPase6 and ATPase8 for complex V), two ribosomal ribonucleic acid (12S and 16S rRNA) and twenty two transfer RNAs (tRNAs)[54]. This genome has also a region known as the non-coding region, that is referred as the control region in the literature, with regulatory functions [55]. Two hypervariable regions can be identified (HVRI and HVRII) in the control region.

The human mtDNA has a few unique characteristics, namely a) maternal inheritance [56-58], b) discrete origins of replication, c) intronless genes, e) absence of dispersed repeats, f) few intergenic DNA, f) polycistronic transcripts, g) different genetic code and h) high copy number per cell [55].

Previous studies have determined the importance of several non-B DNA conformations in the mtDNA [59].

(23)

5.3.4. Short Tandem Repeats Model: Features and Mutation Mechanism

Short tandem repeats (STRs) represent 3% of human genome [23]. Most are located in non-coding regions. It is assumed that they do not have a biological function so they are classified as “junk DNA”. However there are clues pointing to the influence of STRs in gene expression (e.g., [CA]n and [CT]n repeats near a gene), recombination, maintenance of

chromatin spatial organization [60, 61]. The mutation rate of these sequences is lower than unique DNA sequences (Table 1) [60].

Table 1: Mutation rates of unique DNA sequences and STR sequences (order of magnitude).

Mutation rate order of magnitude (nucleotides per generation)

Unique DNA sequences 10-9

STR sequences 10-2 to 10-6

Y chromosome STRs are used in most studies (e.g., population genetics, evolution and forensics) as genetic markers [62-67]. The Y chromosome is one of the two sex-determining chromosomes in most mammals, and is a good model to study the contrasts between non-coding and coding regions since there is no recombination, except for the pseudo-autosomal region. The Y-STRs used in our study were retrieved from National Institute of Standards and Technology (NIST) [68] and are described in Table 2 .

(24)

Table 2: Allele range, repeat motif, GenBank accession numbers and reference alleles of Y-STR locus. Repeat motif abbreviations A,T,G,C,W,Y,R,S correspond respectively to adenine, thymine, guanine, cytosine, weak (A or T), pyrimidine, purine, strong (G or C) following the International Union of Pure and Applied Chemistry (IUPAC).

Marker Name Allele Range* (repeat

numbers) Repeat Motif Accession GenBank Reference Allele

DYS19 10-19 TAGA AC017019 15

DYS385 a/b 7-28 GAAA AC022486 11

DYS389 I 9-17 (TCTG) (TCTA) (TCTG) (TCTA) AC004617 12

DYS389 II 24-34 (TCTG) (TCTA) (TCTG) (TCTA) AC004617 29

DYS390 17-28 (TCTA) (TCTG) AC011289 24

DYS391 6-14 TCTA AC011302 11

DYS392 6-17 TAT AC011745 13

DYS393 9-17 AGAT AC006152 12

YCAII a/b 11-25 CA AC015978 23

DYS388 10-18 ATT AC004810 12

DYS425 10-14 TGT AC095380 10

DYS426 10-12 GTT AC007034 12

DYS434 9-12 TAAT (CTAT) AC002992 10

DYS435 9-13 TGGA AC002992 9

DYS436 9-15 GTT AC005820 12

DYS437 13-17 TCTA AC002992 16

DYS438 6-14 TTTTC AC002531 10

DYS439 9-14 AGAT AC002992 13

DYS441 12-18 TTCC AC004474 14

DYS442 10-14 (TATC)2(TGTC)3(TATC)12 AC004810 17

DYS443 12-17 TTCC AC007274 13

DYS444 11-15 TAGA AC007043 14

DYS445 10-13 TTTA AC009233 12

DYS446 10-18 TCTCT AC006152 14

DYS447 22-29 TAAWA AC005820 23

DYS448 20-26 AGAGAT AC025227 22

DYS449 26-36 TTTC AC051663 29

DYS450 8-11 TTTTA AC051663 9

DYS452 27-33 YATAC AC010137 31

DYS453 9-13 AAAT AC006157 11

DYS454 10-12 AAAT AC025731 11

DYS455 8-12 AAAT AC012068 11

DYS456 13-18 AGAT AC010106 15

DYS458 13-20 GAAA AC010902 16

DYS459 a/b 7-10 TAAA AC010682 9

DYS460 (A7.1) 7-12 ATAG AC009235 10

DYS461 (A7.2) 8-14 (TAGA) CAGA AC009235 12

DYS462 8-14 TATG AC007244 11

DYS463 18-27 AARGG AC007275 24

DYS464 a/b/c/d 11-20 CCTT X17354 13

DYS481 20-30 CTT 22

DYS485 10-18 TTA 16

DYS490 TTA AC019058 12

DYS495 12-18 AAT AC004474 15

DYS497 13-16 TTA 14

DYS504 11-19 TCCT AC006157 18

DYS505 9-15 TCCT AC012078 12

DYS508 8-15 TATC AC006462 11

DYS520 18-26 ATAS AC007275 20

DYS522 8-17 GATA AC007247 10

DYS525 TAGA AC010104 10

DYS531 9-13 AAAT 11

DYS532 9-17 CTTT AC016991 14

DYS533 9-14 ATCT AC053516 12

DYS534 10-20 CTTT AC053516 15

DYS540 TTAT AC010135 12

DYS549 10-14 GATA AC010133 13

DYS556 AATA AC011745 11

DYS557 TTTC AC007876 16

DYS565 9-14 ATAA AC010726 12

DYS570 12-23 TTTC AC012068 17

DYS572 8-12 AAAT 10

DYS573 8-11 TTTA 10

(25)

DYS576 13-21 AAAG AC010104 17

DYS594 9-14 AAATA AC010137 10

DYS607 [GAAG]15[GAAA][GAAG][GAAA][GAAG] 19

DYS612 [CCT]5[CTT][TCT]4[CCT][TCT]25 AC006383 36

DYS626 AAAG 18

DYS632 CATT AC006371 9

DYS635 (C4) 17-27 TSTA compound AC004772 23

DYS641 TAAA AC018677 10

DYS643 7-15 CTTTT AC007007 11

Y-GATA-H4 8-13 (25-30) TAGA AC011751 12

Y-GATA-C4 20-25 TSTA compound G42673 21

Y-GATA-A10 13-18 (TCCA)2(TATC)13 AC011751 15

Figure 4: Schematic view of strand-slippage replication mechanism (figure from Jobling et al.[69]).

There are factors that influence in different ways STR mutations such as repeat number, repeat base composition, repeat size, flanking sequence, recombination, sex and age of the individual [60].

STR accurate replication depends of diverse cell machinery that is used during cell division, DNA repair and recombination. DNA polymerases are essential to keep the integrity of the genome at different stages of cell development [70, 71].

One of the models used to explain Y-STR mutation mechanism is the stepwise mutation model (SMM) [69, 72-78]. This model (Figure 4) assumes that only small changes (when assuming that the change is one repeat unit at time we call the model single SMM) in allele number occur, there are equal probabilities of increasing and decreasing of repeat number, the size of alleles is unlimited and there is independence of the rate and size of mutations from the repeat number [60, 69].

(26)

The biological mechanism that seems to be involved and can explain the observed results for STR mutations is the strand-slippage replication [69, 79, 80] (Figure 3). This process occurs during replication. After DNA single strand template is generated in ‘origins’ points that are recognized by proteins (helicases) that separate the two strands, folding of the template or of the copied strand can occur and originate a final DNA fragment with one allele size difference (one step mutation) [69, 79-84]. The techniques that are used to detect differences in STRs allele size are the polymerase chain reaction (PCR) followed by an electrophoresis. The development of the PCR technique has significantly improved the efficiency of laboratorial diagnostic procedures by allowing the in vitro formation of a large number of DNA copies (amplification) using a specific genomic region as template [85].

STRs were characterized by different experimental approaches as nanoelectrospray mass spectrometry (nano-ESI-MS) and ion mobility spectrometry (IMS) [100], aside from different in silico approaches [101-103]. STR repetitive motifs can interfere in basic molecular mechanisms as DNA replication [70, 79, 104-110].

(27)

5.4. Non-B DNA Conformations Prediction

Primary nucleotide sequences are just the tip of the iceberg concerning the role of DNA in cellular processes [10,11, 86-89]. Little attention has been given to other levels of genetic information beyond primary DNA sequences. It has been shown that non-B DNA conformations (any DNA conformation that is not the orthodox right-handed Watson-Crick B-form) can have important roles in DNA replication, transcription and recombination. The existence of conserved structural DNA stretches suggests that such local DNA conformations can be used to estimate phylogenetic relationships. In this regard, several methods have been proposed in the literature for phylogenetic inference from DNA primary sequences [90]. However, these methodologies usually rely in simple genetic distances [91] and/or models of nucleotide substitution [92] disregarding structural DNA information.

By studying evolutionary constrains in secondary and tertiary DNA structures it is possible to have a glimpse of how selective pressures are modulating mutation patterns in DNA and their implications in understanding complex protein–nucleic acid interactions [87, 93]. The study of non-coding DNA structures is facilitated by the large number of nuclear and mitochondrial genomes now available for many species (genomes of National Center for

Biotechnology Information database, NCBI at www.ncbi.nlm.nih.gov/sites/entrez?db=genome). The quality of data concerning non-coding

regions is improving exponentially, allowing good predictions of structural DNA parameters [10, 86, 87, 93].

It has been shown that a main cause for mutagenic instability is the occurrence of non-B conformations stabilized by negative supercoiling [87]. Large genome rearrangements, deletions or structural polymorphic states could have relevant phenotypic consequences to the organism [86, 87, 94, 95]. For instance, DNA slipped structures play a prominent role in several hereditary neurological diseases (e.g., Friedreich’s ataxia, Huntington disease or myotonic dystrophy) and some mtDNA deletions syndromes [86, 87, 94-97]. It has been already described that DNA-binding proteins, phenotype-associated SNPs and predicted enhancers are functionally relevant [10, 98, 99].

There is a lack of studies incorporating structural mutagenic pattern in non-coding genomic regions. Databases as Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank (PDB) database and nucleic acids database (NDB) can be used to better understand DNA structural features considering different genomic regions (coding versus non-coding). Specific DNA non-B conformation structures (Hairpin, Pseudoknot, and Cruciform) were associated with errors occurring in replication [100].

(28)

5.4.1. Thermodynamics of DNA and UNAFold

Thermodynamics parameters of DNA have been studied for a long time [101-104]. Dynamic programming algorithms for DNA secondary structure prediction were utilized in this thesis. UNAFold [105] is a software that can perform DNA secondary structures predictions for Watson-Crick (W-C) pairings, wobble and non-canonical states under a variety of salt conditions, empirical equations for monovalent and magnesium dependence of thermodynamics. Nearest neighbor energy rules for Watson-Crick base pairs, internal mismatches, terminal mismatches and dangling ends is used to calculate the predicted structures based in a experimentally free energy database that considers different motifs [106]. Energies are also assigned to loops (pseudoknots and base triplets are excluded). The final experimental energy values are assigned to internal, bulge and hairpin loops. Hairpin loop is an unpaired loop in the end of a structure that begins by a paired double helix of DNA. The UNAFold complete database of parameters for base pairs, mismatches, terminal dangling ends, terminal mismatches, coaxial stacking, and a variety of loop motifs including hairpins, bulges, internal loops, and multibranched loops was used. Methods for measurement of the thermodynamic parameters have been reviewed elsewhere [102, 107-110]. Dependence equations are implemented in UNAFold to perform accurate non-B DNA conformations (e.g., secondary structure) calculations for different values of solution conditions, empirical sodium and magnesium.

5.4.2. AmberTools: Molecular Dynamics of Nucleic Acids, NAB and MMPBSA.

AmberTools [111, 112] is a set of tools that can perform different calculations (e.g., build molecular systems, solvate systems, neutralize systems, root-mean-square deviation calculations, end to end distances, hydrogen bonds (H-bonds) calculations, molecular mechanics Poisson-Boltzmann solvent accessible surface area) over three-dimensional (3D) models of proteins or DNA and read molecular simulation data resulting from Amber [111, 112] molecular dynamics calculations.

NAB [113, 114] is a programming language designed to generate models for “unusual" DNA and RNA as the ones predicted by UNAFold. DNA has almost an infinite number of possible conformations with a repeat unit (sugar) that contains seven rotatable bonds (flexible backbone) and a rigid planar base (nucleotide base) [112]. These DNA features difficult the accurate prediction of non-canonical structures (e.g., secondary structures) using refinement methods as molecular mechanics since there are no 3D structures with high homology to our predicted models. Using this high level programming language, residues, strands and molecules can be treated as objects and several routines

(29)

can be performed over these objects. Manipulation of axis systems, including rotation and translations, can be implemented with NAB. Different type of models can be built using distance geometry methods with an additional coordinate manipulation for specific constrained systems. Molecular dynamics simulations can be easily implemented using ab

initio models with the AMBER [111, 112] force field.

The prediction of free energies differences directly linked to conformational equilibria is usually vital to understand the molecular basis of crucial biological functions [115]. The different conformations of DNA that result from properties of the phosphodiester backbone and the nucleic base pairs can be analysed with computational methods, which are also able to determine the associated free energies. Therefore, methods such as molecular mechanics Poisson-Boltzmann solvent accessible surface area (MMPBSA) can be used to determine the free energy of the individual end-point of each DNA molecular system. The entropy and the enthalpy [Generalized Born (GB) and Poisson-Boltzmann (PB)] [116, 117] calculations, must be determined to calculate the contributions to the free energy. Typical contributions to the free energy include the internal energy (bond, dihedral, and angle), the electrostatic and the van der Waals interactions, the free energy of polar solvation, the free energy of nonpolar solvation, and the entropic contribution (TS):

G_molecule= E_internal+E_electrostatic+ E_vdW+ G_(polar solvation )+ G_(non-polar solvation)- TS (1)

For the calculations of relative free energies between closely related complexes, it is assumed that the total entropic term in equation 1 is negligible as the partial contributions essentially cancel each other [118]. The first three terms of equation 1 can be calculated with no cut-off. The nonpolar contribution to the solvation free energy due to van der Waals interactions between the solute and the solvent is usually modeled as a term dependent of the solvent accessible surface area (SASA) of the molecule [119].

(30)

5.5. Python Programming in DNA Analysis

A wide range of computer programs is now available to deal with the huge amount of genetic information generated in thousands of laboratories around the world. The appropriate choice of a program for a given task depends both on the data and on the goals of the experiment. For instance, many open and closed source programs are available to make phylogenetic and evolutionary inferences from genetic data [120, 121].

The first step to build a computer program for management and analysis of genetic information is to choose the appropriate programming language (e.g., Python, Java, Perl, and C++). Another important aspect that must be considered in a software development effort is the interface with the user. To build an easy-to-use program based on point-and-click action over windows buttons, an appropriated graphical user’s interface (GUI) must be developed [122]. Conversely, the GUI development is not needed if the main users of the program are familiarized with commands via prompt operating system (OS) console window. Thus, very important issues for the main core of a program are: the data to be analysed (input data), the implementation of algorithms to perform the calculations or simulations over the data and the format of final results (output data). Different input file formats are normally used to store molecular data like DNA or proteins, namely Phylip [123], FASTA, MEGA [124], NEXUS, GenBank and protein databank (PDB). These formats are commonly used as input or output formats in several programs (e.g., Phylip, MEGA, PAML[125], MrBayes [126], DnaSP, Bioedit [127]) and standard molecular databases (e.g., GeneBank, FASTA, eXtensible Markup Language-XML). Conversion between different input file formats is usually possible and extremely useful if the user wants to carry out different kind of analyses over the data.

(31)

5.5.1. Python

Python (free available at www.python.org) is an object oriented language created by Guido van Rossum [128] that has gained attention in recent years. As other high programming language it can only be executed after processed by a computer. Although being slower than low programming languages, Python have some important advantages: a) a reduced programming time, b) a shorter and easier to read source code c) a high productivity and d) a multiplatform capability (Windows, Linux, and Mac). Different Python third-part modules can be installed for a large variety of tasks, including molecular data handling and analysis: BioPython[129], PyCogent [130] , Matplotlib, , GenomeDiagram [131], NetworkX, py2exe, NumPy , Psyco, SciPy , WxPython (Table 3). These packages have the same common terminology of Python language although with specific modules and built-in functions. An extensive documentation comes as part of Python distribution [128] or can be found in dedicated books and articles [132, 133].

(32)

Table 3: Relevant Python modules in molecular data analyses.

Module Functionality Requirements Documentation Major flaws

BioPython -Parse bioinformatic files into Python for several formats

-Management and manipulation of genetic and proteic data

-Code to perform searches in common on-line bioinformatics databases destinations (e.g., NCBI)

-Python 2.3 or later

-Numerical Python

-Good and well written documentation -Some bugs resulting for poor maintenance of some functions PyCogent

- Same as BioPython -Python 2.4 or later -Good and well written documentation

-Some bugs

GenomeDiag ram

-Graphic representation of genomes and DNA sequences

-Python 2.4 or later

-Good documentation

Pythia _{-Thermodynamic calculations} -Python 2.4 or

later -Bad documentation Some bugs

Matplotlib

-Plot and save graphics in different formats -Handle geographic maps

-Python 2.4 or later -Numpy 1.1 -Libpng 1.1 -Freetype 1.4 -Basemap 0.99.2 -Extensive documentation -Very good examples -Some problems with integration with other major modules, namely wxPython NetworkX

-Construct phylogenetic relationships through networks design and visualization

-Python 2.4 or later -Documentation not enough -Lack of good examples Bugs and lack of flexibility related with visualization and drawing

NumPy -N-dimensional array object -Linear algebra functions -Basic Fourier transforms

-Python 2.4 or later -Nice and exhaustive documentation - Psyco

-Speed up the execution of any Python code -Python 2.4 or _later -Bad _{documentation}

-Maintenance and updates very limited

Py2exe -Converts Python scripts into executable Windows programs able to run without requiring a Python installation -Python 2.3 or later -Good documentation -Poor stability of executables SciPy

-Language extension that uses numpy to do advanced math, signal processing, optimization, statistics -Python 2.4 or later -Very well organized documentation -A lot of cookbook examples - wxPython

- Allows easy creation of robust, highly functional graphical user interface

-Python 2.3 or later -Good wxPython reference documentation -wxPython demo with examples for the code

-Slow performance -Does not include a rapid application development tool (RAD)

(33)

5.5.2. Python Language and WxPython: Code and Common Terminology

As an object-oriented language the functionality of Python is based in objects. Objects can be primitive data (integer, float, Boolean and complex), collection data (string, list, tuple, set dictionary) or even more complex data structures (e.g., SQL databases) [128]. In Python language almost everything can be an object. Normally, a routine process in Python is compacted in a module (a Python file or files saved in plain text with extension *.py) that contains executable statements as well as definition of functions, classes and methods. These Python files are normally edited in an integrated development environment, such as IDE (e.g., VisualWX; http://visualwx.altervista.org/), Eclipse (http://www.eclipse.org/) or NetBeans (http://www.netbeans.org/). IDEs permit that common statements and built-in functions are easily identified in code. Another relevant feature of Python is the mandatory indentation that makes easier to read and write the code.

WxPython is an interface for the C++ toolkit wxWidgets. Cross-platform applications can be created with the functionality of C++ Widgets and the simplicity of Python language [122]. The range of possibilities to build a GUI might be increased by additional widgets (Table 4) that are directly written in wxPython.

Figure 5: Basic wxPython application structure [adapted from wxPython in Action [122].

(34)

Table 4: Common widgets and dialogs implemented in wxPython.

Name Features Type

wx.Window WxWindow is the base class for all windows and represents any visible object on screen. It includes controls and top level windows.

Frame

wx.FlexGridSizer Lays out its children in a two-dimensional table Sizer

wx.StaticBoxSizer Rectangle drawn around other panel items to denote a logical grouping of items

Sizer

wx.Button Control that contains a text string Widget

wx.ComboBox Displays static list with editable or read-only text field; or a drop-down list with text field; or a drop-down list without a text field.

Widget

wx.Grid WxGrid and its related classes are used for displaying and editing tabular data. They provide a rich set of features for display, editing, and

interacting with a variety of data sources, namely genetic data.

Widget

wx.Notebook Manages multiple windows with associated tabs Widget

wx.StaticText Displays one or more lines of read-only text Widget

wx.TextCtrl A text control allows text to be displayed and edited; it may be single line or multi-line.

Widget

wx.FileDialog File chooser dialog Dialog

wx.MessageDialog Dialog that shows a single or multi-line message, with a choice of OK, Yes, No and Cancel buttons.

Dialog

wx.ProgressDialog Dialog that shows a short message and a progress bar Dialog

wx.SingleChoiceDialog Shows a list of strings and allows the user to select one Dialog

(35)

5.5.3. BioPython, PyCogent, GenomeDiagram, and Pythia

As described before, Python and wxPython are packages that implement in the main core of the software the routine operations and tools for GUI construction, respectively. It is then necessary to have a set of packages that could easily deal with most common operations required in molecular data manipulation. BioPython is a package that consists in a set of modules to read and manipulate molecular data (DNA and proteins). The most relevant functionalities of BioPython for computational molecular biology are: a) the capacity for parsing bioinformatic files into Python from several formats (Blast out, ClustalW, FASTA, Genbank, PubMed and Medline, Expasy, SCOP, UniGene, SwissProt); b) the incorporation of a code to perform searches in common on-line bioinformatics destinations (NCBI, Expasy); c) the easy management of sequence features (sequence translation, transcription, weight calculation, alignments) and e) the easy integration with BioPerl and BioJava modules through BioCorba [129]. To use DNA and proteins sequences as input data, it is not necessary to write the code since BioPython already has the SeqIO system that defines SeqRecord objects to manipulate this data and normally is very fast reading and manipulating sequences. BioPython module can be called by using ‘import BioPython’ in the beginning of Python file and specific functions are invoked using ‘from BioPython import ‘function’ ’. Documentation for BioPython has many useful examples for all functions. PyCogent [130] is a python module that can perform all the functions implemented in BioPython but focused in genomic biology. GenomeDiagram [131] can make graphic representations of genomic data. Pythia (http://sourceforge.net/projects/pythia/) includes modules that can calculate DNA binding and folding energies of specific DNA sequences.

5.5.4. Matplotlib, NumPy and SciPy

The implementation of mathematical routines when developing software can be achieved by using a great number of external Python modules, although the most commonly used in software development are Matplotlib, NumPy and SciPy. With these modules it is possible to call mathematical functions, to represent the data graphically, to perform iteration over different numerical data and statistical analyses of data. Global statistics from a sequence or alignment, namely proportion of nucleotides and GC content can be calculated with this module.

(36)

5.5.5. SPInDel Workbench

The SPInDel workbench is a computational platform developed in python object oriented language using BioPython (http://biopython.org/), SciPy (http://www.scipy.org/),

GenomeDiagram (http://bioinf.scri.ac.uk/lp/programs.php), Matplotlib (http://matplotlib.sourceforge.net/), NumPy (http://numpy.scipy.org/), and PyCogent(http://pycogent.sourceforge.net/). It can import alignments of specific targeted genome regions (e.g., ribosomal RNA gene regions) showing regions of nucleotide conservation and variation. The variation introduces gaps in the alignment (-) that can be used as a source of information to characterize and classify different species. The classification of each species is based on the different length of the sequence in the alignment that results from insertion/deletion (indel) events.

Theoretically, the discrimination of all Eukaryotic species on Earth (5-15 million) can be done using 6 hypervariable regions with 20 alleles each. The SPInDel analysis was based in ribosomal RNA gene regions but the analyses of other regions with the same pattern of sequence evolution (e.g., non-coding regions) is also possible. Different statistical approaches were implemented in this multi-platform software (Windows, Linux or compilation in other operating systems using the SPInDel Workbench source code) by using python algorithms and modules. The application of this software can be extended to other fields where the identification of species is relevant (e.g., ecology, forensics).

(37)

5.5.6. NABpy

NABpy (NAB python implementation) is a python module that automatizes all the processes related with initial protein and DNA three-dimensional molecular systems (in vacuum or with explicit solvation) using Matplotlib, BioPython, PyCogent, UNAFold and AmberTools. Different functions are implemented in the module:

 Create protein and DNA molecular systems taking in consideration specific unconstrained and constrained models.

 Solvate systems with explicit water (WAT) and neutralization with sodium ions (Na+_).

 Generate input files to run AMBER [112] molecular dynamics (MD) simulation, including the *.prmtop (topology file) and *.mdcrd (simulation parameters file).

 Calculate H-bonds along trajectories and calculate parameters for DNA base-stacking.

 Calculate the end-to-end distances of all atoms and backbone atoms

 Run structural analysis of DNA molecular systems using Curves+ [134] and 3XDNA [135-137] (helical and backbone parameters).

 Calculate free energy parameters using MMPBSA [112] and Delphi [138].

(38)

6. Research

Questions

and

Objectives

The main objective of this thesis is to study non-coding DNA regions in comparison with the already well-studied protein-coding regions and thus to infer which biological processes occur in non-coding genomic tracts of living cells. Using four different research models, the specific objectives of this work were:

 Analyse the MT clusters duplicated genes considering their coding/non-coding status (model a).

 Study NAMPT and PNC genes and respective functional proteins involved in NAD pathways, using different model species (model b).

 Access the current functional status of NAMPT and PNC homologues genes, using a computational methodology with model organisms (model b).

 Perform a protein-ligand docking using homology modelling structures of NAMPT and PNC (model b).

 Detect and identify conserved structural patterns (B DNA conformations) in non-coding DNA regions of mammalian mitochondrial (model c) and nuclear genomes (model d).

 Evaluate the association between conserved non-B DNA conformations and specific types of non-coding regions such as mitochondrial control regions (model c) and STRs (model d).

 Measure the degree of randomness of structural conservation across genomes using statistical methodologies to validate identified structures. Infer evolutionary constrains and mutagenic patterns in identified structures (model c, model d).

 Identify secondary structures in mtDNA and their role in different biological processes (model c).

 Determine the structural features of different regions in mtDNA (coding and non-coding), and ascertained how these non-B DNA conformation might influence genetic disorders, replication and transcription (model c).

 Implement a 3D structural analysis of DNA using python algorithms, UNAFold, and non-B DNA conformations database (model d).

 Structural analysis of DNA using previously described computational methodologies, such as molecular dynamics (model d).

 Correlate the size, localization and physical parameters of predicted structures with specific genomic features: replication origins, transcription and mutagenic instability (model d).

 Design software able to use different regions of the genome (e.g., non-coding regions), in order to identify taxonomic groups at various levels (SPInDel workbench).

(39)

 Design software (NABpy) to automatize DNA molecular dynamics simulation and MMPBSA free energies analysis.

Next chapters reflect the work that has been achieved in order to tackle the research questions and objectives focussed upon during this work.

(40)

7. Publication I: Gains, Losses and Changes

of Function after Gene Duplication: Study

of the Metallothionein Family

(41)

Gains, Losses and Changes of Function after Gene Duplication: Study of

the Metallothionein Family

Ana Moleirinho1, João Carneiro1,2, Rune Matthiesen1, Raquel M. Silva1, António Amorim1,2, Luísa Azevedo1*

1

IPATIMUP - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal 2_{Faculty of Sciences of the University of Porto, Porto, Portugal}

*E-mail: [email protected]

Abstract

Metallothioneins (MT) are small proteins involved in heavy metal detoxification and protection against oxidative stress and cancer. The mammalian MT family originated through a series of duplication events which generated four major genes (MT1 to MT4). MT1 and MT2 encode for ubiquitous proteins, while MT3 and MT4 evolved to accomplish specific roles in brain and epithelium, respectively. Herein, phylogenetic, transcriptional and polymorphic analyses are carried out to expose gains, losses and diversification of functions that characterize the evolutionary history of the MT family. The phylogenetic analyses show that all four major genes originated through a single duplication event prior to the radiation of mammals. Further expansion of the MT1 gene has occurred in the primate lineage reaching in humans a total of 13 paralogs, five of which are pseudogenes. In humans, the reading frame of all five MT1 pseudogenes is reconstructed by sequence homology with a functional duplicate revealing that loss of invariant cysteines is the most frequent event accounting for pseudogenisation. Expression analyses based on EST counts and RT-PCR experiments show that, as for MT1 and MT2, human MT3 is also ubiquitously expressed while MT4 transcripts are present in brain, testes, esophagus and mainly in thymus. Polymorphic variation reveals two deleterious mutations (Cys30Tyr and Arg31Trp) in MT4 with frequencies reaching about 30% in African and Asian populations suggesting the gene is inactive in some individuals and physiological compensation for its loss must arise from a functional equivalent. Altogether our findings provide novel data on the evolution and diversification of MT gene duplicates, a valuable resource for understanding the vast set of biological processes in which these proteins are involved.

PLoS ONE 6(4): e18487. doi:10.1371/journal.pone.0018487 Editor: Vincent Laudet, Ecole Normale Supe´rieure de Lyon, France

(42)

Introduction

When a particular gene is constrained to a specific function, the appearance of biological novelty demands genetic redundancy. Duplication of pre-existing genes may lead to the establishment of lineage-specific traits and to the development of novel biological functions [1,2,3,4,5]. However, the probability of widening biological functions (neofunctionalisation) is expectedly lower than the chance of inactivation (pseudogenisation) [6,7,8] as most amino acid replacements are more likely neutral or deleterious, rather than leading to any particular adaptive change. Although the majority of gene duplicates result in pseudogenes, many remain functionally active longer than it would be expected by chance. This observation led to the development of the subfunctionalisation model [9,10], according to which the accumulation of complementary loss-of- function mutations within regulatory segments of both members would facilitate their preservation while maintaining the original function. In case of preserving the parental function, duplicates may act as backup compensation copies to buffer against the loss of a functionally related gene [11,12].

The current availability of several genome sequences allows the study of the evolutionary steps underlying the expansion of a gene family by detailed characterisation of lineage-specific expansions. MTs are metal-binding proteins involved in homeostasis and the transport of essential metals, more specifically, in protecting cells against heavy metals toxicity [13,14], having thus a critical role in many biological processes. In mammals, four tandemly clustered genes (MT1 to MT4) are known. Although all genes encode for conserved peptide chains that retain 20 invariant metal-binding cysteines, MT3 and MT4 seem to have developed additional properties relatively to MT1 and MT2, such as protection against brain injuries [15,16] and epithelial differentiation [17], respectively. Finally, during the evolution of the lineage that led to modern humans, MT1 has undergone further duplication events that have resulted in 13 younger duplicate isoforms [18]. The co-existence of younger and older duplicates is thus an opportunity to reconstruct the evolutionary history behind the divergence of the MT family in mammals.

(43)

Materials and Methods

Phylogenetic analyses

Coding sequences annotated as orthologues of the human MT genes were extracted from the Ensembl database (www.ensembl.org, release 56: Sep 2009) [19]. The final set of sequences (Table S1) does not include shortened sequences and those annotated in non-human species as representing the orthologue of distinct human MT1 genes. Codon sequences were aligned using MUSCLE [20,21] incorporated in Geneious software v5.1.3 (http://www.geneious.com). Coding MT sequences from four fish species (Danio rerio, Oryzias latipes, Tetraodon nigroviridis and Takifugu rubripes), two birds (Gallus gallus and

Taeniopygia guttata) and a reptile (Anolis carolinensis) were used to outgroup the phylogeny. Two

methods were used to reconstruct the tree topology: maximum likelihood (ML) and Bayesian. In both cases, the model of nucleotide substitution used was HKY+G as determined in jModelTest [22]. The program BEAST [23] was used to estimate the Bayesian phylogeny in two runs (50 million generations each) using a Bioportal at the University of Oslo (http://www.bioportal. uio.no). The resulting log file was analyzed in Tracer [24]. The tree was obtained in TreeAnnotator from the BEAST software using a threshold for clade credibility of 0.5. For all the statistics obtained, the effective sample size (ESS) was always within the recommended threshold. The ML topology (Figure S1) was obtained with PHYML (http://www.bioportal.uio.no) [25] using the transition/transversion ratio, the proportion of invariable sites and the gamma parameter estimated by the program. Bootstrap branch support was estimated using 1000 data sets.

Tree visualization and final edition were performed in FigTree v1.3.1

(http://tree.bio.ed.ac.uk/software/figtree).

Organization of the human and mouse MT family

The chromosomal organization of the MT family and flanking neighbours (BBS2 and NUP93) in humans and mice was performed using NCBI (Homo sapiens build 36.2 and Mus

The Role of Non-coding DNA Structural Information in Phylogeny, Evolution and Disease

The Role of

Non-coding

DNA Structural

Information in

Phylogeny,

Evolution

and Disease

João Miguel Sotto Maior Faria Carneiro

Biologia

Orientador

Coorientador

ACKNOWLEDGEMENTS/AGRADECIMENTOS

1. Summary

Sumário

Table of Contents

2. General

Introduction

and

Discussion

List of Figures

3. General

Introduction

and

Discussion

List of Tables

4. General

Introduction

and

Discussion

List of Abbreviations

5. General

Introduction

5.1. Coding versus Non-coding DNA

5.2. Contributions to Articles.

5.3. Genetic Models.

5.4. Non-B DNA Conformations Prediction

5.5. Python Programming in DNA Analysis

6. Research

Questions

and

Objectives

7. Publication I: Gains, Losses and Changes

of Function after Gene Duplication: Study

of the Metallothionein Family

Gains, Losses and Changes of Function after Gene Duplication: Study of

the Metallothionein Family

Abstract

Introduction

Materials and Methods