Deciphering the transcriptomics of the Conus species’ natural venoms

(1)

I

Deciphering the

transcriptomics of the Conus

species’ natural venoms

José Manuel Ramos Morim

Mestrado em Aplicações em Biotecnologia e Biologia Sintética

Departamento de Química e Bioquímica e Departamento de Biologia 2021-2022

Orientador

Prof. Agostinho Antunes, FCUP

(2)

Todas as correções determinadas pelo júri, e só essas, foram efetuadas.

O Presidente do Júri,

Porto, ______/______/_________

(3)

“Farà male il dubio di non essere nessuno, sarai qualcuno se resterai diverso dagli altri.”

– Måneskin

(4)

Sworn Statement

I, José Manuel Ramos Morim, enrolled in the Master Degree Applications in Biotechnology and Synthetic Biology at the Faculty of Sciences of the University of Porto hereby declare, in accordance with the provisions of paragraph a) of Article 14 of the Code of Ethical Conduct of the University of Porto, that the content of this dissertation perspectives, research work and my own interpretations at the time of its submission.

By submitting this dissertation, I also declare that it contains the results of my own research work and contributions that have not been previously submitted to this or any other institution.

I further declare that all references to other authors fully comply with the rules of attribution and are referenced in the text by citation and identified in the bibliographic references section. This dissertation does not include any content whose reproduction is protected by copyright laws.

I am aware that the practice of plagiarism and self-plagiarism constitute a form of academic offense.

José Manuel Ramos Morim 29/09/2022

(5)

Acknowledgments

My effort for this Master’s degree began well before even entering it. Every day I remember the strife to get the grades back in the bachelors, the endeavour of being accepted for matriculation and the amazing feeling of accomplishment after I succeeded. I recall all the labour for the completion of every task, of every exam, of every objective in it. From the very first day I felt challenged. Often, I was pushed to my limit. After all my work, I finally pose to end this chapter. Completing this Master’s program will be the greatest academic achievement of my life up until this point.

And I did not reach this point alone. Along my path, there were those who stood by my side when it most mattered, did not leave then and are here now. They supported me, guided me, helped me throughout the way. I do not forget it, will not forget. Anything recorded by writing is permanent, immutable, and therefore immortal. In this way I want to dedicate my achievements to the people I most profoundly trust and admire.

First, I would like to thank Professor Agostinho Antunes for accepting me in this project, giving me the opportunity to embrace this dissertation in a theme I was fascinated about. In a more human aspect, I want to thank for the sympathy and concern the Professor always demonstrated towards me and for the focus in directing my work.

Next, I want to thank my laboratorial colleague and friend, Yihe Zhao. Nearly everything I learned or did for this dissertation was as a member of a team I formed with him leading. Thank you for always helping, tutoring, and directing me to where it was best.

To Professor Paula Gomes, I want to thank for the opportunity of entering the Masters, as without her advice I would not have made it.

To my father, my mother, my sister and my brother I want to thank for always being there as a true family, supporting and backing my efforts, educating me as a person with the values I will forever hold and be true to.

To my three truest friends, I want to dedicate a very special thank you. In everything we have been and will continue to go through together, I will always have you as true brothers.

To my dearest, truest love, Sofia, in my life you represent love, passion, virtue, justice and balance. You have rendered my life more beautiful than I ever imagined it could be. Thank you for everything.

At last, I would also like to express an important acknowledgement to everyone from FCUP for helping me whenever I needed and supporting the project. This work was supported in part by the Portuguese Foundation for Science and Technology (FCT) project PTDC/CTA-AMB/31774/2017 (POCI-01-0145-FEDER/031774/2017).

(6)

Resumo

Os gastrópodes Conus são espécies icónicas e conhecidas há séculos principalmente pelas suas bonitas conchas cheias de padrões coloridos, as quais são muito procuradas para troféus e efeitos de decoração. Desde tempos ancestrais, as conchas, que são prezados objetos de joalharia, são também um autêntico tesouro para muitos colecionadores, para os quais ainda existe um grande mercado. Apenas recentemente especial atenção foi dedicada às capacidades do veneno destas espécies. Além de terem conchas muito belas estes gastrópodes marinhos produzem um veneno predatório com imensos compostos complexos, cada qual com extrema afinidade para um grupo específico de recetores. Estes compostos atuam de formas muito diversas, desde alívio de dor a paralisia, sendo, portanto, muito relevantes para várias áreas de aplicação biomédica se mais estudos lhes fossem dedicados. Nesse sentido, o objetivo desta dissertação é contribuir para decifrar a diversificação molecular do veneno predatório natural destes gastrópodes marinhos ao estudar todos os transcriptomas disponíveis das espécies do género Conus utilizando uma abordagem de genómica comparativa e bioinformática.

Nesta dissertação, foi desenvolvida uma metodologia para descodificar o contexto genómico e as relações entre as proteínas do veneno destes predadores marinhos.

Após a recolha e estudo de toda a informação transcriptómica fidedigna disponível, a montagem dos transcriptomas e a anotação funcional das proteínas revelaram a função dos genes expressos. Análises e comparações baseadas nos resultados obtidos permitiram retirar várias conclusões. Resumidamente, neste trabalho é reportado (1) o número e função dos genes partilhados apenas expressos na glândula do veneno das 20 espécies de Conus analisadas, (2) uma relação entre o tamanho da montagem do transcriptoma e o número de genes únicos encontrados, (3) evidência de relações simbióticas de microrganismos na glândula venenosa e (4) uma congruência com estudos prévios relativamente aos níveis detetados de duplicação no transcriptoma.

Adicionalmente, a crescente evidência científica a apontar para um papel central por parte de nAChRs na infeção do vírus SARS-CoV-2 deu azo à ideia de que conotoxinas podem ter um papel no tratamento da doença Covid-19. Por conseguinte, neste trabalho também se fez uma tentativa preliminar de anotar todas as sequências provenientes dos transcriptomas de espécies Conus e do SARS-Cov-2 que revelassem algum grau de correspondência entre si, mas a evidência de relação genómica está ainda por ser encontrada.

(7)

Palavras-chave: Conus, transcriptómica, glândula venenosa, GO IDs partilhados, simbiose, Covid-19.

(8)

Abstract

Cone snails are iconic species known for centuries for their beautifully coloured pattern conical seashells, which were and still are very sought after for trophies and decorative assets. Since ancient times, the shells which are prized objects used for jewellery, are also a treasure for many collectors, for who there is still a thriving exchanging market. It was only recently that special attention was diverted to the capacities of these species’

venom. As it turns out, other than having good-looking seashells, these snails produce a predatory venom with many complex compounds, each with incredible affinity for a certain class of receptors. These compounds act in a variety of ways from pain relieving to paralysation, thus being highly relevant for a broad field of biomedical applications if more effort is directed to their studies. With such goal in mind, this dissertation aims to contribute .to deciphering the molecular diversification of natural predatory venoms from these marine gastropods by studying available transcriptomes of Conus genus’ species using comparative genomics and bioinformatics assessments.

For this dissertation a methodology was developed to decode the genomic background as well as the relationships among the proteins of these marine predators’ venoms.

After gathering and polishing all trustworthy transcriptomic data at disposal, transcriptome assembly and functional annotation of proteins revealed the functions of the expressed genes. Multiple-step analysis and comparisons based on the results obtained enabled the weaving of several conclusions. In short, in this work it is reported (1) the number and functions of the shared genes found to be uniquely expressed in the venom glands of all 20 Conus species analysed, (2) a correlation of assembly size with the number of unique genes found, (3) evidence for symbiotic microorganism relationships within the venom gland, and (4) an agreement with previous works regarding transcriptomic duplication levels.

Furthermore, increasing scientific evidence pointing for a central role of nAChRs in the SARS-CoV-2 infection gave rise to the idea of possible applications for conotoxins in the Covid-19 disease treatment. Therefore, in this work a preliminary attempt was made at reporting any matching sequences between the transcriptomes of Conus species and SARS-Cov-2, but evidence of a genomic relation is yet to be found.

Key words: Conus, transcriptomics, venom gland, shared GO IDs, symbiosis, Covid- 19.

(9)

List of tables

Table 1 – The 29 genes shared only by venom related tissues with their respective functions. ... 35

(14)

List of figures

Fig. 1 – Distribution of Conus species throughout the world. The top map pictures the worldwide distribution of cone snails: molluscivous (M) in orange dots, piscivorous (P) in red, and vermivorous (V) in green. Dots in purple indicate the feeding habit of V+P;

blue and black dots do it for the M+P; and the solely white dot represents the V+M. The lower map zooms in on all Southeast Asia and the northern coast of Australia, spotlighting the collection of habitats with the greatest diversification of Conus species.

Image adapted from (56). ... 2

Fig. 2 – Macroscopic anatomy of a cone snail. Image taken from (139). ... 3

Fig. 3 – Beautiful cone snail shell patterns belonging to the 20 most abundant species in the South China Sea (13). ... 4

Fig. 4 – Diagram outlining the research’ workflow of this dissertation. ... 10

Fig. 5 – Comprehensive diagram illustrating the methodology workflow... 11

Fig. 6 – Annotation pipeline... 18

Fig. 7 – Description of the 25 transcriptomic samples removed from the dataset, including (from left to right) the samples’ origin study, names, tissues, species, date of collection and reason for removal. ... 24

Fig. 8 – List of the assembled data, featuring (from left to right) the samples’ names, tissues, species, and assembly size (in megabytes), as well as total numbers of samples (76), tissues (7), and species present (20). ... 25

Fig. 9 – UpSet plot illustrating the common GO IDs for the 7 transcriptomes of various tissues in the top bar chart. The first vertical blue line intersecting all 7 samples shows the 5,913 IDs that are shared by these tissues from all over the body and are thus perceived as housekeeping genes. Additionally, in the side black-bar chart are the total GO IDs attributed to each sample individually. ... 30

Fig. 10 – UpSet plot picturing the unique GO IDs for each of the 7 transcriptomes collected from various body parts of cone snails in the top bar chart, while also showing the total GO IDs attributed to each sample in the side bar chart. Organized in a crescent order of degree, this image illustrates the opposite edge of the sequence started in Fig. 8, where the IDs were organized in a decrescent order – the order meaning the number of intersections. Thus, in this plot it is possible to observe the genes with zero intersections (only dots without lines connecting), meaning the unique genes present in each of the transcriptomic samples. ... 31 Fig. 11 – UpSet plot illustrating the common GO IDs for the 69 transcriptomes of the venom-related group in the top black vertical bar. The vertical blue line intersecting all

(15)

69 samples shows the 2,104 IDs shared by these transcriptomes. Additionally, in the side black bar chart are the total GO IDs attributed to each sample individually. ... 33 Fig. 12 – UpSet plot picturing the amount of unique GO IDs for each of the 69 transcriptomes collected from venom ducts, venom bulb and a venom gland of various cone snails in the top bar chart, while also showing the total GO IDs attributed to each sample in the side bar chart. Organized in a crescent order of degree, this image illustrates the opposite edge of the sequence started in Fig. 10, where the IDs were organized in a decrescent order – order meaning the number of intersections. Thus, in this plot it is possible to observe the genes with zero intersections (no lines connecting the blue dots), meaning the unique genes present in each of the samples. ... 34 Fig. 13 – Venn diagram displaying the different groups of shared genes. In the left circle are the housekeeping genes (5,913) and in the right circle are the genes shared by the venom-related transcriptomes (2,104). The cross area indicates the number of genes shared in the venom related transcriptomes which are also expressed in other tissues (2,075). In this way, it is possible to visualise the group of shared genes expressed in various body parts but not the venom apparatus (3838) and most importantly the shared genes only expressed in the venom apparatus (29). ... 35 Fig. 14 – DE of the 29 genes commonly expressed in all venom-related transcriptomes normalized for the 20 species of cone snails present in the dataset. ... 37 Fig. 15 – DE of the 29 GO IDs commonly expressed in all venom-related transcriptomes normalized for the 3 different feeding habits of cone snails. ... 39 Fig. 16 – Three circular charts illustrating the percentage of transcriptomes, divided by assembly size, with their respective number of unique genes (GO IDs) compared to the mean value. A – chart representing the transcriptomes with assembly size below 50M and their respective variable number of unique GO IDs relatively to the mean; B – chart representing the transcriptomes with assembly size from 50 up to 75M and their respective variable number of unique GO IDs relatively to the mean; C – chart representing the transcriptomes with assembly size greater than 75M and their respective variable number of unique GO IDs relatively to the mean. ... 41 Fig. 17 – Maximum likelihood phylogenetic tree of 20 Conus species constructed using two rRNA 16S genes for each species. The branch colours represent feeding habits:

blue for vermivorous, red for piscivorous, and green for molluscivorous. Bootstrap values are written next to the branches in a colour reflecting their number according to the legend in the top left corner. ... 43 Fig. 18 – FastQC status check heatmap of the first MultiQC report; this heatmap illustrates the state of the transcriptomic data right after being acquired from the NCBI

(16)

platform. Along the vertical axis in the right are the files with the transcriptomic data.

Along the horizontal axis in the bottom are the categories evaluated, from left to right:

Basic Statistics, Per Base Sequence Quality, Per Sequence Quality, Per Base Sequence content, Per Sequence GC Content, Per Base N Content, Sequence Length Distribution, Sequence Duplication, Overrepresented Sequences, Adapter Content and Per Tile Sequence Quality. The red colour represents bad quality, yellow is medium quality and the green represents good quality. ... 69 Fig. 19 – FastQC status check heatmap of the second MultiQC report; this heatmap illustrates the state of the transcriptomic data after being processed by the software Trimmomatic. Along the vertical axis in the right are the files with the transcriptomic data. Along the horizontal axis in the bottom are the categories evaluated, from left to right: Basic Statistics, Per Base Sequence Quality, Per Sequence Quality, Per Base Sequence content, Per Sequence GC Content, Per Base N Content, Sequence Length Distribution, Sequence Duplication, Overrepresented Sequences, Adapter Content and Per Tile Sequence Quality. The red colour represents bad quality, yellow is medium quality and the green represents good quality. ... 70 Fig. 20 – FastQC status check heatmap of the third MultiQC report; this heatmap illustrates the state of the transcriptomic data after being processed by the software Cutadapt. Along the vertical axis in the right are the files with the transcriptomic data.

Along the horizontal axis in the bottom are the categories evaluated, from left to right:

Basic Statistics, Per Base Sequence Quality, Per Sequence Quality, Per Base Sequence content, Per Sequence GC Content, Per Base N Content, Sequence Length Distribution, Sequence Duplication, Overrepresented Sequences, Adapter Content and Per Tile Sequence Quality. The red colour represents bad quality, yellow is medium quality and the green represents good quality. ... 71 Fig. 21 – BUSCO assessment performed against BUSCO’s Mollusca set for all 76 transcriptome assemblies. ... 72 Fig. 22 – BUSCO assessment performed against BUSCO’s Metazoa set for all 76 transcriptome assemblies. ... 73 Fig. 23 – BUSCO assessment performed against BUSCO’s Mollusca set for the 69 transcriptome assemblies from venom related tissues. ... 74 Fig. 24 – BUSCO assessment performed against BUSCO’s Metazoa set for the 69 transcriptome assemblies from venom related tissues. ... 75 Fig. 25 – BUSCO assessment performed against BUSCO’s Mollusca set for the 7 transcriptome assemblies from various tissues. ... 76

(17)

Fig. 26 – BUSCO assessment performed against BUSCO’s Metazoa set for the 7 transcriptome assemblies from various tissues. ... 76 Fig. 27 – Chart illustrating the predicted coding sequences in blue, and the sequences recognized by the database in green for all the transcriptomes in decrescent order of ratio. ... 77 Fig. 28 – Chart showing the predicted and recognized coding sequences in blue and green respectively, for the venom-related transcriptomes in decrescent order of ratio. 78 Fig. 29 – Chart showing the predicted and recognized coding sequences in blue and green respectively, for the samples of other body parts in decrescent order of ratio. .. 79 Fig. 30 – Table with the venom-related transcriptomes divided in categories of assembly size: bellow 50M in light green, between 50 and 75M in light yellow, and above 75M in blue. Each type of sequencing instrument has his own colour: blue for Illumina HiSeq 1500, green for Illumina HiSeq 2000, orange for Illumina HiSeq 2500, orange for Illumina HiSeq 4000 and yellow for Illumina Genome Analizer II. Information based on the variables of assembly size and number of unique GO IDs is presented from column 7 up to 10. ... 80 Fig. 31 – ggplot2 depicting the DE of the 29 shared GO IDs in the venom-related transcriptomes normalized by the GO category and feeding habit. ... 81 Fig. 32 – Full annotation result of the entire SARS-Cov-2 genome processed against the Pfam database. In yellow are the proteins related to cell’s life cycle; in green are proteins with mechanistic and metabolic functions; in blue are proteins involved in the pathogenic pathways of the virus; finally, in orange are the proteins which are recognized but whose function is unknown. ... 82

(18)

List of abbreviations

ACE2 – Angiotensin-Converting Enzyme 2 BAM – Binary Alignment Map

BLAST – Basic Local Alignment Search Tool

BUSCO – Benchmarking Universal Single-Copy Orthologs COI – Cytochrome c oxidase subunit I

CPU – Central Processing Unit DE – Differential expression DNA – Deoxyribonucleic Acid

EggNOG – Evolutionary genealogy of genes: Non-supervised Orthologous Groups FCUP – Faculdade de Ciências da Universidade do Porto/ Faculty of Sciences of the University of Porto

GO – Gene Ontology knowledgebase HMMs – Hidden Markov Models HTML – HyperText Markup Language IO – Input/Output

M – Megabytes

MEGA – Molecular Evolutionary Genetics Analysis mRNA – Messenger Ribonucleic Acid

nAChRs – nicotinic acetylcholine receptors

NCBI – National Center for Biotechnology Information PD – Peptidase Domain

PTM – Posttranscriptional Modification RBD – Receptor Binding Domain RNA – Ribonucleic Acid

RNA-seq – High-throughput Sequencing of Transcriptomes

(19)

rRNA – Ribossomal Ribonucleic Acid S protein – Spike glycoprotein

SAM – Sequence Alignment Map

SARS-CoV-2 – Severe Acute Respiratory Syndrome Coronavirus-2 SRA – Sequence Read Archive

UP – Universidade do Porto/ University of Porto

(20)

1. Introduction

1.1. Natural venoms: overview of the research field

Scientific interest in natural venoms started many decades ago mainly for health reasons in a search for medicines (1). The logic behind this being that the understanding of one venom’s mechanism of action provides a perfect insight on how to effectively counter its poisonous nature and ultimately render it harmless. Beyond this simplistic counter-poison reasoning however, the study of a natural venom serves purposes other than helping with the finding of a medicine treatment for that specific venom.

Integrated in a defensive or offensive strategy, each natural venom is a very sophisticated biological weapon. Possessing a singular mixture of often unique components, mainly proteins and peptides, a venom interferes in the functioning of specific or diverse biological targets. Consequently, the accurate cataloguing and description of a venom component’s nature and function potentially unveils previously unknown molecules and mechanisms. In turn, this discovery of novel molecules and its functions provides knowledge of incalculable value for a variety of disciplines and scientific fields such as health, genomics, proteomics and cellular, molecular and neurobiology (2) (3) (4).

Over the years, venom originating from all biological kingdoms was isolated and characterized. Fruitfully, it already prompted the creation and development of biochemical tools for health treatments, illustrating well how venom’s studies are far reaching, pushing for the development of treatments for diseases other than the poisoning by the original venom from which the components originated from (2) (3) (4) (5) (6) (7) (8) (9) (10).

1.2. Conus species ecology

Particularly fascinating for this field of research is the venom of Conus species. From an ecological point of view, the Conus genus is a highly diverse natural group with more than 700 species of sea snails living preferentially in the intertidal zone of tropical and subtropical regions worldwide [Fig. 1] (11) (12) (13) (14).

Being natural predators, they can be found in rocky shores and coral reefs hunting with a specialized harpoon-like radular tooth to inject their shuddering, paralyzing venom onto fishes (piscivorous), molluscs (molluscivorous) or worms (vermivorous) (15) (16).

Although not as widely studied, the Conus genus also produces a defensive venom for

(21)

Fig. 1 – Distribution of Conus species throughout the world. The top map pictures the worldwide distribution of cone snails: molluscivous (M) in orange dots, piscivorous (P) in red, and vermivorous (V) in green. Dots in purple indicate the feeding habit of V+P; blue and black dots do it for the M+P; and the solely white dot represents the V+M. The lower map zooms in on all Southeast Asia and the northern coast of Australia, spotlighting the collection of habitats with the greatest diversification of Conus species. Image adapted from (56).

situations where the animal perceives to be in imminent danger or under attack (17).

The defensive venom is also produced in the same tissues as the offensive one, but as it is harder to obtain, it is not as well studied and is not subject of this dissertation.

(22)

Fig. 2 – Macroscopic anatomy of a cone snail. Image taken from (139).

1.3. Characteristics of the venom and the genome

The predatory venom is a cocktail composed of roughly a couple hundred bioactive well-structured polypeptides (~10-40 amino acid residues) known as conotoxins (18).

They are secreted in the venom duct – a long and convoluted tubular structure [Fig. 2]

– by the epithelial secretory cells before being pushed by muscle peristalsis of the venom bulb to be loaded into the harpoon-like tooth (19) (20). Conotoxins are synthesized as precursors with a three-domain structure: one is a conserved signal region; another is a pro-peptide region involved in the processing of the precursor; and lastly there is a highly variable, cysteine-rich mature region, which is the functional toxin (18) (21) (22). After their synthesis, these mostly disulphide-rich peptides suffer a series of PTMs enabling each of them to interact specifically with their intended target – an important aspect since it means minor side effects in disease treatment (18) (23)

(24) (25) (26) (27) (28).

Targets of conopeptides include the presynaptic membrane calcium channel or G protein receptor, voltage-gated potassium and calcium channels, the receptors of serotonin, somatostatin, norepinephrine and adrenal hormone, and others (18) (29).

The conserved signal region is currently used to classify precursors into “superfamilies”

– a superfamily may consist of several families, each family targeting a specific ion channel and/or receptor (23) – although there is some degree of debate as to whether or not the current classification is well suited and pertinent (30).

(23)

Nowadays, there are 30 gene superfamilies and more than 7,000 conopeptides discovered out of an estimated 50,000 to 1 million different bioactive conotoxins across all these beautifully patterned shell species [Fig. 3] (31) (32) (33) (34) (35). Although only a few percent of this polypeptides have been sequenced and studied, the genus Conus’ venoms are already well established in the grand venom’s research field.

Thoroughly reflecting the great potential as drug precursors are the proven results and direct practical applications in various scientific contexts such as neuroscience and pharmacology (36) (37) (38). For instance, in neuroscience, there are some venom’s molecules in use as molecular tools, several are in clinical trials, and one is even already approved as a drug against chronic pain (13) (39) (40) (41).

Concerning the genome, Conus specimen’s total genome size tends to have around 3.5Gb, being smaller than expected from previous fluorometric assays and flow cytometry (42) (43). Assemblies conducted for C. betulinus (the first intact genome assembled) and C. ventricosus showcase this as they have 86% and 87.6% of their corresponding size estimation, respectively (42) (43). As for genome structure, gastropods in general have an ample range of chromosome numbers and Conidae genomes are no different, varying from 16 pairs in C. magus to 35 pairs in C. coronatus (44) (45) (46).

Regarding the conotoxin genes it was revealed by the venom gland transcriptome that genes scattered in different chromosomes and located within repetitive regions encode conotoxin precursors, hormones, and other venom-related proteins (42) (43). The genes encoding conotoxin precursors are normally structured into 3 exons (mean~85

Fig. 3 – Beautiful cone snail shell patterns belonging to the 20 most abundant species in the South China Sea (13).

(24)

base pairs) which do not necessarily coincide with the 3 structural domains of the corresponding conotoxins – generally the boundaries of the first and second exons do not always coincide with the boundaries of signal and pro-peptide domains, but the third exon does exclusively encode the mature domain (43). Curiously, there is a big discrepancy between number of conotoxin genes versus transcripts and this may be due to natural variation among individuals coupled with the PTMs as while highly expressed transcripts are common to both specimens, variation of moderately and weakly expressed precursor sequences is surprisingly big (26) (42) (43) (47) (48) (49).

Further illustrating this is the fact that, when conotoxin variability is compared quantitatively, highly expressed peptides show a strong correlation between transcription and translation, whereas peptides expressed at lower levels show a poor correlation (42) (43) (47). This in turn suggests that major transcripts are subject to stabilizing selection, while minor transcripts are subject to diversifying selection (47).

Additionally, on account of gene duplication, accelerated substitution rates, recombination, alternative splicing and differential expression, these neuroactive toxins are strikingly structurally diverse from species to species even before undergoing extremely complex PTMs (42) (43). Furthermore, some venom-related proteins have expression levels one order of magnitude higher in the foot than in the venom gland indicating that these hormones and proteins may be endogenous, having a physiological function common to different tissues and not restricted to the venom gland (43). The detection of low expression levels of toxin genes in different tissues outside the venom gland has also been demonstrated in snakes and in platypus (50) (51). To explain the evolutionary origin of this, it is suggested that toxin genes emerge through one out of two main paths: either by gene duplication and adaptive neofunctionalization of physiological genes in the venom gland coupled with reduction of expression levels in other tissues; or instead by sub functionalization through neutral evolution and restriction to the venom gland (47) (50) (52) (53).

At organ level, previous studies on dissected venom ducts also revealed that modern Conus species can produce venom peptides with different functions at different parts of the venom gland and even for different parts of the venom duct (17) (30). All along the length of it, qualitative and quantitative differences in conotoxin components were found, suggesting specialization of duct sections for biosynthesis of conotoxins (30) (54) (55). Further evidence for this specialization lies in the variation in epithelial composition in the proximal, central and distal portions of the duct (30) (54). These anatomic variations and differential expressions explain the compositional differences noted in multiple injections during single feeding events (30) (54) (55). Moreover, the

(25)

defence conotoxin group has been identified and distinguished from the predation- evoked conotoxin group, the former being present in the bulb-proximal and the latter in the bulb-distal end of the specialized venom ducts (17) (30). It is this astonishing capacity to diversify the arsenal of conotoxins that allowed the Conus snails to become specialized in attacking, defending, hunting and intimidating throughout all geographic locations of their habitats. Because of this remarkable flexibility, conotoxin compositions vary and are expected to do so from natural conditions to laboratorial cultures, causing difficulty in research assessments (56).

During predation some peptides assume greater importance than others, materializing the diverse conotoxin compositions throughout successive venom injections. Studies on the function of conotoxins and biomechanics of prey capture attribute a higher potency to the venom from the posterior end of the duct (the end attached to the venom bulb) comparatively to the venom extracted from the anterior portion near the proboscis (30). As the first injections tend to have peptides predominantly found in regions of the duct proximal to its insertion, whereas later injections are composed of peptides found in more distal regions near a large muscular venom bulb connected to the end of the duct, it can be concluded that the last injections are the most dangerous (16) (19) (42) (43). In general, compounds from A, M, O1 and T gene superfamilies account for the bulk of venom cocktails (26) (42) (43) (48). Peptides from the O- superfamily are prominent in all venom profiles and are present in multiple shots from single feeding events, indicating the importance of O-superfamily members throughout prey capture (13) (26) (42) (43) (54) (57) (58). These peptides block calcium and potassium channels and slowly inactivate sodium channels (13) (57) (58). In subsequent injections, peptides of the M- and T-superfamilies are more prevalent (54).

Members of the T-superfamily target the presynaptic membrane calcium channel or G protein receptor, the somatostatin receptor and block noradrenaline transporters or voltage-gated sodium channels while venom peptides from the M-superfamily are known blockers of sodium channels, potassium channels and acetylcholine receptors (13) (57) (58). In this context, it is also important to note that the expression of some conopeptides is so low that their presence cannot be detected by traditional proteomic experiments (56).

1.4. The transcriptomics approach

Analysis of transcriptomes in a transcriptomics (transcriptome + bioinformatics) approach provided all this knowledge as in order to understand the mechanism of

(26)

action of a venom, first there is the need to discover what kind of molecules are composing it.

In this regard, comprehension of the transcriptome – being the set of all RNA transcripts including mRNA – is fundamental as the RNA sequences in the sample provide knowledge of the genes being actively expressed while also indicating the composition and structure of the proteins composing the venom. Thus, data processing and analysis is done utilising the raw RNA sequences’ samples through software programs based on RNA-seq (59) (60).

In essence, RNA-Seq enables the study of genetic and functional information regarding any organism at an unprecedented speed and scale. Together, these two features immensely facilitate functional genomics research in species, especially in those for which genetic or financial resources are limited. This less spotlighted group includes many non-model organisms — organisms that, although they have not been exten- sively studied in a research setting, are nevertheless of substantial ecological, evolutionary or therapeutic importance — such as Conus and other venomous species (13) (59) (61).

In light of this dissertation, the most relevant aspect of RNA-Seq is that it makes it possible to simultaneously study the transcript structure and expression with a high resolution and a broad dynamic range. In this wise, investigation is conducted following predetermined strategies made in accordance with the species of interest. Generally, the primary workflow can be resumed to three steps: sample collection and polishing, transcriptome assembly and functional annotation. Upon annotation completion, the dataset composed of putative toxins, novel genes, and already known venom proteins and genes is then open to a variety of further analysis (59) (62) (63).

1.5. Possible relevance in the Covid-19 disease treatment

Since late 2019, a novel strain of coronavirus has enveloped our world in a pandemic.

As of September of 2022, the severe acute respiratory syndrome coronavirus-2 that causes the Covid-19 disease has infected a confirmed 518 million people worldwide, claiming the lives of over 6 million of them (64).

This coronavirus is a positive-strand RNA virus, and its infection starts in the epithelial cells of the respiratory system, mainly in the lungs and trachea, via its S protein. Firstly, the trimeric viral S protein has its 3 heterodimers cleaved into its subunits: the S1s and S2s. Afterwards, while the S2s play a membrane fusion role, the RBD of the S1 subunits binds with the PD of the host ACE2. The establishment of this early

(27)

connection allows the virus to enter the host cells and is therefore crucial for the viral infection (65) (66).

Frequently, the infection reaches organs beyond the respiratory system ones as, among other aspects, there are ACE2 receptors in the heart, kidneys and intestine (67) (68) (69). Thus, the infection might and often does interfere in the functioning of the circulatory, renal, urogenital, digestive and even central nervous system as through the vascular system’s blood vessels the virus is able to attack the peripheral nerves and disrupt the blood-brain barrier (70) (71) (72) (73).

In the present, the universally adopted counter measure against this contagious virus is vaccines (74). Through various doses of vaccination, the obvious aim is to render the populace ultimately immune to the Covid-19 disease. However, there are multiple possible side effects related to the treatment with all vaccines currently being administrated. For example, the Pfizer vaccine, being one of the most given vaccines, has an enormous list of possible side effects (75) (76). For these reasons, it is important to conduct further research in order to raise the standard of safety, comfort and trustworthiness of the methods used to combat the pandemic.

Interestingly, as Covid-19 is primarily classified as a respiratory disease, the theoretically expected main risk group would logically include anyone with a less healthy respiratory system, such as the elderly, the asthmatics or the smokers (77) (78). However, while this is true for the first two aforementioned groups of people, the reported low prevalence of smoking patients hospitalized due to Covid-19 really stood out and caught the attention of the professionals (79) (80) (81). On account of this early observation, it was suggested that nicotine might mitigate or even prevent the virus’

infection (79). This hypothesis is being currently studied under different perspectives, with one of them being the direct administration of medicinal nicotine – already undergoing clinical trials (with no results posted until now) (82) (83).

Following this important suggestion, another one was made that stated a central role for nAChRs in the SARS-CoV-2 infection (70). This link was first based upon the discovery of a sequence homology between the furin cleavage site of the S protein and a neurotoxin motif that targets nAChRs (84). Reasoning that if Covid-19 can be, in some degree, controlled using nicotine to compete with the virus for binding to nAChRs, these receptors could be central in the process of infection (72) (84) (85).

Indeed, evidence shows that the viral glycoprotein possesses a favourable affinity towards nAChRs, thus supporting this view and validating the importance of the initial link (70) (86) (87). Therefore, the neurotoxin in question assumes great relevance.

(28)

Curiously, it belongs to a family of powerful neurotoxins only present in the natural venom of a fascinating yet neglected group of species: the Conus species.

1.6. Objectives and strategy

For this Masters’ thesis, a multitude of databases and bioinformatic procedures were respectively accessed and applied with the ultimate objective of contributing to the deciphering of the evolutionary genomics and transcriptomics of the Conus genus’

natural venom.

In addition, as the possible application in the treatment of COVID-19 disease gradually took shape along the research work, the objective of helping in the endeavour of the fight against the pandemic also materialized. As stated previously, the sequence similarity between some conotoxins and a key virus protein (the S protein) gave the impression of an interaction of the latter with nAChRs – now confirmed. Aforestated studies unequivocally demonstrate that the S protein promptly binds with the α9- nAChR subunit, just like α-conotoxin. In virtue of these facts, further investigation connecting the Conus genus’ toxins and the SARS-CoV-2 virus is thoroughly necessary, not to say potentially essential.

In order to reach the objectives, a four-step pipeline was designed and is briefly illustrated bellow [Fig. 4]. As before mentioned, the first step is the collection of all possible transcriptomic data regarding cone snails as well as the genome sequences of both the SARS-Cov-2 virus and the viral Spike protein. Within this step, it is required to polish the Conus’ dataset properly and carefully before advancing further. The assembly of the acquired transcriptome follows, being conducted exclusively in the de novo method on account of limitations of time and computer storage space. Next, the functional annotation of the assembled transcriptome is performed, starting by sorting the data into two cured datasets: one exclusively with samples recovered from the venom apparatus and another with samples retrieved from other tissues and organs.

Afterwards, the finding of the biological functions of the genes encountered empowers the jump to the final gene and protein comparison and matching analysis. Thus, in the fourth and last step aims to find relationships amidst the Conus genome datasets, and with the coronavirus’ genome. The results are presented in the form of narrow category heatmaps, intersection plots, expression charts and a phylogeny tree.

Hence, starting with all the Conus’ transcriptomic data publicly available and utilizing the latest bioinformatic tools in a combined transcriptomics approach to process it, the main objective of this dissertation is to produce a cohesive and coherent functional

(29)

annotation of the Conus species natural predatory venom, as well as reporting any relationships found within the wider genome of Conus comparatively to the narrower venom gland transcriptome. Extending from the original purpose, research on matching sequences between the cone snails’ venoms and the coronavirus that caused the terrible pandemic was conducted.

Fig. 4 – Diagram outlining the research’ workflow of this dissertation.

(30)

2. Materials and Methods

2.1. Materials

All materials utilized in the performing of this research work were property of the UP. All procedures were conducted from one laboratorial computer located in the FCUP’s laboratory 2.49. Conveniently, in times of mobility restriction or for progress security check-up, this computer was remotely accessed from my own personal computer using Anydesk – a closed-source remote desktop application (88). Being platform- independent, this software program provides remote access to other computers and devices running the host application.

The laboratorial computer operated on Linux system (for system specifications, see the Annexes section 7.1.), with every procedure made and all bioinformatic tools called for using a Konsole terminal – an open-source terminal emulator (89). All command functions employed are listed in the Annexes section (see 7.2.) exactly as they were written manually in the terminal.

2.2. Methods

The methodology of this dissertation is firstly illustrated bellow [Fig. 5] and then thoroughly explained in the following subchapters. In Fig. 5, all actions performed as well as all software utilized are listed in linear and chronological order.

Fig. 5 – Comprehensive diagram illustrating the methodology workflow.

(31)

2.2.1. Obtention of the transcriptomic data

The transcriptomic data was acquired from the NCBI platform in the “SRA” format (see 7.2.1.). The SRA is a publicly available archive hosted by NCBI, providing a public repository for mostly raw DNA sequencing data (90) (91) (92). Previously known as Short Read Archive, this repository was initially for the "short reads" generated by high- throughput sequencing, which are typically less than 1,000 base pairs in length. It includes data submitted to NCBI, the European Bioinformatics Institute, and the DNA Data Bank of Japan. All data in the SRA format has full, per-base quality scores (93).

Originally, only 101 out of 105 total files were effectively downloaded as 4 of them had download access denied. Fortunately, after investigating the content of the files, it was verified that all the 4 missing ones were in fact equal to another downloaded file.

Therefore, every single available data was obtained, considered, and processed.

2.2.2. Conversion of the dataset to the standard FASTQ format

Next there was the need to convert the downloaded data to a standard format, so, the data set was converted from SRA to FASTQ format – the de facto standard text-based format for storing the output of high-throughput sequencing instruments (see 7.2.2.) (94). The regular computational function can sometimes be – and indeed in this case was being – very slow, regardless of the technological resources (e.g., network, IO, CPU) available. To face this problem and increase the speed of the process, the tool was parallelized. Basically, this way the procedure is accelerated by first dividing the work into a requested, personalized number of threads, then running multiple functions in parallel and in the end concatenating the results back together (95). In practical terms, it makes the process much faster without affecting the result.

The chosen thread number was 40 (out of 56 possible for the system used) and, just as intended, the parallelization delivered the end results much faster than normal functions previously attempted.

2.2.3. Quality control analysis

The tool used for quality control analysis was FastQC. This is a very popular bioinformatic instrument used to provide an overview of basic quality control metrics for raw next generation sequencing data. It imports data from a variety of file formats (any variant of BAM, SAM or, most importantly, FASTQ files) and exports the results to an HTML based permanent report. This report contains a set of summary analyses in a group of coloured bar charts and histograms to provide a clear impression on the state of the research data while also precisely indicating the areas where the quality falters (96). General statistics of the data such as an estimative percentage of duplicates, GC

(32)

content, read lengths and total sequences (in the millions for the samples used for this work) can be found right in the beginning of the report. Then, many other parameters are summarily presented but the most important are: the mean quality values across each base position in the read, the number of reads with average quality scores (per sequence quality scores), the sequence length distribution (which is expected to be around 100 base pairs) and the adapter content and per base N content (the percentage of each base position for which an N was called), both of which should be minimum to none. Other parameters such as duplication values and GC content are not very important since the former does not necessarily translate neither good or bad quality, and the latter should mostly present a normal distribution – a “bell shaped”

graph – but deviations in the case of this research are to be expected since the data is of transcriptomic nature. Lastly, in the end of each report there is a heatmap compiling all the categories.

In this work, FastQC analysis was undertaken a total of three times: the first (see 7.2.3.) was done on the converted data as soon as the conversion from SRA to FASTQ format finished, the second was executed after the data was processed by the software Trimmomatic (see 7.2.4.), and in a similar way the third analysis was made after the dataset being further processed by the software Cutadapt (see 7.2.5.). The first analysis was a routine one made to generally access the state of the dataset without polishing. In contrast, the second and third quality controls were executed after examining the previous reports and submitting the dataset to trimming software a first and second time, respectively.

Following the third quality control, all data still not suitable for further analysis was discarded, but only after being properly investigated (see Results section 3.1.).

2.2.4. Polishing data with trimming software

The need for raw read filtering and trimming can be explain by demonstrations of previous studies that indicate these processes improve the quality of the future assembly (59) (97) (98) (99). Usually, this need is made clear by a quality control report on the newly acquired data. In this sense, after being acquired from NCBI and the routine quality control was made, the files were then processed by the software Trimmomatic. This software is a fast, multithreaded command line tool that can be used to trim and crop Illumina (FASTQ) data as well as to remove adapters (100).

According to the problems found in the first quality report, Trimmomatic was run with adjusted parameters (see 7.2.6.) in the manner of the following reference:

(33)

❖ ILLUMINACLIP: <fastq_data> : <seed_mismatches> :

<palindrome_clip_threshold> : <simple_clip_threshold> :

<min_Adapter_Length> : <keep_both_reads> LEADING: <quality> TRAILING:

<quality> SLIDINGWINDOW: <window_size> : <required_quality> MINLEN:

<length> THREADS: <number>

The function of the options and the specified number attributed are thoroughly explained bellow (101):

• ILLUMINACLIP: <fastqData>:<seed mismatches>:<palindrome clip threshold>:<simple clip threshold>

➢ fastqData: specifies the path to the FASTQ files. In this work, the suggested adapter sequences used were the ones provided for TruSeq3 (as used by HiSeq and MiSeq machines); since the data is paired ended, this was also specified by writing “PE”;

➢ seed_mismatches: set as 2, it specifies the maximum mismatch count which will still allow a full match to be performed;

➢ palindrome_clip_threshold: set as 30, specifies how accurate the match between the two 'adapter ligated' reads must be for PE palindrome read alignment;

➢ simple_clip_threshold: set as 10, specifies how accurate the match between any adapter etc. sequence must be against a read;

➢ min_adapter_length: in addition to the alignment score, palindrome mode can confirm that a minimum length of adapter has been detected. Because the palindrome mode has a very low false positive rate, this number can be safely set even to 1 (reducing drastically from a default of 8, for historical reasons), as in this case, to allow shorter adapter fragments to be removed;

➢ keep_both_reads: after read-though has been detected by palindrome mode, and the adapter sequence removed, the reverse read still contains the same sequence information as the forward read. For this reason, the default behaviour is to entirely drop the reverse read. This parameter avoids that, retaining the reverse read – which may be useful e.g., if the downstream tools cannot handle a combination of paired and unpaired reads;

• LEADING: <quality>

➢ quality: set as 20, it specifies the minimum quality required to keep a base in the beginning of the sequence;

(34)

• TRAILING: <quality>

➢ quality: set as 20, it specifies the minimum quality required to keep a base in the end of the sequence;

• SLIDINGWINDOW: <window_size>:<required_quality>

➢ window_size: set as 4, it specifies the number of the group of bases permanently on the scanning window, across which quality is averaged;

➢ required_quality: set as 15, it specifies the average quality required to keep the whole group of bases in the scanning window previously defined;

• MINLEN: <length>

➢ length: set as 51, specifies the minimum length of reads to be kept;

• THREADS: <number>

➢ set as 48 (from a maximum possible of 56).

In short, the previous parameters act in the removal of leading, trailing low quality (below or above quality 20, respectively) and N bases. They also dictate a scan of the read with a 4-base wide sliding window, cutting it when the average quality per base drops below 15 and dropping reads below the 51 bases long.

Following the Trimmomatic’s editing, the second analysis report revealed the need for additional correction and so the software Cutadapt was used. In essence, Cutadapt is another trimming software that also finds and removes adapter sequences, primers, poly-A tails, and other types of unwanted sequences (102). In this case, it was used to just chop out the first and last 15 bases from the sequences (see 7.2.7.). This software also supports parallel processing – to enable it, the option “-j N” was used, where “N” is the number of cores to use and was set as 30.

As previously stated, after finishing the second trimming process, a third quality control analysis was called and subsequently all data at a less acceptable state was discarded.

2.2.5. Assembly of the transcriptome

The assembly of the transcriptome was done recurring to the software Trinity. This software processes large volumes of RNA-Seq reads through a combination of three independent modules which are applied consecutively to extract full-length splicing isoforms (103). The first module, Inchworm, generates transcript contigs for the second module, Chrysalis, to cluster together, which it does by constructing complete de Bruijn graphs for each cluster. Every single cluster represents the full transcriptional complexity for a given gene or groups of genes if they have the sequences in common.

Then, Chrysalis partitions the full read set among these clusters that are in disjoint

(35)

Bruijn graphs. The last module, Butterfly, processes the individual graphs in parallel, eventually reporting full-length transcripts for alternatively spliced isoforms and setting aside transcripts that correspond to paralogous genes (103) (104).

There are two primary methods for assembly: through the guidance of previously assembled genomic sequences or via de novo assembly (104). For model organisms, the standard approach to transcriptome studies is the genome-guided one. However, as this approach cannot be applied to organisms for which a well-assembled genome does not exist – like most venomous species – a de novo transcriptome assembly is required (104) (105). For the genus Conus, the genome guided approach became possible only recently, with the intact assemblies conducted for C. betulinus and C.

ventricosus made publicly available for reference in 2021 (42) (43).

In this work, although both methods of assembly are possible, only the de novo was carried out, partially due to time constrains but mainly as a consequence of limitations in computer storage space. The de novo assembly performed out of the whole transcriptomic data and without genome reference was therefore attempted (see 7.2.8.). The function ran on 25 CPUs and with an attached maximum memory of 25 Gigabytes. Unfortunately, the task could not be completed at first and stopped mid-way because the storage capacity of the laboratorial computer was all used up.

Consequently, the dataset was compressed to highly compressed archive files.

Fortunately, it succeeded in arranging enough storage space for the remainder of the operations to occur without any further delay of this sort.

2.2.6. Assembly completeness assessment

Evaluating the quality of the assembly and check the execution of the process before moving on is of major importance since every subsequent step utilizes the assembled dataset as a basis. In this regard, the BUSCO proved to be the elite choice to both quickly and reliably produce a quantitative and qualitative assessment of the transcriptome assembly completeness. Based on evolutionarily informed expectations of highly conserved gene content from near omnipresent single copy orthologs, this software performs an evaluation on genome, proteins, or, in this case, transcriptome even without a reference genome (106). Moreover, as its most time-consuming steps are parallelized, the software is very speedy. In the end, the assessment is presented in the form of a bar chart with the following categories:

• “Complete and single copy”: for the orthologs whose aligned sequence length is within 2 standard deviations of the BUSCO group’s mean (i.e., 95%).

(36)

• “Complete and duplicated”: when multiple copies of the same orthologs are found in the gene set being assessed.

• “Fragmented”: for orthologs whose length of their aligned sequence is beyond 2 standard deviations of the BUSCO group’s mean length (i.e., <95%).

• “Missing”: for any BUSCO without a BUSCO-matching gene meeting the

‘expected score’ cut-off.

For this research, two assessments were made in succession on the whole assembled transcriptome against two of the BUSCO’s datasets: the “Mollusca” and the “Metazoa”

(see 7.2.9.a and 7.2.9.b, respectively). The intention was to choose datasets of animal groups that included the Conus species but on a different magnitude. Therefore, the datasets chosen correspond to the phylum and kingdom where cone snails are inserted.

When analysing completeness assessments, the results should be interpreted by comparing both assessments given with the two BUSCO datasets. The reason for this lies with the fact of the completeness assessment being itself a comparison, searching the assemblies for highly conserved genes present in certain animal groups. Hence, as a relative value, it should always be analysed in perspective, meaning particular attention should not be placed on the percentages of completeness alone, but rather on which comparative dataset the tendency for a greater level of completeness is.

Moreover, by focusing on specific tissues as in this case, a transcriptomic experiment is unlikely to produce a BUSCO-complete transcriptome since the level of differentiation is very high. For these reasons, the desired outcome is consistency across the assembled dataset.

After analysing the first two assessments, the dataset was split in two groups based on one category: the origin of the transcriptomic samples. Naturally, the transcriptome previously acquired was recovered from many organs and tissues of Conus species.

As the interest of this work lies with the venom-related data, the dataset was divided in a group of venom-related transcriptomes and another for samples from other tissues.

In this way, the assessment results are easier to interpret. Thus, after the origin of the raw data was inspected on NCBI, both groups had their completeness evaluated against each of the two BUSCO datasets used before (see 7.2.9.c, 7.2.9.d, 7.2.9.e, and 7.2.9.f). On a final note, as the graphs made by the software were very disfigured, the values were used to make better graphs with excel.

(37)

2.2.7. Annotation of the transcriptome

Annotation is a complex multi-step procedure that utilizes several different software and protein databases in sequential order to attain the functions of the proteins. The first stage is to identify and predict coding regions from the assembled transcriptome.

Afterwards, the functionalities of the coding sequences are obtained through crosschecking and comparison across various reviewed databases. The perfected method chosen for annotation is illustrated bellow [Fig. 6] and is detailly explained in the following subchapters.

2.2.7.1. Identifying likely coding sequences

The first of said software is TransDecoder, more specifically the TransDecoder.LongOrfs component. It serves to identify candidate coding regions within transcript sequences such as those generated by transcript assembly using Trinity (107). These likely coding regions are then confronted at various steps with one or more databases to provide an ever-more complete and accurate classification of protein families and domains. In this strategy, the most important output of TransDecoder.LongOrfs is the ‘Trinity.fasta.transdecoder.pep’ file, which contains the protein sequences corresponding to the predicted coding regions within the transcripts.

This “.pep” file obtained for each of the assembled transcriptomes with TransDecoder.LongOrfs (see 7.2.10.) was used to elaborate the subsequent sequence homology and other bioinformatic analyses.

2.2.7.2. Support annotation with UniProt and Pfam

Having the intention of utilizing the greatest tools in conjunction with the best databases available, the strategy for annotation splits up at this point and follows two methods.

Fig. 6 – Annotation pipeline