ISI-integration-biology

(1)

Ingénierie des Systèmes d'Information. Volume X - n°X/2002, pages Y à Z

transcriptome

Laure Berti-Equille*, Fouzia Moussouni, Anne Arcade

*IRISA, Campus de Beaulieu, 35042 Rennes, France

**INSERM U522, Campus de Villejean,35000 Rennes, France {Prénom.Nom}@rennes.inserm.fr

ABSTRACT: A major concern in modern biology and medical research consists of the use of a “high flow” technology named bio-arrays or DNA chips that allows the study of thousands of genes simultaneously. The medical research institute, INSERM U522, specialized in the liver, uses the transcriptome techniques to diagnose liver disease states and to point the way towards new therapies. For this sake, the design of a bioinformatic integrated environment, named Gedaw (Gene Expression DAta Warehouse) has been initiated for storing, managing and analyzing such specific data. As an object-oriented data warehouse, it includes knowledge and complex data on genes expressed in the liver. The concept of ontology is the keystone of the application for integrating both genomic data available on public databanks, as well as experimental data on genes delivered from laboratory experiments and clinical statements. This paper describes the data modeling and processing that allow (i) to capture data from public databanks on genes (e.g., GenBank) (ii) to extract relevant information by selecting objects imported in XML format (iii) to make them persistent into the object-oriented warehouse.

RÉSUMÉ: De nouvelles techniques d'analyse biologique, dites "à haut débit" génèrent une masse considérable de données qu'il est nécessaire d'organiser, de stocker et de gérer. L'unité de recherche U522 de l'INSERM, utilisant ces techniques pour l'étude du transcriptome hépatique, a initié le développement d'un environnement intégré nommé Gedaw (Gene Expression DAta Warehouse), dédié à la gestion, à l'intégration et à l'analyse de ces nombreuses données. Entrepôt de données orienté objet, il regroupe des connaissances et des données complexes sur les gènes du foie. Le concept d'ontologie, au centre de l'application, permet d'intégrer à la fois les données sur les séquences génomiques issues des banques de données publiques, ainsi que les données issues des expériences du laboratoire et des relevés cliniques. Cet article présente la problématique de l'intégration des données biologiques liées au transcriptome. Il décrit le modèle de données associé à ce cas d'application et implanté sous forme de classes d’objets persistants au sein d'une base orientée objet. La chaîne de traitements développée dans Gedaw permet : (i) d'intégrer les données génomiques à partir de banques publiques (telles que GenBank) (ii) d’extraire les informations pertinentes par sélection d'objets au format XML (iii) de les rendre persistantes dans l'entrepôt.

(2)

KEY-WORDS: data integration, biological data, ontology, object-oriented data warehouse, XML, transcriptome.

MOTS-CLÉS: intégration, données biologiques, ontologie, entrepôt de données objet, XML, transcriptome.

1. Introduction

Current projects in genomics and modern biology are daily delivering extremely rich and diverse knowledge on genes and relatives. One of the major challenges of bioinformatics towards this plethora of data is how to efficiently tame it in a way that makes it computationally accessible, user-eloquent, and that allows new bio-knowledge extraction and discovery. The management of biological data is challenging: data are heterogeneous, complex and multi-disciplinary, for example experiment details, raw data, scientific interpretations and hypothesis, images, literature, etc. The challenge lies in searching for relevant methods in information retrieval, scientific text mining and knowledge discovery, in order to take in hand a knowledge that is conceptually so rich and complex.

One of the major concerns in nowadays modern biology and medical research consists of the use of bio-array technologies and transcriptome study to diagnose disease states and to point the way towards new therapies. Because it allows the study of thousands of genes simultaneously, it gives a quite big turn in delivering new knowledge on their dynamics and interrelationships. However, the massive production of data involves difficulties in their management and analysis. Biologists currently gather, seek and compare heterogeneous information items from different information sources to carry out this analysis. They spend a considerable time to select tools and sources, phrase their questions and decipher the results received from each source. This process requires a considerable manual and intellectual effort that makes constant barrier to progress. It may also be plagued with erroneous data and misinterpretations. Data of interest on the expressed genes joint to what is all publicly known on their relatives (expression, sequence, protein function, family, interactions, bibliography, etc.) become increasingly bulky, heterogeneous and distributed over several scientific information sources.

This paper deals with the problems of biological data integration and presents our approach focused on the integration processing implemented into our project of Gene Expression DAta Warehouse (Gedaw). Gedaw is an object-oriented environment for: i) storing, managing and integrating the data generated by transcriptome experiments and ii) analyzing them along with the knowledge on the expressed genes publicly available on the Web. Other aspects of our Gedaw environment, such as biological data quality controlling, mining and refreshing will not be described in this paper even if they may significantly contribute to the data integration processing.

(3)

Therefore, in order to properly integrate biological data, many questions must be addressed: how can we reconcile all the semantics and the different view points available around one biological concept such as the notion of gene (i.e., the metabolic, the chemical and the functional perspectives) ? How to unify available knowledge on a gene in a way that makes it computationally tractable by bioinformatics applications such as, in our case: for the design of an integrated object-oriented environment managing relevant information on gene expression ?

We use XML as an exchange format for mediating and integrating data coming from multiple and semi-structured sources into a single and unified data description model called bio-ontology. The concept of ontology is the keystone of the application for integrating both genomic data available on public databanks, as well as experimental data on genes delivered from laboratory experiments and clinical statements.

The rest of the paper is organized as follows: Section 2 presents the application context of transcriptome study, the needs and the underlying problems raised by biological data integration. Section 3 describes the issues of the biological data integration processing and presents the architecture and the objectives of Gedaw for warehousing transcriptome data. Section 4 proposes our ontological approach for conciliating biological data. Section 5 details with an example the integration processing implemented in Gedaw. Section 6 presents related works on data mediation that are related to our approach. Section 7 concludes and presents the perspectives of our future works.

2. Challenges of biological data integration for the transcriptome study

2.1. Transcriptome experiment

Transcriptome is the study of the transcriptional response of the cell to different environment conditions [BRO 99] such as growth or stress factors, chemicals, foods treatments - in our case iron overload in the hepatocyte liver cell - , genetic disturbance, etc. The response of the cell is materialized by how certain genes can excessively be expressed or not in two different conditions: normal vs. pathologic (Figure 1).

A single-stranded RNA molecule is extracted from two different situations and converted into a double-stranded DNA version by reverse transcription. The process is realized by immobilizing the DNA copy of the selected genes on a solid support (called micro-array or DNA chips) and by hybridizing them with a sample of RNA extracted from the two tissues being studied (e.g. the normal vs. injured liver tissues). If transcripts of the deposited genes are present in the RNA extract, then a hybridization signal is seen that indicates a measure of the gene expression level (green or red in Figure 1).

(4)

RNA extraction reverse transcription hybridization

Normal

Pathologic

Spot intensity measurements as gene expression levels

Micro-array

Figure 1. Gene Expression Experiment

The objective of transcriptome experiments is to find out which genes are abnormally expressed in injured tissues and leads to the means of diagnosing the disease states and certainly point the way towards new therapies. Determine whether the gene is (or is not) expressed is therefore a routine task for a biologist. Things became complicated when generated data became massive. DNA chips is used and thousands of genes may be deposited on a two dimensional grid. Each gene is represented by a spot, and its expression level is measured by means of the spot intensity (Figure 1). So, each spot potentially represents an expressed gene, that does have multiple facets. In fact, for the interpretation on this single spot, the biologist must be confronted to all the public knowledge available for this gene in different sources: literature, public databanks for sequence description, sequence components, functional proteins, known genes interactions, gene variation, tissue distribution, chromosomal localization, etc. Furthermore, this available bio-knowledge is in endless reshape and modification mainly due to a rapid growth of literature.

2.2. Biological issues underlying data integration

To face the generated experimental data, the biologist expects an integrated environment that captures and integrates his own raw data and any available information on the expressed genes to help him to better focus on analysis and scientific interpretation.

(5)

From the image delivered by the transcriptome experiment (Figure 1), the biologist may in fact need to address some biological questions such as:

- Which genes are expressed (or co-expressed) or have changed their expression under a couple of conditions or condition degrees ?

- When dealing with time series, what is the biological causality of genes' relations ? Which genes regulate actively and which genes are actively regulated ?

- What is the functional classification known for some genes of interest ? Is there a correlation between known functional classes and expression levels clusters ? Are changes clustered in particular classes ?

Biologists may also need to go thoroughly into sequences of the co-expressed genes for discovering new motifs. Because genes sharing similar expression profiles, they must share transcription regulation mechanisms that include common transcription factors. We will illustrate in the next section, the extraction of this part of the bio-knowledge, available in public sources on genes. Biologists may also need go thoroughly into disease information and clinical follow up in order to find out expression patterns' correlations, and to know whether there are correlations between particular mutants' phenotypes and expression patterns.

With respect to the integration of data of interest, biologists and more specifically medical science and health researchers need to focus on specific data such as a given metabolism or pathology and to confront them to several experiment results on same genes and same interest. In fact, in addition to data delivered by home experiments, they may wish to confront them to data issued by public transcriptome experiments [MIAME 01], having a close related interest : same organ, same pathology, same specie, or whatever. Users' requirements are not exhaustive but do lead certainly to intensive data sets and knowledge on genes, along with experiment results.

A selected part of this extremely rich and available knowledge on the expressed genes needs to be integrated before analyzing for optimization sake. In fact, to get the information, biologists spend at this time a considerable time and effort to seek heterogeneous sources and tools on the Internet. The challenge is how to automatically capture, organize and integrate data of interest along with capitalized biological knowledge as an added-value. It is clear that to answer to all the issues posed by the biologist, the functionality of such environment has to include : signal processing , heterogeneous data management (capturing, integrating, clustering and mining ) and synthetic result presentation.

(6)

2.3 Biological data exchange and interoperability for transcriptome analysis

Micro-array experiments are used to study the programmed expression of genes after exposure to different factors at various points within the time-course. This means as many as several dozen, or more probably, several hundred micro-arrays have to be analyzed for a biomedical research project. As a micro-array generates thousands of data points, an efficient "real-time" processing and management of the data is a major challenge. In addition to expression profiling, experiments involving two or multiple situations may be analyzed and functional similarities are likely to reveal clusters of genes. This process requires the use of computational techniques for clustering, predicting and visualizing patterns of genes expression. The development of micro-array technology allows systematic gene expression analysis of biological material and provides an integrated overview for the global response of the cell regarding to expression level. The derived data is of major importance for identifying genes responsible for genetic variation within a range of factors.

A micro-array project requires the capture, the management and the integration of various types of genomic data that are useful for the interpretation of gene expression profiles. A current standardization effort made for providing a clear and a controlled specification of the genomic knowledge to allow data exchange between biologists is under intensive development [MAM 01][OMG 01] [MIAME 01]. And, the specification of a data warehouse system specifically meant for gene expression data seems to be as necessary as the adoption of these standards, such as LSR [OMG 01] and MAML/GEML [MAM 01] for the interoperation of biological data.

3. Biological data integration

3.1. Biological information sources

Searching across distributed, disparate biological databanks is increasingly difficult and time-consuming for biomedical researchers. Bioinformatics is coming to the forefront to address the problem of drawing effectively and efficiently

information from a growing collection of 511 multiple and distributed databanks1. Data describing genomic sequences are available in several public databanks via Internet: banks for nucleic acids (DNA, RNA), banks for protein (polypeptides, proteins) such as SWISS-PROT2_{, generalist or specialized databanks such as}

GenBank3_{, EMBL}4_{(European Molecular Biology Laboratory), and DDBJ (DNA}

DataBank of Japan).

1_{see the Public Catalog of Databases: http://www.infobiogen.fr/services/dbcat} 2

SWISS-PROT, http://www.expasy.ch/sprot/sprot-top.html

(7)

Databanks collect information about genomic sequences submitted across the world by research laboratories and genome projects, and they daily exchange these data. Each record in the databank describes a sequence with several annotations (Figure 2a).

Accession number

GI Number

Annotations

Figure 2a. Default format of the GenBank record for gene HFE

It’s identified by a unique accession number and may be retrieved by key-words. Annotations include the description of the sequence: its function (if it is known), its size, the species for which it has been determined, the related scientific publications (authors and references) and the description of the regions constituting the sequence (codon start, codon stop, introns, exons, ORF, etc.).

GenBank (with over 12.8 million records of different sequence in August 2001) [NAR 02] is one the few banks which propose the XML format for its records with a well-defined DTD which specifies the structure and the domain terminology for the records of genes and submitted sequences (Figure 2b).

4_{EMBL (European Molecular Biology Laboratory) http://www.ebi.ac.uk/embl/}

(8)

Figure 2b. The corresponding XML GenBank record for HFE gene

Several criteria may be used to classify biological information sources and determine the strategies to apply for integrating their data :

! the type of the source with regards to i) the refreshment process, ii) the query modes, iii) the data structure, and iv) the data management: some sources provide triggers or maintain a log that can be inspected, so that notifications on the changes of interest can be detected automatically. Other sources provide off-line periodic-dumps, or snapshots of data, and changes are detected by comparing successive snapshots. The challenge is to compare very large database dumps, detecting the changes of interest in an efficient and scalable way.

! the coverage and the cross-referencing of the source, ! the quality of the source and the metadata management.

(9)

The way of how and when data is populated into the local sources plays an important role for integrating local data. Source dependencies and data quality dimensions such as freshness, accuracy, completeness or coverage are crucial for integration and for validation of query results by biologists. We argue that database techniques such as having an expressive internal data model and query language, together with a meta-information repository and meta-information analysis techniques constitutes a necessary foundation for a biological data mediation system which also takes into account the quality dimensions of biological data with operational techniques and tools to evaluate, control and improve the public databanks' quality.

3.2. Integration issues and functions

Various research problems arise with specifying a biological data warehouse system whose components must have the following interrelated functionalities:

Biological entity identification: the identification problem arises when data

from different information sources related to the same entity, has to be merged. The biological databanks may have inconsistent values in equivalent attributes of tuples referring to the same real-world object. They may also include mismatched attributes at a different level of abstraction, or have a combination of both. For example, in the Genome DataBase (GDB), there were 9 ID's records for the same segment in chromosome 17 ! Obviously the same segment could be a clone, a marker or a genomic sequence, etc. Anyone is able to submit biological information to public databanks with more or less formalized submission protocols that usually do not include names standardization or data quality controls. Erroneous data may be easily entered and cross-referenced. The available data sources may also have overlapping scopes with different levels of data quality. Even if some tools propose clusters of records which identify the same biological concept across different biological databanks for being semantically related, biologists still must validate the correctness of the clusters and resolve interpretation differences among the records.

Data translation: at the schema-level, the problem of format heterogeneity

makes necessary to transform data, so that they subscribe to the data model used by the biologist's warehousing system. In our case, the information sources consist of a set of flat files (xml, html) and the data model is object-oriented. This translation problem is inherent in almost all the data integration approaches, but becomes much more complex in the biological domain because the potentially different (and not formalized yet) biological interpretations and the current knowledge state (involving the metabolic, chemical or functional views on a biological concept) reinforce the perpetual schema evolution of the integrated model.

(10)

Data cleaning: at the instance-level, it is necessary to transform the data to

import before its integration in the warehouse in order to prevent different representations or interpretations of values, different aggregation levels or reference points, duplicates, contradictory values. Transformations might include, for example, aggregating or summarizing the data, sampling the data to reduce the size of the warehouse, discarding or correcting data suspected of being erroneous, inserting default values, or eliminating duplicate, inconsistent and outdated information. This step may intensively involve biologists for their expertise and for a consensus. A feature of the data warehouse is that it may contain historical information even when that information is not maintained in the sources. Techniques are here needed for ensuring that outdated information is automatically and efficiently purged from the warehouse. But up-date data does neither imply most accurate data nor most complete data. Actually, a user might be interested in most complete data, and another one might be rather interested in most accurate data. In both cases, it should be possible for these users to specify tolerance thresholds for data and source quality (or, at least, to give the technical means to estimate the quality of query results from the different sources).

Source tracing and monitoring: The warehousing architecture must gracefully

handle changes to the information sources: schema changes, as well as the addition of new information sources and the removal of old ones. In addition, it is likely that biologists will demand schema changes at the warehouse itself. All of these changes should be handled with as few disruptions or modifications to the other components of the warehousing system as possible. Monitoring an information source consists of detecting modifications of the data that are relevant to the warehouse and of propagating those changes to the data warehouse integration module. One approach is to ignore the change detection issue altogether. The refreshment consists simply of propagating entire copies of relevant data from the information source to the warehouse periodically. The main problem related to the main public biological databanks is that update periodicity is of course various depending on the bank, but it may also daily concern potentially around 10 millions of records per bank. The integration module can combine data with existing warehouse data from other sources, or may request complete information from all sources and recompute the warehouse data from scratch. Ignoring change detection may be acceptable in certain scenarios, for example when it is not important for the warehouse data to be current and it is acceptable for the warehouse to be off-line occasionally. However, in our context, currency, efficiency, and continuous access to public data banks are required, then we believe that detecting and propagating changes and incrementally folding the changes into the warehouse will be the preferred but too costly solution.

Selective updating and views : All data modifications at a source that may be

relevant to the warehouse have to be propagated to the integration module. The biological data warehouse may contain multiple views, for example to support different types of analysis. When these views are related to each other, e.g., if they are defined over overlapping portions of the database, then it may be more efficient

(11)

not to materialize all of the views, but rather to materialize certain shared "subviews" or portions of the database, from which the warehouse views can be derived.

Role of mediator and wrappers: Traditionally, the role of mediators is mainly

(i) to determine and select relevant resources with optimization of access strategy to provide low cost or small response times, (ii) to solve domain terminology, ontology differences and scopes mismatches, and (iii) to eliminate replicate information. The main role of wrappers is to face the change detection problem with cooperative or logged sources. In the context of biological databanks, sources are usually not cooperative and their log can not be queried and inspected: the query-based approach is necessary to compare old and new data. It is undesirable to hard-code a wrapper-monitor for each biological information source participating in the warehousing system, especially if new information sources become available frequently and schema evolution of biological sources is coupled with biological knowledge management which unceasingly evolves. Since the classical functionality of the wrappers and monitors is dependent on the type and the evolution of the source as well as on the data provided by that source, a significant research issue is to develop techniques and tools that automate or semi-automate the process of implementing specific wrappers and mediators for biological data integration, through a toolkit or specification-based approach.

3.3. Gedaw: An object-oriented environment for integrating transcriptome data

Considering the different integration issues previously described, a specific data warehouse called Gedaw (Gene Expression DAta Warehouse), has been designed for integrating and managing : i) data that are being produced on the expressed genes in public databanks and literature, ii) the raw data produced by home or external experiments (micro-arrays, PCR or SAGE experiments) and iii) complementary clinical data. This environment views to support complex analysis on such data and tends to be a unified infrastructure for different bioinformatics technologies to measure gene expression (Figure 3).

Gedaw has been implemented using the object DBMS POET6.0. POET provides a direct description of the conceptual models on genes and their expression with Java (or C++) as a binding language. Objects are then made persistent into an objects database, that is the central element of Gedaw. Data representation for browsing and user interface programming is quite intuitive.

At this day, the warehouse integrates experimental data on selected genes (2000 deposited sequences) expressed with regards to three models of iron overload in the liver hepatocyte cell. The database include two parts of the knowledge, representing the transcriptome data and sequence data of the expressed genes. A full description of the conceptual schema is given in section 4.

(12)

Gene Expression Data Warehouse Data cleaning Change Detection Data Translation INTEGRATION Micro-arrays SAGE Clinical Data PCR Public Sources GenBank, EMBL, SWISS-PROT Bibliographic Sources Medline Experimental data ANALYSIS

(13)

Figure 3. Gedaw: A unified infrastructure for the management of heterogeneous

data on gene expression

We have been able to confront data coming from different transcriptome experiment on different samples, and answer queries using OQL and Java API for seeking genes that are co-expressed in different situations or different time courses.

To conduct more specific analysis on the co-expressed genes, Gedaw database has been populated with related gene sequences from public data sources like GenBank and known sequences search tools for new motifs can be used on the extracted sequences.

A screen shot is given in Figure 4 for a query that browse genes with expression levels between a given minimum and maximum measure, over three different experiments. By clicking on the GenBank accession numbers, users may either visualize the information sequence directly on the source, or have access to the full sequence and sequence components data in the database to conduct more complex analyses on them.

(14)

(15)

Figure 4. Browsing genes and expression levels

4. The ontological approach for biological data integration

4.1. Modeling Biological Data

As well as having access to sequence related data on each gene, biologists may have access to its expression levels activities in a range of experiments, each of under a certain condition. Other specific data could be added to the core gene conceptual schema (gene interactions, gene variations).

The Figure 5a presents graphically the simplified biological mechanisms involving the gene and its basic components we had to model. It consists of a succession of fragments, either transcribed then translated into proteins that do most of the work in the cell, or non-transcribed but do have some other functions related to regulation, or termination: while introns -in yellow- are spliced, exons -in green-are gathered to form a transcript, which will leads to messenger RNA (or other mature RNA), polypeptide chain, and finally to a functional protein.

(16)

The corresponding gene transcription and translation scenarios are represented in our conceptual data model with UML formalism in Figure 5b.

GENE promotorexon1 intron

precursor mRNA ORF translated into POLYPEPTIDE exon2 intronexon3 mature mRNA 5’UTR 3’UTR compose terminator

exon1 intron exon2 intron exon3

exon1 exon2 exon3

FUNCTIONAL PROTEIN

transcribed regions

Figure 5a. Simplified biological mechanisms of the gene

The conceptual model in Figure 5b captures expression data across all the genes in one or many experiments. It is a new category of information since it is experimental. To relate to the gene data model, an experiment has a collection of measurement point, each of associates a collection of expression levels with a description of conditions at which the levels were recorded. An expression level is the mRNA quantity of a particular gene expressed at the time of the measurement point.

Notice that the mRNA class is common to the two domains (Gene and Experiment in Figure 5b), and is a central concept in our warehouse. We will refer to this object to either generate data on its expression levels in different situations or the detailed sequence (if available) from public databanks.

(17)

1

Experiment Domain Gene Domain

*

(18)

4.2. Specification of the bio-ontology for transcriptome study

In molecular biology, to provide a clear specification of the structure of the information produced in a field is an essential mean to exchange knowledge between biologists. In this discipline, the current knowledge constitutes the essential engine of the new discoveries. The researchers must share in an effective way their knowledge and, on a higher level, associated ontologies.

More precisely, a successful exchange and integration of information depends on a shared language for communication (a terminology) and a shared understanding of what the data means (an ontology).

As an exchange format, we have been using XML for integrating data coming from semi-structured data sources into a unified data description model we called bio-ontology. A bio-ontology is used as a standard terminology to share in a non ambiguous way biological knowledge of any kind and from any source [ELL 98]. A bio-ontology has been defined for Gedaw [GUE 00][PAT 00][MOU 99], and described with XML format and is associated to the conceptual data model of Gedaw. The model describes the fields Gene and Experiment of the transcriptome.

4.3. Specification of mapping rules

In order to define an appropriate data aggregation of all the available information items, data conflicts have to be resolved using rules for mapping the sources’ records and conciliating different values recorded for a same concept.

Traditional data integration approaches suggest a conflict resolution method that chooses one value over the others. A global query on a gene sequence would retrieve one data value from the source according to specified data integration rules. This theoretical approach is clearly difficult to implement in the biological context of application because biologists’ expertise takes an important part into the process, data integration rules change with the biologist’s interpretation and focus and also, with the current state of scientific knowledge on a particular biological phenomenon. Because our software environment is designed to provide exhaustive information on genes focused in transcriptome experiments, it queries different existing data sources (GenBank, EMBL and SWISS-PROT) to retrieve any information on genes and relatives that have a prominent part in hepatic pathologies. The experimental raw values and public information items are automatically extracted by scripts using the DTD (Document Type Definition) of each data source translated into the Gedaw conceptual data model. Correspondence rules are defined to allow the data exchange (and used for automatic values importation) from the public databanks (such as GenBank) to the Gedaw data warehouse (Figure 6: Mapping the GenBank DTD and the gene data model of Gedaw).

(19)

(20)

<!ELEMENT Bioseq ( Bioseq_id , Bioseq_descr? , Bioseq_inst , Bioseq_annot? )>

<!ELEMENT Bioseq_id ( Bioseq_id__E+ )> <!ELEMENT Bioseq_descr ( Seq-descr )>

<!ELEMENT Bioseq_inst ( Seq-inst )> <!ELEMENT Bioseq_annot ( Seq-annot* )>

<!ELEMENT Seq-descr ( Seqdesc+ )> <!ELEMENT Seqdesc ( Seqdesc_mol-type | … Seqdesc_title | … Seqdesc_molinfo )>

<!ELEMENT Seqdesc_title ( #PCDATA )> <!ELEMENT MolInfo (

MolInfo_biomol? , MolInfo_tech? , MolInfo_techexp? , MolInfo_completeness? )>

<!ELEMENT MolInfo_biomol ( %INTEGER; )> <!ATTLIST MolInfo_biomol value (

Mapping

rule R1

Mapping rule R2

Gene Data Model of Gedaw

GenBank DTD

Figure 6. Correspondences between GenBank DTD and Gedaw data model

Rules can be classified in three categories: the structural mapping rules, the semantic mapping rules and the cognitive mapping rules according to the different knowledge levels and perspectives for biological interpretation. Each category includes mapping rules at:

- the schema-level: for conciliating naming conflicts and structural conflicts between the information sources (mainly DTD's elements) and our conceptual model.

- the instance-level: for conciliating different representations or interpretations of values, different aggregation levels or reference points, and eliminating duplicates and contradictory values.

(21)

The structural mapping rules have been defined according to the Gedaw model with identifying the existing correspondences with the relevant DTD elements of each sources (e.g. the Seqdesc_title element in GenBank DTD is used to extract the name "nom" of the gene and the MolInfo_biomol value its type of molecule with respectively mapping rules R1 and R2 in Figure 6). Then, the records of interest are selectively structured and data are extracted.

For example, let consider the three distinct selectively structured records we may obtain from EMBL and GenBank databanks in querying the same gene HFE (see Figure 7). The first record identified by the accession number AF204869 describes a partial sequence (size = 3043) of the HFE gene with no annotation but one relevant information item about the position of the promoter region. The second record identified by the accession number AF184234 describes a partial sequence (size = 772) of the protein precursor of HFE gene with a detailed but incomplete annotation. The third record identified by the accession number Z92910 describes the complete sequence (size = 12146) of the HFE gene with a complete annotation.

Semantic and cognitive mapping rules are now used for data conciliation at the instance level: several rules may use available tools for determining analogies between homologous data (such as sequence alignment, for example): the result of the BLAST algorithm (implemented in a set of similarity search programs for Basic Local Alignment Search Tool) allows to consider that two sequences match. The scores assigned in a BLAST search have a well-defined statistical interpretation, making real matches easier to distinguish from random background hits. BLAST is therefore able to detect relationships among sequences which share only isolated regions of similarity.

In our example, BLAST(sequence(Z92910), sequence(AF184234))=100% indicates the sequence in both record are perfectly homologous and can be merged.

Other mapping rules are used in this example for conciliating data such as : Descriptive Inclusion: record(Z92910) contains record(AF184234) Position Offset: position(Z92910.exon)=6364+position(AF184234.exon)

(22)

<description>Homo sapiens hem ochromatosis protein (HFE) gene, promoter region and partial sequence </description>

<protein><name>hemochrom atosis protein</name></protein> <nontranscribed_fragment>

<promotor><position>1..>3043</position></prom otor> <sequence>ggtacctgta … cagagctggg gaa</sequence> </nontranscribed_fragm ent> </gene> Relevant information item No annotation

<gene AC="AF184234" SOURCE="GenBank><name>HFE</nam e> <description>Homo sapiens hereditary haemochrom atosis protein precursor (HFE) gene, partial cds</description>

<transcribed_fragm ent><position> <1..>772</position> <exon num ber="4"><position>130..405 </position> </exon> <mRNA><position>join(<130..405,564..>677)</position></m RNA> <primary_polypeptide>hereditary haemochrom atosis protein precursor</primary_polypeptide>

<protein AC="AAF01222.1"><nam e>type I m embrane protein; HLA-H</name></protein>

<exon num ber="5"><position>564..677 </position> </exon> <sequence>ctcctgtagc ttgttttttt … tgggaagcat tt</sequence> </transcribed_fragment> </gene>

Incomplete annotation

<gene AC="Z92910" SOURCE="EMBL"> <name>Homo sapiens HFE gene</name>

<description>haemochromatosis; HFE gene</description>

<nontranscribed_fragment>polyA_signal<position10617..10622</position <sequence> ggatccttta...atttgtacat taaaagtttt ccatgg</sequence>

</gene>

Complete annotation

(23)

5. Detailed Integration Processing in Gedaw

The application developed in Java is a software surcouche on top of Gedaw’s object-oriented database. The data processing of genomic data breaks up into four following steps (Figure 8) :

1. Coordinating: the script starts the application. Its seeks in the data warehouse the accession numbers of the incomplete objects of interest (e.g. mRNA objects with missing information),

2. Importing: the script allows, starting from each accession number, to import in XML format the corresponding descriptive record from the GenBank databank,

3. Cleaning: the script parses the XML files, extracts the information which are to be integrated in the warehouse, and delivers it in a textual file,

4. Storing: the script parses the textual file containing information and ensures the data storage and persistence into the POET object-oriented database of Gedaw.

5.1. Coordinating

The first step of the genomic data processing consists in opening a session of the object-oriented database, and selecting in the database by OQL queries, objects whose attributes values are missing (e.g. mRNA in Figure 8).

5.2. Data importing

This step allows the first connection to the GenBank site via the URL and the corresponding accession number. Then, the GI number is extracted (GI is the identification number of the sequence in the databank, see Figure 2a). The GI number appears in the URL of the complete record describing the sequence. A second connection to the GenBank site is established in order to reach, with to number GI, the complete record which is imported locally in XML (Figure 2b). 5.3. Semi-structured objects cleaning

The cleaning script uses the XML QueryEngine component to seek the interesting elements through XML documents: the query is built for each relevant element to extract. The queries are formulated to declare the path to reach the relevant element, using filters and selection operators. The tags defined by GenBank have a recursive structure, therefore the depth of a same element may vary from one record to another according to the complexity of the sequence.

(24)

(25)

Figure 8. Data Integration Processing

A compromise thus had to be found between generic paths independently of the complexity of the records and specific paths in order to retain the relevant tag among several tags with the same name but different semantics. XML QueryEngine makes it possible to deliver the result of the requests according to three formats of presentation. The format 'StandardListener' was selected to deliver the result with the content of the retrieved elements, and the corresponding tags. The step of cleaning allows i) the creation of a new XMLEngine object ii) the query submission (expressed with the path) to the XMLEngine object and iii) the redirection of the query result in a file (Figure 8). If an element is found, it is presented in the form:

(26)

<name of the last tag of the path>content< /name of the last tag of the path> Figure 9 shows the result of three requests: the definition of RNAm ( tag: <Seqdesc_title >) , the organization (tag: <Org-ref_taxname>) and the number of chromosome (tag: < SubSource_name >).

<?xml version="1.0"?> <xql:result query="//Seq-entry_seq/Bioseq/Bioseq_descr/Seq-descr/Seqdesc/Seqdesc_title" hitCount="1" elemCount="1" docCount="1" xmlns:xql="http://www.fatdog.com/Standard_Listener.html">

<Seqdesc_title>Homo sapiens hemochromatosis (HFE), mRNA</Seqdesc_title> </xql:result> <?xml version="1.0"?> <xql:result query="//Seq-descr/Seqdesc/Seqdesc_source/BioSource/BioSource_org/Org-ref/Org-ref_taxname" hitCount="1" elemCount="1" docCount="1" xmlns:xql="http://www.fatdog.com/Standard_Listener.html"> <Org-ref_taxname>Homo sapiens</Org-ref_taxname> </xql:result> <?xml version="1.0"?> <xql:result query="//Seq-descr/Seqdesc/Seqdesc_source/BioSource/BioSource_subtype/SubSource[./SubSource_subtype/@value=' chromosome']/SubSource_name" hitCount="1" elemCount="1" docCount="1" xmlns:xql="http://www.fatdog.com/Standard_Listener.html"> <SubSource_name>6</SubSource_name>

Figure 9. Extract of the resulting file produced by the cleaning step

Lastly, the transformation of the file in the format adapted for the next step of the data processing and the object instanciation into the database by the loader class are carried out.

5.4. Data loading

This stage allows the parsing of the clean file by seeking character strings corresponding to specific tags. It also allows the extraction of information item if it exists, the instanciation of the classes of the Gedaw data model and/or the assignment of the values to the attributes of the objects. Lastly, it ensures the persistence of the data in the database.

(27)

5.5. Objects exploring

New object instances and new values of attributes can be visualized by navigating within the POET object-oriented database (Figure 10).

(28)

Description of the object mRNA (class, identity, values of attributes)

complete description of ORF

objects of mRNA fragments list (FragARNm_) : with ORF, UTR5 and UTR3.

The activation of listeFragARNm_ enables the visualization of each fragment of mRNA

Classes hierarchy

Stored objects : mRNA, ORF, UTR3, UTR5 and a Polypeptide

(29)

6. Related Works

Among the research propositions for classical mediation systems (such as Information Manifold [ORD 96, LEV 95], SIMS with the language LOOM [CHE 93], Context Interchange [GOH 94], Garlic [CAR 95]) based on a standard mediator-wrapper architecture [WIE 95] [FER 98], the domain-model based integration approach relying on a common domain model described at the mediator level seems to be the most adapted approach for biological data integration. The domain model (or metadata schema) captures the basic vocabulary used for the description of information expressed in the multiple databases. The mediator is used to resolve semantic conflicts among independent information sources. Knowledge necessary for this resolution is usually stored as shared ontologies. But, an important condition of integration for a new information source is here to provide an exhaustive description of its structure in terms of the domain model. The independence of resource-to-resource and mediator-to-resource description models is also another essential constraint [HUL 97].

In the context of biological databanks, these conditions are not affordable because many sources are dependant and propose redundant information items because of partial information exchanges and translations.

Shared ontologies are used to conciliate data conflicts, even if the formalization of biological and genomic domains are under current intensive development [ASH 00][ASH98]. Another difficult point is that many data sources do not represent biological objects optimally for the kinds of queries that investigators typically want to pose (e.g. GenBank is sequence-centric but not gene-centric, SwissProt is sequence-centric, but not domain-centric) [SIE 01]. Biological data sources differ in their representation of key concepts and have their own ontology. Integrated query must operate on the most up-to-date versions of the data sources in order to avoid being "scooped" on important biological discoveries. The result of applications such as BLAST must also be integrated with a large variety of sequence annotations found in data sources. But wide-ranging multi-source queries, particularly those containing joins on the results of BLAST searches, often return unmanageably large result sets, requiring non-traditional methods to identify and exclude extraneous data. These obstacles may explain the difficulties of biological data integration and the few research propositions in this specific domain [MOU 99][SIE 01][ECK 01].

7. Conclusion

The Gedaw application presented in the paper allows massive importation of biological data into an object-oriented data warehouse dedicated to transcriptome analysis for the human liver. The sequence data are extracted from public databanks and integrated within the experiment data into the warehouse. The defined data model for genes involved in transcriptome experiments is a conceptual description faithful to the biological knowledge. It was worked out to take into account the

(30)

immediate needs of biomedical researcher at INSERM U522 research unit. The extension to new concepts (functional classifications, interactions proteins/proteins, clinical hypothesis, etc.) is possible without having to reconsider the application issue. The descriptive richness and the flexibility of the object-oriented approach are two advantages in the proposed model, thus, making it possible to build evolutionary bioinformatics applications.

Our interest in the conceptual decomposition of the genomic sequence is to isolate distinct regions from genes in order to be able to treat them individually and to correlate them to experiments raw data and derived information from literature. This decomposition which is close to the biologist’s approach, respects the principle that is each region of a gene must be first individualized because it has a precise biological function.

There is no doubt that the development of a universal terminology or ontology is a serious and expensive undertaking. The use of XML format is therefore powerful for data retrieval and exchange and for developing effective techniques for biological recognition and extraction information. Other functionalities that relate to refreshment, processing and data cleaning, are currently in progress, and were not discussed in the paper. Another aspect under development relates to the reconciliation and the unification of the heterogeneous data by knowledge acquisition and formalization of expertise rules for mapping data. The unification of data and multiple point-of-views for their interpretation before integration is a difficult problem in bioinformatics. And, there is not satisfied answers to this problem by classical solutions proposed in existing mediation systems.

8. Acknowledgements

The authors are grateful to Brice Courselaud, Emilie Guérin et Olivier Loreal for their contributions in prototype specification, development and utilisation. We also thank the colleagues of INSERM U522 for insights on transcriptome data integration.

9. References

[ASH 98] ASHBURNER M., “On the representation of gene function in genetic databases”. ISMB, Montreal, 1998. http://www.geneontology.org/gene_ontology_discussion.html [ASH 00] ASHBURNER M., BALL A.C., BLAKE J.A. ET AL. (The Gene Ontology Consortium).

“Gene Ontology: tool for the unification of biology”. Nature Genetics, 25: 25-28, 2000 [BRO 99] BROWN T.A., Genomes, BIOS Scientific publisher, 1999.

[CAR 95] CAREY M., HAAS L., SCHWARZ P. ET AL. “Towards heterogeneous multimedia information systems: The Garlic approach”. In Proc. Of RIDE-DOM, p. 124-131, 1995.

(31)

[CHE 93] CHEE C., ARENS Y., KNOBLOCK C., AND HSU C.. “Retrieving and integrating data from multiple information sources”. Intl. J. of Intelligent and Cooperative Information Systems, 2(2):127-158, 1993.

[ECK 01] ECKMAN B.A., KOSKY A.S., AND LAROCO L.A., Extending traditional query-based integration approaches for functional characterization of post-genomic data. Bioinformatics, 17: 587-601, 2001.

[ELL 98] ELLIS L., SPEEDIE S., MCLEISH R., “Representing metabolic pathway information: an object-oriented approach”, Bioinformatics, 14(9), 803-806, 1998

[FER 98] FERNANDEZ M., FLORESCU D., KANG J. ET AL. “Catching the boat with Strudel: Experiences with a web-site management system”. In Proc. Of ACM SIGMOD Conf. on Management of Data, p. 414-425, 1998.

[GOH 94] GOH C., MADNICK S., AND SIEGEL M., “Context Interchange: overcoming the challenges of the large-scale interoperable database systems in a dynamic environment”. In Proc. Of CIKM'94, p. 337-346, 1994.

[GUE 00] GUERIN E. Contribution à la modélisation d'un entrepôt de données dédié à l'analyse du transcriptome hépatique. Rapport de DEA Génomique et informatique Université de Rennes I, 2001.

[HUL 97] HULL R., “Managing semantic heterogeneity in databases: a theoretical prospective”. In Proc. of PODS'97, p. 51-61, 1997.

[LEV 95] LEVY A., SRIVASTAVA D., KIRK T., “Data model and query evaluation in global information system”. J. of Intelligent Information Systems, 5(2):121-143, 1995.

[MAM 01] MICROARRAY MARKUP LANGUAGE (MAML), Specification Primer, DRAFT 2001-01-19, http://www.ncbi.nlm.nih.gov/geo/maml/

[MIAME 01] MGED, MIAME, Nature Genetics, vol 29, no. 4, pp 365-371, 2001. http://www.mged.org/Annotations-wg/index.html

[MOU 99] MOUSSOUNI F., PATON N.W., HAYES A., ET AL., “Database Challenges for Genome Information in the Post Sequencing Phase”. In Proc. of DEXA'99, Lecture Notes in Computer Science, Vol. 1677, p. 540-549, 1999.

[NAR 02] NUCLEIC ACIDS RESEARCH, Special Issue on Databases, Vol. 30, No. 1, Oxford University Press, 2002.

[PAT 00] PATON N.W., KHAN S.A., HAYES A., MOUSSOUNI F., ET AL. “Conceptual Modelling of genomic Information”. Bioinformatics, 16(6):548-558, 2000.

[OMG 01] OBJECT MANAGEMENT GROUP, Life Science research Gene Expression LSR RFP-7, Request for Proposal, 2001-08-21, http://www.ncbi.nlm.nih.gov/geo/maml

[ORD 96] ORDILLE J., LEVY A., RAJARAMAN A., “Querying heterogeneous information sources using source descriptions ”. In Proc. of VLDB'96, p. 251-262, 1996.

(32)

[SIE 01] SIEPEL A., FARMER A., TOLOPKO A., ZHUANG M., MENDES P., BEAVIS W., AND

SOBRAL B., “ISYS: a decentralized, component-based approach to the integration of heterogeneous bioinformatics resources”, Bioinformatics 17: 83-94, 2001.

[WIE 95] WIEDERHOLD G., “Mediation in information systems”. ACM Computing Surveys, 27(2):265-267, june 1995.