hypothesis-porto-spaccapietra

(1)

Scientific Models and Hypotheses

Fabio Porto1_{, Stefano Spaccapietra}2

1 _{LNCC – National Laboratory of Scientific Computing} Petropolis, Brazil

2 _{EPFL-IC – Database Laboratory} Lausanne, Switzerland [email protected]

Abstract. New instruments and techniques used in capturing scientific data are

exponentially increasing the volume of data consumed by in-silico research, what has been referred to as data deluge. Once captured, scientific data goes through a cleaning workflow before getting ready to analysis that will eventu-ally confirm the scientist’s hypothesis. The whole process is, nevertheless, complex and takes the focus of the scientist attention away from his/her re-search and towards solving the complexity associated to managing computing products. Moreover, as the research evolves, references to previous results and workflows are needed as source of provenance data. Based on these observa-tions, we claim that in-silico experiments must be supported by a holistic hy-potheses data model. The latter covers hyhy-potheses formulation and validation, scientific model description, in addition to support simulations. Adopting a data perspective to represent hypotheses allow high-level references to experiments and provides support to hypotheses evolution. Hypotheses are associated to computational models that once run by simulations allow for quantitative vali-dation of hypotheses. Finally, using the data model as a basis we sketch the ar-chitecture of a scientific model management system.

1. Introduction

The availability of important experimental and computational facilities nowadays al-lows many large-scale scientific projects to produce a never before observed amount of experimental and simulation data. This wealth of data needs to be structured and controlled in a way that readily makes sense to scientists, so that relevant knowledge may be extracted to contribute to the scientific investigation process. Current data management technologies are clearly unable to cope with scientists' requirements [1]. In particular, traditional database technology, which has been designed for supporting business-oriented applications, is poorly suitable for scientific applications. Only a few scientific projects, such as the Sloan Digital Sky Survey, claim to successfully use relational databases [2]. More frequently, science applications require complex data model [1], to represent meshes, sequences and other scientific abstractions, sup-port for knowledge evolution, and close integration with simulators and scientific workflow engines. Globally, the list of scientific data management requirements is

(2)

long and variable, but would also include support for intense world-wide collabora-tion, and reasoning with uncertain knowledge and uncertain models for an incre-mental elaboration of scientific knowledge.

A promising approach to better scientific data management is based on the concept of scientific model. Building on the informal introduction to the concept given in [3], we define a scientific model as a formal representation that 1) synthesizes the under-standing about a studied phenomenon, entity or process, and 2) allows scientists to eventually simulate the behaviour of the real world processes they investigate and compare the expected behaviour against experimental results. Both goals, understand-ing and simulatunderstand-ing, build upon the formalization of the process at hand, e.g. as a set of mathematical formulae or as logic programs. Formalization, indeed, ensures unam-biguous interpretation of what is being stated and precisely instructs on the implemen-tation that has to be achieved in view of running the experiments that will support or contradict their assumptions. The formalization stands at the core of the scientific model. The latter, however, includes much more than the formal definitions. Its aim is to include all the knowledge that in one way or another is related to or has contributed to the elaboration and assessment of the formalization. This is a fundamental requisite for collaboration between different teams in the elaboration of the scientific model and for its reuse by other groups. Scientists indeed do not adopt a scientific result without examining how and by whom it has been established, what were the underly-ing hypotheses and supportunderly-ing previous works, what were the experimental frame-works that allowed producing the result and assessing it. It is this holistic view of sci-entific work that the scisci-entific model concept intends to support.

From a data management perspective, scientific models encompass all the informa-tion used and produced during a scientific explorainforma-tion process. This informainforma-tion is contextualized by including a link to the corresponding scientific domain, biblio-graphic references, provenance data, and a description of the observed phenomenon that is represented by the formalization. Part of a scientific model is its computational model, which we define here as an implementation-oriented specification that trans-lates the formalization into a computable process, including the details about the software and hardware resources. Experiments, essential to the scientific process, are run as defined by the computational model to simulate the behaviour of the studied phenomenon and its interaction with the interfacing environment, according to the specifications in the scientific model. The elaboration of the computational model is a complex but necessary task. It allows scientists to confront their simulation hypothe-ses against experimental results, reiterate in different experimental settings and with possibly updated specifications reflecting changes in the hypotheses, eventually and hopefully leading to an accurate representation of the studied phenomenon. This retry and refine process, known as model tuning, includes both modifying programs (i.e. changing the behaviour) and fitting new input parameter and set-up values (i.e. re-specifying the initial simulation state). In many cases when experimental data are available, regression analysis provides formal procedures for model fitting, i.e. the process of fitting the model to match with existing experimental results [4].

It is a fact that the complexity of specifying and running a computational model and managing all the resources produced during a scientific endeavour deviates the scientist attention off the phenomenon being investigated into implementation

(3)

con-cerns [5]. State of the art approaches to computational simulation resort to scientific workflows languages, such as SCUFL [8] and BPEL [7], to specify workflows and to their corresponding running environments [6, 9, 10] for models evaluation. Tailored environments like MATLAB [11] and Neuron [12] can be used as support for small-scale non data-intensive simulations. While some of these offers have attracted a large user community [13], they fail to provide an integrated scientific model environment as the one proposed in this paper, which aims at providing an integrated management of all the facets of the studied phenomenon, the derived models and data therein pro-duced.

This work intends to support scientists in specifying, running, analysing and shar-ing scientific models and model’s data, i.e. data produced by runnshar-ing simulations. Adopting a data-oriented perspective for scientific model representation, we provide formal definitions for the scientific model concept and its components, as informally proposed in [3]. Beyond the resulting data model, with which users describe scientific models and derived computational models, we propose a query language with which scientists specify simulation queries. The model is grounded in XML and tightly re-lated to domain ontologies, which provide formal domain descriptions and uniform terminology. Scientists may search for existing scientific models and run simulations that automatically invoke the underlying programs on provided inputs. The results of a simulation may generate complex data that can be queried in the context of the sci-entific model. The proposal provides a framework for individual scisci-entific models. We suggest extending this basic framework in two directions. One is intended to sup-port complex scientific models, which integrate into their definition other scientific models (in full or in part), leading to a compositional network of scientific models. Higher-level models can then be specified e.g. through views that export a unified representation of underlying scientific models. The other extension is intended to support scientist's work in progress through the specification of uncertainties in a sci-entific model that has not yet been fully assessed by experimental data. We call this hypothesis modelling. Finally, we sketch a scientific model management system sup-porting the whole process.

The remaining of this chapter is structured as follows. Section 2 discusses related work. Section 3 presents a running example extracted from the neuroscience domain to be explored throughout the chapter. Section 4 covers scientific model driven mod-elling and section 5 details the simulation language. Section 6 extends scientific model with hypotheses and section 7 discusses some advanced model elements. Fi-nally, section 7 introduces a general architecture for a scientific model management system. The conclusion discusses our achievements so far and future work.

2. Related Work

Data and knowledge management supporting in-silico scientific research is a com-prehensive topic that has appeared under the eScience or computational science label. It encompasses the semantic description of the scientific domain, the experiment evaluation through scientific workflow systems and results analysis through a myriad of different techniques, among other in-silico related tasks.

(4)

Given the broad class of application domains that may benefit from eScience re-lated data management techniques, it has been posture-lated that there is a small chance that a single solution would cover the diverse set of requirements coming from these domains [1]. The semantic description of scientific domains through ontologies [28] is one exception that has attracted the attention of the scientific community as a means to support collaboration through common conceptual agreement. In this line, GeneOntology [29] is probably the most notorious and successful example of practi-cal adoption of ontologies in the scientific domain. Similarly, scientific workflows have become the de facto standard for expressing and running in-silico experiments, using execution environments, such as [6, 9, 10]. Despite that, we believe that

in-silico experiments require a more comprehensive model that can offer scientists a

ho-listic view of his/her research. In particular, we propose a data model with which sci-entist may define scientific hypotheses, describe scientific models [3] and run simula-tions using computational models. Integrate hypotheses in a data model is, however, not trivial.

Hypotheses modeling have been introduced in databases back in the 80’s [23]. In that context, one envisions a hypothetical database state, produced by delete and in-sert operations, and verifies whether queries are satisfied on that hypothetical state. This approach is, however, far from the experimental semantics settings that we are interested in. Closer to our objective is the logical model proposed in the context of the HyBow project [26, 27] for modeling hypotheses in the biology domain. Hy-potheses (H) are represented as a set of first-order predicate calculus sentences with free quantifiers. In conjunction with an axiom set specified as rules that models known biological facts over the same universe, and experimental data, the knowledge base may contradict or validate some of the sentences in H, leaving the remaining ones as candidates to new discovery. As more experimental data is obtained and rules identified, discoveries become positive facts or are contradicted. In the case of con-tradictions, hurting rules must be identified and eliminated from the theory formed by H.

The approach adopted by Hybrow supports hypotheses validation in the spirit of what we aim to represent, i.e. a formal definition to be confronted with experimental results and extending the scientific knowledge base. According to what has been dis-cussed above concerning modeling of hypotheses in a scientific framework, this model needs to be extended. Critical is the conclusion that the adopted model-theoretical approach for hypotheses validation does not seem adequate for represent-ing hypotheses-oriented research, in contrast with similarity based models. Moreover, this work aims at integrating hypotheses with the simulation environment bridging the gap between qualitative and quantitative representation.

3. A neuroscience Scientific Model

In this section we introduce a running example taken from scientific models devel-oped for the neuroscience domain. According to Kandel [14], “the task of neural sci-ence is to explain behavior in terms of the activities of the brain”. Numerous compu-tational neuroscience groups investigate scientific models that aim at explaining some

(5)

specific behavior. A classical example from these scientific models is the "axon membrane action potential" model proposed by Hodgkin and Huxley [15] that de-scribes how action potentials traverse the cell membrane. An action potential is a pulse-like wave of voltage that can traverse certain types of cell membranes, such as the membrane of the axon of a neuron. The model quantitatively computes the poten-tial between the interior and the extracellular liquid, based on the flow of ions of so-dium (Na+_{), potassium (K}+_{) and a leakage flow representing all other channel types.}

The formalization in the model is given by the following mathematical equation: I = m3_{h g}

Na (E – ENa ) + n4 gK (E – EK ) + gL (E – EL ) (0) where gi,i={Na, K, l} is a time-dependent variable that represents the membrane con-ductance for sodium, potassium and leakage, Ei models the equilibrium potential for each ion channel, E is the membrane potential, and, n, m and h are parameters con-trolling the probability of the sodium or potassium gates to be opened. The total ionic current across the membrane of the cell is modeled by the variable I.

This action potential model simulates the variation on voltage in the cell when ap-plied to a neuron compartment. It has been used, for example, in the formulation of the scientific model of a single neuron. Figure 2 shows the domain ontology that sup-ports the conceptual representation of a single neuron with its various compartments and the membrane behaviour given by the Hodgkin&Huxley (HH) model. The Hodg-kin&Huxley class in the ontology represents the corresponding model with the input and output parameters conforming to equation (0).

From the scientific model specification, a computational neuroscientist will con-ceive programs that implement the behaviour defined by equation (0). Next, when running a simulation, the program receives input values for the parameters identified in equation (0) and produces the total ionic current across the membrane (variable I).

Fig. 1. Single Neuron Model using the Hodgkin&Huxley model domain ontology.

4. Data Models for Scientific Models

During in-silico scientific investigation, scientists explore, produce and register data and metadata describing a phenomenon and associated simulations. Different scien-tific products are generated and need to be managed. Scienscien-tific model metadata cov-ers provenance, contextual and descriptive information that drive scientific model searching and querying. Similarly, computational model metadata are used as the

(6)

ba-sis for the automatic instantiation and evaluation of simulations and serve as context to qualify input and output data. In order to structure this wealth of information, we propose a data model for scientific models data and metadata that is presented in this section.

4.1 The Observed Phenomenon

The starting point of an in-silico scientific investigation comprehends the clear specification of the phenomenon one attempts to explain. The formal description of a phenomenon includes a domain ontology, setting the formal conceptual representation of the domain in which the phenomenon is inserted, a phenomenon title and a infor-mal textual description. A Phenomenon is represented in the data model as:

Ph=< R, OPD, title> (1)

• R identifies the resource (in the web sense). It consists of a tuple <LSID, D>, where

o D is a free text description.

o LSID is the lifescience identifier, i.e. an unified identifier for the life science specified using the urn:lsid:authority.org:namespace

:object:revision standard format1_[16].

4.2 The Scientific Model

A scientific model provides a comprehensive description of the scientist interpreta-tion on the observed phenomenon. Of prime importance is a formal representainterpreta-tion of the scientific phenomenon interpretation, possibly using mathematical formulae, and a reference to the phenomenon it attempts to explain. Once a scientific model is in-serted into the same semantic domain as the phenomenon it refers to, the Phenomenon ontology covers its domain description. In addition, the scientific model (SM) de-scription includes bibliography references and other metadata supporting model pres-entation.

Formally, a scientific model is defined as a sextuplet

SM = < R, LSIDPh, OMF, B, I, A> (2) where

• R as described above,

• LSIDPh denotes the phenomenon identifier,

• OMF is the formalization, e.g. one or more mathematical formulae,

• B is a list of bibliographic references, if any,

• I is a list of images, if any, that are used to illustrate the scientific model, and

1_{urn:lsid is the LSID protocol identifying label, authority.org is a DNS reference, and the} re-maining field names are self explanatory

(7)

• A is a list of annotations, if any, that convey any additional information that contributes to the understanding and description of the model.

We recommend using domain ontologies to unambiguously describe the domain cov-ered by the scientific model (OPh) as well as to explain the mathematical formulae

(OMF). The LSIDSM identification attribute, specified in R, hooks the scientific model to its computational models and simulations.

The Hodgkin&Huxley model, presented in section 3, can be depicted as a scientific model. Its data view is illustrated in Figure 2. It provides metadata to support basic search queries over a scientific model database and reasoning capabilities on the the-ory specified by the scientific model ontology.

Fig. 2. The Hodgkin&Huxley scientific model representation

4.3 The Computational Model

The definition of the scientific model suggests an explanation for a phenomenon using some formal language. In order to run in-silico experiments, a scientist needs to construct a computational representation of the scientific model. Although desirable, an automatic mapping from a formal description of a SM to its computational model is still not feasible, requiring engineering efforts from the scientific group. Neverthe-less, once a computation model has been specified and built, an engine may read such specification and automatically instantiate an execution, if input data is provided. The computational model (CM) description in our data model provides required metadata for such automatic instantiation.

In this context, the Environment ontology and the Domain ontology contribute to disambiguate CM specifications. The Environment ontology describes the execution

R _O

(8)

environment associated to the CM, including: programming language specification, programming environment parameters, libraries, documentation, input and output pa-rameters, initialization procedure, and executing programs. The set of input parame-ters expected by the CM façade program includes set-up and evaluation parameparame-ters. Set-up parameters comprise state information that should hold during the various simulations. They may also introduce run-time control values, such as frequency of output graph update etc. The set-up parameter are referred to in the data model as IS

={i1, i2,…, in}, where each instance ij=<v, P> of Is corresponds to a value v and its

corresponding parameter P in Parameters (i.e a class on the Environment ontology). Complementing the input parameters, evaluation parameters specify the simulation instance to be run. Thus, given a state specified in IS, the evaluation parameters E

complete the CM running state specification.

On the output side, an outputWrapper specifies methods for obtaining the experi-mental results. Given the different formats and procedures used by generic programs when producing output, each CM shall indicate the outputWrapper class that knows how to capture its output and return it to the system to feed an eventual pipeline.

A CM is formally defined as a 7-tuplet

CM=< R, LSIDSM, XOE, XOD, Mi, Mo, A > (3).

In (3), R is the CM resource identification; LSIDSM is a reference to the associated

scientific model, XOE and XOD are the XML serializations of the environment

ontol-ogy, of XML type Environment, and of the Phenomenon domain ontolontol-ogy, of XML type Domain, respectively2_{. M}_{i and Mo}_{are the mappings between the underlying} pro-gram input and output parameters and the corresponding domain ontology properties (XML tree leave nodes). Finally, A corresponds to annotations identifying authoring information.

Figure 3 illustrates the representation of a computational model implementing the scientific model SM01 from Figure 2.

2_{The serialization of Ontologies into an XML structure follows a detailed technique not} pre-sented here. The main intuition is to form a tree structure having concepts as nodes and aid-ing in semantically qualifyaid-ing program’s parameters.

(9)

Fig. 3. A Hodgkin-Huxley Computational Model

4.4 Simulation

The previous elements of the model present metadata used in scientific model for-malization and querying, and in support to computational model automatic instantia-tion. We now turn to the expression of simulations. Simulations are in-silico experi-ments run to assess the scientific model against the observed phenomenon. By analogy with databases, where users’ data is intentionally expressed in queries, we call simulation query the specification of a simulation.

Let us define a simulation database DBS= {CM1, CM2, …, CMm}, where CMi, 1< i < m, are n-ary data views on the computational model. Given a computational model CMi, a corresponding n-ary data view CMi abstracts the software programs behavior associ-ated to the CM, by exposing its input and output parameters as data attributes and completely hiding its implementing programs. This is similar to modeling of user-defined functions as relations in databases [17].

Consider, for instance, the data view below corresponding to the CM in Figure 3:

HodgkinHuxley(i:(m, n, h, gNa, gK, gL, ENa, EK, EL),o:(I)) (4).

The HodgkinHuxley data view presents one attribute for each input/output CM pa-rameter. Querying a data view requires binding attributes in the input parameter set (prefixed with i:) to input values and retrieving results associated to output parameters (prefixed with o:). In this context, a simulation query (S) interrogates a data view CMi by providing binding values and obtaining results from output attributes.

Section 5 formally presents the simulation query language.  

(10)

5. Simulation language

In-silico experiments are commonly expressed using workflow or some sort of

scripting languages. We aim to leverage the expression of simulations by providing a high-level query language with which scientists may express a large class of workflows, notably those that can be modeled as direct acyclic graphs.

In this context, a simulation query is specified as expression in non-recursive Data-log[19] comprising a head and a body. The body is a boolean expression composed of a conjunction of predicates, whereas the head specifies a predicate holding variables containing the expected simulation results, necessarily appearing in one of the predi-cates in the body. Users interface with simulation queries by providing the input pa-rameters and set-up values needed for the evaluation of the predicates, and getting in return the output values in the variables defined in the head. We first present the syn-tax and semantics of simulation query predicates.

5.1 Simulation Predicate

A simulation query predicate is specified as:

Si ( (Vi,Wi) ; (Xi’, Xo’) ; (Ii,Oi) ; IS ) (5).

In (5), Si labels the simulation query predicate and associate it to the corresponding CM resource identification. Vi and Wi are the two sets of variables defined to refer to values provided as input or produced as output when running the underlying CM pro-gram. The set of input and output parameters’ values are provided by the XML documents Xi and Xo, respectively. Note that the associated CM definition specifies the schemas for Xi and Xo. For example, using the CM in Figure 3, the Xi document can be obtained from the result of the XPath expression “/CM/DomainOntology” over the Hodgkin&Huxley CM element XOD, and by filling its leaf nodes with the input values. Thus, /Neuron/Axon/Hogking-Huxley/m = 0,1 illustrates a possible value as-signment for the input parameter m. Ii and Oo are the mappings defining the corre-spondence (see definition 1) between the input and output variables in Vi and Wi and the input and output parameter values in Xi and Xo. Finally, IS represents simulation set-up parameters.

Definition 1: Correspondence assertions in Ii and Oi are specified as $x = Path, where $x is a variable in {Vi ∪ Wi} and Path is an XPath [20] expression pointing to a data element in Xk’, k={i,o}, whose leaf node is either an input parameter value or an output value.

Having described the syntax for individual simulation predicates, we can proceed to define the semantics of body expressions. First we define the semantics of a single simulation predicate. The definition for complete body expressions follows.

(11)

5.2 Semantics of a single simulation predicate

A single simulation predicate returns a boolean value of its evaluation according to the definition in Definition 2 below with respect to its syntax in (5) and the CM speci-fication in (3).

Definition 2: A simulation predicate Si evaluates to true iff given a Xi holding the set of parameter values input to the program implementing the corresponding CM, according to Mi, it exists a Xo whose leaf values are produced by the evaluation of the referenced program and that is built from the mappings in Mo.

5.3 Semantics of the body of a Simulation expression

More elaborate conjunctive expressions can be composed from single simulation predicates to form the body of a simulation. The semantics of a conjunction of simu-lation predicates in the body of a simusimu-lation is defined in Definition 3.

Definition 3: Given a conjunction of simulation predicates s= s1 ∧ s2 ∧…∧ sn, s is considered to hold true if the conjunctive expression on the right evaluates to true. Moreover, if more than one simulation predicates si and sj in s refer to the same

vari-able, for instance $x, then they share a single associated value. In addition, the shared variable must hold a single binding to a value, either provided as input or produced as output by an underlying program computation.

Note that the restriction regarding sharing variables among simulation predicates leads to data-dependency relationships, in which the simulation predicate holding the value associated to the shared variable shall precede in evaluation order the remaining simulation predicates sharing that particular variable. Moreover, variable sharing in-troduces a particular mode of value assignment to data elements in Xi, replacing that of the node corresponding to its associated path.

Finally, given a body that evaluates to true, then the head of the simulation identi-fies the variables in the query body whose values are returned as the simulation re-sults, such that if K is the set of variables in the head then K ⊆ (Vi ∪ Wi), for 1 ≤ i ≤

n, with n being the number of simulation predicates in the body.

5.4 A Simulation query

A simulation query combines the head and its body into a clause as illustrated in (6), according to definitions 1, 2 and 3.

S(K) := S1 ((V1, W1) ; (Xi1, Xo1) ; (I1, O1) ; IS1) ∧

S2 ((V2, W2) ; (Xi2, Xo2) ; (I2, O2) ; IS2) ∧ (6)

…. ∧

Sn ((Vn, Wn) ; (Xin, Xon) ; (In, On) ; ISn)

An example of a simulation query is given in Figure 4. This particular query re-turns the total ionic current across the membrane ($I) according to the parameters values specified in the input document HHCM01I. As discussed before, the user must

(12)

provide a mapping from each query variable to the corresponding data element of the domain ontology XML serialization document. In this example, the input and output XML documents, Xi and Xo, are illustrated by documents HHCM01I and HHCM01O, respectively, both of type Neuron.

S($I, $z) := CM01((i:($m,$h,$n,$gNa,$gK,$gL,$ENa,$EK,$EL),o:($I)); (HHCM01I, HHCM01O); ( $m = /Neuron/Axon/ Hodgkin-Huxley/m, ….3_, $I = /Neuron/Axon/ Hodgkin-Huxley/I) ) ∧ CM024_{(($I , $z); (ACM02} I, ACM02O); ( $z=/Analysis/result) )

Fig. 4. A simulation query example.

6. Hypothesis Modelling

Up to now, we have focused our discussion on scientific models and the entities that derive from them during a research exploration. An important missing aspect from a conceptual point of view is the expression of scientific hypotheses, which drives re-search by proposing an explanation for a studied phenomenon. Indeed, according to wikipedia [24], a scientific hypothesis is used as a tentative explanation of an

obser-vation, but which has not yet been fully tested by the prediction validation process for a scientific theory. A hypothesis is used in the scientific method to predict the results of further experiments, which will be used either to confirm or disprove it. A success-fully-tested hypothesis achieves the status of a scientific theory. This is at odds with

mathematics, whose role is to prove theorems, whereas in experimental and simula-tion science one wants to test hypotheses [27]. In this context, scientific hypotheses play a fundamental role in experimental science by bringing rigor to results valida-tion. In addition, as explanations of phenomena, hypotheses should get as close to the object they explain as possible to the point where one may conceptually replace the other with a degree of uncertainty. On the other hand, computational science test hy-potheses through simulations, whose results, which we name here hypothesis’ in-stances, associate quantitative values to hypotheses. In doing so, scientific hypothe-ses introduce a valuable contribution by bridging the gap between qualitative description of the phenomenon domain and the corresponding quantitative valuation obtained through simulations. Thus, in our modeling approach we aim at coming up with a representation of scientific hypotheses that may be used in qualitative (i.e. on-tological) assertions and with minimum tricks can be quantitatively confronted to phenomenon observations. Having said that, bridging this gap is a challenging job that we are starting to tackle and whose first results we report in this section.

3_{The remaining mappings are not shown due to lack of space.} 4_{The CM02 computational model has purposely not been described}

(13)

In order to integrate hypotheses within our scientific model framework the follow-ing conceptual entities shall be contemplated:

• phenomenon – representing a set of observations that a scientist records of a certain phenomenon that he/she wants to explain;

• hypothesis – representing a set of explanations for a given phenomenon based on a computational model;

• competing hypotheses – different hypotheses for the same phenomenon; • hypotheses provenance – a causal relationship between hypotheses or

phe-nomena and a new formulated hypothesis. The relationship highlights the logi-cal basis of a hypothesis explaining its scientific origin and providing prove-nance information about the latter;

6.1 Running Example

Let’s consider a small variation on the scenario presented in section 3. Suppose we want to feed a scientific visualization application with the temporal variation on the value of the ionic current (I), in other words, the ionic current is a function of time. The result is a time series showing the variation of the ionic current during an interval of time Δt. In addition, we will assume that independent scientific models are con-ceived to model the ionic current on each gate (i.e. sodium, potassium and leakage). In this revised scenario, the formulae in (0) can be re-written as:

I = ∫(1-d) m3 h (E – ENa) gNa (t) dt + ∫(1-d) n4 (E – EK) gK (t) dt + ∫(1-d) (E – EL) gL (t) dt (7).

In (7), d is the duration of the simulation and the membrane conductance gi is a func-tion of the simulafunc-tion time instant. The ionic current on each gate is modeled by a dif-ferent scientific model, leading to the following computational models:

ionicChan-nelNa, ionicChannelK and ionicChannelL.

Given this new scenario, a scientist may formulate the following hypothesis concern-ing Hodkconcern-ing-Huxley model: The total ionic current on a membrane compartment is a

time function of the ionic current on the sodium, potassium and leakage channels.

Under the scenario exposed above, the following entities have been identified: • phenomena:

a) Total ionic current on a neural membrane compartment; b) Time dependent ionic current on a sodium ionic channel; c) Time dependent ionic current on a potassium ionic channel; d) Time dependent ionic current on a leakage ionic channel; • computational models:

a) ionicChannelNa

b) ionicChannelK

c) ionicChannelL

• hypothesis: The total ionic current on a membrane compartment is a time function of the ionic current on the sodium, potassium and leakage channels.

(14)

6.2 Hypotheses Model

In order to integrate scientific hypothesis into the scientific model data model, we formally define a Hypotheses Data Model (HDM).

A HDM describes an experiment domain and is defined as HDM={Ph, PhO, H, HI, HP, CM, SM, E, V}, where:

• Ph – is a set of phenomena, as defined in (1);

• PhO – is a set of phenomenon observations, see below; • H – is a set of hypothesis, see below;

• HI – is a set of hypothesis instances, see below; • SM – is a set of scientific models, as defined in (2); • CM – is a set of computational models, as defined in (3); • E – experiments, see section 7;

• V – simulation views, see section 7;

Elements of HDM are n-ary relations with attributes defined in D, a set of concrete domains, such as integers, strings, floats, etc.

Definition: A phenomenon observation (PhO) is a temporal record of a phenome-non, quantitatively described by its attribute values.

Thus, a phenomenon observation models the observable entity in different states and time instants. The observed variation is recorded as PhO attribute values.

In the example of section 6.1, a scientist may register the following observations con-cerning totalIonicCurrent:

totalIonicCurrent (ob1, t1, <m1,n1,h1>, 0.001); (8) totalIonicCurrent (ob2, t1, <m2,n2,h2>, 0.002);

In the representations above (8), two instances of the phenomenon totalIonicCurrent are depicted. The observations include an identifier, the time instant of the observa-tion, a set of initial state values, specifying the context on which the phenomenon was observed, and an attribute that quantitatively describes it. The latter serves as the basis for assessing hypotheses.

We extend the specification in (1) for phenomenon observations in the HDM as

phi(obid, date, V, U, A), phi ∈ Ph,

where phi is a phenomenon observation set label, obid is an observation identifier, V=

<v1, v2, …, vk >, U =<a1, a2, …, al>), with ai ,vj ∈ Dm, Dn, respectively, Dm, Dn ⊆ D, for all 1 ≤ i ≤ k, 1 ≤ j ≤ l. V represents a list of initial set-up values and U a list of phenome-non attributes. Finally, A is a list of annotations.

Syntactically, we distinguish initial set-up values from observable attributes with an underline on the latter. Thus, in light of the example in Figure 6.1, we would have the following phenomena observation schemas involved in computing the ionic cur-rent in a membrane compartment:

totalIonicCurrent (obid , date, INA, IK, IL, totalIonic) ionicChannelNa (obid , date, t, m, h, INA)

ionicChannelK (obid , date, t, n, IK) ionicChannelL (obid , date, t, IL)

(15)

Phenomenon observations are the basis for the modeling activity. A scientist formu-lates hypotheses that may or not be validated when compared to observations.

Definition: A scientific hypothesis is the specification of a possible

explana-tion for a given phenomenon.

In this context, a scientific hypothesis is specified as H=<id, Phid,F, CMid> (9)

where id is an hypothesis identifier; Phid is a phenomenon identifier (LSIDPh); F is a comparison function, used in measuring the accuracy of hypotheses with respect to observations; and an identifier for the corresponding computational model. Referring to our running example, the total ionic current phenomenon hypothesis schema would be defined as:

totalIonicCurrent (hid , LSIDtotalIonicCurrent , fi, LSIDCM);

Observe that, by providing the computational model LSID in a hypothesis declara-tion, we are indicating that hypotheses are associated to a computational representa-tion, expressed by the entity pointed by LSIDCM, and whose evaluation simulates the

phenomenon identified by Phid. In this context, hypothesis instance set corresponds to the set of results obtained by running the hypothesis associated computational model, and forms the basis for quantitative hypothesis validation. We highlight two aspects of hypothesis instances: hypothesis provenance and hypothesis’ comparable attrib-utes. On one hand side, hypothesis provenance explains the required logical formulae defining the computation of hypothesis instance. On the other hand side, hypothesis’ comparable attribute values export the simulation data view and are confronted with the phenomenon observation attribute values, according to the comparison function fi.

Using the example in section 6.1, we illustrate both concepts with their instances: Hypothesis instance:

totalIonicCurrent (hi1, t1, <m1,n1,h1>, 0.003,< ob1, ob2>, dist1);

Hypothesis provenance:

totalIonicCurrent($totalIonic)= totalIonicCurrent ($INA, $IK, $IL, $totalIonic) ∧ ionicChannelNa (t1, m1, h1, $INA) ∧ ionicChannelK (t1, n1,$IK) ∧ ionicChannelL (t1, $IL) Definition: A scientific hypothesis instance holds the same initial condition values

as the phenomenon observation it attempts to explain and characteristic values com-puted by the corresponding computational model. The latter are compared against phenomenon observation attribute values through hypotheses’ comparison function. Thus, scientific hypotheses instances are formally specified as: hii (hj ,date, pj , c , q, dist) (10)

where, hii ∈ HI , pj =(p1, p2, …, pn) , pk ∈ Dk, Dk ⊆ D, for all 1 ≤ k ≤ n, is the initial condition’s value set, c=(c1, c2, …, cm) , cl ∈ Dl ,Dl ⊆ D for all 1 ≤ l ≤ m, is the set of at-tribute values used in validation against phenomenon observations, hj is a hypothesis instance identifier, q is a set of phenomenon observations, defining the observations comparison set, dist is a measure of distance, computed by hypothesis’ comparison function f ,between the hypothesis instance and the explained phenomenon observa-tions phj.

We motivated the need for a scientific data model as to permit knowledge evolution in sync with the investigation progress. Hypotheses foster evolution by allowing

(16)

competing hypotheses to be specified and individually assessed against phenomenon observations. Similarly, by modifying a computational model a new hypothesis may be specified. Finally, each instance of a hypothesis instance set represents a new sci-entific essay with different input value set. Thus, hypotheses, hypotheses instances and competing hypotheses are three different dimensions through which scientists can register the investigation evolution keeping provenance information about the pro-gress of the research.

The introduction of hypothesis and hypothesis instance complete the core entities of our scientific model data model.

7. Advanced Elements of HDM

In this section we briefly introduce two advanced elements of the scientific model data model, provenance management and complex scientific models representation. An important element of a scientific model environment is the ability to provide provenance [25] information regarding simulation query evaluation. In this respect, an experiment refers to simulation query evaluations, including the simulation query text, a list of computational models LSIDs, as appearing in the simulation query body, and an identifier.

Experiment = <LSIDe, list< CMi>, query-text> (11)

Thus, for an instance of simulation query running the HodgkinHuxley model, we would have:

S($I) := CM01((i:($m,$h,$n,$gNa,$gK,$gL,$ENa,$EK,$EL),o:($I)); leading to the provenance information:

Experiment (LSIDi, CM01, “S($I) := CM01((i:($m,$h,$n,$gNa, $gK, $gL,

$E-Na, $EK,$EL), o:($I)”); (12)

In (11), the experiment element filled in with a simulation query instance is presented. Experiments support research evolution by allowing scientists to recover previous es-says and to apply data fitting to computational model parameter values.

A second advanced element of the model supports the composition of computation models in the same vein as database views [21] and, for this reason, is named

simula-tion view (V). It allows simulasimula-tion queries to be memorized so that they can be

re-executed later on or included in a more complex simulation. V provides users with an external perspective of a simulation through the set of input parameter values that configure the participants’ computational models. In addition, a simulation view es-tablishes correspondences between the exported parameters and the ones specified on each simulation predicate taking part in the body of the simulation describing the view. A simulation view is expressed as follows:

Sv((V , W) ; (Xiv’, Xov’) ; (Iv,Ov) ; (ISv, Ms)) =

(17)

S2((V2,W2) ; (Xi2’, Xo2’) ; (I2,O2) ; IS2) ∧ (13)

…. ∧

Sn((Vn,Wn) ; (Xin’, Xon’) ; (In,On) ; ISn).

The body of the simulation view is alike the one in ordinary simulations, expressing a conjunction of simulation predicates. The difference appears in the head of the for-mula. Indeed, the latter exports an integrated view of the simulation predicates’ input and output parameters appearing in the body of the formula and specified in Xik and Xok, 1 ≤ k ≤ n. The two sets of correspondences, Ik and Ok, map the external view in {Xiv, Xov} to the corresponding parameters in the simulation predicates in the body, {Xik, Xok}. Thus, a correspondence assertion is expressed as Sv.path/dataelement ≡ Si.path/dataelement, where path is an XPath expression. In the same line as the in-put/output parameters, ISv expresses the uniform view of set-up parameter values ap-pearing in the body of the formulae and Ms asserts the correspondences between the set-up data elements in ISv and those in the body.

8. Managing Scientific Models

A scientific model management system (SMMS) supports scientists in designing, cre-ating and searching scientific model’s entities, and in managing the results of simula-tions and analyses. Figure 6 depicts the main system funcsimula-tions structured into four layers. A user layer provides the interface for scientists to create and edit elements of the model and to request system services, such as running simulations, querying and reasoning. Users may query scientific model meta-data as well as hypotheses data and simulation results.

Fig. 6. Scientific Model Management System Architecture

The metadata management layer stores scientific model metadata and supports meta-data management services. In this work, scientific model metameta-data is based on a set of

(18)

ontologies that guarantees uniform terminology among models and scientists. A trans-formation and selection service allows scientists to map ontology fragments to XML trees, which are then used in data model elements description. The catalog service manages metadata about scientific model and data, as well as supporting information such as ontologies, views (see section 5.5) and transformation rules.

The service layer supports simulation evaluation, querying of simulation results and reasoning. We have extended the query processing system CoDIMS [10] to cope with simulation queries evaluation. Finally, a data management layer supports distributed scientific models management and wrappers implementing complex data types, offer-ing access to simulation results, such as graphs and time series. In this paper, the de-tails of the architecture are not further explored. Similarly, the dede-tails regarding on-tology management, transformation and alignment are left to future work.

9. Conclusion

Managing in-silico simulations has become a major challenge for eScience appli-cations. As science increasingly depends on computational resources to aid solving extremely complex questions, it becomes paramount to offer scientists mechanisms to manage the wealth of knowledge produced during a scientific endeavor. This chapter presented initial results aiming to contribute to this idea. We propose a data centric semantic based data model with which scientists represent scientific hypotheses, sci-entific models and associated computational models. Scisci-entific hypotheses are expla-nations of observable phenomena expressed through the results of computer simula-tions, which can be compared against phenomena observations. The model allows scientists to record the existing knowledge about an observable investigated phe-nomenon, including a formal mathematical interpretation of it, if one exists. Addi-tionally, it intends to serve as the basis for the formal management of the scientific exploration products, as well as supporting models evolution and models sharing. Traditionally, scientific computation takes either a mathematical or computational view when modeling is discussed. Scientific workflows are the most practical exam-ple of its computational view. By taking a declarative perspective on scientific model we envisage various benefits. Firstly, a higher-level declarative language allows sci-entists to concentrate on the scientific questions he/she tries to answers saving pre-cious time that otherwise would be expended on workflows definition and evaluation. Secondly, hypotheses are good candidates to bridge the gap between an ontological description of studied phenomena and the simulations that aim at explaining them. Finally, data views on scientific entities allow for querying and searching for models supporting scientific models sharing among different scientific groups.

There are various opportunities for future work. In this chapter we haven’t introduced the ontological modeling for hypotheses. Different relationships among hypotheses may be expressed, such as: composition, similarity, use, etc. Investigating how to ex-press these and other relationships among hypotheses and the phenomena they at-tempt to explain is an interesting topic. Furthermore, hypotheses evolve and eventu-ally become a theory. It would be interesting to model and infer on such evolution. Taking into account hypotheses validation, the distance between a hypothesis instance

(19)

and the referred phenomenon may be probabilistically interpreted. Thus, considering probabilistic inference models when running hypotheses formula needs to be investi-gated. Another path of research explores the use of provenance information to help on the investigation evolution.

Considering data management on the results of simulations there is a huge path to fol-low. Computational models usually deal with complex data types, such as meshes, grids, temporal series and etc. Integrating these data types into our model is future work. Besides, all the research work involving heterogeneous data are of relevance to allow the automatic communication between computational models involved in a simulation.

We have developed a first prototype system that implements the data model and the simulation query language on top of the CoDIMS system. The system is designed in the context of scientific model management system architecture with a set of minimal services that scientists may expect from such an environment.

References

[1] Stonebreaker, M., Becla, J., DeWitt, D., et al, “Requirements for Science Data Base and SciDB”, Conference on Innovative Data Systems Research, CIDR, 2009.

[2] Szalay, A., Kunszt, P., Thakar, A., Gray, J., et al., “Designing and Mining Multi-Terabyte Astronomy Archives: The Sloan Digital Sky Survey”, ACM SIGMOD, Dallas, Tx, USA, 2000, pp. 451-462.

[3] Hunter, J., Scientific Models – A User-oriented Approach to the Integration of Scientific Data and Digital Libraries, VALA 2006, Melbourne, February, 2006.

[4] Jaqaman, K., Danuser, G., Linking data to models: data regression, Nature Reviews, Molecular Cell Biology, V(7), November 2006, pp. 813-819.

[5]Silvert, W., Modelling as a discipline, International Journal General Systems, V30(3), pp.1-22, 2000.

[6] Oinn, T., Greenwood, M., Addis, M., Taverna: Lessons in creating a workflow envi-ronment for the life sciences, Concurrence Computation : Pract. Exper., 1-7, 2000.

[7] Akram, A., Meredith, D., Allan, R., Evaluation of BPEL for Scientific Workflows, Clus-ter Computing and the Grid, CCGRID, V1(16-19), May, 2006.

[8] http://www.gridworkflow.org/snips/gridworkflow/space/XScufl

[9] Altintas, I., Berkley, C., Jaeger, E., Jones, M., Ludascher, B., Kepler, M., An Extensible System for Design and Execution of Scientific Workflows. In SSDBM, 2004.

[10] Porto, F., Tajmouati, O., Silva, V.F.V, Schulze, B., Ayres, F.M., “QEF Supporting Complex Query Applications”, 7th_{Int’l Symposium on Cluster Computing and the Grid, Rio} de Janeiro, Brazil, pp. 846-851.

[11] MATLAB, http://en.wikipedia.org/wiki/Matlab, last access 24/06/2008. [12] Neuron, http://www.neuron.yale.edu, last access 24/06/2008.

[13] Roure, D., Goble, C., Stevens, R., Designing the myExperiment Virtual Research Envi-ronment for the Social Sharing of Workflows. Science 2007 - Third IEEE Int. Conf. on e-Science and Grid Computing. Bangalore, India, 10-13 December 2007. Pp. 603-610.

(20)

[14] Kandel, E., Schwarts, J., Jessel, T., Principles of NeuroScience, 4th ed. McGraw-Hill, New York, 2000

[15] Hodgkin, A., Huxley, A., A quantitative description of ion currents and its applications to conduction and excitation in nerve membranes, J. Physiol. (Lond.), 117:500-544, 1952. [16] http://lsids.sourceforge.net/, last accessed 26/04/2008.

[17] Chauduri, S., Shim, K., Query Optimization in the presence of Foreign Functions, Proc. of the 19th_{Very Large Database Conference, Dublin, Ireland, 1993, pp. 529-542.}

[18] Grosof, B., Horrocks, I., Volz, R., Decker, S., Description Logic Programs: Combining Logic Programs with Description Logic, Proc. WWW2003, Budapest, May 2003.

[19] Ullman, J., “Principles of Database and Knowledge-Base Systems”, Vol.1, Computer Science Press, 1988.

[20] http://www.w3.org/TR/xpath, last accessed 26/04/2008.

[21] Elmasri, R., Navathe,S., Fundamentals of Database Systems, 2nd Edition. Benja-min/Cummings 1994.

[22] Christiansen H., Andreasen T., A practical Approach to Hypothetical Database Queries, Transactions and Change in Logic DBs, LNCS 1472, pp. 340-355, Springer-Verlag, Berlin, 1988.

[23] Bonner A. J., Hypothetical Datalog: Complexity and Expressibility, Theoretical Com-puter Science 76 (1990), pp. 3-51, North-Holland.

[24] http://en.wikipedia.org/wiki/Scientific_hypothesis, last access 04/09/2009.

[25] Davidson, S., Freire, J., “Provenance and Scientific Workflows: Challenges and Oppor-tunities”, Proc. 2008 ACM SIGMOD Int’l Conf on Management of Data, Vancouver, CA, 2008, pp. 1345-1350.

[26] Racunas, S.A., Shah, N.H., Albert, I., Fedoroff, N.V., Hybrow: a prototype system for computer-aided hypothesis evaluation, Bioinformatics, Vol.20, Suppl.1 2004, pp. 257-264. [27] Racunas, S., Griffin, C., Shah, N., A Finite Model Theory for Biological Hypotheses, Proc. of the 2004 IEEE Computational Sytems Bioinformatics Conferences, 2004.

[28] T. R. Gruber, “Toward Principles for the Design of Ontologies Used for Knowledge Sharing”. Stanford University. International Journal of Human-Computer Studies 1995. [29] http://www.geneontology.org/