• Nenhum resultado encontrado

eScience2012-aula1

N/A
N/A
Protected

Academic year: 2021

Share "eScience2012-aula1"

Copied!
22
0
0

Texto

(1)

Data Management for eScience

Fabio Porto

(

fporto@

lncc.br)

LNCC

CCC

http://dexl.lncc.br

Fábio Porto eScience 2009

A word of caution

l

This presentation is a personal reflection

certainly biased by my own experience and

far from being complete

Fábio Porto

eScience 2009

Agenda

l

Part I And Prof. Heuser said

l

Part II Plato cave

l

Part III Shedding light

(2)

Part I: And Prof. Heuser said

SBBD2009 - invited talk

Fábio Porto

eScience 2009

l

“To see what is in front of one’s nose

needs a constant struggle”

George Orwell

Fábio Porto

eScience 2009

(3)

Fábio Porto

eScience 2009

A need for collaboration

Fábio Porto

eScience 2009

A time of questions

l  What is the correlation between the greenhouse phenomenon and the food shortage?

l  Is the current worldwide shortage of food associated to the increasing culture of green fuel?

l  In a flock of birds trajectory which are the dense regions?

l  In a galaxy, where are the dark energy foci?

l  Which are the DNA sequences that match a certain model?

l  How can I model a metabolic network with hundreds of compounds?

l  May one produce the neural network from an initial connection set and the electrophysiology data?

Fábio Porto

eScience 2009

A challenge on volume

l 

Dark Energy Survey (DES) project expects to

produce 100 PB in 10 years;

(source:personal comm.)

– 5000o sky cover, “all” objects, “perfect” accuracy

l 

Yahoo claims to manage 2 PB of click data in a

modified PostgreSQL

l 

EMBL - nucleotide database 260 Gbases

l 

High-throughput sequencing 454 Roche technology

– Sequence 400-600 million bases in 10 hours

– Eg. A project at Max Plant Institute aims at sequencing the whole genome of the Neanderthal at 3 billion base pairs is expected to take 2 years to finish.

(4)

eScience 2009

GenBank exponential growth

1982 - 2008

 

Growth  in  nucleo.de  sequences  submi4ed  to   GenBank  between  1982  and  2005.   The  note  from  each  release  of  GenBank  reveal   the  total  number  of  nucleo.des  submi4ed  to   the  database.  This  graph  uses  one  data  point   from  each  year  to  show  the  exponen.al   growth  rate  of  nucleo.de  sequences  in  this   interna.onal  database.  Data  from  whole   genome  sequencing  is  not  included  in  these   figures,  but  expressed  sequence  tag  data  and   data  from  sequencing  centers  is  included.   Copyright  2008  Nature  Educa.on.  

Fábio Porto

eScience 2009

Where all these data is coming

from?

l

Better instrumentation generates more

accurate and larger datasets

l

Highly scalable systems process data faster

and generate higher volumes to be

processed and analysed

l

Faster networks allow data sharing easier

l

New researching becomes easy and

consequently generate more data

Fábio Porto

eScience 2009

A quest for quality

Source:EMBL

l

Many of the sequences in GenBank are

badly annotated. How can one trust such

database”?

– 

Confidence as an important metadata to be

computed

– 

How to computationally model quality of data

(5)

Fábio Porto

eScience 2009

A need for models

l

Traditional databases the schema models

the semantics of the data it reflects

l

In eScience, data reflects models that

simulate phenomena

l

Thus management of data corresponds to

managing models

Fábio Porto

eScience 2009

A view on Computer Science

l

The challenge for computer science in the

next decades will be to produce

computational models that permit users to

interpret the evolution of facts and

discoveries and reactively and

collaboratively produce knowledge

Fábio Porto

eScience 2009

The Bottom-line

l

“Scientists are spending most of their time

manipulating, organizing, finding and moving

data, instead of researching. And it’s going

to get worse”

– 

Office Science of Data Management challenge -

(6)

eScience 2009

Data Management for eScience

l 

What are the options:

– From a system viewpoint:

l Adapt Object-Relational DBMSs to support whatever new data

management requirement (Oracle, PostgreSQL,…)

l Develop specific solutions (SRB, …)

l Adopt hybrid solutions (flat files, DFS + DBMS)

l Use search technology (coupled with file system data mngmt)

l Use semantic technology (Pallet, Racer, Oracle,…)

l Highly distributed and parallelizable techniques (MapReduce,

multi-core, cloud,..)

l Workflows systems (taverna, vistrails, CoDIMS,…)

l Adaptive query processing techniques

Fábio Porto

eScience 2009

Data Management for eScience

l 

What are the options?

– From an Data (K) point of view

l Semantic representation and reasoning

l Uncertainty representation and similarity based models

l Data and knowledge integration with quality measurements

l Hypotheses, Scientific Models and simulations

l Provenance

l Temporal series data analysis

l Multi-dimensional data analysis (KD-tree)

l Integration with trust and quality

l Declarative workflows

Fábio Porto

eScience 2009

And Prof. Heuser said

l 

No single data model can represent the myriad of

data characteristics produced in scientific research

l 

No single DBMS system can support the different

needs of scientific application

l 

But, many of research results in databases can be

applied to science data

(7)

Part II Plato Cave

Semantic Representation &

Reasoning

Fábio Porto

eScience 2009

Domain specification

Terminology

(8)

eScience 2009

Model of the brain with functions

Problem Solving, Emotion, Complex Thought

Coordination of complex movement

Initiation of voluntary movement Receives tactile information from the body Processing of multisensory information Complex processing of visual information Detection of simple visual stimuli Language comprehension Complex processing of auditory information Detection of

sound quality (loudness, tone) Speech production and articulation

Domain specification

Fábio Porto eScience 2009

Domain terminology

From a macro to micro level

Fábio Porto

eScience 2009

(9)

Fábio Porto

eScience 2009

From simulated data to semantics

a) Single neuron Soma axon Dendrites b) A neuron column c) Layer 4 neuron

Images from the Blue Brain Project

Fábio Porto

eScience 2009

Terminology

l 

Terminology initiatives in Neuroscience

– BrainML (http://brainml.org)

– Mouse Brain Atlas

– Allen Brain Atlas

– Cocomac (http://www.cocomac.org)

l 

In medicine

– UMLS (www.nlm.nih.gov/research/umls)

l 

In Bio-ontologies

– Compiled at OBO (www.obofoundry.org)

– GeneOntology (www.geneontology.org)

(10)

eScience 2009

Probabilistic Querying

l 

The ubiquitous availability of scientific data leads to

uncertainty about their content.

– Blast returns similarity between a sequence query and sequences in a set

– Computer simulation approximate (distance) to the phenomenon it simulates

Fábio Porto

eScience 2009

Simulations

Large coast swell (wave height > 3m) caused by strong wind > 20

and rain ( precipitation > 20 mm)

Rain Precip. Prec.

25 mm

14 mm 0,95

0.8

wind speed Prec.

15’

23’ 0,9

0.95

Coast-swell height Prec.

4 m 2.5 m ?? ??

Simulations

Fábio Porto eScience 2009

Drug Resistance

Which drug to prescribe to a HIV-1 patient?

(atggaaaagg …)

Genbank sequencegene

attgcc.. attggcc.. pol pol gene pol pol

Blast

ccgttgcc.. Attgggcc.. pol pol pol pol attgccc 0.99 12AI,345GI,.. Attggg… 0.95 123AD,222GI attgag 0.9 444TI,555TI Drug resistance drug1 0.88 12AI,345GI,.. drug2 0.8 123AD,233GI drug3 0.9 444TI,556TD

query

atggaaaagg …

(11)

Fábio Porto

eScience 2009

And also

l

Datasources with different quality

– 

Which data shall we trust?

– 

Automatic evaluation of cost models

– 

Semantic descriptions

Hypotheses based research

Fábio Porto

eScience 2009

Scientific applications

l  Scientific exploration is one particularly relevant hypothesis based application

l  Current in silico scientific endeavours rely on scientific workflows

l  Scientific workflow has become a standard for computer simulations, data analysis, and transformations (eg. Taverna, Kepler, ..) in eScience;

l  But

– It drives the scientist attention towards the computational model, in

detriment to the scientific problem pursued;

– It hardly evolves with new findings;

– Managing produced data is tough;

– Reproducing results are difficult;

– Sharing the graph is not sharing the experience;

– Semantics are in the problem domain and not in the computational

(12)

eScience 2009

Hypotheses in eScience

hypotheses are formal representations synthesizing the

understanding a scientist has about a studied

phenomenon, entity or process. It allows scientists to

simulate the behaviour of the real world and compare it

against experimental results.

Corollary:

The exploratory nature of science requires to continuously

handle evolution while maintaining a holistic and

integrated view of the information used and produced

during a scientific endeavour

Fábio Porto

eScience 2009

Hypothesis driven, data-oriented

Model

Data

Hypothesis

KB

Validation

Hypothesis

Model

Data

KB

ok

Model

Data

Fábio Porto eScience 2009

Databases for scientific models

l  Managing scientific models involves managing all resources involved in a scientific endeavour;

l  There exists already some ‘databases’ available on the web;

l  Published metadata, programs, documentation, data, etc..

l  Weak querying mechanisms

l  Weak integration, no composition

l  Running environments in the form of a workflow or simulation system (i.e. neuron, Mathlab)

(13)

Fábio Porto

eScience 2009

ModelDB - database of models

http://senselab.med.yale.edu/modeldb/

Fábio Porto

eScience 2009

SMs include programs and graphs

Fábio Porto

eScience 2009

Conceptual model for SM single

neuron

(14)

eScience 2009

Querying simulation results

a) Find circuits with specific

Topology;

b) What are the possible

Connections ? c) How do simulation results correlate with experimental results?

Fábio Porto

eScience 2009

Simulation results management

MODEL M1 M 1 XML File In puts M o del MODEL M 2 MODEL M 3 MODEL M4 DATA /XM L M1 O utpu t DA TA/ XML M1 O u tput DA TA /XML M1 O u tput DA T A/XM L M 2 Output DATA /XM L M3 Ou tp u t

Inputs OutputsExtra Inputs OutputsExtra

Inputs OutputsExtra

Inputs Outputs Extra

DA TA /X M L M 4 O utp ut Fábio Porto eScience 2009

Managing high volume of data

l 

Distributed in grids

– Fragmentation policies

l spatial, temporal, spatial-temporal

l target oriented: eg. genes, galaxies

l Artificial key -> value

– Use of Distributed Hash Table [ ]

l 

Locally

– Indexing (R-Tree family, VA-file, VP-Tree) (Eduardo Valle - Tutorial - SBBD2009)

l R-Tree - data partitioning in minimum bounding rectangles

l VA-File - grid with binary indexes

l VP-Tree - metric space data partitioning based on the choice

(15)

Fábio Porto

eScience 2009

Data management

l  To manage:

– Scientific model metadata

– Computational model invocation

– Output data

l  To integrate

– Metadata

– Raw data with metadata

– data

l  To hide:

– From scientists the underlying complexity involving in setting-up,

running and integrating models

l  To offer:

– Data-oriented semantic based view on scientific models

Part III Shading light

Fábio Porto

eScience 2009

Objective

l

Support hypotheses based research by

– 

managing data and knowledge

– 

raising level of abstraction

– 

Automatically generate workflows

– 

Manage intermediary and final results

(16)

eScience 2009

The DiLeS approach

O nt ol og y Simulation Computational Model Scientific Model

Distributed data mngmt

A Data-oriented declarative view of scientific models Hypotheses

Fábio Porto

eScience 2009

The basis

l  An ontology-based domain description guides modelling and experiments

l  Annotated images enhance domains semantics

l  Scientific models represented through their data view (XML grounding)

l  Experiments are queries whose results are managed by the system

l  Complex data formats (matrices, graphs, 3D graphs,..)

l  Integration/ alignment of data and metadata

Fábio Porto

eScience 2009

(17)

Fábio Porto eScience 2009

Hypotheses Relationships

Fábio Porto eScience 2009

Hypotheses model

l

Choice between a ontological or a logical

view on ontologies is open

Rain Precip. Prec.

25 mm

14 mm 0,95

0.8

wind speed Prec.

15’

23’ 0,9

0.95

Coast-swell height Prec.

4 m 2.5 m ?? ??

Simulations

Fábio Porto eScience 2009

Scientific model (SM)

l

Specifies the research to be pursued to

prove a hypothesis

– 

Specifies the domain of investigation

– 

References to bibliography

– 

Described mathematical equations modelling the

studied phenomenon

(18)

eScience 2009

SM data Model - Scientific Model

Hogkin-Huxley G_Na G_K G_L E_Na E_K E_L h Neuron LSID CompartmentType diameter length capacitance 1.. * Axon CompartmentBehavior 1..* MembraneChannel totalIonicCurrent 1..* 1.. * 1..* 1..* Soma Dendrite Fábio Porto eScience 2009

SM data Model - Scientific Model

Fábio Porto

eScience 2009

Computational Model (CM)

l  Describes the computational implementation of a SM

–  Intentional view of a simulation

l  Composes the metadata needed for running simulations

l  Domain ontology (terminology) mapped into an XML view

–  Mappings associate elements of the ontology to nodes and arrow

of the XML view

l  Exports a data view of a SM

–  Input and Output contextualized by the scientific domain (XML

view)

–  Description of the execution environment with pointers to required

software and hardware

(19)

Fábio Porto eScience 2009

Computational Model

Fábio Porto eScience 2009

Simulation

l

Is a experiment request (query)

l

Fill in input parameters and input variables

with experience values

l

Declaratively combines computational

models with input data -> scientific workflow

l

Output data is stored and managed for future

analysis

Fábio Porto

eScience 2009

Simulation query

l

Use datalog model

– 

Programs are n-ary predicates

l Bound variables

l Variables bound to input and output data

– 

Elegant model but needs extensions:

l Integrate loops

– Requires extensions to the syntax and semantics of the

language

l Integrate if-else conditions

(20)

eScience 2009

SM data Model - Simulation query

Fábio Porto

eScience 2009

Simulation query

Drug (D,P):=BLAST (X,M,P1)

DR (M,P1,D,P)

Genbank sequencegene

attgcc.. attggcc.. pol pol gene pol pol

Blast

ccgttgcc.. Attgggcc.. pol pol pol pol attgccc 0.99 12AI,345GI,.. Attggg… 0.95 123AD,222GI attgag 0.9 444TI,555TI Drug resistance drug1 0.88 12AI,345GI,.. drug2 0.8 123AD,233GI drug3 0.9 444TI,556TD

query

atggaaaagg … Fábio Porto eScience 2009

Scientific Model Management

Arquitecture (SMMA)

l 

To support :

– SM specification, modification and evolution

– SM sharing

– Scientific domain ontologies design and management

– Hypothesis and SM reasoning

– SM semantic search and metadata querying

– Simulation queries (workflows) evaluation and simulation results management (CoDIMS)

– Simulation results data querying

– Integrating/aligning SM data, metadata and/with domain ontologies

(21)

Fábio Porto

eScience 2009

SMM architecture

Simulation

Request Data Analysis Hypotheses reasoning

Datatype plugins Distributed Data Management Scientific

Model definition ontology definition Domain

Simulation request Analysis & Optimization

Simulation Evaluation & Querying Ontology to XML Reasoning Ontology to Ontology Catalog management User Layer Metadata Management Layer Service Layer Data Management Layer Ontology to Data

Conclusion

Fábio Porto eScience 2009

Concluding remarks

l

Scientific applications seems one of the next

great waves in data management

l

A high-level data model for managing

simulations seems relevant

l

Various database techniques and theories

applicable

(22)

Referências

Documentos relacionados

O bullying na escola passou a ser tema de preocupação, podendo aparecer-nos de diversas formas, umas mais cruéis que outras, dependendo de muitos fatores. Estudos sobre a

Virgínia Brilhante Ecolingua: a Formal Ontology for Data in

Alguns estudos por exemplo, Coles e Hesterly 1998b; Macinati 2008, embora não analisem especificamente essa questão, indiciam que poderá haver diferenças entre a motivação

documentos oficiais para os anos iniciais; a organização curricular de cursos cearenses em relação à promoção da EAN; as experiências formativas; e as

The further steps, specifically Thesaurus Building Cycle and Ontology Harmonization (Ontology ’s Taxonomy Harmonization Cycle and Ontology ’s Contents Harmonization Cycle),

Outra festa de destaque na cidade no período, como vimos anteriormente, era o dia da Independência do Brasil, comemorado com grandes festividades no dia 7 de setembro. É um momento

É nestas circunstâncias que se altera a iconografia do Paraíso: o Paraíso celeste, evocado através do Juízo Final, abandona progressivamente a estrutura em registos e

Este caso de uso começa após a criação de uma nova avaliação, configuração e teste dos instrumentos, quando o metrologista clica no botão “Executar” para dar início