Data Management for eScience
Fabio Porto
(
fporto@
lncc.br)
LNCC
–
CCC
http://dexl.lncc.br
Fábio Porto eScience 2009A word of caution
l
This presentation is a personal reflection
certainly biased by my own experience and
far from being complete
Fábio Porto
eScience 2009
Agenda
l
Part I And Prof. Heuser said
l
Part II Plato cave
l
Part III Shedding light
Part I: And Prof. Heuser said
SBBD2009 - invited talk
Fábio Porto
eScience 2009
l
“To see what is in front of one’s nose
needs a constant struggle”
George Orwell
Fábio Porto
eScience 2009
Fábio Porto
eScience 2009
A need for collaboration
Fábio Porto
eScience 2009
A time of questions
l What is the correlation between the greenhouse phenomenon and the food shortage?
l Is the current worldwide shortage of food associated to the increasing culture of green fuel?
l In a flock of birds trajectory which are the dense regions?
l In a galaxy, where are the dark energy foci?
l Which are the DNA sequences that match a certain model?
l How can I model a metabolic network with hundreds of compounds?
l May one produce the neural network from an initial connection set and the electrophysiology data?
Fábio Porto
eScience 2009
A challenge on volume
l
Dark Energy Survey (DES) project expects to
produce 100 PB in 10 years;
(source:personal comm.)– 5000o sky cover, “all” objects, “perfect” accuracy
l
Yahoo claims to manage 2 PB of click data in a
modified PostgreSQL
l
EMBL - nucleotide database 260 Gbases
l
High-throughput sequencing 454 Roche technology
– Sequence 400-600 million bases in 10 hours
– Eg. A project at Max Plant Institute aims at sequencing the whole genome of the Neanderthal at 3 billion base pairs is expected to take 2 years to finish.
eScience 2009
GenBank exponential growth
1982 - 2008
Growth in nucleo.de sequences submi4ed to GenBank between 1982 and 2005. The note from each release of GenBank reveal the total number of nucleo.des submi4ed to the database. This graph uses one data point from each year to show the exponen.al growth rate of nucleo.de sequences in this interna.onal database. Data from whole genome sequencing is not included in these figures, but expressed sequence tag data and data from sequencing centers is included. Copyright 2008 Nature Educa.on.
Fábio Porto
eScience 2009
Where all these data is coming
from?
l
Better instrumentation generates more
accurate and larger datasets
l
Highly scalable systems process data faster
and generate higher volumes to be
processed and analysed
l
Faster networks allow data sharing easier
l
New researching becomes easy and
consequently generate more data
Fábio Porto
eScience 2009
A quest for quality
Source:EMBL
l
Many of the sequences in GenBank are
badly annotated. How can one trust such
“
database”?
–
Confidence as an important metadata to be
computed
–
How to computationally model quality of data
Fábio Porto
eScience 2009
A need for models
l
Traditional databases the schema models
the semantics of the data it reflects
l
In eScience, data reflects models that
simulate phenomena
l
Thus management of data corresponds to
managing models
Fábio Porto
eScience 2009
A view on Computer Science
l
The challenge for computer science in the
next decades will be to produce
computational models that permit users to
interpret the evolution of facts and
discoveries and reactively and
collaboratively produce knowledge
Fábio Porto
eScience 2009
The Bottom-line
l
“Scientists are spending most of their time
manipulating, organizing, finding and moving
data, instead of researching. And it’s going
to get worse”
–
Office Science of Data Management challenge -
eScience 2009
Data Management for eScience
l
What are the options:
– From a system viewpoint:
l Adapt Object-Relational DBMSs to support whatever new data
management requirement (Oracle, PostgreSQL,…)
l Develop specific solutions (SRB, …)
l Adopt hybrid solutions (flat files, DFS + DBMS)
l Use search technology (coupled with file system data mngmt)
l Use semantic technology (Pallet, Racer, Oracle,…)
l Highly distributed and parallelizable techniques (MapReduce,
multi-core, cloud,..)
l Workflows systems (taverna, vistrails, CoDIMS,…)
l Adaptive query processing techniques
Fábio Porto
eScience 2009
Data Management for eScience
l
What are the options?
– From an Data (K) point of view
l Semantic representation and reasoning
l Uncertainty representation and similarity based models
l Data and knowledge integration with quality measurements
l Hypotheses, Scientific Models and simulations
l Provenance
l Temporal series data analysis
l Multi-dimensional data analysis (KD-tree)
l Integration with trust and quality
l Declarative workflows
Fábio Porto
eScience 2009
And Prof. Heuser said
l
No single data model can represent the myriad of
data characteristics produced in scientific research
l
No single DBMS system can support the different
needs of scientific application
l
But, many of research results in databases can be
applied to science data
Part II Plato Cave
Semantic Representation &
Reasoning
Fábio Porto
eScience 2009
Domain specification
Terminology
eScience 2009
Model of the brain with functions
Problem Solving, Emotion, Complex Thought
Coordination of complex movement
Initiation of voluntary movement Receives tactile information from the body Processing of multisensory information Complex processing of visual information Detection of simple visual stimuli Language comprehension Complex processing of auditory information Detection of
sound quality (loudness, tone) Speech production and articulation
Domain specification
Fábio Porto eScience 2009Domain terminology
From a macro to micro level
Fábio Porto
eScience 2009
Fábio Porto
eScience 2009
From simulated data to semantics
a) Single neuron Soma axon Dendrites b) A neuron column c) Layer 4 neuron
Images from the Blue Brain Project
Fábio Porto
eScience 2009
Terminology
l
Terminology initiatives in Neuroscience
– BrainML (http://brainml.org)
– Mouse Brain Atlas
– Allen Brain Atlas
– Cocomac (http://www.cocomac.org)
l
In medicine
– UMLS (www.nlm.nih.gov/research/umls)
l
In Bio-ontologies
– Compiled at OBO (www.obofoundry.org)
– GeneOntology (www.geneontology.org)
eScience 2009
Probabilistic Querying
l
The ubiquitous availability of scientific data leads to
uncertainty about their content.
– Blast returns similarity between a sequence query and sequences in a set
– Computer simulation approximate (distance) to the phenomenon it simulates
Fábio Porto
eScience 2009
Simulations
Large coast swell (wave height > 3m) caused by strong wind > 20
’
and rain ( precipitation > 20 mm)
Rain Precip. Prec.
25 mm
14 mm 0,95
0.8
wind speed Prec.
15’
23’ 0,9
0.95
Coast-swell height Prec.
4 m 2.5 m ?? ??
Simulations
Fábio Porto eScience 2009Drug Resistance
Which drug to prescribe to a HIV-1 patient?
(atggaaaagg …)
Genbank sequencegene
attgcc.. attggcc.. pol pol gene pol pol
Blast
ccgttgcc.. Attgggcc.. pol pol pol pol attgccc 0.99 12AI,345GI,.. Attggg… 0.95 123AD,222GI attgag 0.9 444TI,555TI Drug resistance drug1 0.88 12AI,345GI,.. drug2 0.8 123AD,233GI drug3 0.9 444TI,556TDquery
atggaaaagg …Fábio Porto
eScience 2009
And also
l
Datasources with different quality
–
Which data shall we trust?
–
Automatic evaluation of cost models
–
Semantic descriptions
Hypotheses based research
Fábio Porto
eScience 2009
Scientific applications
l Scientific exploration is one particularly relevant hypothesis based application
l Current in silico scientific endeavours rely on scientific workflows
l Scientific workflow has become a standard for computer simulations, data analysis, and transformations (eg. Taverna, Kepler, ..) in eScience;
l But
– It drives the scientist attention towards the computational model, in
detriment to the scientific problem pursued;
– It hardly evolves with new findings;
– Managing produced data is tough;
– Reproducing results are difficult;
– Sharing the graph is not sharing the experience;
– Semantics are in the problem domain and not in the computational
eScience 2009
Hypotheses in eScience
hypotheses are formal representations synthesizing the
understanding a scientist has about a studied
phenomenon, entity or process. It allows scientists to
simulate the behaviour of the real world and compare it
against experimental results.
Corollary:
The exploratory nature of science requires to continuously
handle evolution while maintaining a holistic and
integrated view of the information used and produced
during a scientific endeavour
Fábio Porto
eScience 2009
Hypothesis driven, data-oriented
Model
Data
Hypothesis
KB
Validation
Hypothesis
Model
Data
KB
ok
Model
Data
Fábio Porto eScience 2009Databases for scientific models
l Managing scientific models involves managing all resources involved in a scientific endeavour;
l There exists already some ‘databases’ available on the web;
l Published metadata, programs, documentation, data, etc..
l Weak querying mechanisms
l Weak integration, no composition
l Running environments in the form of a workflow or simulation system (i.e. neuron, Mathlab)
Fábio Porto
eScience 2009
ModelDB - database of models
http://senselab.med.yale.edu/modeldb/
Fábio Porto
eScience 2009
SMs include programs and graphs
Fábio Porto
eScience 2009
Conceptual model for SM single
neuron
eScience 2009
Querying simulation results
a) Find circuits with specific
Topology;
b) What are the possible
Connections ? c) How do simulation results correlate with experimental results?
Fábio Porto
eScience 2009
Simulation results management
MODEL M1 M 1 XML File In puts M o del MODEL M 2 MODEL M 3 MODEL M4 DATA /XM L M1 O utpu t DA TA/ XML M1 O u tput DA TA /XML M1 O u tput DA T A/XM L M 2 Output DATA /XM L M3 Ou tp u t
Inputs OutputsExtra Inputs OutputsExtra
Inputs OutputsExtra
Inputs Outputs Extra
DA TA /X M L M 4 O utp ut Fábio Porto eScience 2009
Managing high volume of data
l
Distributed in grids
– Fragmentation policies
l spatial, temporal, spatial-temporal
l target oriented: eg. genes, galaxies
l Artificial key -> value
– Use of Distributed Hash Table [ ]
l
Locally
– Indexing (R-Tree family, VA-file, VP-Tree) (Eduardo Valle - Tutorial - SBBD2009)
l R-Tree - data partitioning in minimum bounding rectangles
l VA-File - grid with binary indexes
l VP-Tree - metric space data partitioning based on the choice
Fábio Porto
eScience 2009
Data management
l To manage:
– Scientific model metadata
– Computational model invocation
– Output data
l To integrate
– Metadata
– Raw data with metadata
– data
l To hide:
– From scientists the underlying complexity involving in setting-up,
running and integrating models
l To offer:
– Data-oriented semantic based view on scientific models
Part III Shading light
Fábio Porto
eScience 2009
Objective
l
Support hypotheses based research by
–
managing data and knowledge
–
raising level of abstraction
–
Automatically generate workflows
–
Manage intermediary and final results
eScience 2009
The DiLeS approach
O nt ol og y Simulation Computational Model Scientific Model
Distributed data mngmt
A Data-oriented declarative view of scientific models Hypotheses
Fábio Porto
eScience 2009
The basis
l An ontology-based domain description guides modelling and experiments
l Annotated images enhance domains semantics
l Scientific models represented through their data view (XML grounding)
l Experiments are queries whose results are managed by the system
l Complex data formats (matrices, graphs, 3D graphs,..)
l Integration/ alignment of data and metadata
Fábio Porto
eScience 2009
Fábio Porto eScience 2009
Hypotheses Relationships
Fábio Porto eScience 2009Hypotheses model
l
Choice between a ontological or a logical
view on ontologies is open
Rain Precip. Prec.
25 mm
14 mm 0,95
0.8
wind speed Prec.
15’
23’ 0,9
0.95
Coast-swell height Prec.
4 m 2.5 m ?? ??
Simulations
Fábio Porto eScience 2009Scientific model (SM)
l
Specifies the research to be pursued to
prove a hypothesis
–
Specifies the domain of investigation
–
References to bibliography
–
Described mathematical equations modelling the
studied phenomenon
eScience 2009
SM data Model - Scientific Model
Hogkin-Huxley G_Na G_K G_L E_Na E_K E_L h Neuron LSID CompartmentType diameter length capacitance 1.. * Axon CompartmentBehavior 1..* MembraneChannel totalIonicCurrent 1..* 1.. * 1..* 1..* Soma Dendrite Fábio Porto eScience 2009
SM data Model - Scientific Model
Fábio Porto
eScience 2009
Computational Model (CM)
l Describes the computational implementation of a SM
– Intentional view of a simulation
l Composes the metadata needed for running simulations
l Domain ontology (terminology) mapped into an XML view
– Mappings associate elements of the ontology to nodes and arrow
of the XML view
l Exports a data view of a SM
– Input and Output contextualized by the scientific domain (XML
view)
– Description of the execution environment with pointers to required
software and hardware
Fábio Porto eScience 2009
Computational Model
Fábio Porto eScience 2009Simulation
l
Is a experiment request (query)
l
Fill in input parameters and input variables
with experience values
l
Declaratively combines computational
models with input data -> scientific workflow
l
Output data is stored and managed for future
analysis
Fábio Porto
eScience 2009
Simulation query
l
Use datalog model
–
Programs are n-ary predicates
l Bound variables
l Variables bound to input and output data
–
Elegant model but needs extensions:
l Integrate loops
– Requires extensions to the syntax and semantics of the
language
l Integrate if-else conditions
eScience 2009
SM data Model - Simulation query
Fábio Porto
eScience 2009
Simulation query
Drug (D,P):=BLAST (X,M,P1)
∧
DR (M,P1,D,P)
Genbank sequencegene
attgcc.. attggcc.. pol pol gene pol pol
Blast
ccgttgcc.. Attgggcc.. pol pol pol pol attgccc 0.99 12AI,345GI,.. Attggg… 0.95 123AD,222GI attgag 0.9 444TI,555TI Drug resistance drug1 0.88 12AI,345GI,.. drug2 0.8 123AD,233GI drug3 0.9 444TI,556TDquery
atggaaaagg … Fábio Porto eScience 2009Scientific Model Management
Arquitecture (SMMA)
l
To support :
– SM specification, modification and evolution
– SM sharing
– Scientific domain ontologies design and management
– Hypothesis and SM reasoning
– SM semantic search and metadata querying
– Simulation queries (workflows) evaluation and simulation results management (CoDIMS)
– Simulation results data querying
– Integrating/aligning SM data, metadata and/with domain ontologies
Fábio Porto
eScience 2009
SMM architecture
Simulation
Request Data Analysis Hypotheses reasoning
Datatype plugins Distributed Data Management Scientific
Model definition ontology definition Domain
Simulation request Analysis & Optimization
Simulation Evaluation & Querying Ontology to XML Reasoning Ontology to Ontology Catalog management User Layer Metadata Management Layer Service Layer Data Management Layer Ontology to Data