Scien,fic Workflow
Fabio Porto
[email protected]
h?p://dexl.lncc.br
Outline
•
From Wet-‐lab to in-‐silico
•
Scien,fic Workflows Concepts
•
Scien,fic Workflow Model (introduc,on)
•
Example
Scien,fic Workflow
15/05/12 2
From wet lab to in-‐silico research
•
As science moves to in-‐silico, scien,fic life-‐
cycle must be supported;
•
The availability of data enables mul,ple
experiments/analyses;
•
Scien,fic notebook fruit of provenance
records;
In-‐silico science
•
The overwhelming amount of produced data
to be analysed
•
Ini,al strategies based on Unix-‐like scrip,ng
languages
•
Complexity in managing, transforming,
Scien,fic Workflows
A high level descrip1on of the process used to carry out computa1onal and analy1cal experiments
• Provide high-‐level modelling approaches for specifying analyses
• Offer generic services to op,mize execu,on • Manage intermediate results
• Hide distribu,on and paralleliza,on • Store provenance informa,on
15/05/12 Scien,fic Workflow 5
Scien,fic Workflow Life-‐cycle
Ludascher et al 2009 updated F. Porto 2012
15/05/12 Scien,fic Workflow 6 Hypothesis, experiment Goals Experiment, Workflow Design Workflow Prepara,on Workflow Execu,on Post-‐ Execu,on analysis Workflow
repository Sources Data Provenance Store Monitoring Hypotheses database
Phases
• Hypotheses Formula,on – Phenomenon Descrip,on• Observed physical quan,,es
• Topology – Hypothesis descrip,on • Valida,on criteria • Constraints • Space-‐,me, scale • Workflow Design
– Iden,fy exis,ng resources to be reused:
• Workflow templates
• Programs
– Model ac,vi,es and dependencies; – Define data requirements
• Formats, localiza,on, selec,on criteria;
• Specify data transforma,on procedures
Phases (cont.)
• Workflow Prepara,on
– Choose data sources and parameter values;
– Select execu,on environment;
– Reserve physical resources (e.g. Cluster);
– Define execu,on model of ac,va,ons;
• Workflow Execu,on
– Run workflows;
– Store provenance informa,on;
– Keep final and intermediate results;
Phases (cont.)
•
Post-‐execu,on analysis
– Does the result makes sense?
• Examine execu,on traces
– Which results were tainted by this input dataset?
• Data dependencies
– Why did this step failed?
• Debug runs
– Which step took the longest ,me
• performance
15/05/12 Scien,fic Workflow 9
Roles
•
A workflow involves different roles:
– Domain scien,sts
• Act as (high-‐level) workflow designers;
• May act as workflow operators;
– Workflow Engineers
• Implement new ac,vi,es
• Integrate the workflow into a par,cular workflow
system
• Analyse and fix bugs
15/05/12 Scien,fic Workflow 10
Types of Scien,fic Workflows
• There is no agreement on a classifica,on but
– Exploratory workflows
• The design phase is very preliminary • Language shall allow easy reformula,on • Techniques to stop and analyse intermediate results
– Produc,on workflows
• Detailed design phase • Less prone to modifica,ons
• Models a known scien,fic procedure or protocol
– Science oriented workflows
• Reflect scien,fic experiments
– Engineering workflows
• Deal with data movement and job management
Concepts and System Func,ons
•
Integrated Workflow Environment
– Environment to support the scien,fic life cycle – Use Visual programming interface for workflow
design
– Provide libraries of exis,ng components – Templates and workflows from workflow
repositories
Concepts and System Func,ons
•
Workflow Prepara,on and Execu,on Support
– Support to parameter sweep;
– Smart rerun avoiding costly recomputa,on;
– Run,me monitoring;
– Fault-‐tolerance with “smart-‐resume”
– Mapping to distributed nodes;
15/05/12 Scien,fic Workflow 13
Concepts and System Func,ons
•
Data-‐driven
– Although workflow design focuses on processing steps, workflow computa,ons are data driven
• Without workflow system scien,sts spend their ,me
– Reading, reformaWng, transfering and saving datasets
• Each step defines a data transforma,on ac,vity
– Dataflow oriented workflows [Lee,Parks 1995 Dataflow Process Network]
• Emphasize the central role of data
• Data flow passes through workflow steps
15/05/12 Scien,fic Workflow 14
Proccess vs Data driven
A
B
C
D
Step A is at the same ,me a processing step and a data spliWng step
A B C D s1 s2 s3 s4 Flow graph specifies how data is guided through the processes
Workflows in different levels of
abstrac,on (Ogasawara et al 2009)
• Scien,fic workflows are usually defined in two levels of
abstrac,on: abstract and concrete;
• Abstract – refers to the specifica,on of ac,vi,es with
no reference to physical resources or par,cular implementa,on.
– Wokflow represented as a DAG of conceptual ac,vi,es;
• Concrete – define technological characteris,cs and
computa,onal resources required to run the workflow;
– Is a par,cular instan,a,on of an abstract workflow
– Ac,vi,es are named concrete ac1vi1es or tasks
• workflow system suppor,ng abstract workflow
Concrete vs abstract workflows
(taverna)
15/05/12 Scien,fic Workflow 17
Sequence to
align Protein Type
Align Sequence
Alignment result
a) Concrete workflow in Scufl (Taverna) b) Abstract Workflow (GExpline)
Implica,ons Abstract and Concrete
Workflows
•
The same abstract workflow may be
implemented differently:
– Changing tasks that implement ac,vi,es
• Implemen,ng AlignSequence using other alignment
tool;
• Keep provenance informa,on about the experiment
evolu,on
– Usually this is implemented designing a new workflow
15/05/12 Scien,fic Workflow 18
Scien,fic Workflow Model
Dealing with heterogeneity
• integra,on of autonomously defined programs pose constraints on tasks integra,on;
• Proposed solu,ons
– Shims – processing steps that act as adaptors
changing format of data between heterogeneous format tasks (Oinn et al. 2006 – Taverna)
• Control operator in QEF
– Data Model – make it explicit the data model
• Describe inputs and output data schema;
• Using mappings between schema to handle data
transforma,on declara,vely
Scien,fic workflow Model
• Current State:
– No standard language for modelling scien,fic workflow
• BPEL – use for business workflows
– Data has no schema and passed as files through ac,vi,es
• General View:
– A set of ac,vi,es A={a1, a2,…, an}
• Ac,vi,es are:
– Data manipula,on (e.g. transforma,on, formaWng,…) – Data analyses
• par,al order among ac,vi,es
– modelled as a directed acyclic graph
• Some scien,fic workflow systems support cycles: Kepler, QEF
– Data are modelled as input/output of ac,vi,es
• Data Unit is a set of elements that compose the input to an ac,vity
– Data Unit list -‐ DU={ du1,du2,…duk}
Scien,fic Workflow Execu,on model
• Each ac,vity becomes a Task
• The workflow engine executes an instance of the workflow
– schedules tasks based on the par,al ordering among
the corresponding ac,vi,es
– Evaluates a single data unit set (DU)
– If workflow system support cycles
• Parameter sweep, Turbulence simula,on, Par,cle tracing
• A single instance of workflow evaluates a set of data unit
sets
15/05/12 Scien,fic Workflow 21
Risers Fa,gue Analysis (RFA) scien,fic
workflow
15/05/12 Scien,fic Workflow 22
Static Data (sdat) files, Dynamic Data (ddat) files sdat files, ddat files
SAV files, SsSai files
ddat files, SiSai files, SsSai files, SAV files
DiSai files Compressed Riser Data (rd.zip) (environmental
conditions, physical data and riser geometry)
DdSai files, Env files
SsSai files, DdSai files, Env files SiSai files, FTE files,
FTR files sdat files, FTR files
1. Extraction of Riser Data
call ExtractRD(rd.zip)
2. Run Static Analysis Preprocessing
For each rd in RD Set, x in sdat files,
y in ddat files call PSRiser(rd, x, y)
3. Run Static Analysis
For each rd in RD Set, x in sdat files, y
in FTR files call SRiser(rd, x, y)
6. Run Tension Analysis
For each rd in RD Set, x in DdSai files,
y in Env files call Tanalysis(rd ,x, y) 7. Run Curvature Analysis
For each rd in RD Set, z in SsSai files call Canalysis(rd, z) ddat files, DiSai files,
FTE files, SAV files
DdSai files, Env files SsSai files
8. Merge Data
For each rdT in Accepted Tension
RD Set, rdC in Accepted Curvature RD Set call Match (rdT, rdC) 9. Compression of Riser’s Results
call CompressRD(Merged Rd, SsSai
files, DdSai files, MEnv files)
Riser Data (RD) set RD set RD set RD Set RD Set RD Set Accepted Tension RD Set Accepted Curvature RD Set Merged RD Set Shared disk
5. Run Dynamic Analysis
For each rd in RD Set, x in ddat files,
y in DiSai files, z in FTE files, w in SAV files
call Driser (rd, x, y, z, w)
4. Run Dynamic Analysis Preprocessing
For each rd in RD Set, x in ddat files,
y in SiSai files, z in SsSai files, w in SAV files
call PDRiser(rd, x, y, z, w)
RdResult.zip
Scien,fic Workflow Systems
•
VisTrails – scien,fic visualiza,on, design tool,
provenance management
•
Kepler – high-‐level data model, design tool,
execu,on model, provenance management
•
Taverna – biologia, design tools, web services,
MyExperiment , Xsculf, Proveniência;
•
Swix – biology, HPC, scrip,ng language
•
QEF – algebraic based scien,fic workflow
Taverna Scufl
• Taverna language components:
– Inputs – entry points for the data for the workflow
– Output – exit point for the data for the workflow
– Processors – an individual step in the workflow (task)
• Input ports and output ports
– Data links – links data sources to data des,na,ons
• Data sources – workflow inputs or processor output ports
• Data des,na,ons – workflow outputs or processors input
Kepler (Al,ntas et al 2006)
• Workflow language:– Actors – specify what processing occurs. • Kepler comes with more than 350 actors • Define the tasks to be done
• User-‐defined actors can be added to the Kepler repository – Directors – specify when the processing occurs
• Implements computa,onal model (e.g. synchronous, parallel) • Provides execu,on parameters
• Example: SDF – process simple pipeline type of workflows; CT-‐ workflows that evolve as con,nuous func,on of ,me
– Composite/individual Actors
• Composite -‐ – subworkflows (reuse) • Individual –single task – Parameters – configure actors behavior – Rela,on – allow data to be sent to mul,ple consumers
• Mul,disciplinary
• Design and execu,on of workflows
• Capture of Provenance data, both prospec,ve and retrospec,ve
15/05/12 Scien,fic Workflow 25
The Lotka_Volterra Model
Predator-‐prey
•
Two equa,ons modeling the modifica,on of
the popula,ons of prey and predator
– dn1/dt = r*n1 -‐ a*n1*n2– dn2/dt = -‐d*n2 + b*n1*n2
•
The workflow:
– 6 actors= 2 plo?ers, 2 equa,ons and 2 integral func,ons
– 1 director – defines execu,on parameters
• ,me, step-‐size, max-‐itera,ons, ODESolver
15/05/12 Scien,fic Workflow 26
QEF (LNCC)
• Algebra Based Workflow engine • Components
– Logical operators – implement applica,on logic
– Control operators – implement data manipula,on
– DataSources – Implement input/output data
– DataUnit – data communica,on between operators
• Paralelism, loops • No design tool
• No provenance management
15/05/12 Scien,fic Workflow 29 eScience 2009
Adap,ve and Extensible Query Engine
•
Extensible to data types
•
Extensible to applica,on algebra
•
Extensible to execu,on model
•
Schedule opera,ons in grid nodes
•
Adap,ve execu,on model
Objec(ve
• Offer a query processing framework that can be extended to adapt to data centric grid application needs;
• Offer transparency in using resources to answer queries;
• Query optimization transparently introduced
• Standardize remote communication using web services even when dealing with large amount of unstructured data • Run-time performance monitoring and decision
Control Operators
• Add data-flow and transformation operators • Isolate application oriented operators from execution model data-flow concerns • parallel grid based execution model:
• Split/Merge - controls the routing of tuples to parallel nodes and the corresponding unification of multiple routes to a single flow
• Send/Receive - marshalling/ unmarshalling of tuples and interface with communication mechanisms • B2I/I2B - blocks and unblocks tuples • Orbit - implements loop in a data-flow
eScience 2009
The Execu(on Model
Example of simple QEF Workflow
Data sources (Input)
Output Operator
Possibly distributed over a Grid environment
Integra,on unit (Tuple) containing data source units
eScience 2009
Itera(on Model
A B C DataSource OPEN OPEN OPEN A B C DataSource GETNEXT GETNEXT GETNEXT A B C DataSourceCLOSE CLOSE CLOSE
Results
Distribu(on and Paralleliza(on
Operator distributionA Query Optimizer selects a set of operators in the QEP to execute over a Grid environment.
A B2 C DataSource B1 B3
General Parallel Execu(on Model
Remote QEP
In order to parallelize an execution, the initial QEP is modified and sent to remote nodes to handle the distributed execution. Control operator Distributed operator R : Receiver S : Sender Sp : Split Ini(al
eScience 2009
Modifying IQEP to adapt to execu(on
model
Par,cles Geometry Velocity A (TCP) SJ TJ Orbit merge Split Send Receive B2I Send I2B Receive B2I I2BQuery op,mizer adds control operators according to execu,on model and IQEP sta,s,cs Local dataflow Remote dataflow Logical operator Control operator Control node Remote nodei eScience 2009
Grid node alloca(on algorithm (G2N)
Grid Greedy Node scheduling algorithm (G2N)
• Offers maximum usage of scheduled resources during query evaluation.
• Basic idea : “an optimal parallel allocation strategy
for an independent query operator … is the one in which the computed elapsed-time of its execution is as close as possible to the maximum sequential time in each node evaluating an instance of the operator”.
A Bn € t1+ t2= tx
( )
Bn node on this cost operator ) (Bn t 1 t 2 t Introduc(on Applica(on Architecture Implem. Conclusion PrinciplesImplementa(on
• Core development in Java 1.5. • Globus toolkit 4.
• Derby DBMS (catalog).
• Tomcat, AJAX and Google Web Toolkit for user interface.
• Runs on Windows, Unix and Linux. • source code, demo, user guide available at: