• Nenhum resultado encontrado

ScientificWorkflow-v2

N/A
N/A
Protected

Academic year: 2021

Share "ScientificWorkflow-v2"

Copied!
10
0
0

Texto

(1)

Scien,fic  Workflow  

Fabio  Porto  

[email protected]

 

h?p://dexl.lncc.br  

Outline  

• 

From  Wet-­‐lab  to  in-­‐silico  

• 

Scien,fic  Workflows  Concepts  

• 

Scien,fic  Workflow  Model  (introduc,on)  

• 

Example        

Scien,fic  Workflow  

15/05/12   2  

From  wet  lab  to  in-­‐silico  research  

• 

As  science  moves  to  in-­‐silico,  scien,fic  life-­‐

cycle  must  be  supported;  

• 

The  availability  of  data  enables  mul,ple  

experiments/analyses;  

• 

Scien,fic  notebook  fruit  of  provenance  

records;  

In-­‐silico  science  

• 

The  overwhelming  amount  of  produced  data  

to  be  analysed  

• 

 Ini,al  strategies  based  on  Unix-­‐like  scrip,ng  

languages  

• 

Complexity  in  managing,  transforming,  

(2)

Scien,fic  Workflows  

A  high  level  descrip1on  of  the  process  used  to  carry   out  computa1onal  and  analy1cal  experiments  

• Provide  high-­‐level  modelling  approaches  for   specifying  analyses  

• Offer  generic  services  to  op,mize  execu,on   • Manage  intermediate  results  

• Hide  distribu,on  and  paralleliza,on   • Store  provenance  informa,on        

15/05/12   Scien,fic  Workflow   5  

Scien,fic  Workflow  Life-­‐cycle  

Ludascher  et  al  2009   updated  F.  Porto  2012

 

15/05/12   Scien,fic  Workflow   6   Hypothesis,   experiment   Goals   Experiment,   Workflow   Design   Workflow   Prepara,on   Workflow   Execu,on   Post-­‐ Execu,on   analysis   Workflow  

repository   Sources  Data   Provenance   Store   Monitoring   Hypotheses   database  

Phases  

•  Hypotheses  Formula,on   – Phenomenon  Descrip,on  

• Observed  physical  quan,,es  

• Topology   – Hypothesis  descrip,on   • Valida,on  criteria   • Constraints   • Space-­‐,me,  scale   •  Workflow  Design  

– Iden,fy  exis,ng  resources  to  be  reused:  

• Workflow  templates  

• Programs  

– Model  ac,vi,es  and  dependencies;   – Define  data  requirements  

• Formats,  localiza,on,  selec,on  criteria;  

• Specify  data  transforma,on  procedures  

Phases  (cont.)  

•  Workflow  Prepara,on  

– Choose  data  sources  and  parameter  values;  

– Select  execu,on  environment;  

– Reserve  physical  resources  (e.g.  Cluster);  

– Define  execu,on  model  of  ac,va,ons;  

•  Workflow  Execu,on  

– Run  workflows;  

– Store  provenance  informa,on;  

– Keep  final  and  intermediate  results;  

(3)

Phases  (cont.)  

• 

Post-­‐execu,on  analysis  

– Does  the  result  makes  sense?  

• Examine  execu,on  traces  

– Which  results  were  tainted  by  this  input  dataset?  

• Data  dependencies  

– Why  did  this  step  failed?  

• Debug  runs  

– Which  step  took  the  longest  ,me  

• performance  

15/05/12   Scien,fic  Workflow   9  

Roles  

• 

A  workflow  involves  different  roles:  

– Domain  scien,sts    

• Act  as  (high-­‐level)  workflow  designers;  

• May  act  as  workflow  operators;    

– Workflow  Engineers  

• Implement  new  ac,vi,es  

• Integrate  the  workflow  into  a  par,cular  workflow  

system  

• Analyse  and  fix  bugs  

15/05/12   Scien,fic  Workflow   10  

Types  of  Scien,fic  Workflows  

•  There  is  no  agreement  on  a  classifica,on  but  

– Exploratory  workflows  

• The  design  phase  is  very  preliminary   • Language  shall  allow  easy  reformula,on   • Techniques  to  stop  and  analyse  intermediate  results  

– Produc,on  workflows  

• Detailed  design  phase   • Less  prone  to  modifica,ons  

• Models  a  known  scien,fic  procedure  or  protocol  

– Science  oriented  workflows  

• Reflect  scien,fic  experiments  

– Engineering  workflows  

• Deal  with  data  movement  and  job  management  

Concepts  and  System  Func,ons  

• 

Integrated  Workflow  Environment  

– Environment  to  support  the  scien,fic  life  cycle   – Use  Visual  programming  interface  for  workflow  

design  

– Provide  libraries  of  exis,ng  components   – Templates  and  workflows  from  workflow  

repositories  

(4)

Concepts  and  System  Func,ons  

• 

Workflow  Prepara,on  and  Execu,on  Support  

– Support  to  parameter  sweep;  

– Smart  rerun  avoiding  costly  recomputa,on;  

– Run,me  monitoring;  

– Fault-­‐tolerance  with  “smart-­‐resume”  

– Mapping  to  distributed  nodes;  

15/05/12   Scien,fic  Workflow   13  

Concepts  and  System  Func,ons  

• 

Data-­‐driven  

– Although  workflow  design  focuses  on  processing   steps,  workflow  computa,ons  are  data  driven  

• Without  workflow  system  scien,sts  spend  their  ,me  

– Reading,  reformaWng,  transfering  and  saving  datasets  

• Each  step  defines  a  data  transforma,on  ac,vity  

– Dataflow  oriented  workflows  [Lee,Parks  1995   Dataflow  Process  Network]  

• Emphasize  the  central  role  of  data  

• Data  flow  passes  through  workflow  steps  

15/05/12   Scien,fic  Workflow   14  

Proccess  vs  Data  driven  

A  

B  

C  

D  

Step  A  is  at  the  same  ,me  a  processing  step  and  a  data  spliWng  step  

A   B   C   D   s1   s2   s3   s4   Flow  graph  specifies  how  data  is  guided  through  the  processes  

Workflows  in  different  levels  of  

abstrac,on  (Ogasawara  et  al  2009)  

•  Scien,fic  workflows  are  usually  defined  in  two  levels  of  

abstrac,on:  abstract  and  concrete;  

•  Abstract  –  refers  to  the  specifica,on  of  ac,vi,es  with  

no  reference  to  physical  resources  or  par,cular   implementa,on.  

– Wokflow  represented  as  a  DAG  of  conceptual  ac,vi,es;  

•  Concrete  –  define  technological  characteris,cs  and  

computa,onal  resources  required  to  run  the  workflow;  

– Is  a  par,cular  instan,a,on  of  an  abstract  workflow  

– Ac,vi,es  are  named  concrete  ac1vi1es  or  tasks  

•  workflow  system  suppor,ng  abstract  workflow  

(5)

Concrete  vs  abstract  workflows  

(taverna)  

15/05/12   Scien,fic  Workflow   17  

Sequence  to  

align   Protein  Type  

Align   Sequence  

Alignment   result  

a)  Concrete  workflow  in  Scufl  (Taverna)   b)  Abstract  Workflow  (GExpline)  

Implica,ons  Abstract  and  Concrete  

Workflows  

• 

The  same  abstract  workflow  may  be  

implemented  differently:  

– Changing  tasks  that  implement  ac,vi,es  

• Implemen,ng  AlignSequence  using  other  alignment  

tool;  

• Keep  provenance  informa,on  about  the  experiment  

evolu,on  

– Usually  this  is  implemented  designing  a  new   workflow  

15/05/12   Scien,fic  Workflow   18  

Scien,fic  Workflow  Model  

Dealing  with  heterogeneity  

• integra,on  of  autonomously  defined  programs   pose  constraints  on  tasks  integra,on;  

• Proposed  solu,ons  

– Shims  –  processing  steps  that  act  as  adaptors  

changing  format  of  data  between  heterogeneous   format  tasks  (Oinn  et  al.  2006  –  Taverna)  

• Control  operator  in  QEF  

– Data  Model  –  make  it  explicit  the  data  model  

• Describe  inputs  and  output  data  schema;  

• Using  mappings  between  schema  to  handle  data  

transforma,on  declara,vely  

Scien,fic  workflow  Model  

•  Current  State:  

– No  standard  language  for  modelling  scien,fic  workflow  

• BPEL  –  use  for  business  workflows  

– Data  has  no  schema  and  passed  as  files  through  ac,vi,es  

•  General  View:  

– A  set  of  ac,vi,es  A={a1,  a2,…,  an}  

• Ac,vi,es  are:  

– Data  manipula,on  (e.g.  transforma,on,  formaWng,…)   – Data  analyses  

• par,al  order  among  ac,vi,es  

– modelled  as  a  directed  acyclic  graph  

• Some  scien,fic  workflow  systems  support  cycles:  Kepler,  QEF  

– Data  are  modelled  as  input/output  of  ac,vi,es  

• Data  Unit  is  a  set  of  elements  that  compose  the  input  to  an  ac,vity  

– Data  Unit  list  -­‐    DU={  du1,du2,…duk}  

(6)

Scien,fic  Workflow  Execu,on  model  

• Each  ac,vity  becomes  a  Task  

• The  workflow  engine  executes  an  instance  of  the   workflow  

–   schedules  tasks  based  on  the  par,al  ordering  among  

the  corresponding  ac,vi,es  

– Evaluates  a  single  data  unit  set    (DU)  

– If  workflow  system  support  cycles  

• Parameter  sweep,  Turbulence  simula,on,  Par,cle  tracing  

• A  single  instance  of  workflow  evaluates  a  set  of  data  unit  

sets  

15/05/12   Scien,fic  Workflow   21  

Risers  Fa,gue  Analysis  (RFA)  scien,fic  

workflow    

15/05/12   Scien,fic  Workflow   22  

Static  Data  (sdat)  files, Dynamic  Data  (ddat)  files sdat files,  ddat files

SAV  files, SsSai files

ddat files,  SiSai files,   SsSai files,  SAV  files

DiSai files Compressed  Riser  Data  (rd.zip)  (environmental  

conditions,  physical  data  and  riser  geometry)

DdSai files, Env files

SsSai files, DdSai files,     Env files SiSai files,  FTE  files,  

FTR  files sdat files,     FTR  files

1.  Extraction  of  Riser  Data

call ExtractRD(rd.zip)

2.  Run  Static  Analysis  Preprocessing

For  each rd in  RD  Set,  x  in  sdat files,  

y  in  ddat files  call PSRiser(rd,  x,  y)

3.  Run  Static  Analysis

For  each rd  in  RD  Set,  x  in  sdat files,  y  

in  FTR  files  call SRiser(rd,  x,  y)

6.  Run  Tension  Analysis

For  each rd  in  RD  Set,  x  in  DdSai files,  

y  in  Env files  call Tanalysis(rd  ,x,  y) 7.  Run  Curvature  Analysis

For  each rd  in  RD  Set,  z  in  SsSai files call Canalysis(rd,  z) ddat files,  DiSai files,  

FTE  files,  SAV  files

DdSai files, Env files SsSai files

8.  Merge  Data

For  each  rdT in  Accepted  Tension  

RD  Set,  rdC in  Accepted  Curvature   RD  Set  call Match    (rdT,  rdC) 9.  Compression  of  Riser’s  Results

call CompressRD(Merged  Rd,  SsSai

files,  DdSai files,  MEnv files)

Riser  Data   (RD)  set RD  set RD  set RD  Set RD  Set RD  Set Accepted Tension RD  Set Accepted Curvature RD  Set Merged  RD  Set Shared  disk

5.  Run  Dynamic  Analysis

For  each rd  in  RD  Set,  x  in  ddat files,  

y  in  DiSai files,  z  in  FTE  files,  w  in  SAV  files

call Driser (rd,  x,  y,  z,  w)

4.  Run  Dynamic  Analysis  Preprocessing

For  each rd in  RD  Set,  x  in  ddat files,  

y  in  SiSai files,  z  in  SsSai files,  w  in  SAV  files

call  PDRiser(rd,  x,  y,  z,  w)

RdResult.zip

Scien,fic  Workflow  Systems  

• 

VisTrails  –  scien,fic  visualiza,on,  design  tool,  

provenance  management  

• 

Kepler  –  high-­‐level  data  model,  design  tool,  

execu,on  model,  provenance  management  

• 

Taverna  –  biologia,  design  tools,  web  services,  

MyExperiment  ,  Xsculf,  Proveniência;    

• 

Swix  –  biology,  HPC,  scrip,ng  language  

• 

QEF  –  algebraic  based  scien,fic  workflow    

Taverna  Scufl  

•  Taverna  language  components:  

– Inputs  –  entry  points  for  the  data  for  the  workflow  

– Output  –  exit  point  for  the  data  for  the  workflow  

– Processors  –  an  individual  step  in  the  workflow  (task)  

• Input  ports  and  output  ports  

– Data  links  –  links  data  sources  to  data  des,na,ons  

• Data  sources  –  workflow  inputs  or  processor  output  ports  

• Data  des,na,ons  –  workflow  outputs  or  processors  input  

(7)

Kepler  (Al,ntas  et  al  2006)  

•  Workflow  language:  

–  Actors  –  specify  what  processing  occurs.     • Kepler  comes  with  more  than  350  actors   • Define  the  tasks  to  be  done  

• User-­‐defined  actors  can  be  added  to  the  Kepler  repository   –  Directors  –  specify  when  the  processing  occurs  

• Implements  computa,onal  model  (e.g.  synchronous,  parallel)   • Provides  execu,on  parameters  

• Example:  SDF  –  process  simple  pipeline  type  of  workflows;  CT-­‐  workflows  that  evolve  as  con,nuous   func,on  of  ,me  

–  Composite/individual  Actors    

• Composite  -­‐  –  subworkflows  (reuse)   • Individual  –single  task   –  Parameters  –  configure  actors  behavior   –  Rela,on  –  allow  data  to  be  sent  to  mul,ple  consumers  

•  Mul,disciplinary  

•  Design  and  execu,on  of  workflows  

•  Capture  of  Provenance  data,  both  prospec,ve  and  retrospec,ve  

15/05/12   Scien,fic  Workflow   25  

The  Lotka_Volterra  Model    

Predator-­‐prey  

• 

Two  equa,ons  modeling  the  modifica,on  of  

the  popula,ons  of  prey  and  predator  

– dn1/dt  =  r*n1  -­‐  a*n1*n2  

– dn2/dt  =  -­‐d*n2  +  b*n1*n2  

• 

The  workflow:  

– 6  actors=  2  plo?ers,  2  equa,ons  and  2  integral   func,ons  

– 1  director  –  defines  execu,on  parameters    

• ,me,  step-­‐size,  max-­‐itera,ons,  ODESolver  

15/05/12   Scien,fic  Workflow   26  

(8)

QEF  (LNCC)  

• Algebra  Based  Workflow  engine   • Components  

– Logical  operators  –  implement  applica,on  logic  

– Control  operators  –  implement  data  manipula,on  

– DataSources  –  Implement  input/output  data  

– DataUnit  –  data  communica,on  between  operators  

• Paralelism,  loops   • No  design  tool  

• No  provenance  management  

15/05/12   Scien,fic  Workflow   29   eScience  2009  

Adap,ve  and  Extensible  Query  Engine  

 

• 

Extensible  to  data  types  

• 

Extensible  to  applica,on  algebra  

• 

Extensible  to  execu,on  model  

• 

Schedule  opera,ons  in  grid  nodes  

• 

Adap,ve  execu,on  model  

Objec(ve    

•  Offer a query processing framework that can be extended to adapt to data centric grid application needs;

•  Offer transparency in using resources to answer queries;

•  Query optimization transparently introduced

•  Standardize remote communication using web services even when dealing with large amount of unstructured data •  Run-time performance monitoring and decision

Control  Operators  

•  Add data-flow and transformation operators •  Isolate application oriented operators from execution model data-flow concerns •  parallel grid based execution model:

•  Split/Merge - controls the routing of tuples to parallel nodes and the corresponding unification of multiple routes to a single flow

•  Send/Receive - marshalling/ unmarshalling of tuples and interface with communication mechanisms •  B2I/I2B - blocks and unblocks tuples •  Orbit - implements loop in a data-flow

(9)

eScience  2009  

The  Execu(on  Model  

Example of simple QEF Workflow

Data  sources   (Input)  

Output   Operator  

Possibly  distributed  over  a   Grid  environment  

Integra,on  unit  (Tuple)   containing  data  source  units  

eScience  2009  

Itera(on  Model  

A   B   C   DataSource   OPEN   OPEN   OPEN   A   B   C   DataSource   GETNEXT   GETNEXT   GETNEXT   A   B   C   DataSource  

CLOSE   CLOSE   CLOSE  

Results  

Distribu(on  and  Paralleliza(on    

Operator distribution

A Query Optimizer selects a set of operators in the QEP to execute over a Grid environment.

A   B2   C   DataSource   B1   B3  

General  Parallel  Execu(on  Model

   

Remote QEP

In order to parallelize an execution, the initial QEP is modified and sent to remote nodes to handle the distributed execution. Control  operator   Distributed  operator   R  :  Receiver   S  :  Sender   Sp  :  Split   Ini(al  

(10)

eScience  2009  

Modifying  IQEP  to  adapt  to  execu(on  

model  

Par,cles   Geometry   Velocity   A  (TCP)   SJ   TJ   Orbit   merge   Split   Send   Receive   B2I   Send   I2B   Receive   B2I   I2B  

Query  op,mizer  adds   control  operators  according   to  execu,on  model  and   IQEP  sta,s,cs   Local  dataflow   Remote  dataflow   Logical  operator   Control  operator   Control  node   Remote  nodei   eScience  2009  

Grid  node  alloca(on  algorithm  (G2N)

 

Grid Greedy Node scheduling algorithm (G2N)

•  Offers maximum usage of scheduled resources during query evaluation.

•  Basic idea : “an optimal parallel allocation strategy

for an independent query operator … is the one in which the computed elapsed-time of its execution is as close as possible to the maximum sequential time in each node evaluating an instance of the operator”.

A   Bn   € t1+ t2= tx

( )

Bn node on this cost operator ) (Bn t 1 t 2 t Introduc(on   Applica(on   Architecture   Implem.   Conclusion   Principles  

Implementa(on

 

•  Core development in Java 1.5. •  Globus toolkit 4.

•  Derby DBMS (catalog).

•  Tomcat, AJAX and Google Web Toolkit for user interface.

•  Runs on Windows, Unix and Linux. •  source code, demo, user guide available at:

Referências

Documentos relacionados

Se considerarmos a educação como processo por meio do qual os grupos sociais mantêm sua existência contínua, como afirma Dewey em seu livro Democracia e

A base do discurso oficial adotado pela FUNAI no campo da educação escolar foi, segundo Cunha (1990:95,88), a utilização da língua nativa no processo

Al realizar un análisis económico de Puebla, se busca interrelacionarlo con los niveles de competitividad, tratando de espacializar las principales actividades económicas de la

A cada entidade foi apresentado o projeto e colocadas algumas questões de modo a compreender e perceber qual a opinião destes atores sobre o potencial histórico,

O problema de encontrar o valor de uma op¸c˜ao americana de venda ´e equi- valente a obter a solu¸c˜ao de um sistema de inequa¸c˜oes variacionais desde que sua formula¸c˜ao

O referido estudo possui como objetivo geral identificar o número de novos casos de Dengue notificados na vigilância epidemiológica do município de Sousa/PB, no período de

It is this criteria which distinguishes the preemptive kernel (which each process has a maximum time to be executed, if the time runs out the next process is

No que toca a uma eventual melhoria global da eficiência organizacional, os diferentes casos de estudo não referem diretamente a questão.. Página | 50 casos de Continous