TY - GEN
T1 - Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies
AU - Kalayci, Selim
AU - Ezenwoye, Onyeka
AU - Viswanathan, Balaji
AU - Dasgupta, Gargi
AU - Sadjadi, S. Masoud
AU - Fong, Liana
PY - 2008
Y1 - 2008
N2 - Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.
AB - Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.
UR - http://www.scopus.com/inward/record.url?scp=58049109366&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=58049109366&partnerID=8YFLogxK
U2 - 10.1007/978-3-540-89652-4_8
DO - 10.1007/978-3-540-89652-4_8
M3 - Conference contribution
AN - SCOPUS:58049109366
SN - 3540896473
SN - 9783540896470
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 54
EP - 69
BT - Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings
PB - Springer Verlag
T2 - 6th International Conference on Service-Oriented Computing, ICSOC 2008
Y2 - 1 December 2008 through 5 December 2008
ER -