Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies

Selim Kalayci; Onyeka Ezenwoye; Balaji Viswanathan; Gargi Dasgupta; S. Masoud Sadjadi; Liana Fong

doi:10.1007/978-3-540-89652-4_8

Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies

Selim Kalayci, Onyeka Ezenwoye, Balaji Viswanathan, Gargi Dasgupta, S. Masoud Sadjadi, Liana Fong

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

7 Scopus citations

Abstract

Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.

Original language	English (US)
Title of host publication	Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings
Publisher	Springer Verlag
Pages	54-69
Number of pages	16
ISBN (Print)	3540896473, 9783540896470
DOIs	https://doi.org/10.1007/978-3-540-89652-4_8
State	Published - 2008
Externally published	Yes
Event	6th International Conference on Service-Oriented Computing, ICSOC 2008 - Sydney, Australia Duration: Dec 1 2008 → Dec 5 2008

Publication series

Name	Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume	5364 LNCS
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Other

Other	6th International Conference on Service-Oriented Computing, ICSOC 2008
Country/Territory	Australia
City	Sydney
Period	12/1/08 → 12/5/08

ASJC Scopus subject areas

Theoretical Computer Science
General Computer Science

Access to Document

10.1007/978-3-540-89652-4_8

Cite this

Kalayci, S., Ezenwoye, O., Viswanathan, B., Dasgupta, G., Sadjadi, S. M., & Fong, L. (2008). Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies. In Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings (pp. 54-69). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5364 LNCS). Springer Verlag. https://doi.org/10.1007/978-3-540-89652-4_8

Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies. / Kalayci, Selim; Ezenwoye, Onyeka; Viswanathan, Balaji et al.
Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings. Springer Verlag, 2008. p. 54-69 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5364 LNCS).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Kalayci, S, Ezenwoye, O, Viswanathan, B, Dasgupta, G, Sadjadi, SM & Fong, L 2008, Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies. in Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5364 LNCS, Springer Verlag, pp. 54-69, 6th International Conference on Service-Oriented Computing, ICSOC 2008, Sydney, Australia, 12/1/08. https://doi.org/10.1007/978-3-540-89652-4_8

Kalayci S, Ezenwoye O, Viswanathan B, Dasgupta G, Sadjadi SM, Fong L. Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies. In Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings. Springer Verlag. 2008. p. 54-69. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). doi: 10.1007/978-3-540-89652-4_8

Kalayci, Selim ; Ezenwoye, Onyeka ; Viswanathan, Balaji et al. / Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies. Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings. Springer Verlag, 2008. pp. 54-69 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).

@inproceedings{eb48e2817d63480fafbff6cb9e5dbac0,

title = "Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies",

abstract = "Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.",

author = "Selim Kalayci and Onyeka Ezenwoye and Balaji Viswanathan and Gargi Dasgupta and Sadjadi, {S. Masoud} and Liana Fong",

year = "2008",

doi = "10.1007/978-3-540-89652-4_8",

language = "English (US)",

isbn = "3540896473",

series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",

publisher = "Springer Verlag",

pages = "54--69",

booktitle = "Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings",

note = "6th International Conference on Service-Oriented Computing, ICSOC 2008 ; Conference date: 01-12-2008 Through 05-12-2008",

}

TY - GEN

T1 - Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies

AU - Kalayci, Selim

AU - Ezenwoye, Onyeka

AU - Viswanathan, Balaji

AU - Dasgupta, Gargi

AU - Sadjadi, S. Masoud

AU - Fong, Liana

PY - 2008

Y1 - 2008

N2 - Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.

AB - Currently, many grid applications are developed as job flows that are composed of multiple jobs. The execution of job flows requires the support of a job flow manager and a job scheduler. Due to the long running nature of job flows, the support for fault tolerance and recovery policies is especially important. This support is inherently complicated due to the sequencing and dependency of jobs within a flow, and the required coordination between workflow engines and job schedulers. In this paper, we describe the design and implementation of a job flow manager that supports fault tolerance. First, we identify and label job flow patterns within a job flow during deployment time. Next, at runtime, we introduce a proxy that intercepts and resolves faults using job flow patterns and their corresponding fault-recovery policies. Our design has the advantages of separation of the job flow and fault handling logic, requiring no manipulation at the modeling time, and providing flexibility with respect to fault resolution at runtime. We validate our design with a prototypical implementation based on the ActiveBPEL workflow engine and GridWay Meta-scheduler, and Montage application as the case study.

UR - http://www.scopus.com/inward/record.url?scp=58049109366&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=58049109366&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-89652-4_8

DO - 10.1007/978-3-540-89652-4_8

M3 - Conference contribution

AN - SCOPUS:58049109366

SN - 3540896473

SN - 9783540896470

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 54

EP - 69

BT - Service-Oriented Computing - ICSOC 2008 - 6th International Conference, Proceedings

PB - Springer Verlag

T2 - 6th International Conference on Service-Oriented Computing, ICSOC 2008

Y2 - 1 December 2008 through 5 December 2008

ER -

Design and implementation of a fault tolerant job flow manager using job flow patterns and recovery policies

Abstract

Publication series

Other

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this