Supporting fault-tolerance in streaming grid applications

Qian Zhu, Liang Chen, Gagan Agrawal

Research output: Contribution to conferencePaperpeer-review

11 Scopus citations

Abstract

This paper considers the problem of supporting and efficiently implementing fault-tolerance for tightly-coupled and pipelined applications, especially streaming applications, in a grid environment. We provide an alternative to basic checkpointing and use the notion of Light-weight Summary Structure(LSS) to enable efficient failure-recovery. The idea behind LSS is that at certain points during the execution of a processing stage, the state of the program can be summarized by a small amount of memory. This allows us to store copies of LSS for enabling failure-recovery, which causes low overhead fault-tolerance. Our work can be viewed as an optimization and adaptation of the idea of application-level checkpointing to a different execution environment, and for a different class of applications. Our implementation and evaluation of LSS based failure-recovery has been in the context of the GATES (Grid-based AdapTive Execution on Streams) middleware. An observation we use for providing very low overhead support for fault-tolerance is that algorithms analyzing data streams are only allowed to take a single pass over data, which means they only perform approximate processing. Therefore, we believe that in supporting fault-tolerant execution for these applications, it is acceptable to not analyze a small number of packets of data during failure-recovery. We show how we perform failure-recovery and also demonstrate how we could use additional buffers to limit data loss during the recovery procedure. We also present an efficient algorithm for allocating a new computation resource for failure-recovery at runtime. We have extensively evaluated our implementation using three stream data processing applications, and shown that the use of LSS allows effective and low-overhead failure-recovery.

Original languageEnglish (US)
Pages156
DOIs
StatePublished - 2008
Externally publishedYes
EventIPDPS 2008 - 22nd IEEE International Parallel and Distributed Processing Symposium - Miami, FL, United States
Duration: Apr 14 2008Apr 18 2008

Conference

ConferenceIPDPS 2008 - 22nd IEEE International Parallel and Distributed Processing Symposium
Country/TerritoryUnited States
CityMiami, FL
Period4/14/084/18/08

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Supporting fault-tolerance in streaming grid applications'. Together they form a unique fingerprint.

Cite this