Reliable, Efficient Recovery for Complex Services with Replicated Subsystems

Edward Tremel, Sagar Jha, Weijia Song, David Chu, Ken Birman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Applications with internal substructure are common in the cloud, where many systems are organized as independently logged and replicated subsystems that interact via flows of objects or some form of RPC. Restarting such an application is difficult: a restart algorithm needs to efficiently provision the subsystems by mapping them to nodes with needed data and compute resources, while simultaneously guaranteeing that replicas are in distinct failure domains. Additional failures can occur during recovery, hence the restart process must itself be a restartable procedure. In this paper we present an algorithm for efficiently restarting a service composed of sharded subsystems, each using a replicated state machine model, into a state that (1) has the same fault-tolerance guarantees as the running system, (2) satisfies resource constraints and has all needed data to restart into a consistent state, (3) makes safe decisions about which updates to preserve from the logged state, (4) ensures that the restarted state will be mutually consistent across all subsystems and shards, and (5) ensures that no committed updates will be lost. If restart is not currently possible, the algorithm will await additional resources, then retry.

Original languageEnglish (US)
Title of host publicationProceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages172-183
Number of pages12
ISBN (Electronic)9781728158099
DOIs
StatePublished - Jun 2020
Externally publishedYes
Event50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020 - Valencia, Spain
Duration: Jun 29 2020Jul 2 2020

Publication series

NameProceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020

Conference

Conference50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
CountrySpain
CityValencia
Period6/29/207/2/20

Keywords

  • Fault Tolerance
  • Recovery
  • Replication
  • State Machine Replication

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems
  • Information Systems and Management
  • Safety, Risk, Reliability and Quality

Fingerprint Dive into the research topics of 'Reliable, Efficient Recovery for Complex Services with Replicated Subsystems'. Together they form a unique fingerprint.

Cite this