TY - GEN
T1 - Reliable, Efficient Recovery for Complex Services with Replicated Subsystems
AU - Tremel, Edward
AU - Jha, Sagar
AU - Song, Weijia
AU - Chu, David
AU - Birman, Ken
N1 - Funding Information:
This work was supported, in part, by a grant from AFRL Wright-Patterson.
Publisher Copyright:
© 2020 IEEE.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/6
Y1 - 2020/6
N2 - Applications with internal substructure are common in the cloud, where many systems are organized as independently logged and replicated subsystems that interact via flows of objects or some form of RPC. Restarting such an application is difficult: a restart algorithm needs to efficiently provision the subsystems by mapping them to nodes with needed data and compute resources, while simultaneously guaranteeing that replicas are in distinct failure domains. Additional failures can occur during recovery, hence the restart process must itself be a restartable procedure. In this paper we present an algorithm for efficiently restarting a service composed of sharded subsystems, each using a replicated state machine model, into a state that (1) has the same fault-tolerance guarantees as the running system, (2) satisfies resource constraints and has all needed data to restart into a consistent state, (3) makes safe decisions about which updates to preserve from the logged state, (4) ensures that the restarted state will be mutually consistent across all subsystems and shards, and (5) ensures that no committed updates will be lost. If restart is not currently possible, the algorithm will await additional resources, then retry.
AB - Applications with internal substructure are common in the cloud, where many systems are organized as independently logged and replicated subsystems that interact via flows of objects or some form of RPC. Restarting such an application is difficult: a restart algorithm needs to efficiently provision the subsystems by mapping them to nodes with needed data and compute resources, while simultaneously guaranteeing that replicas are in distinct failure domains. Additional failures can occur during recovery, hence the restart process must itself be a restartable procedure. In this paper we present an algorithm for efficiently restarting a service composed of sharded subsystems, each using a replicated state machine model, into a state that (1) has the same fault-tolerance guarantees as the running system, (2) satisfies resource constraints and has all needed data to restart into a consistent state, (3) makes safe decisions about which updates to preserve from the logged state, (4) ensures that the restarted state will be mutually consistent across all subsystems and shards, and (5) ensures that no committed updates will be lost. If restart is not currently possible, the algorithm will await additional resources, then retry.
KW - Fault Tolerance
KW - Recovery
KW - Replication
KW - State Machine Replication
UR - http://www.scopus.com/inward/record.url?scp=85090413867&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090413867&partnerID=8YFLogxK
U2 - 10.1109/DSN48063.2020.00035
DO - 10.1109/DSN48063.2020.00035
M3 - Conference contribution
AN - SCOPUS:85090413867
T3 - Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
SP - 172
EP - 183
BT - Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020
Y2 - 29 June 2020 through 2 July 2020
ER -