Abstract
This work considers the problem of performing t tasks in a distributed system of p fault-prone processors. This problem, called DO-ALL herein, was introduced by Dwork, Halpern and Waarts. The solutions presented here are for the model of computation that abstracts a synchronous message-passing distributed system with processor stop-failures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < stop-failures and does not allow restarts. Its available processor steps (work) complexity is S = O((t + p log p/log log p) · log f) and its message complexity is M = O(t + p log p/ log log p + fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f, it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stop-failures and restarts. This new algorithm is the first solution for the DO-ALL problem that efficiently deals with processor restarts. Its available processor steps is S = O((t + p log p + f) · min{log p, log f}), and its message complexity is M = O(t + p log p + fp), where f is the total number of failures.
Original language | English (US) |
---|---|
Article number | 1 |
Pages (from-to) | 49-64 |
Number of pages | 16 |
Journal | Distributed Computing |
Volume | 14 |
Issue number | 1 |
DOIs | |
State | Published - 2001 |
Externally published | Yes |
Keywords
- Distributed systems
- Fault-tolerance
- Load balancing
- Processor restarts
- Work
ASJC Scopus subject areas
- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Computational Theory and Mathematics