## Abstract

This work considers the problem of performing t tasks in a distributed system of p fault-prone processors. This problem, called DO-ALL herein, was introduced by Dwork, Halpern and Waarts. The solutions presented here are for the model of computation that abstracts a synchronous message-passing distributed system with processor stop-failures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < stop-failures and does not allow restarts. Its available processor steps (work) complexity is S = O((t + p log p/log log p) · log f) and its message complexity is M = O(t + p log p/ log log p + fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f, it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stop-failures and restarts. This new algorithm is the first solution for the DO-ALL problem that efficiently deals with processor restarts. Its available processor steps is S = O((t + p log p + f) · min{log p, log f}), and its message complexity is M = O(t + p log p + fp), where f is the total number of failures.

Original language | English (US) |
---|---|

Pages (from-to) | 49-64 |

Number of pages | 16 |

Journal | Distributed Computing |

Volume | 14 |

Issue number | 1 |

DOIs | |

State | Published - Jan 2001 |

Externally published | Yes |

## Keywords

- Distributed systems
- Fault-tolerance
- Load balancing
- Processor restarts
- Work

## ASJC Scopus subject areas

- Theoretical Computer Science
- Hardware and Architecture
- Computer Networks and Communications
- Computational Theory and Mathematics