Performing tasks on synchronous restartable message-passing processors

Bogdan S. Chlebus, Roberto De Prisco, Alex A. Shvartsman

Research output: Contribution to journalArticle

39 Scopus citations

Abstract

This work considers the problem of performing t tasks in a distributed system of p fault-prone processors. This problem, called DO-ALL herein, was introduced by Dwork, Halpern and Waarts. The solutions presented here are for the model of computation that abstracts a synchronous message-passing distributed system with processor stop-failures and restarts. We present two new algorithms based on a new aggressive coordination paradigm by which multiple coordinators may be active as the result of failures. The first algorithm is tolerant of f < stop-failures and does not allow restarts. Its available processor steps (work) complexity is S = O((t + p log p/log log p) · log f) and its message complexity is M = O(t + p log p/ log log p + fp). Unlike prior solutions, our algorithm uses redundant broadcasts when encountering failures and, for p = t and large f, it achieves better work complexity. This algorithm is used as the basis for another algorithm that tolerates stop-failures and restarts. This new algorithm is the first solution for the DO-ALL problem that efficiently deals with processor restarts. Its available processor steps is S = O((t + p log p + f) · min{log p, log f}), and its message complexity is M = O(t + p log p + fp), where f is the total number of failures.

Original languageEnglish (US)
Pages (from-to)49-64
Number of pages16
JournalDistributed Computing
Volume14
Issue number1
DOIs
Publication statusPublished - Jan 2001
Externally publishedYes

    Fingerprint

Keywords

  • Distributed systems
  • Fault-tolerance
  • Load balancing
  • Processor restarts
  • Work

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Computational Theory and Mathematics

Cite this