Fault tolerant parallel data-intensive algorithms

Mucahid Kutlu; Gagan Agrawal; Oguz Kurt

doi:10.1145/2287076.2287099

Fault tolerant parallel data-intensive algorithms

Mucahid Kutlu, Gagan Agrawal, Oguz Kurt

Research output: Contribution to conference › Paper › peer-review

2 Scopus citations

Abstract

Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.

Original language	English (US)
Pages	133
DOIs	https://doi.org/10.1145/2287076.2287099 https://doi.org/10.1109/HiPC.2012.6507503
State	Published - 2012
Externally published	Yes
Event	2012 19th International Conference on High Performance Computing, HiPC 2012 - Pune, India Duration: Dec 18 2012 → Dec 21 2012

Conference

Conference	2012 19th International Conference on High Performance Computing, HiPC 2012
Country/Territory	India
City	Pune
Period	12/18/12 → 12/21/12

ASJC Scopus subject areas

Software

Access to Document

http://dl.acm.org/citation.cfm?doid=2287076.2287099

Cite this

@conference{1752596080e448baa0165ffea2bccfc7,

title = "Fault tolerant parallel data-intensive algorithms",

abstract = "Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.",

author = "Mucahid Kutlu and Gagan Agrawal and Oguz Kurt",

year = "2012",

doi = "10.1145/2287076.2287099",

language = "English (US)",

pages = "133",

note = "2012 19th International Conference on High Performance Computing, HiPC 2012 ; Conference date: 18-12-2012 Through 21-12-2012",

}

TY - CONF

T1 - Fault tolerant parallel data-intensive algorithms

AU - Kutlu, Mucahid

AU - Agrawal, Gagan

AU - Kurt, Oguz

PY - 2012

Y1 - 2012

N2 - Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.

AB - Fault-tolerance is rapidly becoming a crucial issue in high-end and distributed computing, as increasing number of cores are decreasing the mean-time to failure of the systems. While checkpointing, including checkpointing of parallel programs like MPI applications, provides a general solution, the overhead of this approach is becoming increasingly unacceptable. Thus, algorithm-based fault-tolerance provides a nice practical alternative, though it is less general. Although this approach has been studied for many applications, there is no existing work for algorithm-based fault-tolerance for the growing class of data-intensive parallel applications. In this paper, we present an algorithm-based fault tolerance solution that handles fail-stop failures for a class of data intensive algorithms. We divide the dataset into smaller data blocks and in replication step, we distribute the replicated blocks with the aim of keeping the maximum data intersection between any two processors minimum. This allows us to have minimum data loss when multiple failures occur. In addition, our approach enables better load balance after failure, and decreases the amount of re-processing of the lost data. We have evaluated our approach by using two popular parallel data mining algorithms, which are k-means and apriori. We show that our approach has negligible overhead when there are no failures, and allows us to gracefully handle different number of failures, and failures at different points of processing. We also provide the comparison of our approach with the MapReduce based solution for fault tolerance, and show that we outperform Hadoop both in absence and presence of failures.

UR - http://www.scopus.com/inward/record.url?scp=84880272368&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84880272368&partnerID=8YFLogxK

U2 - 10.1145/2287076.2287099

DO - 10.1145/2287076.2287099

M3 - Paper

SP - 133

T2 - 2012 19th International Conference on High Performance Computing, HiPC 2012

Y2 - 18 December 2012 through 21 December 2012

ER -

Fault tolerant parallel data-intensive algorithms

Abstract

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this