TY - GEN
T1 - Supporting fault tolerance in a data-intensive computing middleware
AU - Bicer, Tekin
AU - Jiang, Wei
AU - Agrawal, Gagan
PY - 2010
Y1 - 2010
N2 - Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.
AB - Over the last 2-3 years, the importance of data- intensive computing has increasingly been recognized, closely coupled with the emergence and popularity of map-reduce for developing this class of applications. Besides programmability and ease of parallelization, fault tolerance is clearly important for data-intensive applications, because of their long running nature, and because of the potential for using a large number of nodes for processing massive amounts of data. Fault-tolerance has been an important attribute of map-reduce as well in its Hadoop implementation, where it is based on replication of data in the file system. Two important goals in supporting fault-tolerance are low overheads and efficient recovery. With these goals, this paper describes a different approach for enabling data-intensive computing with fault-tolerance. Our approach is based on an API for developing data-intensive computations that is a variation of map-reduce, and it involves an explicit programmer-declared reduction object. We show how more efficient fault-tolerance support can be developed using this API. Particularly, as the reduction object represents the state of the computation on a node, we can periodically cache the reduction object from every node at another location and use it to support failure-recovery. We have extensively evaluated our approach using two data- intensive applications. Our results show that the overheads of our scheme are extremely low, and our system outperforms Hadoop both in absence and presence of failures.
KW - Cloud computing
KW - Data-intensive computing
KW - Fault tolerance
KW - Map-Reduce
UR - http://www.scopus.com/inward/record.url?scp=77953974120&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77953974120&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2010.5470462
DO - 10.1109/IPDPS.2010.5470462
M3 - Conference contribution
AN - SCOPUS:77953974120
SN - 9781424464432
T3 - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
BT - Proceedings of the 2010 IEEE International Symposium on Parallel and Distributed Processing, IPDPS 2010
T2 - 24th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2010
Y2 - 19 April 2010 through 23 April 2010
ER -