Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance

Ruoming Jin, Ge Yang, Gagan Agrawal

Research output: Contribution to journalArticlepeer-review

82 Scopus citations

Abstract

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In this paper, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cachesensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

Original languageEnglish (US)
Pages (from-to)71-89
Number of pages19
JournalIEEE Transactions on Knowledge and Data Engineering
Volume17
Issue number1
DOIs
StatePublished - Jan 2005

Keywords

  • Association mining
  • Clustering
  • Decision tree construction
  • Programming interfaces
  • Shared memory parallelization

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computational Theory and Mathematics

Fingerprint Dive into the research topics of 'Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance'. Together they form a unique fingerprint.

Cite this