Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction

Ge Yang; Ruoming Jin; Gagan Agrawal

doi:10.1109/IPDPS.2003.1213162

Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction

Ge Yang, Ruoming Jin, Gagan Agrawal

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

2 Scopus citations

Abstract

Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We have developed a set of parallel algorithms for data cube construction using a new data structure called aggregation tree. Our experience has shown that a number of performance trade-offs arise in developing a parallel data cube implementation. We focus on three important issues, which are: (1) data distribution, i.e., how the original array is distributed among the processors; (2) level of parallelism, i.e., what parts of the computation are parallelized and sequentialized; and (3) frequency of communication, i.e., does the implementation require frequent interprocessor communication (and less memory) or less frequent communication (and more memory). We present a detailed experimental study evaluating the above trade-offs. We consider parallel data cube construction with different cube sizes and sparsity levels. Our experimental results show the following: (1) In all cases, reducing the frequency of communication and using higher memory gave better performance, though the difference was relatively small. (2) Choosing data distribution to minimize communication volume made a substantial difference in the performance in most of the cases. (3) Finally, using parallelism at all levels gave better performance, even though it increases the total communication volume.

Original language	English (US)
Title of host publication	Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003
Publisher	Institute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)	0769519261, 9780769519265
DOIs	https://doi.org/10.1109/IPDPS.2003.1213162
State	Published - 2003
Externally published	Yes
Event	International Parallel and Distributed Processing Symposium, IPDPS 2003 - Nice, France Duration: Apr 22 2003 → Apr 26 2003

Publication series

Name	Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003

Conference

Conference	International Parallel and Distributed Processing Symposium, IPDPS 2003
Country/Territory	France
City	Nice
Period	4/22/03 → 4/26/03

ASJC Scopus subject areas

Computational Theory and Mathematics
Theoretical Computer Science
Software

Access to Document

10.1109/IPDPS.2003.1213162

Cite this

Yang, G., Jin, R., & Agrawal, G. (2003). Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003 Article 1213162 (Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPS.2003.1213162

Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction. / Yang, Ge; Jin, Ruoming; Agrawal, Gagan.
Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003. Institute of Electrical and Electronics Engineers Inc., 2003. 1213162 (Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Yang, G, Jin, R & Agrawal, G 2003, Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction. in Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003., 1213162, Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003, Institute of Electrical and Electronics Engineers Inc., International Parallel and Distributed Processing Symposium, IPDPS 2003, Nice, France, 4/22/03. https://doi.org/10.1109/IPDPS.2003.1213162

Yang G, Jin R, Agrawal G. Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction. In Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003. Institute of Electrical and Electronics Engineers Inc. 2003. 1213162. (Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003). doi: 10.1109/IPDPS.2003.1213162

Yang, Ge ; Jin, Ruoming ; Agrawal, Gagan. / Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction. Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003. Institute of Electrical and Electronics Engineers Inc., 2003. (Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003).

@inproceedings{55bb86ed23dc4a9e9f9b638e554ac65b,

title = "Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction",

abstract = "Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We have developed a set of parallel algorithms for data cube construction using a new data structure called aggregation tree. Our experience has shown that a number of performance trade-offs arise in developing a parallel data cube implementation. We focus on three important issues, which are: (1) data distribution, i.e., how the original array is distributed among the processors; (2) level of parallelism, i.e., what parts of the computation are parallelized and sequentialized; and (3) frequency of communication, i.e., does the implementation require frequent interprocessor communication (and less memory) or less frequent communication (and more memory). We present a detailed experimental study evaluating the above trade-offs. We consider parallel data cube construction with different cube sizes and sparsity levels. Our experimental results show the following: (1) In all cases, reducing the frequency of communication and using higher memory gave better performance, though the difference was relatively small. (2) Choosing data distribution to minimize communication volume made a substantial difference in the performance in most of the cases. (3) Finally, using parallelism at all levels gave better performance, even though it increases the total communication volume.",

author = "Ge Yang and Ruoming Jin and Gagan Agrawal",

note = "Publisher Copyright: {\textcopyright} 2003 IEEE.; International Parallel and Distributed Processing Symposium, IPDPS 2003 ; Conference date: 22-04-2003 Through 26-04-2003",

year = "2003",

doi = "10.1109/IPDPS.2003.1213162",

language = "English (US)",

series = "Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003",

publisher = "Institute of Electrical and Electronics Engineers Inc.",

booktitle = "Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003",

}

TY - GEN

T1 - Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction

AU - Yang, Ge

AU - Jin, Ruoming

AU - Agrawal, Gagan

PY - 2003

Y1 - 2003

N2 - Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We have developed a set of parallel algorithms for data cube construction using a new data structure called aggregation tree. Our experience has shown that a number of performance trade-offs arise in developing a parallel data cube implementation. We focus on three important issues, which are: (1) data distribution, i.e., how the original array is distributed among the processors; (2) level of parallelism, i.e., what parts of the computation are parallelized and sequentialized; and (3) frequency of communication, i.e., does the implementation require frequent interprocessor communication (and less memory) or less frequent communication (and more memory). We present a detailed experimental study evaluating the above trade-offs. We consider parallel data cube construction with different cube sizes and sparsity levels. Our experimental results show the following: (1) In all cases, reducing the frequency of communication and using higher memory gave better performance, though the difference was relatively small. (2) Choosing data distribution to minimize communication volume made a substantial difference in the performance in most of the cases. (3) Finally, using parallelism at all levels gave better performance, even though it increases the total communication volume.

AB - Data cube construction is a commonly used operation in data warehouses. Because of the volume of data that is stored and analyzed in a data warehouse and the amount of computation involved in data cube construction, it is natural to consider parallel machines for this operation. We have developed a set of parallel algorithms for data cube construction using a new data structure called aggregation tree. Our experience has shown that a number of performance trade-offs arise in developing a parallel data cube implementation. We focus on three important issues, which are: (1) data distribution, i.e., how the original array is distributed among the processors; (2) level of parallelism, i.e., what parts of the computation are parallelized and sequentialized; and (3) frequency of communication, i.e., does the implementation require frequent interprocessor communication (and less memory) or less frequent communication (and more memory). We present a detailed experimental study evaluating the above trade-offs. We consider parallel data cube construction with different cube sizes and sparsity levels. Our experimental results show the following: (1) In all cases, reducing the frequency of communication and using higher memory gave better performance, though the difference was relatively small. (2) Choosing data distribution to minimize communication volume made a substantial difference in the performance in most of the cases. (3) Finally, using parallelism at all levels gave better performance, even though it increases the total communication volume.

UR - http://www.scopus.com/inward/record.url?scp=84889833536&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84889833536&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2003.1213162

DO - 10.1109/IPDPS.2003.1213162

M3 - Conference contribution

AN - SCOPUS:84889833536

T3 - Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003

BT - Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003

PB - Institute of Electrical and Electronics Engineers Inc.

T2 - International Parallel and Distributed Processing Symposium, IPDPS 2003

Y2 - 22 April 2003 through 26 April 2003

ER -

Impact of data distribution, level of parallelism, and communication frequency on parallel data cube construction

Abstract

Publication series

Conference

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this