TY - GEN
T1 - Practical loop transformations for tensor contraction expressions on multi-level memory hierarchies
AU - Ma, Wenjing
AU - Krishnamoorthy, Sriram
AU - Agrawal, Gagan
PY - 2011
Y1 - 2011
N2 - Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.
AB - Modern architectures are characterized by deeper levels of memory hierarchy, often explicitly addressable. Optimizing applications for such architectures requires careful management of the data movement across all these levels. In this paper, we focus on the problem of mapping tensor contractions to memory hierarchies with more than two levels, specifically addressing placement of memory allocation and data movement statements, choice of loop fusions, and tile size selection. Existing algorithms to find an integrated solution to this problem even for two-level memory hierarchies have been shown to be expensive. We improve upon this work by focusing on the first-order cost components, simplifying the analysis required and reducing the number of candidates to be evaluated. We have evaluated our framework on a cluster of GPUs. Using five candidate tensor contraction expressions, we show that fusion at multiple levels improves performance, and our framework is effective in determining profitable transformations.
UR - http://www.scopus.com/inward/record.url?scp=79953267209&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79953267209&partnerID=8YFLogxK
U2 - 10.1007/978-3-642-19861-8_15
DO - 10.1007/978-3-642-19861-8_15
M3 - Conference contribution
AN - SCOPUS:79953267209
SN - 9783642198601
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 266
EP - 285
BT - Compiler Construction - 20th International Conference, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011, Proceedings
T2 - 20th International Conference on Compiler Construction, CC 2011, Held as Part of the Joint European Conferences on Theory and Practice of Software, ETAPS 2011
Y2 - 26 March 2011 through 3 April 2011
ER -