BifurKTM: Approximately consistent distributed transactional memory for GPUs

Samuel Irving; Lu Peng; Costas Busch; Jih Kwon Peir

doi:10.4230/OASIcs.PARMA-DITAM.2021.2

BifurKTM: Approximately consistent distributed transactional memory for GPUs

Samuel Irving, Lu Peng, Costas Busch, Jih Kwon Peir

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

1 Scopus citations

Abstract

We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer.

Original language	English (US)
Title of host publication	12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021
Editors	Joao Bispo, Stefano Cherubin, Jose Flich
Publisher	Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing
ISBN (Electronic)	9783959771818
DOIs	https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2
State	Published - Mar 1 2021
Event	12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021 - Budapest, Hungary Duration: Jan 19 2021 → …

Publication series

Name	OpenAccess Series in Informatics
Volume	88
ISSN (Print)	2190-6807

Conference

Conference	12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021
Country/Territory	Hungary
City	Budapest
Period	1/19/21 → …

Keywords

Approximate Consistency
Distributed Transactional Memory
GPU

ASJC Scopus subject areas

Geography, Planning and Development
Modeling and Simulation

Access to Document

10.4230/OASIcs.PARMA-DITAM.2021.2

Cite this

Irving, S., Peng, L., Busch, C., & Peir, J. K. (2021). BifurKTM: Approximately consistent distributed transactional memory for GPUs. In J. Bispo, S. Cherubin, & J. Flich (Eds.), 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021 Article 2 (OpenAccess Series in Informatics; Vol. 88). Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2

BifurKTM: Approximately consistent distributed transactional memory for GPUs. / Irving, Samuel; Peng, Lu; Busch, Costas et al.
12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021. ed. / Joao Bispo; Stefano Cherubin; Jose Flich. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2021. 2 (OpenAccess Series in Informatics; Vol. 88).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Irving, S, Peng, L, Busch, C & Peir, JK 2021, BifurKTM: Approximately consistent distributed transactional memory for GPUs. in J Bispo, S Cherubin & J Flich (eds), 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021., 2, OpenAccess Series in Informatics, vol. 88, Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021, Budapest, Hungary, 1/19/21. https://doi.org/10.4230/OASIcs.PARMA-DITAM.2021.2

Irving S, Peng L, Busch C, Peir JK. BifurKTM: Approximately consistent distributed transactional memory for GPUs. In Bispo J, Cherubin S, Flich J, editors, 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing. 2021. 2. (OpenAccess Series in Informatics). doi: 10.4230/OASIcs.PARMA-DITAM.2021.2

Irving, Samuel ; Peng, Lu ; Busch, Costas et al. / BifurKTM : Approximately consistent distributed transactional memory for GPUs. 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021. editor / Joao Bispo ; Stefano Cherubin ; Jose Flich. Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing, 2021. (OpenAccess Series in Informatics).

@inproceedings{14ed6c259faf439082018bd072b61f49,

title = "BifurKTM: Approximately consistent distributed transactional memory for GPUs",

abstract = "We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer.",

keywords = "Approximate Consistency, Distributed Transactional Memory, GPU",

author = "Samuel Irving and Lu Peng and Costas Busch and Peir, {Jih Kwon}",

note = "Publisher Copyright: {\textcopyright} Samuel Irving, Lu Peng, Costas Busch, and Jih-Kwon Peir; licensed under Creative Commons License CC-BY 4.0; 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021 ; Conference date: 19-01-2021",

year = "2021",

month = mar,

day = "1",

doi = "10.4230/OASIcs.PARMA-DITAM.2021.2",

language = "English (US)",

series = "OpenAccess Series in Informatics",

publisher = "Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing",

editor = "Joao Bispo and Stefano Cherubin and Jose Flich",

booktitle = "12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021",

address = "Germany",

}

TY - GEN

T1 - BifurKTM

T2 - 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021

AU - Irving, Samuel

AU - Peng, Lu

AU - Busch, Costas

AU - Peir, Jih Kwon

N1 - Publisher Copyright: © Samuel Irving, Lu Peng, Costas Busch, and Jih-Kwon Peir; licensed under Creative Commons License CC-BY 4.0

PY - 2021/3/1

Y1 - 2021/3/1

N2 - We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer.

AB - We present BifurKTM, the first read-optimized Distributed Transactional Memory system for GPU clusters. The BifurKTM design includes: GPU KoSTM, a new software transactional memory conflict detection scheme that exploits relaxed consistency to increase throughput; and KoDTM, a Distributed Transactional Memory model that combines the Data- and Control- flow models to greatly reduce communication overheads. Despite the allure of huge speedups, GPUs are limited in use due to their programmability and extreme sensitivity to workload characteristics. These become daunting concerns when considering a distributed GPU cluster, wherein a programmer must design algorithms to hide communication latency by exploiting data regularity, high compute intensity, etc. The BifurKTM design allows GPU programmers to exploit a new workload characteristic: the percentage of the workload that is Read-Only (e.g. reads but does not modify shared memory), even when this percentage is not known in advance. Programmers designate transactions that are suitable for Approximate Consistency, in which transactions “appear” to execute at the most convenient time for preventing conflicts. By leveraging Approximate Consistency for Read-Only transactions, the BifurKTM runtime system offers improved performance, application flexibility, and programmability without introducing any errors into shared memory. Our experiments show that Approximate Consistency can improve BkTM performance by up to 34x in applications with moderate network communication utilization and a read-intensive workload. Using Approximate Consistency, BkTM can reduce GPU-to-GPU network communication by 99%, reduce the number of aborts by up to 100%, and achieve an average speedup of 18x over a similarly sized CPU cluster while requiring minimal effort from the programmer.

KW - Approximate Consistency

KW - Distributed Transactional Memory

KW - GPU

UR - http://www.scopus.com/inward/record.url?scp=85115834732&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85115834732&partnerID=8YFLogxK

U2 - 10.4230/OASIcs.PARMA-DITAM.2021.2

DO - 10.4230/OASIcs.PARMA-DITAM.2021.2

M3 - Conference contribution

AN - SCOPUS:85115834732

T3 - OpenAccess Series in Informatics

BT - 12th Workshop on Parallel Programming and Run-Time Management Techniques for Many-Core Architectures and 10th Workshop on Design Tools and Architectures for Multicore Embedded Computing Platforms, PARMA-DITAM 2021

A2 - Bispo, Joao

A2 - Cherubin, Stefano

A2 - Flich, Jose

PB - Schloss Dagstuhl- Leibniz-Zentrum fur Informatik GmbH, Dagstuhl Publishing

Y2 - 19 January 2021

ER -

BifurKTM: Approximately consistent distributed transactional memory for GPUs

Abstract

Publication series

Conference

Keywords

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this