A linear speedup analysis of distributed deep learning with sparse and quantized communication

Peng Jiang; Gagan Agrawal

A linear speedup analysis of distributed deep learning with sparse and quantized communication

Peng Jiang, Gagan Agrawal

Research output: Contribution to journal › Conference article › peer-review

Abstract

The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/^pMK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/^pMK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% 5% communication data size.

Original language	English (US)
Pages (from-to)	2525-2536
Number of pages	12
Journal	Advances in Neural Information Processing Systems
Volume	2018-December
State	Published - 2018
Externally published	Yes
Event	32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada Duration: Dec 2 2018 → Dec 8 2018

ASJC Scopus subject areas

Computer Networks and Communications
Information Systems
Signal Processing

Access to Document

https://dblp.org/rec/conf/nips/JiangA18

Cite this

@article{8bf45d72b93d44898b40ae59daed4582,

title = "A linear speedup analysis of distributed deep learning with sparse and quantized communication",

abstract = "The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/pMK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/pMK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% 5% communication data size.",

author = "Peng Jiang and Gagan Agrawal",

note = "Publisher Copyright: {\textcopyright} 2018 Curran Associates Inc.All rights reserved.; 32nd Conference on Neural Information Processing Systems, NeurIPS 2018 ; Conference date: 02-12-2018 Through 08-12-2018",

year = "2018",

language = "English (US)",

volume = "2018-December",

pages = "2525--2536",

journal = "Advances in Neural Information Processing Systems",

issn = "1049-5258",

}

TY - JOUR

T1 - A linear speedup analysis of distributed deep learning with sparse and quantized communication

AU - Jiang, Peng

AU - Agrawal, Gagan

PY - 2018

Y1 - 2018

N2 - The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/pMK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/pMK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% 5% communication data size.

AB - The large communication overhead has imposed a bottleneck on the performance of distributed Stochastic Gradient Descent (SGD) for training deep neural networks. Previous works have demonstrated the potential of using gradient sparsification and quantization to reduce the communication cost. However, there is still a lack of understanding about how sparse and quantized communication affects the convergence rate of the training algorithm. In this paper, we study the convergence rate of distributed SGD for non-convex optimization with two communication reducing strategies: sparse parameter averaging and gradient quantization. We show that O(1/pMK) convergence rate can be achieved if the sparsification and quantization hyperparameters are configured properly. We also propose a strategy called periodic quantized averaging (PQASGD) that further reduces the communication cost while preserving the O(1/pMK) convergence rate. Our evaluation validates our theoretical results and shows that our PQASGD can converge as fast as full-communication SGD with only 3% 5% communication data size.

UR - http://www.scopus.com/inward/record.url?scp=85064232043&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064232043&partnerID=8YFLogxK

M3 - Conference article

SN - 1049-5258

VL - 2018-December

SP - 2525

EP - 2536

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

T2 - 32nd Conference on Neural Information Processing Systems, NeurIPS 2018

Y2 - 2 December 2018 through 8 December 2018

ER -

A linear speedup analysis of distributed deep learning with sparse and quantized communication

Abstract

ASJC Scopus subject areas

Access to Document

Other files and links

Fingerprint

Cite this