December 12, 2017

Distributing control of deep learning training delivers 10x performance improvement

by Wei Zhang, IBM

My IBM Research AI team and I recently completed the first formal theoretical study of the convergence rate and communications complexity associated with a decentralized distributed approach in a deep learning training setting. The empirical evidence proves that in specific configurations, a decentralized approach can result in a 10x performance boost over a centralized approach without additional complexity. A paper describing our work has been accepted for oral presentation at the NIPS 2017 Conference, one of the 40 out of 3240 submissions selected for this.

Supervised machine learning generally consists of two phases: 1) training (building a model) and 2) inference (making predictions with the model). The training phase involves finding optimal values for a model's parameters such that error on a set of training examples is minimized, and the model generalizes to new data. There are several algorithms to find optimal values (training cost vs. model accuracy) for model parameters, however, gradient decent, in various flavors, is one of the most popular.

In its simplest form, gradient descent is a serial algorithm that iteratively tweaks a model's parameters and, therefore, is prohibitively expensive for large datasets. The Stochastic Gradient Descent (SGD) is a computational simplification which cycles through the training samples in random order to arrive at optimal parameter values (convergence). In fact, SGD is the de facto optimization method used during model training for most of the deep learning frameworks. Accelerating solving for the SGD can help lower training costs and improve overall user experience, and is one of the important fields of study in the deep learning domain.

The Parallel Stochastic Gradient Descent (PSGD) is an approach to parallelize the SGD algorithm by using multiple gradient calculating "workers." The PSGD approach is based on the distributed computing paradigm and, until recently, has been implemented in a centralized configuration arrangement to deal with challenges involved in consistent reads of and updates to model parameters. In such a configuration, there is, conceptually, a single centralized "master" parameter server responsible for synchronized reading and updating. Two of the most commonly used centralized PGSD approaches are all-reduce and asynchronous SGD (ASGD).

During the past three years, I have been studying and working on the all-reduce based approach, and designing and testing several protocol variants for ASGD. The two approaches have unique pros and cons. The all-reduce approach has predictable convergence, but, being synchronous, is slow. The ASGD-based approach allows workers to run at different speeds (without synchronization) and is fast, but has unpredictable convergence behavior. In practice, depending upon the approach taken, you are either praying that there is no computing device hiccup so the all-reduce runtime is not abysmal; or hoping the ASGD algorithm will just converge, keeping the accuracy vs training cost curve in-check.

It took me about two and a half years of work to realize that both approaches are essentially the same – centralized configurations. The only difference is that in all-reduce, there is one version of the central weights and, in ASGD, there are multiple versions. In any case, the centralized parameter server, unfortunately, becomes the single point of failure and a runtime bottleneck for both.

The obvious question then was, what if we can get rid of the parameter server altogether and let each worker run in a peer-to-peer fashion and, eventually, reach a consensus? If it worked, it would simplify distributed deep learning system design tremendously – not only could we get rid of the single point of failure, but also reduce the training cost. However, nobody knew if such a decentralized distributed approach would generate correct results for complicated optimization problems such as deep learning training.

To our surprise and delight, we have found that our approach works and can achieve similar convergence rates as the traditional SGD. Our solution has been successfully applied to deep learning training across different type of workloads (open-source workloads and proprietary IBM workloads) and is easy to incorporate into various open-source frameworks (already accomplished for Torch). The technique works extremely well in slow network configurations as it involves a low volume of handshake messages.

Provided by IBM

Citation: Distributing control of deep learning training delivers 10x performance improvement (2017, December 12) retrieved 17 July 2024 from https://phys.org/news/2017-12-deep-10x.html

This document is subject to copyright. Apart from any fair dealing for the purpose of private study or research, no part may be reproduced without the written permission. The content is provided for information purposes only.

Explore further

IBM scientists demonstrate 10x faster large-scale machine learning using GPUs

44 shares

Feedback to editors

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

10 hours ago

Intensive farming could raise risk of new pandemics, researchers warn

11 hours ago

Scientists develop new AI method to create material 'fingerprints'

14 hours ago

Study shows frogs can quickly increase their tolerance to pesticides

15 hours ago

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

15 hours ago

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

15 hours ago

Scientists use machine learning to predict diversity of tree species in forests

16 hours ago

Physicists pool skills to better describe the unstable sigma meson particle

17 hours ago

Telescope tag-team discovers 10 strange and exotic pulsars

18 hours ago

NASA transmits hip-hop song to deep space for first time

18 hours ago

Load comments (0)

Distributing control of deep learning training delivers 10x performance improvement

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Particle.js: Exploring Particle Physics with Web Technologies

Help solving a geometrical matching issue with Graph Neural Networks

5 GHz PC WiFi connection Cybersecurity question

Help with some optimization code for Block Matrices

Is an API Always Necessary for Server-Client Communication?

I did this POST message configuration damage to my wifi internet, help

IBM scientists demonstrate 10x faster large-scale machine learning using GPUs

Microsoft open sources Distributed Machine Learning Toolkit for more efficient big data research

The future of hardware is AI

Supercomputing speeds up deep learning training

A practical optimisation algorithm for big data applications

Improving machine learning with an old approach

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Medical Xpress

Tech Xplore

Science X

Distributing control of deep learning training delivers 10x performance improvement

New 3D anatomical atlas of the African clawed frog increases understanding of development and metamorphosis processes

Intensive farming could raise risk of new pandemics, researchers warn

Scientists develop new AI method to create material 'fingerprints'

Study shows frogs can quickly increase their tolerance to pesticides

Nature-based solutions to disaster risk from climate change are cost-effective, study confirms

Astronomers discover what may be 21 neutron stars orbiting sun-like stars

Scientists use machine learning to predict diversity of tree species in forests

Physicists pool skills to better describe the unstable sigma meson particle

Telescope tag-team discovers 10 strange and exotic pulsars

NASA transmits hip-hop song to deep space for first time

Relevant PhysicsForums posts

Related Stories

IBM scientists demonstrate 10x faster large-scale machine learning using GPUs

Microsoft open sources Distributed Machine Learning Toolkit for more efficient big data research

The future of hardware is AI

Supercomputing speeds up deep learning training

A practical optimisation algorithm for big data applications

Improving machine learning with an old approach

Recommended for you

Hyphens in paper titles harm citation counts and journal impact factors

A big step toward the practical application of 3-D holography with high-performance computers

Combining multiple CCTV images could help catch suspects

Applying deep learning to motion capture with DeepLabCut

Training artificial intelligence with artificial X-rays

New model for large-scale 3-D facial recognition

Newsletter sign up

Donate and enjoy an ad-free experience