High Performance Distributed Deep Learning: A Beginner's Guide
AI/Machine Learning/Deep Learning
TimeSunday, June 16th2pm - 6pm
DescriptionThe current wave of advances in Deep Learning (DL) has led to many exciting
challenges and opportunities for Computer Science and Artificial Intelligence
researchers alike. DL frameworks like TensorFlow, PyTorch, Caffe, and several
others have emerged that offer ease of use and flexibility to describe, train,
and deploy various types of Deep Neural Networks (DNNs). In this tutorial, we
will provide an overview of interesting trends in DNN design and how
cutting-edge hardware architectures are playing a key role in moving the field
forward. We will also present an overview of different DNN architectures and DL
frameworks. Most DL frameworks started with a single-node/single-GPU design.
However, approaches to parallelize the process of DNN training are also being
actively explored. The DL community has moved along different distributed
training designs that exploit communication runtimes like gRPC, MPI, and NCCL.
In this context, we highlight new challenges and opportunities for communication
runtimes to efficiently support distributed DNN training. We also highlight some
of our co-design efforts to utilize CUDA-Aware MPI for large-scale DNN training
on modern GPU clusters. Finally, we also include hands-on exercises to enable the
attendees gain first-hand experience of running distributed DNN training
experiments on a modern GPU cluster.
Content Level The content level will be as follows: 60% beginner, 30% intermediate, and 10% advanced.
Target AudienceThis tutorial is targeted for newcomers as well as scientists, engineers, researchers, and students working in the areas of DL and MPI-based distributed DNN training on modern HPC clusters with high-performance interconnects.
PrerequisitesThere is no fixed pre-requisite. As long as the attendee has general knowledge in HPC and Networking, he/she will be able to understand and appreciate it.