Deep learning is a rapidly growing field of machine learning, and has proven successful in many domains, including computer vision, language translation, and speech recognition. The training of deep neural networks is resource intensive, requiring compute accelerators such as GPUs, as well as large amounts of storage and memory, and network bandwidth. Additionally, getting the training data ready requires a lot of tooling for data cleansing, data merging, ambiguity resolution, etc. Sophisticated middleware abstractions are needed to schedule resources, manage the distributed training job as well as visualize how well the training is progressing. Likewise, serving the large neural network models with low latency constraints can require middleware to manage model caching, selection, and refinement.
All the major cloud providers, including Amazon, Google, IBM, and Microsoft have started to offer cloud services in the last year or so with services to train and/or serve deep neural network models. In addition, there is a lot of activity in open source middleware for deep learning, including Tensorflow, Theano, Caffe2, PyTorch, and MXNet. There are also efforts to extend existing platforms such as Spark for deep learning workloads.
This workshop focuses on the tools, frameworks, and algorithms to support executing deep learning algorithms in a distributed environment. As new hardware and accelerators become available, the middleware and systems need to be able exploit their capabilities and ensure they are utilized efficiently.
The workshop is scheduled to be in the afternoon on Dec 11 2017.
Record questions and ideas in this Google Doc: https://goo.gl/YRUKsz
Introduction and tutorial on deep learning (1:30 - 2:00) Bishwaranjan Bhattacharjee (IBM Research)
Paper presentations #1 (2:00 - 3:00)
The TensorFlow Partitioning and Scheduling Problem: It's the Critical Path! Ruben Mayer (University of Stuttgart), Christian Mayer (University of Stuttgart), Larissa Laich (University of Stuttgart)
Orchestrating Deep Learning workloads on distributed infrastructure. Seetharami Seelam (IBM Research)
TensorView: Visualizing the Training of Convolutional Neural Network Using Paraview. Xinyu Chen (University of New Mexico), Qiang Guan (Los Alamos National Laboratory), Xin Liang (University of California, Riverside), Li-Ta Lo (Los Alamos National Laboratory), Simon Su (US Army Research Laboratory), Trilce Estrada (University of New Mexico), James Ahrens (Los Alamos National Laboratory)
Break (3:00 - 3:30)
Paper presentations #2 (3:30 - 3:50)
Balanced System Design for Distributed Deep Learning with fast GPUs. Bishwaranjan Bhattacharjee (IBM Research)
Panel discussion (3:50 - 5:00)
Bishwaranjan Bhattacharjee, IBM Research
Vatche Ishakian, Bentley University
Hans-Arno Jacobsen, Middleware Systems Research Group
Vinod Muthusamy, IBM Research
Ian Foster, Argonne National Laboratory and the University of Chicago
Benoit Huet, Eurecom
Pietro Michiardi, Eurecom
Peter Pietzuch, Imperial College
Evgenia Smirni, College of William and Mary
Yandong Wang, Citadel Securities
Chuan Wu, University of Hong Kong