DIDL 2020 - Fourth Workshop on Distributed Infrastructures for Deep Learning

The DIDL workshop is co-located with ACM/IFIP Middleware 2020, which takes place from December 7-11 in Delft, The Netherlands.

Deep learning is a rapidly growing field of machine learning, and has proven successful in many domains, including computer vision, language translation, and speech recognition. The training of deep neural networks is resource intensive, requiring compute accelerators such as GPUs, as well as large amounts of storage and memory, and network bandwidth. Additionally, getting the training data ready requires a lot of tooling for data cleansing, data merging, ambiguity resolution, etc. Sophisticated middleware abstractions are needed to schedule resources, manage the distributed training job as well as visualize how well the training is progressing. Likewise, serving the large neural network models with low latency constraints can require middleware to manage model caching, selection, and refinement.

All the major cloud providers, including Amazon, Google, IBM, and Microsoft have started to offer cloud services in the last year or so with services to train and/or serve deep neural network models. In addition, there is a lot of activity in open source middleware for deep learning, including Tensorflow, Theano, Caffe2, PyTorch, and MXNet. There are also efforts to extend existing platforms such as Spark for deep learning workloads.

This workshop focuses on the tools, frameworks, and algorithms to support executing deep learning algorithms in a distributed environment. As new hardware and accelerators become available, the middleware and systems need to be able exploit their capabilities and ensure they are utilized efficiently.

The workshop is scheduled to be in the morning on Dec 7, 2020.

Workshop Agenda

Introduction (9:00 - 9:10 ET / 15:00 - 15:10 CET)

Keynote: Making Training in Distributed Machine Learning Adaptive (9:10 - 9:55 ET / 15:10 - 15:55 CET)
Prof. Peter Pietzuch, Imperial College London

Abstract: When using distributed machine learning (ML) systems to train models on a cluster of worker machines, users must configure a large number of parameters: hyper-parameters (e.g. the batch size and the learning rate) affect model convergence; system parameters (e.g. the number of workers and their communication topology) impact training performance. In current systems, adapting such parameters during training is ill-supported.

In this talk, I will describe our recent work on KungFu, a distributed ML library for TensorFlow and PyTorch that is designed to enable adaptive training. KungFu allows users to express high-level Adaptation Policies (APs) that describe how to change hyper- and system parameters during training. APs take real-time monitored metrics (e.g. signal-to-noise ratios and noise scale) as input and trigger control actions (e.g. cluster rescaling or synchronisation strategy updates). For execution, APs are translated into monitoring and control operators, which are embedded in the dataflow graph. APs exploit an efficient asynchronous collective communication layer, which ensures concurrency and consistency of monitoring and adaptation operations. (This work has appeared in USENIX OSDI 2020.)

Bio: Peter Pietzuch is a Professor of Distributed Systems at Imperial College London, where he leads the Large-scale Data & Systems (LSDS) group. His research work focuses on the design and engineering of scalable, reliable and secure large-scale software systems, with a particular interest in performance, data management and security issues when supporting machine learning applications. He has published papers in premier scientific venues, including OSDI/SOSP, SIGMOD, VLDB, ASPLOS, USENIX ATC, EuroSys, SoCC, ICDCS, DEBS, and Middleware. Currently he is a Visiting Researcher with Microsoft Research and serves as the Director of Research in the Department, the Chair of the ACM SIGOPS European Chapter, and an Associate Editor for IEEE TKDE and TCC. Before joining Imperial College London, he was a post-doctoral Fellow at Harvard University. He holds PhD and MA degrees from the University of Cambridge.

Break (9:55 - 10:05 ET / 15:55 - 16:05 CET)

Paper presentations (10:05 - 11:05 ET / 16:05 - 17:05 CET)

Graph Representation Matters in Device Placement
Milko Mitropolitsky, Zainab Abbas,Amir H. Payberah (KTH Royal Institute of Technology)

Tools and Techniques for Privacy-aware, Edge-centric Distributed Deep Learning
Ziran Min*, Robert E. Canady*, Akram Hakiri^, Uttam Ghosh*, Aniruddha Gokhale*
* Vanderbilt University ^ University of Carthage

Break (11:05 - 11:15 ET / 17:05 - 17:15 CET)

Keynote: Architecture Transferability in Large Scale Neural Architecture Search (11:15 - 12:00 ET / 17:15 - 18:00 CET)
Rameshwar Panda, IBM Research

Abstract: Neural Architecture Search (NAS) is an open and challenging problem in machine learning. While NAS offers great promise, the prohibitive computational demand of most of the existing NAS methods makes it difficult to directly search the architectures on large-scale tasks. The typical way of conducting large scale NAS is to search for an architectural building block on a small dataset and then transfer the block to a larger dataset. In this talk, I will briefly review recent progress and challenges in the architecture transferability of different NAS methods, discuss transfer value of different proxy datasets, and few directions that machine learning researchers should focus in designing future NAS algorithms that are not only efficient but also more effective at large scale.

Bio: Rameswar Panda is currently a Research Staff Member at MIT-IBM Watson AI Lab, Cambridge, USA. Prior to joining MIT-IBM lab, he obtained his Ph.D in Electrical and Computer Engineering from University of California, Riverside in 2018. During Ph.D., Rameswar worked at NEC Labs America, Adobe Research and Siemens Corporate Research. His primary research interests span the areas of computer vision, machine learning and multimedia. In particular, his current focus is on image and video understanding including efficient dynamic neural networks, large-scale neural architecture search and learning with limited supervision. His work has been published in top-tier conferences such as CVPR, ICCV, ECCV, NeurIPS as well as high impact journals such as TIP and TMM. He actively participates as a program committee member for many top AI conferences and was leading co-chair of the workshop on Multi-modal Video Analysis at ECCV 2020 and Workshop on Neural Architecture Search at CVPR 2020. More details can be found in https://rpand002.github.io/.

Workshop conclusion (12:00 - 12:10 ET / 18:00 - 18:10 CET)

Workshop call for papers

Call For Papers (CFP)

Workshop Co-chairs

Bishwaranjan Bhattacharjee, IBM Research
Vatche Ishakian, Bentley University
Vinod Muthusamy, IBM Research

Program Committee

Parag Chandakkar, Walmart Labs
Ian Foster, Argonne National Laboratory and the University of Chicago
Matthew Hill, Dataminr
Mayoore Jaiswal, Nvidia
Gauri Joshi, Carnegie Mellon University
Jayaram K. R., IBM Research
Ruben Mayer, Technical University of Munich
Pietro Michiardi, Eurecom
Phuong Nguyen, eBay
Peter Pietzuch, Imperial College
Chuan Wu, University of Hong Kong

Fourth Workshop on Distributed Infrastructures for Deep Learning (DIDL) 2020

Workshop Agenda

Workshop call for papers

Workshop Co-chairs

Program Committee