Fifth Workshop on Distributed Infrastructures for Deep Learning (DIDL) 2021

Middleware 2021 Workshops

The DIDL workshop is co-located with ACM/IFIP Middleware 2021, which takes place from December 6-10 in Québec, Canada.

Deep learning is a rapidly growing field of machine learning, and has proven successful in many domains, including computer vision, language translation, and speech recognition. The training of deep neural networks is resource intensive, requiring compute accelerators such as GPUs, as well as large amounts of storage and memory, and network bandwidth. Additionally, getting the training data ready requires a lot of tooling for data cleansing, data merging, ambiguity resolution, etc. Sophisticated middleware abstractions are needed to schedule resources, manage the distributed training job as well as visualize how well the training is progressing. Likewise, serving the large neural network models with low latency constraints can require middleware to manage model caching, selection, and refinement.

All the major cloud providers, including Amazon, Google, IBM, and Microsoft have started to offer cloud services recently to train and/or serve deep neural network models. In addition, there is a lot of activity in open source middleware for deep learning, including but not limited to Tensorflow, Theano, Caffe2, PyTorch, MXNet, Hugging Face, and fairseq. There are also efforts to extend existing platforms such as Spark and Ray for various aspects of deep learning.

This workshop focuses on the tools, frameworks, and algorithms to support executing deep learning algorithms in a distributed environment. As new hardware and accelerators become available, the middleware and systems need to be able exploit their capabilities and ensure they are utilized efficiently.

Workshop Agenda (Tentative)

Introduction (11:00 - 11:10 EST)

Keynote: AI/ML Pipelines using CodeFlare (11:10 - 12:10 EST)
Mudhakar Srivatsa, IBM Research

Abstract: Pipelines have become a ubiquitous construct in machine learning spanning tasks ranging from data cleaning and preprocessing, training foundational models, model optimization and transfer learning and low latency inferencing. While the many pipeline construct has existed for many years (e.g., SciKit learn pipelines, Spark pipelines), this talk will focus on a process calculus style definition of pipeline - called CodeFlare pipelines - that makes it readily amenable to scaling complex AI/ML workflows on a commodity cluster. CodeFlare pipelines not only enable data scientists to introduce compute, data and multi-stage parallelism using simple annotations on the pipeline graph, but also operationalize them on a hybrid cloud platform (Red Hat OpenShift), thereby making the solution deployable just about anywhere and leverage the benefits of serverless computing. This talk will cover a basic realization of CodeFlare pipelines on the Ray platform (1.7.0 release) that has shown near linear scalability.

Bio: Mudhakar Srivatsa is a distinguished research staff member at the Distributed AI department in IBM T. J. Watson Research Center. His work is focussed on cloud-native scaling of AI/ML workloads with applications to large scale spatial and timeseries data. He has led the deployment of AI-assisted solutions for air traffic control, IT operations, combating piracy in the maritime domain, and public safety in dense urban environments such as stadiums and music festivals.

Break (12:10 - 12:30 EST)

Tutorial: Use of Codeflare and Ray for Deep Learning Tasks (12:30 - 1:00 EST)
Linsong Chu, IBM Research

Abstract: Benchmarking is crucial for natural language understanding systems, but also very challenging. A variety of collections of resources are needed for training, evaluating and analyzing the system, which requires a large scale distributed deep learning system. In this talk, Linsong will show how Ray, and Ray's integration with Horovod, can be used to train and evaluate deep learning models at scale for tasks like NLP benchmarking. Linsong will demonstrate this workflow on two sample applications, GLUE Benchmarking using Ray and Anomaly Detection with Remote Sensing data using Ray+Horovod Integration.

Bio: Linsong Chu is a Research Engineer in IBM Research with specialization on large scale machine learning and spatiotemporal analysis.

Paper presentations (1:00 - 1:40 EST)

RAMPS: Next Generation Platform for Real Time and Resilient IoT Analytics using MmWave and Programmable Switches
Vishal Shrivastav (Purdue University), Dimitrios Koutsonikolas (Northeastern University), Saurabh Bagchi (Purdue University)

Reproducible Model Sharing for AI Practitioners
Amin Moradi (Leiden University), Alexandru Uta (Leiden University)

Workshop conclusion (1:40 - 1:45 EST)

Workshop call for papers

Call For Papers (CFP)

Workshop Co-chairs

Bishwaranjan Bhattacharjee, IBM Research
Vatche Ishakian, IBM Research
Vinod Muthusamy, IBM Research

Program Committee (Tentative)

Parag Chandakkar, Walmart Labs
Ian Foster, Argonne National Laboratory and the University of Chicago
Matthew Hill, Dataminr
Mayoore Jaiswal, Nvidia
Gauri Joshi, Carnegie Mellon University
Jayaram K. R., IBM Research
Ruben Mayer, Technical University of Munich
Pietro Michiardi, Eurecom
Phuong Nguyen, eBay
Peter Pietzuch, Imperial College
Chuan Wu, University of Hong Kong