
Agenda (PST time zone):
- 4:55 - 5:00pm: Intro
- 5:00 - 5:45pm: Tech talk 1 + QA
- 5:45 - 6:30pm: Tech talk 2 + QA
- 6:30pm: Close
Talk 1: Large-scale data shuffle in Ray with Exoshuffle
Shuffle is a key primitive in large-scale data processing applications. The difficulty of large-scale shuffle has inspired a myriad of implementations. While these have greatly improved shuffle performance and reliability over time, it comes at a cost: flexibility. We show that contrary to popular wisdom, shuffle can be implemented with high performance and reliability on a general-purpose system for distributed computing: Ray.
In this talk, we present Exoshuffle, an application-level shuffle system that outperforms Spark and achieves 82% of theoretical performance on a 100TB sort on 100 nodes. In Ray 2.0, we have integrated Exoshuffle with the Datasets library to provide high-performance large-scale shuffle for ML users.
Talk 2: Scaling Training and Batch Inference: A Deep Dive into Ray AIR Data Processing Engine
Are you looking to scale your ML pipeline to multiple machines? Are you encountering an ingest bottleneck, preventing you from saturating your GPUs? This talk will cover how Ray AIR uses Ray Datasets for efficient data loading and preprocessing for both training and batch inference, diving into how AIR uses Datasets to achieve high performance and scalability.
We start by giving an overview of creating training and batch inference pipelines using Ray AIR. Next, we dive into the Ray Datasets internals, detailing features such as distributed data sharding, parallel + distributed I/O and transformations, pipelining of CPU and GPU compute, autoscaling pools of inference workers, and efficient per-epoch shuffling. Finally, we present case studies of users that have deployed such AIR workloads to production and have seen the performance + scalability benefits.