Large-scale Data Process and ML Pipelines

Oct 19, 04:55PM PST(11:55PM GMT).
  • Virtual SF Big Analytics
  • Free 91 Attendees

Agenda (PST time zone):
- 4:55 - 5:00pm: Intro
- 5:00 - 5:45pm: Tech talk 1 + QA
- 5:45 - 6:30pm: Tech talk 2 + QA
- 6:30pm: Close

Talk 1: Large-scale data shuffle in Ray with Exoshuffle

Shuffle is a key primitive in large-scale data processing applications. The difficulty of large-scale shuffle has inspired a myriad of implementations. While these have greatly improved shuffle performance and reliability over time, it comes at a cost: flexibility. We show that contrary to popular wisdom, shuffle can be implemented with high performance and reliability on a general-purpose system for distributed computing: Ray.

In this talk, we present Exoshuffle, an application-level shuffle system that outperforms Spark and achieves 82% of theoretical performance on a 100TB sort on 100 nodes. In Ray 2.0, we have integrated Exoshuffle with the Datasets library to provide high-performance large-scale shuffle for ML users.

Talk 2: Scaling Training and Batch Inference: A Deep Dive into Ray AIR Data Processing Engine

Are you looking to scale your ML pipeline to multiple machines? Are you encountering an ingest bottleneck, preventing you from saturating your GPUs? This talk will cover how Ray AIR uses Ray Datasets for efficient data loading and preprocessing for both training and batch inference, diving into how AIR uses Datasets to achieve high performance and scalability.

We start by giving an overview of creating training and batch inference pipelines using Ray AIR. Next, we dive into the Ray Datasets internals, detailing features such as distributed data sharding, parallel + distributed I/O and transformations, pipelining of CPU and GPU compute, autoscaling pools of inference workers, and efficient per-epoch shuffling. Finally, we present case studies of users that have deployed such AIR workloads to production and have seen the performance + scalability benefits.

Jules Damji (Anyscale)

Stephanie Wang
PhD student in distributed systems at UC Berkeley, a software engineer at Anyscale, and a lead committer for the Ray project.

Jiajun Yao
software engineer at Anyscale and a committer for the Ray project

Jules Damji
Lead developer advocate at Anyscale Inc, an MLflow contributor, and co-author of Learning Spark, 2nd Edition
The event ended.
Watch Recording
*Recordings hosted on Youtube, click the link will open the Youtube page.