Spark RAPIDS ML: GPU Accelerated Distributed ML in Spark Clusters

May 09, 12:00 PM PDT
  • Virtual SF Big Analytics
  • 42 RSVP

Spark ML is a key component of Apache Spark for large-scale machine learning, and provides built-in implementations of many popular machine learning algorithms. These implementations were created a decade ago, and do not leverage modern computing accelerators such as GPUs. In this talk, we will explore how to enable GPU acceleration of Spark machine learning applications.

We will introduce Spark RAPIDS ML, a recently open-sourced pySpark package ( that provides GPU-based distributed implementations of machine learning algorithms with standard Spark ML APIs. Spark RAPIDS ML is built upon the open-source GPU accelerated RAPIDS cuML library (, which implements various algorithms for regression, classification, clustering, and dimensionality reduction. We will give an overview of Spark RAPIDS ML’s features, design, and implementation, go over usage and examples, and share some benchmarking results demonstrating significant performance and cost benefits relative to CPU-based Spark ML. The project is joint work with Bobby Wang and Lee Yang from NVIDIA.

Jinfeng Li(Nvidia)

Jinfeng Li
Senior engineer in machine learning at NVIDIA.
Erik Ordentlich
Senior manager at NVIDIA, leading a group working on GPU accelerated distributed ML and DL for Spark clusters and federated learning
The event ended.
Watch Recording
*Recordings hosted on Youtube, click the link will open the Youtube page.
Contact Organizer