Spark ML is a key component of Apache Spark for large-scale machine learning, and provides built-in implementations of many popular machine learning algorithms. These implementations were created a decade ago, and do not leverage modern computing accelerators such as GPUs. In this talk, we will explore how to enable GPU acceleration of Spark machine learning applications.
We will introduce Spark RAPIDS ML, a recently open-sourced pySpark package (https://github.com/NVIDIA/spark-rapids-ml) that provides GPU-based distributed implementations of machine learning algorithms with standard Spark ML APIs. Spark RAPIDS ML is built upon the open-source GPU accelerated RAPIDS cuML library (https://github.com/rapidsai/cuml), which implements various algorithms for regression, classification, clustering, and dimensionality reduction. We will give an overview of Spark RAPIDS ML’s features, design, and implementation, go over usage and examples, and share some benchmarking results demonstrating significant performance and cost benefits relative to CPU-based Spark ML. The project is joint work with Bobby Wang and Lee Yang from NVIDIA.