To effectively support deep learning at LinkedIn, we need to first address the data processing issues. Most of the datasets used by our ML algorithms (e.g., LinkedIn large scale personalization engine Photon-ML) are in Avro format. Each record in an Avro dataset is essentially a sparse vector, and can be easily consumed by most of the modern classifiers. However, the format cannot be directly used by TensorFlow -- the leading deep learning package. The main blocker is that the sparse vector is not in the same format as Tensor.
Many companies have vast amount of ML data in similar sparse vector format, and Tensor format is still relatively new to many companies. Avro2TF bridges this gap by providing scalable Spark based transformation and extension mechanism to efficiently convert the data into TF records that can be readily consumed by TensorFlow. With this technology, engineers can improve their productivity by focusing on model building rather than data processing.
In this talk, we will go over the data processing issues common to many machine learning pipelines, and how we solve the problems, then deep dive into the open sourced tool, Avro2TF. How it works, its tech architecture and usage.
Senior Software Engineer at LinkedIn. His research interests include distributed big data systems and machine learning. He is currently working in the AI foundation team at LinkedIn and is leading the deep learning effort.