Next Generation Open Source Data Infra

Nov 13, 06:30PM PDT(02:30AM GMT).
  • Free 178 Attendees
Talk #1: Introducing Iceberg, Tables designed for object stores

This talk will focus on Iceberg, a new table metadata format thats designed for managing huge tables backed by S3 storage. Iceberg decreases job planning time from minutes to under a second, while also isolating reads from writes and fixing longstanding problems like reliable schema evolution. This talk will include an overview of how Iceberg works and details about how Netflix is using Iceberg to make big data easier and more reliable.


Ryan Blue works on Netflix big data platform team. He contributes to Apache Spark and is a PMC member of Apache Parquet and Apache Avro.

Talk #2: Scaling Apache Spark Usage at Lyft

In this talk, Li will talk about current Apache Spark usages at Lyft and how Lyft scales current usage of Apache Spark for machine learning and etl-type of workloads through managed multi-cluster model. In this talk we will also show how we operate Apache Spark with autoscaling and high availability support. In this talk we will also show how Spark coexists with our Apache Hive and other data infrastructure services as a portfolio offered to a wide range of customers.


Li Gao is the tech lead in the Apache Spark domain in Data Infrastructure org at Lyft. Prior to Lyft, Li worked at Salesforce, Fitbit, Marin Software, and a few startups etc. on various technical leadership positions on cloud native and hybrid cloud data platforms at scale. Besides Spark, Li has scaled and productionized other open source projects, such as Presto, Apache HBase, Apache Phoenix, Apache Kafka, Apache Airflow, Apache Hive, and Apache Cassandra.

Talk #3: From flat files to deconstructed database: The evolution and future of the big data ecosystem

In this talk, Julien discusses the key open source components of the big data ecosystem—including Apache Calcite, Parquet, Arrow, Avro, and Kafka as well as batch and streaming systems—and explains how they relate to each other and how they make the ecosystem more of a database and less of a filesystem. (Parquet is the columnar data layout to optimize data at rest for querying. Arrow is the in-memory representation for maximum throughput execution and overhead-free data exchange. Calcite is the optimizer to make the most of our infrastructure capabilities.) Julien also explores the emerging components that are still missing or haven’t become standard yet to fully materialize the transformation to an extremely flexible database that lets you innovate with your data.


Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC member on Apache Pig, Apache Arrow, and a few other projects. Julien is a principal engineer at WeWork where he works on the data platform architecture. Previously, he was an architect at Dremio; tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_); and a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Ryan, Li, and Julien

The event ended.
Watch Recording
*Recordings hosted on Youtube, click the link will open the Youtube page.