Best practices towards a production-ready pipeline with Apache Beam


May 13, 10:00 AM PDT
  • Virtual
  • 340 RSVP
Description
Speaker
Introducing BeamLearningMonth in May 2020! In collaboration with Google cloud team, we host a series of practical introductory sessions to Apache Beam!

Apache Beam is an open source, unified model for defining both batch and streaming data-parallel processing pipelines. Using one of the open source Beam SDKs, you build a program that defines the pipeline. The pipeline is then executed by one of Beam’s supported distributed processing back-ends, which include Apache Apex, Apache Flink, Apache Spark, and Google Cloud Dataflow.

This is session 2 of the series:
Data Engineering is a very interesting field, with lots of new technologies, and opportunities. Unfortunately, it takes a long time to master, and there arenot many resources for intermediate practitioners
In this talk, we will take a public dataset and a concept, and we will build an Apache Beam pipeline that is stable and ready to run in production. We will walk through the workflow of starting the project in an IDE, writing and organizing pipeline code, as well as writing tests, and running them. You can adapt this model for your own pipeline, and I will be happy to answer your questions!

For more talks on Apache Beam, join and watch our Session 1 on May 6th, 10am PST. Link

Pablo Estrada

Software Engineer at Google and a PMC member for Apache Beam. He has been in the Google Cloud Dataflow team for almost 4 years, and has worked on many areas of the Beam and Dataflow stacks like monitoring, IO, and local execution. Pablo lives in Seattle, and misses going to the office.
The event ended.
Watch Recording
*Recordings hosted on Youtube, click the link will open the Youtube page.
Contact Organizer