Hien’s 15 Picks for Data+AI Summit

4 min readMay 7, 2021

The Data + AI Summit is only a few weeks away — May 24–28 (FYI: the virtual general conference is free). The first two days are reserved for both complimentary and paid trainings. The last three days are for the hundreds of interesting sessions that cover a wide range of topics, such as Spark internals and best practices, Data Engineering, Data Architecture, Data Science, Deep Learning and ML, SQL Analytics, Business Intelligence, Spark production use case, so on.

Here are the 15 sessions that I am excited about.

Productionalizing Machine Learning Solutions with Effective Tracking, Monitoring and Management: Sumanth Venkatasubbaiah and Pankaj Rastogi from Intuit will present a system that continuously tracks and monitors ML models across the various development lifecycle stages at Intuit to ensure the health of their AI solutions.
Scaling Online ML Predictions at DoorDash: Hien Luu and Arbaz Khan from DoorDash will share the journey of building and scaling the ML platform at DoorDash and particularly the prediction service, which supports up to billions of predictions per day with a peak request rate above 1 million per second.
Koalas: Does Koalas Work Well or Not?: Takuya Ueshin and Xinrong Meng from Databricks will introduce Koalas and its current status, and the comparison between Koalas and Dask.
Consolidating MLOps at One of Europe’s Biggest Airports: Floris Hoogenboom from Royal Schiphol Group will take us through the way they rely on MLFlow to release multiple versions of a model per week in a controlled fashion at Amsterdam Schiphol airport. Their ML use cases include predicting passenger flow and analyzing what is happening around the aircraft.
Structured Streaming Use-Cases at Apple: Kristine Guo and Liang-Chi Hsieh from Apple will share the interesting solutions they came up with to augment Structured Streaming to maintain large amount of state, to add a general solution to support session window, compute aggregates over dynamic batches, and perform stream-stream joins.
YOLO with Data-Driven Software: Brooke Wenig from Databricks and Tim Hunter from ABN AMRO will show how to treat data like code through the concept of Data-Driven Software (DDS) to allow data engineers and data scientists to YOLO: you only load your data once. This technique enables data scientists to use the intermediate results while collaborating with their peers without having to compute everything from scratch
Importance of ML Reproducibility & Applications with MLflow: Gray Gwizdz and Marcy Grace Moesta from Databricks will outline the challenges and importance of building and maintaining reproducible, efficient, and governed ML solutions and how these challenges can be addressed using the point in time snapshots of Delta Lake and the governance capabilities provided by MLflow.
Project Zen: Making Data Science Easier in PySpark: Hyukjin Kwon and Haejoon Lee from Databricks will present some of the awesome improvements and useful features in PySpark, such as the newly redesigned pandas UDFs and functions APIs with Python type hints.
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue: Hang Wang from Lyft will show how using Fungue-Tune and Spark together can eliminate some of the paint points in performing hyperparameter tuning at scale.
Simplify Data Conversion from Spark to TensorFlow and PyTorch: Liang Zhang from Databricks will show how to use the Apache Spark Dataset Converter, an open-source tool that simplifies the data conversion from Spark to deep learning frameworks, such as TensorFlow and PyTorch. This will enable distributed training on Apache Spark cluster.
Deep Dive into the New Features of Apache Spark 3.1: Wenchen Fan and Xiao Li from Databricks will walk us through the exciting new developments in the Apache Spark 3.1, such as the SQL features for ANSI SQL, new streaming features, the performance enhancements and new tuning tricks in query compiler.
Giving Away The Keys To The Kingdom: Using Terraform To Automate Databricks: Hamilton Ford and Serge Smertin from Scribd will share how Scribd offers their internal customers flexibility without acting as gatekeepers when it comes to any needs in Databricks, and it only takes a pull request.
Scaling your Data Pipelines with Apache Spark on Kubernetes: Rajesh Thallam from Google Cloud will help us in understanding key traits of Apache Spark on Kubernetes, such as autoscaling, and he will demonstrate running analytics pipelines on Spark orchestrated with Apache Airflow on Kubernetes cluster.
Observability for Data Pipelines with OpenLineage: Julien Le Dem from DataKin, co-creator of Apache Parquet, will present the open-source project called Marquez, which instruments data pipelines to collect lineage and metadata. The collected metadata are extremely valuable when it comes to understanding the dependencies between many teams consuming and producing data in the data ecosystem that changes constantly.
Code Once Use Often with Declarative Data Pipelines: Anthony Awuley and Carter Kilgour from FlashFood will present their declarative data pipeline implementation to enable less specialized personnel to build and set up ETL pipelines easily and quickly because the boilerplate logic has been abstracted away to create highly configurable Apache Spark applications that can be orchestrated by Airflow.