Hien’s 15 Picks for Data+AI Summit 2022

3 min readMay 9, 2022

The Data + AI Summit North America 2022 (June 27–30) is about 6weeks away. It is one of largest Data & AI conferences in the world.

There are hundreds of interesting sessions that cover a wide range of topics, such as Data Analytics, Data Lakes, Data Engineering, Data Architecture, Data Science & Machine Learning, MLOps and DataOps, and more. This year, the conference sessions will be available in a hybrid mode: in-person and virtual. The virtual pass is FREE.

To continue the tradition I started last year, here are the 15 sessions I am excited about.

Evolution of Data Architecture and How to Build a Lakehouse: Vini Jaiswal will discuss the data landscape and why Lakehouses are becoming a de facto for organizations building scalable data architecture.
Presto on Spark: A Unified SQL Experience: Ariel Weisberg and Shradha Ambekar will share the integration between Presto and Spark enables a unified SQL experience between batch and interactive use cases.
How Robinhood Built a Streaming Lakehouse to Bring Data Freshness from 24h to Less Than 15 mins: Balaji Varadarajan Vikrant Goel will share their journey of building a scalable streaming data lakehouse with Spark, Postgres, and other open source technologies to power the business analytics, experimentation and ML use cases at Robinhood.
Scaling AI Workoads with Ray Ecosystem: Jules Damji will cover Ray’s overview, architecture, core concepts, and primitives. Ray is an exciting and upcoming distributed system designed to easily scale Python applications and ML workloads.
Distributed Machine Learning at Lyft: Anindya Saha and Shiraz Zaman will demonstrate how their ML platform provides a unified and distributed training platform to manage the ML pipelines, which are running on Spark on Kubernetes, and are built using Fugue (their home grown data processing and parameter optimization framework built on top of Spark).
MLOps on Databricks: A How-To Guide: Nial Turbitt and Joseph Bradley will unpack general principles that can guide your organization’s decisions for MLOps and share recommended ways to deploying ML models and pipelines on Databricks.
Tackling Challenges of Distributed Deep Learning with Open Source Solutions: Amog Kamsetty will share the merits of the ML open source ecosystem for distributed DL and then will introduce Ray Train, a open source library built on Ray distributed execution framework.
Data Lakehouse and Data Mesh — Two Sides of the Same Coin: Max Schultze and Arif Wider will share how these two architectural approaches are not competing with each other. They are rather orthogonal and can go very well together.
Deep Dive into the New Features of Apache Spark 3.2 and 3.3: Xiao Li and Wenchen Fan will talk about the high-level features and improvements, and will deep dive into these features: Pandas API on Apache Spark, productionizing adaptive query execution, using RocksDB statestore to make state processing more scalable.
How To Make Spark on Kubernetes Run Reliably on Spot Instances: Jean-Yves Stephan and Hudson Buzby will cover concrete guidelines on how to make Spark run reliably on spot instances, with code examples from real-world use cases.
Data Lineage with Spark, Delta Lake, and Unity Catalog: Tao Feng will talk about how to capture table and column lineage for spark/delta and unity catalog and how users could leverage data lineage to manage data change, ensure data quality and implement data governance in a data driven organization.
Spark Inception: Exploiting the Spark REPL to Build Streaming Notebooks: This session is for the hardcore Spark geeks. Scott Haines will unveil the magic behind the Spark REPL curtain and will show how it works. Then he will teach us how to build our own Notebook-style service on top of Spark & Scala ILoop.
Scaling up Machine Learning in Instacart Search for the 2020 Surge in Online Shopping: Tejaswi Tenneti will cover the architecture of Instacart search engine, the issues they faced in training and serving ML models due to the increase in scale, and how they overcame those challenges by using more sophisticated models.
Real-time Cost Reduction Monitoring and Alerting: Ofer Ohana and David Sellam will present several use cases for which their real-time cost monitoring infrastructure enables them to detect problematic code, infrastructure and individual use of their infrastructure.
Scaling Real-Time ML at CashApp with Tecton: Michael Barnathan will show how CashApp has leveraged Tecton and built scalable ML serving infrastructure that provides high-quality search results to user in less than 100 milliseconds.

While you are here, allow me to share a shameless plug for my talk — MLOps at DoorDash 😃

Hope to see you there in person!!

Hien’s 15 Picks for Data+AI Summit 2022

Written by Hien Luu