Safe Streams at Scale: Uber’s Deployment Safety Framework for Flink Jobs

Ursa Wins VLDB 2025 Best Industry Paper: The First Lakehouse-Native Streaming Engine for Kafka

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Deny Accept

Breakout session

33 mins

Safe Streams at Scale: Uber’s Deployment Safety Framework for Flink Jobs

Yusheng Chen

Yao Li

Resources

Download Slide Deck ↓

Operating thousands of Apache Flink jobs that power real-time decisioning, pricing, and operational workflows is no small feat — especially when every deployment has the potential to impact millions of users. At Uber, maintaining reliability, scalability, and speed across our Flink FaaS (Flink-as-a-Service) platform demanded a new approach to deployment safety at scale.

In this talk, we introduce Uber’s Deployment Safety Framework for Flink Jobs, a system that delivers safe, fast, and automated deployments through full-lifecycle quality control. Learn how we built an ecosystem that combines progressive rollouts, automated testing, and intelligent rollback mechanisms to ensure stability without slowing down innovation.

Key topics include:

Deployment Incrementality – Progressive rollouts that limit blast radius and ensure safety.
Automation & CI/CD Guardrails – Consistent code and config validation across environments.
Unit & End-to-End Testing – Catching risky changes early through automated checks and traffic injection.
Smart Rollbacks – Automated, metric-triggered rollbacks that prevent widespread failures.
Tenant-Aware Testing – Validation of behavior across real workloads and environments.

We’ll share lessons learned from scaling this system to thousands of streaming jobs and how it strengthened Uber’s real-time, event-driven data platform.

Whether you manage Flink at scale or operate mission-critical streaming systems, this session offers practical insights into building safe, self-healing deployment pipelines for modern data platforms.

Yusheng Chen

Staff Engineer, Uber

Yusheng Chen is Staff Engineer of streaming data analytics platform. His team provides services to develop reliable, scalable, and high-performing stream processing applications. He is the tech lead to bring safe deployment to Flink as a service platform in Uber.

Yao Li

Sr. Software Engineer, Yao Li

Yao Li is a Sr. Software Engineer on Uber's Flink team and an Apache Heron (Incubating) committer. With a PhD and postdoctoral research background in Electronic Engineering, Yao brings deep expertise in real-time streaming systems and large-scale data infrastructure.

Recommended resources

Watch more events.

Show all

Video

34 mins

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Recommended resources

High-throughput streaming Lakehouse with Apache Hudi

Schema Management and Streaming Data Products

Flink Streaming Ingestion to Cloud-lake at Scale

Newsletter