
Operating thousands of Apache Flink jobs that power real-time decisioning, pricing, and operational workflows is no small feat — especially when every deployment has the potential to impact millions of users. At Uber, maintaining reliability, scalability, and speed across our Flink FaaS (Flink-as-a-Service) platform demanded a new approach to deployment safety at scale.
In this talk, we introduce Uber’s Deployment Safety Framework for Flink Jobs, a system that delivers safe, fast, and automated deployments through full-lifecycle quality control. Learn how we built an ecosystem that combines progressive rollouts, automated testing, and intelligent rollback mechanisms to ensure stability without slowing down innovation.
Key topics include:
- Deployment Incrementality – Progressive rollouts that limit blast radius and ensure safety.
- Automation & CI/CD Guardrails – Consistent code and config validation across environments.
- Unit & End-to-End Testing – Catching risky changes early through automated checks and traffic injection.
- Smart Rollbacks – Automated, metric-triggered rollbacks that prevent widespread failures.
- Tenant-Aware Testing – Validation of behavior across real workloads and environments.
We’ll share lessons learned from scaling this system to thousands of streaming jobs and how it strengthened Uber’s real-time, event-driven data platform.
Whether you manage Flink at scale or operate mission-critical streaming systems, this session offers practical insights into building safe, self-healing deployment pipelines for modern data platforms.
Recommended resources
Watch more events.
Newsletter
Our strategies and tactics delivered right to your inbox

.png)

.png)


