TL;DR
The session addresses the challenge of achieving high-throughput and conflict-free streaming ingestion in real-time data processing. The solution presented is Apache Hudi's Non-Blocking Concurrency Control (NBCC), which allows multiple streams to write concurrently to the same table without conflicts. This innovation enables efficient real-time data processing, enhancing data freshness and reducing resource wastage.
Opening
In the fast-paced world of data streaming, Uber's early data processing challenges highlight a common industry pain point: the struggle to maintain data freshness and processing efficiency amidst high-volume, concurrent data ingestion. Their journey from a cumbersome 24-hour data refresh cycle to real-time updates laid the groundwork for Apache Hudi's evolution. This transformation was driven by the need for a system capable of handling updates, deletes, and upserts while ensuring atomicity and consistency, paving the way for the development of Hudi's Non-Blocking Concurrency Control.
What You'll Learn (Key Takeaways)
- Apache Hudi's Non-Blocking Concurrency Control (NBCC) – Learn how NBCC manages concurrent writes without conflicts, enhancing data throughput and freshness in streaming architectures.
- Integration with Apache Flink – Discover the synergy between Flink and Hudi, leveraging NBCC for efficient, real-time data processing pipelines.
- Innovative File Layout and Indexing – Understand how Hudi's file layout and bucket index support event-time ordering and conflict-free ingestion.
- Future of NBCC – Explore upcoming enhancements like extensions to metadata tables, clustering, and various index types, promising even greater efficiency.
Q&A Highlights
Q: How does Hudi differ from Iceberg in handling write-heavy workloads? A: Hudi excels in low-latency, write-heavy scenarios due to its rich indexing and native table management services, making it ideal for high-frequency streaming workloads.
Q: What optimizations does Hudi provide for low latency compared to other data lakehouses? A: Hudi's unique design includes built-in compaction, clustering, and indexing, allowing for efficient upserts and fast data processing, ideal for real-time analytics.
Q: Is table maintenance built into Hudi? A: Yes, Hudi incorporates native table management services, allowing for seamless compaction, clustering, and maintenance without relying on external scheduling or compute engines.


