High-Throughput Streaming with NBCC in Apache Hudi & Flink

Introducing the StreamNative AI Hub — Agent Engine, MCP Server & more.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Deny Accept

Breakout session

30 min

High-throughput streaming in Lakehouse with Non-Blocking Concurrency Control in Apache Flink & Hudi

Dipankar Mazumdar

Sagar Sumit

Resources

Download Slide Deck ↓

TL;DR

The session addresses the challenge of achieving high-throughput and conflict-free streaming ingestion in real-time data processing. The solution presented is Apache Hudi's Non-Blocking Concurrency Control (NBCC), which allows multiple streams to write concurrently to the same table without conflicts. This innovation enables efficient real-time data processing, enhancing data freshness and reducing resource wastage.

Opening

In the fast-paced world of data streaming, Uber's early data processing challenges highlight a common industry pain point: the struggle to maintain data freshness and processing efficiency amidst high-volume, concurrent data ingestion. Their journey from a cumbersome 24-hour data refresh cycle to real-time updates laid the groundwork for Apache Hudi's evolution. This transformation was driven by the need for a system capable of handling updates, deletes, and upserts while ensuring atomicity and consistency, paving the way for the development of Hudi's Non-Blocking Concurrency Control.

What You'll Learn (Key Takeaways)

Apache Hudi's Non-Blocking Concurrency Control (NBCC) – Learn how NBCC manages concurrent writes without conflicts, enhancing data throughput and freshness in streaming architectures.
Integration with Apache Flink – Discover the synergy between Flink and Hudi, leveraging NBCC for efficient, real-time data processing pipelines.
Innovative File Layout and Indexing – Understand how Hudi's file layout and bucket index support event-time ordering and conflict-free ingestion.
Future of NBCC – Explore upcoming enhancements like extensions to metadata tables, clustering, and various index types, promising even greater efficiency.

Q&A Highlights

Q: How does Hudi differ from Iceberg in handling write-heavy workloads?
A: Hudi excels in low-latency, write-heavy scenarios due to its rich indexing and native table management services, making it ideal for high-frequency streaming workloads.

Q: What optimizations does Hudi provide for low latency compared to other data lakehouses?
A: Hudi's unique design includes built-in compaction, clustering, and indexing, allowing for efficient upserts and fast data processing, ideal for real-time analytics.

Q: Is table maintenance built into Hudi?
A: Yes, Hudi incorporates native table management services, allowing for seamless compaction, clustering, and maintenance without relying on external scheduling or compute engines.

Dipankar Mazumdar

Staff Data Engineer Advocate, Onehouse.ai

Dipankar is currently a Staff Data Engineer Advocate at Onehouse, where he focuses on open-source projects such as Apache Hudi and XTable to help engineering teams build and scale robust analytics platforms. Before this, he worked on critical open-source projects such as Apache Iceberg and Arrow at Dremio. For most of his career, Dipankar worked at the intersection of Data Engineering and Machine Learning. He is also the author of the book "Engineering Lakehouses using Open table Formats". Dipankar has been a speaker at numerous conferences such as Data+AI, ApacheCon, Scale By the Bay, Data Day Texas among others.

Sagar Sumit

Software Engineer, Onehouse

Sagar Sumit is a Database Engineer at Onehouse and an Apache Hudi committer. He works on Hudi's transactional engine and his current project involves the design of new indexing schemes. He is also a contributor to the Presto and Trino projects. In the past, he has worked on the team that built Amazon Aurora, a relational database built for the cloud, that now powers mission-critical applications for AWS customers. He originally started his career with Oracle GoldenGate, replicating committed transactions across heterogeneous database systems.