High-throughput streaming in Lakehouse with Non-Blocking Concurrency Control in Apache Flink & Hudi
Dipankar Mazumdar
Sagar Sumit

TL;DR

The session addresses the challenge of achieving high-throughput and conflict-free streaming ingestion in real-time data processing. The solution presented is Apache Hudi's Non-Blocking Concurrency Control (NBCC), which allows multiple streams to write concurrently to the same table without conflicts. This innovation enables efficient real-time data processing, enhancing data freshness and reducing resource wastage.

Opening

In the fast-paced world of data streaming, Uber's early data processing challenges highlight a common industry pain point: the struggle to maintain data freshness and processing efficiency amidst high-volume, concurrent data ingestion. Their journey from a cumbersome 24-hour data refresh cycle to real-time updates laid the groundwork for Apache Hudi's evolution. This transformation was driven by the need for a system capable of handling updates, deletes, and upserts while ensuring atomicity and consistency, paving the way for the development of Hudi's Non-Blocking Concurrency Control.

What You'll Learn (Key Takeaways)

  • Apache Hudi's Non-Blocking Concurrency Control (NBCC) – Learn how NBCC manages concurrent writes without conflicts, enhancing data throughput and freshness in streaming architectures.
  • Integration with Apache Flink – Discover the synergy between Flink and Hudi, leveraging NBCC for efficient, real-time data processing pipelines.
  • Innovative File Layout and Indexing – Understand how Hudi's file layout and bucket index support event-time ordering and conflict-free ingestion.
  • Future of NBCC – Explore upcoming enhancements like extensions to metadata tables, clustering, and various index types, promising even greater efficiency.

Q&A Highlights

Q: How does Hudi differ from Iceberg in handling write-heavy workloads?
A: Hudi excels in low-latency, write-heavy scenarios due to its rich indexing and native table management services, making it ideal for high-frequency streaming workloads.

Q: What optimizations does Hudi provide for low latency compared to other data lakehouses?
A: Hudi's unique design includes built-in compaction, clustering, and indexing, allowing for efficient upserts and fast data processing, ideal for real-time analytics.

Q: Is table maintenance built into Hudi?
A: Yes, Hudi incorporates native table management services, allowing for seamless compaction, clustering, and maintenance without relying on external scheduling or compute engines.

Dipankar Mazumdar
Staff Data Engineer Advocate, Onehouse.ai

Dipankar is currently a Staff Data Engineer Advocate at Onehouse, where he focuses on open-source projects such as Apache Hudi and XTable to help engineering teams build and scale robust analytics platforms. Before this, he worked on critical open-source projects such as Apache Iceberg and Arrow at Dremio. For most of his career, Dipankar worked at the intersection of Data Engineering and Machine Learning. He is also the author of the book "Engineering Lakehouses using Open table Formats". Dipankar has been a speaker at numerous conferences such as Data+AI, ApacheCon, Scale By the Bay, Data Day Texas among others.

Sagar Sumit
Software Engineer, Onehouse

Sagar Sumit is a Database Engineer at Onehouse and an Apache Hudi committer. He works on Hudi's transactional engine and his current project involves the design of new indexing schemes. He is also a contributor to the Presto and Trino projects. In the past, he has worked on the team that built Amazon Aurora, a relational database built for the cloud, that now powers mission-critical applications for AWS customers. He originally started his career with Oracle GoldenGate, replicating committed transactions across heterogeneous database systems.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.