Native Apache Kafka Service Is Coming Soon to StreamNative Cloud. Join the waitlist and get $1,000 in credits.

Join Waitlist >
StreamNative Logo
VideoMay 29, 202530 min

High-throughput streaming in Lakehouse with Non-Blocking Concurrency Control in Apache Flink & Hudi

Unlock Instant Access

Complete the form to start watching.

Session Overview

Discover how Apache Hudi's Non-Blocking Concurrency Control boosts real-time data ingestion efficiency. Learn to implement conflict-free streaming pipelines!

TL;DR

The session addresses the challenge of achieving high-throughput and conflict-free streaming ingestion in real-time data processing. The solution presented is Apache Hudi's Non-Blocking Concurrency Control (NBCC), which allows multiple streams to write concurrently to the same table without conflicts. This innovation enables efficient real-time data processing, enhancing data freshness and reducing resource wastage.

Opening

In the fast-paced world of data streaming, Uber's early data processing challenges highlight a common industry pain point: the struggle to maintain data freshness and processing efficiency amidst high-volume, concurrent data ingestion. Their journey from a cumbersome 24-hour data refresh cycle to real-time updates laid the groundwork for Apache Hudi's evolution. This transformation was driven by the need for a system capable of handling updates, deletes, and upserts while ensuring atomicity and consistency, paving the way for the development of Hudi's Non-Blocking Concurrency Control.

What You'll Learn (Key Takeaways)

  • Apache Hudi's Non-Blocking Concurrency Control (NBCC) – Learn how NBCC manages concurrent writes without conflicts, enhancing data throughput and freshness in streaming architectures.
  • Integration with Apache Flink – Discover the synergy between Flink and Hudi, leveraging NBCC for efficient, real-time data processing pipelines.
  • Innovative File Layout and Indexing – Understand how Hudi's file layout and bucket index support event-time ordering and conflict-free ingestion.
  • Future of NBCC – Explore upcoming enhancements like extensions to metadata tables, clustering, and various index types, promising even greater efficiency.

Q&A Highlights

Q: How does Hudi differ from Iceberg in handling write-heavy workloads? A: Hudi excels in low-latency, write-heavy scenarios due to its rich indexing and native table management services, making it ideal for high-frequency streaming workloads.

Q: What optimizations does Hudi provide for low latency compared to other data lakehouses? A: Hudi's unique design includes built-in compaction, clustering, and indexing, allowing for efficient upserts and fast data processing, ideal for real-time analytics.

Q: Is table maintenance built into Hudi? A: Yes, Hudi incorporates native table management services, allowing for seamless compaction, clustering, and maintenance without relying on external scheduling or compute engines.

About Speaker

Dipankar Mazumdar

Dipankar Mazumdar Dipankar is currently a Staff Data Engineer Advocate at Onehouse, where he focuses on open-source projects such as Apache Hudi and XTable to help engineering teams build and scale robust analytics pla...

Sagar Sumit

Sagar Sumit Sagar Sumit is a Database Engineer at Onehouse and an Apache Hudi committer. He works on Hudi's transactional engine and his current project involves the design of new indexing schemes. He is also a c...