keynote
35 min
Fluss: Reinventing Kafka for the Real-Time Lakehouse
Jark Wu

TL;DR

The session addresses the limitations of Kafka in creating real-time lakehouses necessary for modern AI applications. Fluss is introduced as a novel system built from scratch to integrate seamlessly with Lakehouse architectures. This integration enables real-time data streaming and analytics, significantly enhancing data responsiveness and reducing costs.

Opening

Imagine trying to power a cutting-edge AI application with outdated, sluggish data infrastructure. This is the reality faced by many who rely on traditional batch-oriented lakehouses and legacy streaming tools like Kafka. As Jark Wu highlighted in his session, the industry is witnessing an urgent shift towards real-time lakehouses—driven by the demands of Gen AI and the need for immediate, accurate data. Wu's exploration of a LinkedIn discussion about reimagining Kafka set the stage for introducing Fluss, a revolutionary system designed to overcome these challenges.

What You'll Learn (Key Takeaways)

  • Fluss as a Unified Solution – Fluss is designed as a Lakehouse-native streaming storage system, seamlessly integrating with modern lakehouse architectures to provide real-time streaming analytics.
  • Overcoming Kafka's Limitations – Fluss addresses Kafka's lack of native schema support, data model mismatches, and the inability to handle updates and deletes, making it a robust solution for real-time lakehouse needs.
  • Real-World Applications – Fluss is already in production at Alibaba, replacing Kafka in various scenarios, reducing operational costs by up to 80%, and managing over one petabyte of data.
  • Future Directions – The Fluss project is rapidly evolving, with plans for broader format support, enhanced query engine compatibility, and deeper integration with lakehouse systems.

Q&A Highlights

Q: How does Fluss handle columnar data for real-time workloads?
A: Fluss uses columnar storage for streaming logs, significantly improving performance by reducing network I/O and processing only necessary data.

Q: How does Fluss compare with other Lakehouse-native solutions?
A: Fluss focuses on integrating with existing lakehouse architectures without requiring multiple data copies, unlike some other solutions that still necessitate storing multiple data copies.

Q: Does Fluss utilize local disks, and how does it manage data storage?
A: Fluss uses a tiered storage system, including local disks, to efficiently manage historical data and integrate with lakehouse tables through its Lake Tiering service.

Q: Is the Arrow buffer constructed on the client or server side?
A: For log tables, the Arrow buffer is constructed client-side, while for primary key tables, it's constructed server-side to handle updates and deletes efficiently.

Q: Is there a Fluss community for collaboration and participation?
A: Yes, Fluss has an active community, and interested individuals can join the discussions and contribute via GitHub or connect through LinkedIn with Jark Wu for more information.

Jark Wu
Head of Fluss and Flink SQL at Alibaba Cloud, Alibaba Cloud

Jark Wu is a committer and PMC member of Apache Flink. He leads both the Fluss and Flink SQL teams at Alibaba Cloud. With a decade of experience in Flink, he has been deeply involved in developing and evolving Flink SQL from zero to now. Throughout this journey, he has also initiated and incubated Flink CDC and Fluss projects, further expanding the Flink ecosystem.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.