TL;DR
The session addresses the limitations of Kafka in creating real-time lakehouses necessary for modern AI applications. Fluss is introduced as a novel system built from scratch to integrate seamlessly with Lakehouse architectures. This integration enables real-time data streaming and analytics, significantly enhancing data responsiveness and reducing costs.
Opening
Imagine trying to power a cutting-edge AI application with outdated, sluggish data infrastructure. This is the reality faced by many who rely on traditional batch-oriented lakehouses and legacy streaming tools like Kafka. As Jark Wu highlighted in his session, the industry is witnessing an urgent shift towards real-time lakehouses—driven by the demands of Gen AI and the need for immediate, accurate data. Wu's exploration of a LinkedIn discussion about reimagining Kafka set the stage for introducing Fluss, a revolutionary system designed to overcome these challenges.
What You'll Learn (Key Takeaways)
- Fluss as a Unified Solution – Fluss is designed as a Lakehouse-native streaming storage system, seamlessly integrating with modern lakehouse architectures to provide real-time streaming analytics.
- Overcoming Kafka's Limitations – Fluss addresses Kafka's lack of native schema support, data model mismatches, and the inability to handle updates and deletes, making it a robust solution for real-time lakehouse needs.
- Real-World Applications – Fluss is already in production at Alibaba, replacing Kafka in various scenarios, reducing operational costs by up to 80%, and managing over one petabyte of data.
- Future Directions – The Fluss project is rapidly evolving, with plans for broader format support, enhanced query engine compatibility, and deeper integration with lakehouse systems.
Q&A Highlights
Q: How does Fluss handle columnar data for real-time workloads? A: Fluss uses columnar storage for streaming logs, significantly improving performance by reducing network I/O and processing only necessary data.
Q: How does Fluss compare with other Lakehouse-native solutions? A: Fluss focuses on integrating with existing lakehouse architectures without requiring multiple data copies, unlike some other solutions that still necessitate storing multiple data copies.
Q: Does Fluss utilize local disks, and how does it manage data storage? A: Fluss uses a tiered storage system, including local disks, to efficiently manage historical data and integrate with lakehouse tables through its Lake Tiering service.
Q: Is the Arrow buffer constructed on the client or server side? A: For log tables, the Arrow buffer is constructed client-side, while for primary key tables, it's constructed server-side to handle updates and deletes efficiently.
Q: Is there a Fluss community for collaboration and participation? A: Yes, Fluss has an active community, and interested individuals can join the discussions and contribute via GitHub or connect through LinkedIn with Jark Wu for more information.

