Fluss: Reinventing Kafka for the Real-Time Lakehouse

Discover Fluss, the innovative solution redefining Kafka for real-time lakehouses. Enhance data streaming, analytics, and reduce costs for AI applications.

TL;DR

The session addresses the limitations of Kafka in creating real-time lakehouses necessary for modern AI applications. Fluss is introduced as a novel system built from scratch to integrate seamlessly with Lakehouse architectures. This integration enables real-time data streaming and analytics, significantly enhancing data responsiveness and reducing costs.

Opening

Imagine trying to power a cutting-edge AI application with outdated, sluggish data infrastructure. This is the reality faced by many who rely on traditional batch-oriented lakehouses and legacy streaming tools like Kafka. As Jark Wu highlighted in his session, the industry is witnessing an urgent shift towards real-time lakehouses—driven by the demands of Gen AI and the need for immediate, accurate data. Wu's exploration of a LinkedIn discussion about reimagining Kafka set the stage for introducing Fluss, a revolutionary system designed to overcome these challenges.

What You'll Learn (Key Takeaways)

Fluss as a Unified Solution – Fluss is designed as a Lakehouse-native streaming storage system, seamlessly integrating with modern lakehouse architectures to provide real-time streaming analytics.
Overcoming Kafka's Limitations – Fluss addresses Kafka's lack of native schema support, data model mismatches, and the inability to handle updates and deletes, making it a robust solution for real-time lakehouse needs.
Real-World Applications – Fluss is already in production at Alibaba, replacing Kafka in various scenarios, reducing operational costs by up to 80%, and managing over one petabyte of data.
Future Directions – The Fluss project is rapidly evolving, with plans for broader format support, enhanced query engine compatibility, and deeper integration with lakehouse systems.

Q&A Highlights

Q: How does Fluss handle columnar data for real-time workloads? A: Fluss uses columnar storage for streaming logs, significantly improving performance by reducing network I/O and processing only necessary data.

Q: How does Fluss compare with other Lakehouse-native solutions? A: Fluss focuses on integrating with existing lakehouse architectures without requiring multiple data copies, unlike some other solutions that still necessitate storing multiple data copies.

Q: Does Fluss utilize local disks, and how does it manage data storage? A: Fluss uses a tiered storage system, including local disks, to efficiently manage historical data and integrate with lakehouse tables through its Lake Tiering service.

Q: Is the Arrow buffer constructed on the client or server side? A: For log tables, the Arrow buffer is constructed client-side, while for primary key tables, it's constructed server-side to handle updates and deletes efficiently.

Q: Is there a Fluss community for collaboration and participation? A: Yes, Fluss has an active community, and interested individuals can join the discussions and contribute via GitHub or connect through LinkedIn with Jark Wu for more information.

Fluss: Reinventing Kafka for the Real-Time Lakehouse

Session Overview

TL;DR

Opening

What You'll Learn (Key Takeaways)

Q&A Highlights

Related Resources

Announcing Data Streaming Summit 2026: The Data Streaming + Agent Infra Conference

Introducing Scalable Topics in Apache Pulsar 5.0

Introducing StreamNative Agent Skills: From Connectivity to Expertise for the Agentic Era

Make Your Data Ready and Safe for Agentic AI