Melting Icebergs: Enabling Analytical Access to Kafka Data through Iceberg Projections
Roman Kolesnev
Tom Scott

TL;DR

In this session, the issue of bridging operational and analytical data estates was addressed by creating a logical projection of Kafka data in an Iceberg-compatible format. This solution eliminates the need for traditional ETL processes, allowing direct analytical access to Kafka data. The key benefit is a unified data architecture that maintains operational performance while providing analytical flexibility.

Opening

The challenge of unifying operational and analytical data often feels like navigating a labyrinth—data moves slowly from real-time systems like Kafka to analytical platforms due to complex ETL pipelines. In this session, Roman Kolesnev and Tom Scott from Streambased propose a groundbreaking approach that could streamline this process. Instead of moving data, they create a logical projection of Kafka data into Iceberg-compatible tables, allowing analytical access without the cumbersome data transformation steps.

What You'll Learn (Key Takeaways)

  • Logical Projections over ETL – By creating a logical view of Kafka data compatible with Iceberg, practitioners can avoid moving data across systems, thus maintaining a single source of truth without additional storage costs.
  • Seamless Integration – The approach leverages Kafka's existing ecosystem, including Schema Registry and consumer groups, ensuring that analytical tools can directly access Kafka data without specialized adjustments.
  • Enhanced Performance and Cost Efficiency – By dynamically generating necessary metadata and data files on-demand, the solution reduces the overhead typically associated with ETL processes, leading to improved performance and reduced costs.
  • Indexing for Efficient Queries – Advanced indexing allows more precise data retrieval, significantly optimizing query performance and reducing data read volumes.

Q&A Highlights

Q: How does your approach differ from other Kafka-Iceberg solutions?
A: Unlike other solutions that require data transformation, our approach uses a logical projection, maintaining a single data store while presenting it in an Iceberg-compatible format for analytics.

Q: Can the indexing service be used without Iceberg projections?
A: Yes, the indexing service originated as a separate consumer feature, allowing efficient data retrieval directly from Kafka without requiring Iceberg.

Q: Is there any actual disk writing involved when querying data?
A: No, the data retrieval and transformation occur in-memory, avoiding temporary materialized files and reducing disk I/O costs.

Q: Are there any plans for open-source indexing in Kafka?
A: Currently, there is no open-source indexing project for Kafka, but the need is increasingly recognized due to long-term data storage trends.

Q: Does using Avro instead of Parquet affect performance?
A: While Avro allows for efficient in-memory operations, some benefits of columnar formats like Parquet are sacrificed, though mitigated by Kafka's typical use cases.

Roman Kolesnev
Principal Software Engineer, Streambased

Roman is a Principal Software Engineer at Streambased. His experience includes building business critical event streaming applications and distributed systems in the financial and technology sectors.

Tom Scott
CEO, Streambased

Long time enthusiast of Kafka and all things data integration, Tom has more than 10yrs experience (5yrs+ Kafka) in innovative and efficient ways to store, query and move data. Tom is pioneering the Streaming Datalake at Streambased. An exciting new approach to raw and historical data management in your event streaming infrastructure.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.