
TL;DR
In this session, the issue of bridging operational and analytical data estates was addressed by creating a logical projection of Kafka data in an Iceberg-compatible format. This solution eliminates the need for traditional ETL processes, allowing direct analytical access to Kafka data. The key benefit is a unified data architecture that maintains operational performance while providing analytical flexibility.
Opening
The challenge of unifying operational and analytical data often feels like navigating a labyrinth—data moves slowly from real-time systems like Kafka to analytical platforms due to complex ETL pipelines. In this session, Roman Kolesnev and Tom Scott from Streambased propose a groundbreaking approach that could streamline this process. Instead of moving data, they create a logical projection of Kafka data into Iceberg-compatible tables, allowing analytical access without the cumbersome data transformation steps.
What You'll Learn (Key Takeaways)
- Logical Projections over ETL – By creating a logical view of Kafka data compatible with Iceberg, practitioners can avoid moving data across systems, thus maintaining a single source of truth without additional storage costs.
- Seamless Integration – The approach leverages Kafka's existing ecosystem, including Schema Registry and consumer groups, ensuring that analytical tools can directly access Kafka data without specialized adjustments.
- Enhanced Performance and Cost Efficiency – By dynamically generating necessary metadata and data files on-demand, the solution reduces the overhead typically associated with ETL processes, leading to improved performance and reduced costs.
- Indexing for Efficient Queries – Advanced indexing allows more precise data retrieval, significantly optimizing query performance and reducing data read volumes.
Q&A Highlights
Q: How does your approach differ from other Kafka-Iceberg solutions?
A: Unlike other solutions that require data transformation, our approach uses a logical projection, maintaining a single data store while presenting it in an Iceberg-compatible format for analytics.
Q: Can the indexing service be used without Iceberg projections?
A: Yes, the indexing service originated as a separate consumer feature, allowing efficient data retrieval directly from Kafka without requiring Iceberg.
Q: Is there any actual disk writing involved when querying data?
A: No, the data retrieval and transformation occur in-memory, avoiding temporary materialized files and reducing disk I/O costs.
Q: Are there any plans for open-source indexing in Kafka?
A: Currently, there is no open-source indexing project for Kafka, but the need is increasingly recognized due to long-term data storage trends.
Q: Does using Avro instead of Parquet affect performance?
A: While Avro allows for efficient in-memory operations, some benefits of columnar formats like Parquet are sacrificed, though mitigated by Kafka's typical use cases.
Newsletter
Our strategies and tactics delivered right to your inbox