Streaming with Iceberg: From Zero to Hero
Yuval Yogev

TL;DR

Streaming data into Iceberg presents unique challenges distinct from batch processing, primarily due to issues like small file creation and its impact on performance. The session by Yuval Yogev offered solutions such as optimizing partitioning, managing compaction strategies, and fine-tuning configurations to enhance streaming efficiency. Implementing these practices can lead to robust, cost-effective streaming pipelines that capitalize on Iceberg's capabilities.

Opening

Imagine constantly battling with a deluge of small files that slow down your query times, inflate your cloud storage costs, and complicate your data strategy. This is a common pain point when streaming data into Iceberg, a powerful open table format gaining rapid popularity. As more organizations integrate real-time streaming into their data platforms, mastering these challenges becomes crucial to maintain performance and cost-effectiveness.

What You'll Learn (Key Takeaways)

  • Optimizing File Management – Learn how to manage the creation of small files by leveraging data compaction strategies, ensuring efficient query performance and reduced storage costs.
  • Strategic Compaction – Understand when and what to compact, balancing compaction costs with performance needs using a tailored approach based on data usage patterns.
  • Active Storage Monitoring – Implement a more aggressive snapshot expiration strategy to keep active storage costs in check and ensure efficient data management.
  • Efficient Merges – Explore the use of Storage Partition Join (SPJ) in Spark for optimizing merge operations, reducing the costly shuffle stage, and maintaining low latency in streaming scenarios.

Q&A Highlights

Q: Is it possible to clean snapshot history to reduce size and cost in storage?
A: Yes, cleaning snapshot history is not only possible but highly recommended. Using a more aggressive strategy, such as reducing the number of days or setting an absolute number of snapshots, can prevent metadata and storage costs from ballooning.

Q: How does Iceberg fare with real-time streaming use cases?
A: Iceberg is increasingly being adopted for real-time streaming scenarios, offering robust performance with some tuning. The key is in effectively managing challenges like compaction and merges.

Q: What are some strategies for scheduling maintenance jobs to avoid commit conflicts?
A: Orchestrating your streaming jobs with maintenance tasks is crucial to avoid commit conflicts, especially in high-frequency streaming environments. Monitoring metrics and setting alerts can help manage this effectively.

Yuval Yogev
CTO, Ryft

I started as an algorithms developer working for 2 years at Mobileye, developing image processing algorithms for self driving cars. After that I have been working at Sygnia, building high scale security analytics products, ingesting tens of TB's per day. I love building new products and designing large data pipelines, enthusiastic about data. Currently building a new product, focused on the open lakehouse architecture.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.