Avro, arrow, protobuf, parquet and why
Ben Gamble

TL;DR

In the session "Avro, Arrow, Protobuf, Parquet and Why," Ben Gamble addresses the complexities of choosing data formats for streaming, emphasizing the shortcomings of JSON due to its overhead. He highlights Avro, Protobuf, and Parquet as key formats, explaining their roles in optimizing data processing efficiency. By selecting appropriate formats, practitioners can enhance performance, manage schema evolution, and improve data portability across systems.

Opening

Imagine managing a SaaS platform for same-day deliveries where a sudden influx of JSON requests overwhelms your system, causing data conflicts and service disruptions. This scenario illustrates the pressing need to optimize data serialization for scalability and efficiency. In data streaming, where speed and accuracy are paramount, understanding the right data format can make the difference between seamless operations and catastrophic failures.

What You'll Learn (Key Takeaways)

  • Selecting the Right Data Format – Discover the benefits of using Avro for its dynamic schema capabilities, Protobuf for strong schema guarantees in RPC, and Parquet for efficient long-term data storage.
  • Managing Schema Evolution – Learn how proper schema management can prevent data conflicts and ensure backward and forward compatibility, although it's generally safer to version topics with schema changes.
  • Optimizing Data Processing – Understand how Apache Arrow's columnar in-memory format can enhance computational efficiency, particularly in analytical streaming systems.
  • Real-World Applications – Explore how these formats apply to large-scale streaming systems, from Flink's role in managing petabytes of data to Parquet's use in data lake architectures for long-term storage.

Q&A Highlights

Q: How do large language models fit into the streaming data world dominated by binary formats?
A: Large language models excel with text and JSON, but struggle with schemas. Message Pack offers a middle ground by providing a binary, JSON-like format that is efficient for wire transmission.

Q: Does Parquet work well for high-frequency logging?
A: Yes, Parquet is effective for high-frequency logging; however, the lack of a variant type can be a limitation, which is why most logging systems do not use Parquet as their backend.

For further insights, connect with Ben Gamble on LinkedIn, where he shares his extensive knowledge on data streaming and more.

Ben Gamble
Field CTO, Ververica

A long builder of AI powered games, simulations, and collaborative user experiences, Ben has previously built a global logistics company, Large scale online games and Augmented reality apps. Ben currently works to make fast data and AI a reality for everyone.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.