Data Streaming Trends from Kafka Summit London 2024
I recently attended Kafka Summit London, a major data streaming conference by Confluent. This exciting event brought together a large community of messaging and data-streaming enthusiasts. Our team had valuable discussions with attendees, vendors, and colleagues about the growing importance of these technologies in today's industries.
I am excited to share the key trends we observed at the conference.
These insights confirmed that the design choices on Apache Pulsar and ONE StreamNative Platform have been innovative. They not only address common challenges faced by the data streaming community but have also been ahead of the curve for several years now.
Kafka as a Protocol
The data streaming platform landscape is experiencing an increasing diversity of vendors, with established players facing increasing competition. Aiven and Redpanda are prominent examples, having built a presence over several years. Newer entrants like WarpStream are bringing innovative approaches, which we will discuss further in this blog post.
This diversity offers significant benefits to the data streaming community. We now have a more comprehensive range of choices with distinct value propositions. These solutions address common Kafka challenges with unique approaches.
Notably, some vendors provide implementations that remain compatible with Kafka clients to varying degrees while being independent of the original Apache Kafka codebase. Quoting Chris Riccomini's recent blog post: isn’t it time to accept the Kafka protocol is what really matters?
This trend of considering Kafka's protocol independently of its implementation might actually be a more general trend in the world of data infrastructure. Indeed, this is what happened with S3, and it's also happening with Postgres.
With our ONE StreamNative Platform, we are proud to be part of this evolving ecosystem. We are fully Kafka-compatible while providing Pulsar's unique advantages, such as multi-tenancy, unparalleled elasticity, and tiered storage. StreamNative allows the Kafka community to leverage these Pulsar features, which we believe is a significant benefit.
The emergence of ‘serverless’ object storage for reducing costs
A pervasive topic at the Kafka Summit London is moving streaming data out of the cluster to so-called 'serverless' storage systems (a marketing buzzword that actually means here: servers managed by someone else). This approach, primarily facilitated by cloud storage solutions like Amazon S3, offers a compelling blend of cost-efficiency and scalability that's hard to overlook.
It's fascinating to observe that what is considered a novel trend isn't really new. In fact, it's been over a decade since Apache Pulsar introduced an architecture that separates storage from computing. Moreover, Pulsar has natively incorporated Tiered Storage for over five years.
It’s encouraging to see that the market finally recognizes the Apache Pulsar approach as superior to traditional Kafka's architecture.
Cost and Scalability
The driving force behind this trend is cloud storage services' cost-effectiveness and scalability. With their virtually infinite storage capabilities provided at a minimal cost, cloud storage platforms like S3 are becoming the go-to solution for businesses looking to manage their data more efficiently.
A prime example of this trend in action is WarpStream's use of a stateless message broker, coupled with the placement of all data within S3. This architectural decision underscores the benefits of scalability and reduced storage costs. However, it's important to note that such a model may not be universally applicable. It's particularly suited to use cases where low latency isn't a critical requirement and dependency on a specific cloud storage provider is an acceptable trade-off.
Tiered Storage
The concept of Tiered Storage is also gaining traction, especially as a means to manage 'cold' data efficiently. Businesses can significantly reduce operational costs by relocating less frequently accessed data to cost-effective storage solutions like S3.
This approach has been implemented in platforms like the ONE StreamNative Platform and Apache Pulsar for years, although it's still in its nascent stages within the Kafka ecosystem. Tiered Storage for Kafka is implemented in a proprietary, commercial solution, and there is still no other production-ready implementation yet. It's not easy to navigate among those multiple implementations when you're a Kafka user.
Using S3 as a workaround
Scaling Apache Kafka clusters becomes increasingly challenging as data volumes grow. The traditional partition-based storage model hinders elasticity due to partition reassignment operations. Indeed, with more data to manage, partition reassignment operations become increasingly slow, resource-intensive, and detrimental to performance and reliability. Our booth discussions with attendees and several conference talks highlighted these concerns.
One strategy gaining traction involves offloading data to Amazon S3. This approach involves reducing as much as possible the amount of data stored locally in the cluster nodes or even not storing any data in the cluster at all. The goal is to circumvent the inherent limitations of Kafka's storage model based on partitions. Indeed, the less data there is in partitions, the less painful the partition reassignment operations are.
However, this approach introduces trade-offs, including latency increases and a strong dependency on S3. Migrating data out of the cluster should be a strategic decision, not solely a workaround for the limitations of the core Kafka storage model based on partitions.
No dilemma with Pulsar
In contrast, Apache Pulsar has offered a compelling alternative for over ten years.
Indeed, Pulsar was designed from the start with a separation of compute and storage. This allows for exceptional elasticity, which is incomparable to what you can achieve with traditional Kafka. The decisive advantage of Pulsar is that it doesn't force you to choose between latency and elasticity. Indeed, with Pulsar, there's no dilemma:
- You can benefit from Tiered Storage to reduce the storage costs of cold data. Pulsar’s Tiered Storage has been battle-tested for years and is available as open-source.
- Thanks to an alternative storage model, you can also benefit from elasticity without the need for cloud storage and without sacrificing latency.
For more information, feel free to read the resources shared at the end of this blog post.
Stream & batch processing convergence
Historically, a separation existed between analytics and streaming data. These domains functioned within distinct infrastructures and ecosystems. Analytics relies on querying tables in batches while streaming data flows continuously.
However, there are signs of convergence. The boundaries are blurring, as evidenced by Confluent's recent announcement regarding Tableflow as the ability to expose a topic’s data as Iceberg tables.
This recent announcement is particularly interesting. StreamNative has addressed this need for a long time, allowing users to seamlessly integrate streaming data with their data lake platform and leverage data warehouses and data lakehouses' native query capabilities. While the announcement itself was positive, it also validated the approach we implemented more than a year ago with the introduction of Pulsar’s Lakehouse Tiered Storage.
This industry trend of converging towards solutions bridging the gap between streaming and analytics aligns perfectly with StreamNative's position as a thought leader. It reinforces our belief that we're delivering the functionality users demand: the ability to capture streaming data, analyze it, and make it readily available for data-driven decision-making. Queryable open formats like Iceberg play a crucial role in this.
Given Confluent’s recent acquisition of Immerock, there was an increased focus on Flink at this year's conference. Notably, Flink facilitates processing in both streaming and batching modes using a unified programming model, further contributing to the dissolving boundaries.
The prominence of Flink at the Kafka Summit, with approximately thirty dedicated presentations, underscores its growing importance. Additionally, Confluent announced the general availability of their managed Flink offering during their keynote.
However, several alternatives, such as RisingWave and Timeplus, exhibit significant potential to capture substantial market share.
Another popular option is Databricks' Lakehouse platform. This mature platform provides seamless data streaming and analytics integration by combining Apache Spark Structured Streaming for stream processing and Delta Lake for storage. This platform ensures that streaming data is immediately ready for analytics.
Conclusion
The Kafka Summit London provided valuable insights, particularly how Apache Pulsar, the technology powering ONE StreamNative, addresses challenges highlighted in presentations, attendee discussions, and emerging trends. This underscores the advanced capabilities and continued relevance of ONE StreamNative in the streaming data landscape.
Want to Learn More About Apache Pulsar and StreamNative?
- Use our ONE StreamNative Platform to spin up a Kafka-compatible Pulsar cluster in minutes. Get started today with 200$ credit.
- Deep dive into the partition vs segment models: Data Streaming Patterns: What You Didn't Know About Partitioning
- Explore the hurdles encountered in managing data retention in traditional Kafka, and the comparative benefits Pulsar & our Kafka-compatible platform provide: Challenges in Kafka: the Data Retention Stories of Kevin and Patricia
- Engage with the Pulsar community by joining the Pulsar Slack channel.
Newsletter
Our strategies and tactics delivered right to your inbox