Oct 2, 2023
6 min read

Emerging Trends in Data Streaming: Insights from Current 2023

Sijie Guo
CEO and Co-Founder, StreamNative, Apache Pulsar PMC Member

Last week, I had the pleasure to attend Current 2023, a data-streaming conference hosted by Confluent. This dynamic convention united thousands of aficionados from the realms of messaging and data streaming. The conference allowed our team to engage in profound discussions with a multitude of participants, vendors, and peers, elucidating the escalating impact of messaging and data streaming in the current industrial landscape. I am thrilled to impart some critical insights and observations I garnered during the conference. These insights and observations have confirmed that our design choices involving Apache Pulsar and StreamNative Cloud have been pioneering in several fields for years. These fields include but are not limited to, queue semantics, cost-efficient multi-tenancy, integration with Flink, BYOC model, and so forth.

Message queuing for Kafka users

Kafka has established its reputation as an event streaming system, predominantly conceived for transferring data between pipelines and services. Confluent’s representation of Kafka as a messaging system has led to widespread acknowledgment in the community and industry, albeit with some missing pieces. Kafka, though impactful, misses several queuing functionalities found in conventional messaging queuing systems, such as scheduled messages, delayed messages, Time-To-Live (TTL), dead-letter-queue, individual acknowledgment, and more.

This limitation of Kafka API and semantics has caused adopters with extensive engineering resources like tech unicorns to layer additional queuing semantics atop Kafka. Others, feeling the constraints, turn to alternatives like Apache Pulsar. Apache Pulsar has effectively positioned itself as a unified data streaming platform, offering a flexible messaging model supporting both event streaming and message queuing, appealing to a broad user base seeking the scalability of Kafka and advanced messaging queue semantics.

Addressing the noted gaps, Confluent endeavors to integrate Queue semantics into Kafka through KIP-932, aiming to bring about a unified streaming data platform that accommodates both event streaming and message queuing needs, emulating what Pulsar has already achieved for years.

Cost efficiency is what everyone cares about

Given the current economic recession, it's imperative for companies globally to prioritize cost reduction. During the keynote at Current 2023, Confluent unveiled Kora, its cloud-native Kafka engine. This introduction promises a potential cost reduction of up to 40% for Confluent customers.

Similarly, Redpanda emerges as another pivotal vendor focusing on cost reduction, asserting a claim of delivering a sixfold lower cloud spend compared to conventional Kafka offerings. The ongoing battle for cost savings between Redpanda and Confluent is evident. Remarkably, Redpanda has been running a challenge aiming to cut Confluent customers’ bills by 50%. However, it’s noteworthy that Redpanda’s primary emphasis regarding cost reduction predominantly pivots on the performance of a single cluster. Jack Vanlightly’s insightful post provides an extensive analysis contrasting the performance of Kafka and Redpanda to invalidate Redpanda’s claims.

It's crucial to acknowledge that although each of these systems may offer its distinct benefits in particular scenarios, the efficacy of a streaming data system is constrained by the inherent network and disk bandwidth of the underlying resources. Therefore, real cost reductions are achieved through adept optimization of network and disk utilization. The substantial cost predominantly emanates from the absence of multi-tenancy functionality. Consequently, within most organizations, the norm becomes establishing a separate Kafka cluster for each team and overprovisioning resources to accommodate growth projections. Dialogues with Kafka users managing dozens of Kafka clusters revealed that approximately 70-80% of those clusters are underutilized.

When juxtaposed with the approaches of Kafka (Confluent) and Redpanda, StreamNative’s resolution to this predicament is fundamentally ingrained in the incorporation of native multi-tenancy features within the core of Apache Pulsar. Hence, it’s accessible across deployments utilizing Pulsar, irrespective of whether the deployment utilizes open-source Pulsar or StreamNative products.

Multi-tenancy is ascending as the imminent significant trend within streaming data systems for achieving cost efficiency. I conjecture that to catch up with Pulsar and StreamNative, Confluent will inevitably integrate this feature into its future product offerings.

Bring Your Own Cloud (BYOC) is the path to untangle data privacy and data sovereignty in the cloud

While Jack Valightly's recent exposition on "The Future Of Cloud Services And BYOC" has made for an engaging read, it notably leans towards a preference for Confluent Cloud. However, interactions with numerous vendors at the Current conference highlighted a considerable gap in perspective. A majority of vendors, including Decodable, Veverica, DeltaStream, RisingWave, and others, have expressed their unequivocal support for the Bring Your Own Cloud (BYOC) deployment model. 

Currently, the prevalent offerings to manage data streaming platforms are self-hosted and vendor-hosted (SaaS). Both have distinct advantages and disadvantages. 

Self-hosted solutions, revered for the unparalleled control they offer over data, are particularly appealing to organizations emphasizing data privacy, security, and sovereignty. However, these require significant initial investments in infrastructure and human resources. 

In contrast, SaaS solutions serve as a comprehensive solution for setup, monitoring, maintenance, and scaling but might face challenges regarding transparency, access control, and residency, potentially resulting in trust issues

Vendors championing BYOC assert that it amalgamates the advantages of both self-managed and SaaS solutions. It enables companies to set up their clusters within their Virtual Private Cloud (VPC), maintaining data within their environment while outsourcing operations and maintenance. This methodology not only assures data privacy and compliance but also facilitates scalability on the organization's infrastructure, aligning seamlessly with data sovereignty requisites.

Furthermore, BYOC allows organizations to capitalize on infrastructure discounts offered by cloud providers, rewarding long-term spending commitments with substantial discounts. In the prevailing economic recession, BYOC stands out as a beneficial approach, enabling organizations to optimize their existing cloud commitments.

Although the allure of a fully SaaS model is undeniable, the pragmatic reality underscores BYOC as a beacon for data sovereignty, providing a meticulously managed cloud model. Jack’s contention in his blog post is that BYOC falls short in delivering operational efficiency to customers, a statement that holds both validity and contradiction. It is indeed true for numerous systems, including Kafka, primarily due to its lack of multi-tenancy, necessitating the deployment of multiple Kafka clusters into a customer’s VPC by vendors. However, this is not the case for Apache Pulsar. Given its native multi-tenancy support, Pulsar inherently achieves operational efficiency even being deployed via BYOC.

The unfolding debate between Confluent and other BYOC vendors is indeed riveting. At StreamNative, we are steadfast in our belief that BYOC is the path forward for data privacy and sovereignty. It enables the provision of operational efficiency through native multi-tenancy and lays down robust foundations for ensuring data privacy and sovereignty.

The rise of a Data Streaming Platform; Flink is the de facto standard for stream processing

I am uncertain whether the nuanced shift in Confluent’s platform—from event streaming to data in motion and now to a data streaming platform—has caught widespread attention. This transformation occurred after Confluent’s acquisition of Immerok. This shift, arguably, signals the limitations of KStream and KSQL, as, within the framework of a Data Streaming platform, supporting two disparate processing technologies seems counterintuitive.

While Confluent is not the pioneer of Apache Flink, it has played a significant educational role in propagating this technology, inadvertently aiding other Flink vendors by positioning Flink more prominently in mainstream discussions. Ververica has maintained its market vigor post the Immerok spinoff. Conversations with the Ververica team resonate with palpable enthusiasm, making the upcoming Flink Forward Seattle 2023 in November a highly anticipated event. Beyond Confluent and Ververica, Decodable is streamlining Flink's intricate lower-level details to offer users simplified stream processing capabilities, and DeltaStream is introducing a serverless Streaming SQL platform empowered by Apache Flink.

Apache Flink, a noteworthy entity in the big-data ecosystem, has encountered critiques regarding its user-friendliness and cost-efficiency. Both Ververica and Confluent are navigating these challenges by providing fully managed Flink and Flink SQL services. However, emerging entities like RisingWave and Timeplus are demonstrating considerable potential to secure larger market segments.

Moreover, Streaming SQL is persistently generating discussions among vendors specializing in stream processing products. We at StreamNative, are slated to moderate a panel discussion “Streaming SQL: Databases Meet Stream Processing” with these vendors at the forthcoming Pulsar Summit North America 2023 on Wednesday, October 25, in San Francisco. For those intrigued by industry trends surrounding Streaming SQL, this summit presents an invaluable opportunity to engage with the creators and vendors shaping streaming SQL.

Summary

The Current 2023 event showcased intriguing trends in the data streaming era, illuminating the future of multi-tenant data streaming platforms. These platforms are poised to support both event streaming and message queuing, facilitate interconnections between microservices and data pipelines/services, and offer SQL and stream processing capabilities. The event was highly enlightening.

For those who have a keen interest in delving deeper into data streaming trends, I extend an invitation to attend the Pulsar Summit North America 2023 on October 25, 2023. Register now to continue exploring the exciting realm of data streaming in San Francisco!

{{cta-blog}}

Keep delving into the captivating world of data streaming by participating in Pulsar Summit North America 2023.
Register now
Sijie Guo
Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California.

Related articles

Apr 11, 2024
5 min read

The New CAP Theorem for Data Streaming: Understanding the Trade-offs Between Cost, Availability, and Performance

Mar 31, 2024
5 min read

Data Streaming Trends from Kafka Summit London 2024

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Thought Leadership
Kafka
Multi-Tenancy & Isolation