SQL has been a fundamental tool for software engineers in building applications for several decades. However, the world has undergone significant transformations since SQL's inception in 1974. In this article, we will explore how SQL is adapting to the changing landscape of data usage in application development. We'll delve into the history of SQL, its traditional role in data systems, and its evolution to meet the demands of streaming data. Additionally, we'll take a closer look at various technologies and vendors that are shaping the field of Streaming SQL.
The Role of SQL Today
Over fifty years ago, when Edgar Codd introduced the concept of a relational database and SQL as its query language, our world and technological landscape were vastly different from what we experience today. Relational databases were designed to work with a single shared data set, enabling efficient query processing and computations to enhance productivity. SQL played a crucial role in simplifying data manipulation and storage, revolutionizing tasks such as inventory management and financial accounting.
However, our present-day reality is characterized by constant online activity and an incessant generation and consumption of data. Data never rests; it flows from various sources, driving computations and business logic in our applications. Unlike in the past, we no longer work with a single, shared data set. Instead, we interact with data streams that originate from diverse locations. In this new era, our infrastructures must be capable of processing these data streams in real time. Unfortunately, the traditional relational database and SQL were not designed to meet the demands of this futuristic world.
To address the challenges posed by streaming data, technologies like Apache Pulsar and Apache Kafka emerged, enabling the creation, collection, storage, and processing of streaming and messaging data. While these advancements have significantly improved the field of stream processing, the developer experience for working with streaming data is still a far cry from the simplicity and familiarity of writing declarative SQL statements in a traditional relational database.
Introducing Streaming SQL
One of the primary obstacles faced by companies adopting stream processing technologies is the steep learning curve associated with stream processing systems. Unlike conventional databases like MySQL and PostgreSQL, which provide SQL as the interactive interface, most streaming systems require users to learn platform-specific programming interfaces, often in Java, to manipulate streaming data. This learning process can be daunting, especially for non-technical individuals. Additionally, stream processing systems represent data in a different manner than databases, necessitating the creation of complex data extraction logic to facilitate data transit between streaming systems and databases.
Given the evolving landscape of data streaming and the need for user-friendly solutions, the concept of "Streaming SQL" has emerged. Streaming SQL aims to provide new language abstractions and query semantics that can handle both streaming and static data, simplifying the process of solving complex use cases. By leveraging the familiar declarative nature of SQL, Streaming SQL allows users to focus on what they want to achieve, while the underlying stream processing engine handles the intricacies of execution.
When using Streaming SQL, several key distinctions become apparent. Traditional SQL queries on a database return static results from a specific point in time. In contrast, Streaming SQL queries operate on data streams, rendering point-in-time answers less relevant. Instead, continuous queries that update themselves, often referred to as materialized views, become more valuable in the streaming context. Each Streaming SQL vendor has its own approach to achieving materialized views.
Similarly, the concept of response time in traditional databases differs from the notion of lag in streaming SQL systems. While traditional databases focus on query response times, streaming SQL systems introduce the concept of time lag, which represents the delay between input events and the corresponding output results. Understanding the existence of time lag helps users write and utilize streaming SQL in ways that avoid potential issues.
Another distinction lies in the work creation process. Traditional databases remain idle until a query is received, whereas streaming SQL systems generate work based on incoming data from the stream. Different vendors employ various strategies to handle this work creation process.
The Benefits and Challenges
Streaming SQL is particularly well-suited for use cases that involve repetitive queries, such as dashboards, reports, and automation. However, its introduction also presents challenges. The official SQL standard lacks support for Streaming SQL functionality, leading to vendors adopting their own syntax or dialect extensions to existing SQL standards like Postgres. As a result, users face the challenge of choosing the right Streaming SQL system among the diverse offerings in the market.
To help users navigate through the challenges, we categorize the Streaming SQL vendors into three groups.
- Stream Processors: Apache Spark and Apache Flink (Flink vendors include but are not limited to Ververica, Confluent, Decodable, and DeltaStream)
- Stream Storage Systems: Apache Pulsar and Apache Kafka
- New Vendors Building Streaming SQL Solutions: Risingwave, Timeplus, etc.
Apache Spark and Apache Flink are the most popular data processing engines that support both batch and stream processing. While Apache Flink is considered the de facto standard for stream processing, Apache Spark is widely used for batch processing. Both systems offer SQL layers on top of their data processing engines, simplifying the writing of data processing jobs for users. Ververica, the company founded by the original creators of Flink, was the first pioneer in commercializing Apache Flink. Both Confluent and Decodable also provide product offerings based on Apache Flink to address the management headache of Apache Flink. Deltastream, founded by the creator of KSQL, aims to provide a powerful solution powered by Apache Flink, offering both streaming analytics and a streaming database in one comprehensive package.
Apache Kafka and Apache Pulsar are two highly popular streaming data storage systems. Confluent, the company behind Kafka, introduced KStream and KSQL years ago, providing users with tools for data processing and streaming SQL within Kafka. This allowed Confluent to compete with the Flink ecosystem for data processing. However, after acquiring Immerok earlier this year, Confluent decided to adopt Apache Flink as its data processing engine. On the other hand, the founders of StreamNative took a different approach. Instead of introducing a separate data processing engine, they built Pulsar Functions, a lightweight serverless event processing framework with an emphasis on simplicity and seamless integration with Pulsar. Pulsar Functions address a significant portion of trivial stream processing use cases to help users reduce the steep learning curve and heavy maintenance overhead. StreamNative additionally introduced a SQL extension for Pulsar Functions called "pfSQL" during Pulsar Summit San Francisco 2022, enabling Pulsar users to write SQL-like declarative statements for event processing.
While it is common to see stream processors and stream storage systems incorporating SQL to simplify stream processing, new players in the data streaming space completely abstract the data processing layer from end users. Instead, they introduce streaming SQL directly as the user interface for interaction.
Risingwave, for example, is an open-source distributed SQL streaming database designed for the cloud. It is built from scratch using Rust and seamlessly integrates with the Postgres SQL ecosystem.
Timeplus takes a different approach by offering a unified platform for streaming and historical OLAP. They have recently open-sourced their core stream processing engine, Proton.
Choosing the Right Solution
With the growing prominence of Streaming SQL, selecting the appropriate streaming SQL product can be a daunting task. To address this challenge, StreamNative is organizing a keynote panel discussion, “Streaming SQL: Databases Meet Streaming”, during Pulsar Summit North America 2023 on Wednesday, October 25, in San Francisco. The event will bring together core technologists from leading vendors such as Databricks, Deltastream, Risingwave, Timeplus, and StreamNative to discuss Streaming SQL and explore the future of data streaming. This conference offers an excellent opportunity for data streaming community users to connect with like-minded enthusiasts and engage in insightful discussions about the future of data streaming.
Streaming SQL has emerged as a game-changer in the data streaming landscape, enabling users to simplify their stream processing tasks. However, the lack of standardized support poses challenges for users seeking a unified solution. By understanding the different categories of vendors and their offerings, users can make informed decisions when selecting a Streaming SQL solution. Events like Pulsar Summit North America 2023 provide a platform for industry experts to share their insights and collectively shape the future of data streaming.