Ursa Wins VLDB 2025 Best Industry Paper: The First Lakehouse-Native Streaming Engine for Kafka

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Blog

May 31, 2024

5 min read

Streaming Data into the Future of Generative AI

Sijie Guo

Co-Founder and CEO, StreamNative

Generative AI is revolutionizing the tech landscape, offering businesses unprecedented capabilities like hyper-personalization, data monetization, and enhanced customer interactions. The backbone of generative AI is its reliance on large language models (LLMs), which are trained on vast datasets to create outputs that reflect learned patterns.

This transformative potential of generative AI, with McKinsey estimating it could contribute between $2.6 and $4.4 trillion to the global economy annually. However, to realize this potential, we must use LLMs effectively with the right real-time data. This is because integrating generative AI into business needs domain-specific, real-time data. This is particularly crucial in fields such as customer service, where the relevance and timeliness of the data can significantly impact the quality of service delivered. For example, an airline customer service agent using a generative AI tool needs current, specific information about flight statuses and company policies to provide accurate assistance. Why is that?

LLMs trained on general data require domain-specific, real-time inputs.

Large language models (LLMs) are trained on extensive public datasets, but to address specific queries such as, "Is my flight delayed?" or "Can I upgrade to first class?", they require domain-specific, real-time data. For instance, the answers to such questions depend on personal details about the traveler, the airline, and the flight timing. LLM cannot resolve these issues independently, as it is trained on public, historical data and cannot access private, real-time data.

This limitation cannot be overcome merely by enhancing OpenAI's capabilities or by integrating ChatGPT with search engines like Bing, which access only publicly available information. Instead, the airline must securely integrate its internal data sources with LLM to provide accurate, real-time responses to customer inquiries. This approach diverges significantly from traditional machine learning infrastructure.

In traditional machine learning setups, most data engineering tasks are performed during model training, using a specific dataset to optimize the model through feature engineering. Once trained, the model is generally static and tailored to a particular task. In contrast, LLMs utilize massive general datasets, allowing for a broad and reusable model created through deep learning algorithms. This shift means that LLMs, such as those used by OpenAI and Google, rely on continual prompt-based training rather than one-time, problem-specific training. Consequently, data engineering must handle real-time data streams to ensure prompt accuracy.

The shift to data streaming is crucial as companies adapt LLMs to their specific domains through prompt engineering and fine-tuning techniques. Prompt engineering involves crafting textual inputs that effectively communicate with LLMs, anchoring the AI in a domain-specific context to enhance accuracy and narrow the scope of semantic interpretation. Alternatively, fine-tuning adjusts pre-trained models with targeted datasets, aligning them more closely with specific business needs. However, fine-tuning can sometimes overwrite previous knowledge, potentially degrading the model’s performance on tasks it was originally trained on.

Effective use of LLMs in domains such as customer service requires adapting the model to handle industry-specific queries and ensuring that the model can access and utilize real-time data. For example, an AI assistant designed to manage flight delays must be informed about current specifics, not just general data about a train. This necessitates a system where data flows in real-time to the LLM at the moment of request, enabling truly intelligent, automated responses. This real-time data integration unlocks AI's full potential in domain-specific applications.

We live in a world that needs data streaming more than ever.

Data Streaming: Enabling Real-Time Generative AI Applications

If you liken LLMs to rockets, then data streaming is the fuel that powers them. Without the real-time, business-specific, highly contextual knowledge provided by data streams, no LLM can function effectively.

Generative AI is transforming how we approach data engineering, business operations, and interactions with data. Data streaming catalyzes this change by enabling real-time generative AI applications not constrained by where the data lives. It liberates data from various silos, making it readily available and accessible for generative AI applications.

At StreamNative, we developed the ONE StreamNative Platform—a data streaming platform designed to ensure that the right data is available at the right place and time by routing relevant data streams anywhere they’re needed in the business, all in real-time.

Two weeks ago at the Pulsar Summit, we were excited to introduce the Ursa engine during our keynote presentation. As Matteo highlighted, Ursa is vital to our grand data streaming vision. We believe that there are four core pillars of a data streaming platform essential for helping enterprises achieve real success with real-time data streams:

Stream: This foundational layer stores data streams and supplies real-time data feeds to other applications or services.
Connect: This feature enables the integration of segregated data sources with a data streaming platform, facilitating the flow of domain-specific and real-time knowledge into your business operations.
Secure: Our platform is designed to ensure that data stream access is secure and trustworthy. Robust governance ensures you know the data’s origin and lineage, creating a reliable data stream that teams can trust and access securely.
Everywhere: In today’s complex and hybrid environments, a data streaming platform must be versatile enough to operate anywhere to effectively deliver the right data to the right place at the right time.

Stream: Deliver Fresh Data as Streams

The foundation of a data streaming platform is a store that stores data streams and offers the same dataset as real-time data feeds to other applications and services. Ursa is the core data streaming engine that fulfills our technology vision to enable data sharing across different teams, departments, and organizations. The Ursa engine provides the following major capabilities:

Kafka API Compatible: Ursa is Kafka API compatible, allowing you to continue using the Kafka API to build your streaming applications without needing to rewrite them. Additionally, Ursa is a multi-protocol engine that supports Pulsar and MQTT. This flexibility lets you choose the protocol that best meets your business needs, enabling you to utilize the Kafka ecosystem immediately and focus on building your generative AI applications.

Built on Top of Lakehouse: Ursa maximizes the capability for enabling data sharing by storing data streams in Lakehouse table formats. This compatibility with open lakehouse formats means you don’t need to create bespoke integrations to integrate data streams into data lakes, ensuring data freshness for training your models.

Designed for a Hybrid World: The Ursa engine is not merely designed for on-premises or solely for cloud environments. It adheres to architectural principles suited for hybrid settings, offering latency-optimized and cost-optimized data streams for various workloads and environments. This flexibility allows you to balance trade-offs between latency (performance), availability, and cost.

Overall, the Ursa engine offers a cost-effective solution to provide fresh data as streams for your business, allowing you to allocate saved capital toward advancing your generative AI journey.

Connect: Bring Domain-Specific and Real-Time Data to Your Business

While Ursa supports multiple protocols, allowing users to choose how they write their streaming applications, not every piece of software is already designed with data streaming in mind. Some data generated by legacy software or other methods remains crucial for powering your generative AI applications. Connectors, including those specific to domain knowledge built using tools like Pulsar Functions, are vital for linking domain-specific data from various silos to a data streaming platform. These connectors make domain-specific knowledge readily available and easily accessible to generative AI-enabled applications.

Kafka Connect and Pulsar I/O are two common frameworks used to facilitate the integration of data from disparate silos into a data streaming platform. Traditionally, StreamNative has supported only Pulsar I/O connectors. However, as announced at the Pulsar Summit, we are enhancing the Function Mesh framework to create a unified connector framework that can accommodate both Kafka Connect and Pulsar I/O connectors. This development means you no longer need to consider whether a connector is specifically for Pulsar or Kafka. The unified connectors are designed to efficiently transport data into and out of a data streaming platform, delivering domain-specific, real-time data to your generative AI applications.

WASM is another significant innovation in our connector space. It enables users to write transformation logic in any programming language of their choice.

Secure: Ensure Data is Secured and Trusted

While the "Stream" pillar provides the engine for storing data streams effectively, and "Connect" facilitates the integration of different systems with data streams to deliver domain-specific, real-time data to your business, the "Secure" aspect focuses on ensuring that access to your data is both secure and trustworthy. Data streaming platforms enforce robust governance measures so you know the origin and lineage of your data. With this knowledge, you have a reliable data stream that teams can confidently trust and access securely.

Features such as multi-tenancy and role-based access control are foundational to guaranteeing data security and trust. These features help manage and safeguard access, ensuring that only authorized personnel have the right level of interaction with sensitive information.

Everywhere: Deploy Anywhere to Handle the Complex and Hybrid World

This introduces the final pillar of data streaming: the necessity to manage data that might be generated and stored in diverse locations, including different places, data centers, or cloud providers across the globe. To ensure the right data is delivered to the right places at the right time, data streaming platforms need the capability to be deployed anywhere the business requires.

The ONE StreamNative platform was purposefully designed based on Kubernetes, allowing it to be deployed anywhere Kubernetes can run. In addition to these cloud-native capabilities, StreamNative offers various deployment options, ranging from SaaS to BYOC (Bring Your Own Cloud) to Private Cloud licenses. This flexibility lets you choose the best deployment option for your business needs.

Beyond our existing cloud offerings, we have recently expanded our services to include Azure and will soon introduce self-service BYOC capabilities in our UI.

All Four Pillars Together

By integrating all four pillars, data streaming platforms are a crucial solution, providing the necessary infrastructure to support real-time, generative AI applications. These platforms facilitate the seamless flow of targeted data streams, ensuring that large language models (LLMs) receive the most relevant and current information. This capability is essential for maintaining the accuracy and reliability of AI-driven solutions, as it enables immediate responses to changing conditions and inputs. Data streaming platforms enable real-time generative applications at scale by offering the following:

Integrating diverse operational data in real time enhances the reliability and usability of business-specific knowledge.
The organization of unstructured data into structured formats that are more easily processed by AI systems.
Decoupling customer-facing applications from backend AI processes allows for scalable and efficient customer interactions.
The modular architecture supports ongoing technological upgrades without disrupting existing operations.

Enable Data Streaming Throughout the Organization

Generative AI represents a paradigm shift for the entire software and tech industry. It not only changes how we interact with data but also how we engage with people. No matter what generative AI applications you build, they should not be treated as another traditional engineering project. Instead, there needs to be a shift in mindset of how to use data, from batch processing to data streaming, enabling data to flow throughout the organization. This approach allows for the selective incorporation of valuable data as needed, fostering experimentation and adaptation—treating it like modular building blocks.

Traditional project-based engineering approaches, which often rely on periodic data updates, can lead to outdated or irrelevant data. In contrast, data streaming offers a dynamic and continuous data integration strategy. This approach meets the immediate needs of generative AI applications and facilitates rapid adaptation and experimentation with new data sources and AI models.

Ultimately, embracing data streaming is not just about enhancing current capabilities but is a strategic move towards future-proofing business operations and leveraging real-time data for competitive advantage. Organizations should consider incorporating data streaming into their operational model to fully harness the potential of generative AI, ensuring they remain at the forefront of technological innovation and service excellence. StreamNative supports your transition to generative AI with the most cost-effective data streaming platform. Talk to us if you want to learn more about data streaming and generative AI.

This is some text inside of a div block.

Button Text

Sijie Guo

Sijie’s journey with Apache Pulsar began at Yahoo! where he was part of the team working to develop a global messaging platform for the company. He then went to Twitter, where he led the messaging infrastructure group and co-created DistributedLog and Twitter EventBus. In 2017, he co-founded Streamlio, which was acquired by Splunk, and in 2019 he founded StreamNative. He is one of the original creators of Apache Pulsar and Apache BookKeeper, and remains VP of Apache BookKeeper and PMC Member of Apache Pulsar. Sijie lives in the San Francisco Bay Area of California.

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Streaming Data into the Future of Generative AI

LLMs trained on general data require domain-specific, real-time inputs.

Data Streaming: Enabling Real-Time Generative AI Applications

Stream: Deliver Fresh Data as Streams

Connect: Bring Domain-Specific and Real-Time Data to Your Business

Secure: Ensure Data is Secured and Trusted

Everywhere: Deploy Anywhere to Handle the Complex and Hybrid World

All Four Pillars Together

Enable Data Streaming Throughout the Organization

Newsletter