Introduction: The Challenge of Scalable Customer Engagement
Blueshift is an AI-driven customer engagement platform that combines customer data, cross-channel marketing, and AI to deliver personalized experiences at every stage of the customer journey. Achieving this vision at scale is no small feat – Blueshift ingests billions of events per day and processes over 50 terabytes of data daily for hundreds of enterprise customers. Its backend consisted of 250+ microservices and dozens of databases, handling both real-time and batch data flows. However, Blueshift’s legacy architecture began to strain under this growth. The system was tightly coupled – hardwired service dependencies meant a slowdown in one microservice could cause failures in multiple other services. The legacy stack relied on multiple disparate messaging systems (Kafka for streaming, NSQ for pub/sub, Sidekiq for job queues) which added complexity and operational overhead. This resulted in occasional cascading failures and reliability issues under extreme load. In short, the architecture lacked one critical quality: “resilience” – the ability to gracefully handle failures, spikes, and unpredictable conditions without collapsing.
Blueshift’s engineering team realized a fundamental overhaul was needed. They envisioned a new architecture with resilience as the core focus. Key requirements included adopting a fully event-driven design with asynchronous processing (to decouple services), introducing customer-level and service-level SLAs (to isolate workloads and avoid noisy neighbors), enabling data stream fan-out (so one event could feed multiple consumers), and seamless auto-scaling to handle traffic spikes. They also sought higher fault tolerance and infrastructure consolidation – replacing the tangle of Kafka/NSQ/Sidekiq with a single unified messaging platform. In reimagining the platform, Blueshift “decided to rebuild with [a] new architecture” centered on these goals.
Enter Apache Pulsar. To meet the above requirements, Blueshift turned to Apache Pulsar as the event-driven backbone of its next-generation system. Pulsar offered the promise of a reliable publish/subscribe foundation with the flexibility, scalability, and durability needed to connect hundreds of microservices in real time. The following sections describe why Blueshift chose Pulsar and how it transformed their architecture to achieve massive scale with resilience.
Why Blueshift Chose Pulsar as Its Backbone
Several factors led Blueshift to select Apache Pulsar as the messaging heart of its platform, replacing the legacy mix of systems:
- Unified Streaming and Queueing: Pulsar allowed Blueshift to consolidate Kafka, NSQ, and Sidekiq into a single platform, eliminating the operational overhead of managing multiple messaging technologies. With one unified cluster handling pub/sub, queuing, and streaming, the team reduced infrastructure complexity, training requirements, and costs.
- Decoupling for Fault Isolation: Pulsar’s native publish-subscribe model decouples producers and consumers, enabling loose coupling between microservices. Blueshift’s services now communicate through Pulsar topics instead of direct calls, so a slowdown in one component no longer cascades to others. This event-driven architecture provides true fault isolation, vastly improving overall system resiliency. If one service lags, its messages queue in Pulsar without bringing down the entire pipeline.
- Multi-Tenancy and Isolation: Apache Pulsar was designed with multi-tenancy (tenants and namespaces) which Blueshift leverages to isolate data streams per customer and service. In the new design, different teams and features operate in separate Pulsar namespaces, and each customer gets dedicated topics for their data. This prevents the “noisy neighbor” problem – one client’s traffic spikes can’t interfere with others – making per-customer SLAs technically feasible. Blueshift can guarantee a minimum processing throughput for each customer by segregating workloads at the topic level.
- Durable Storage (No Data Loss): Pulsar’s segmented storage (backed by Apache BookKeeper) ensures persistence of events and guards against data loss. Unlike the old system where outages could drop events, Pulsar’s durable log keeps all data until acknowledged by consumers. Blueshift “never need[s] to worry about message loss” anymore thanks to Pulsar’s highly durable storage architecture – a critical requirement given the volume of valuable customer interaction data being handled.
- Scalable Fan-Out: Many of Blueshift’s pipelines require the same event to drive multiple actions (for example, a user activity event might update profiles, trigger a campaign, and index into search). Pulsar supports consumer fan-out, allowing multiple independent subscriptions on the same topic. Blueshift no longer needs to build duplicate data pipelines or topic clones for each new consumer. Each service simply subscribes to the relevant Pulsar topic, and Pulsar efficiently delivers a copy of each message to all subscribers. This drastically simplifies the architecture for cross-cutting data flows and ensures consistent data across services without extra overhead.
In addition to the above, Pulsar brought other out-of-the-box features that Blueshift found valuable, such as broker-side dispatch rate limiting to throttle consumers (useful for protecting downstream systems) and flexible retention policies for different data types (hot data vs. cold data). All these capabilities aligned perfectly with Blueshift’s needs for a multi-tenant, scalable, and robust messaging backbone.
Architecting Blueshift with Pulsar: Key Improvements
Blueshift’s new architecture, built around Apache Pulsar, introduced several powerful patterns and operational improvements that solved the legacy challenges:
- Per-Customer Topics for Isolated SLAs: In the Pulsar model, Blueshift segregated event streams by customer account to eliminate contention. Within each Pulsar namespace (grouped by domain like “user updates” or “campaign events”), Blueshift enabled automatic topic creation so that every new customer gets their own set of topics for their data. For example, under the user-updates namespace, there might be topics named for Customer1, Customer2, Customer3, etc., created on the fly when each customer onboards. This provides strong isolation between customers’ event flows – one client’s surge won’t backlog another’s, since they are on different topics. Blueshift also applies Pulsar’s namespace policies (like per-namespace retention and rate limits) to give each data stream the appropriate SLA. High-priority events (e.g. real-time user clicks) use topics with short retention and aggressive rate limits, while less urgent data (e.g. weekly analytics) can reside in topics with longer retention and moderated throughput. By organizing topics per customer and use case, Blueshift avoids noisy neighbors and can guarantee minimum service levels for each client’s data – a crucial business requirement as the company scales.
- Seamless Scaling and Fault Isolation: In the revamped architecture, services are decoupled via Pulsar, so the system handles component failures gracefully. Suppose a particular microservice (e.g. the event processing service) slows down or goes offline – instead of rippling failures, Pulsar buffers incoming events in a backlog for that service’s topic. Other services continue to function normally, and the platform as a whole stays online (the status page might simply show a delay in one area, rather than a full outage). Once the affected service is restored, it can replay the queued messages and catch up. Blueshift configured auto-scaling for such scenarios: when a backlog is building up, additional consumer instances automatically spin up to drain the queue faster. Pulsar distributes the accumulated messages to the newly scaled-out consumers, and the backlog drops to normal levels without manual intervention. This elastic scaling ensures that recovery is quick and throughput can surge to meet demand, all while isolating the incident to the troubled service. The new Pulsar-driven design thus contains failures and spikes to single subsystems – a stark contrast to the old architecture where one slow database could drag everything down. Small outages that previously caused widespread downtime are now non-events, handled transparently by Pulsar’s backpressure and buffering capabilities.
- Message Replay for Easy Recovery: Pulsar’s persistent storage and cursor management give Blueshift the ability to reprocess events on demand – a feature that has vastly improved operability. If a downstream system experiences a transient issue or data needs to be reloaded, the team can simply replay messages from Pulsar rather than building custom scripts or asking clients to resend data. For example, if the database that feeds a particular report was temporarily down, Blueshift can instruct the consumer to rewind to an earlier position or use Pulsar’s built-in replay tools to re-deliver recent events once the database recovers. This capability means no data is permanently lost or skipped due to outages. The team highlighted that they can “easily go and replay messages right from Pulsar and [don’t] have to involve the consumer at all” to backfill missing data. Similarly, for use cases that benefit from periodic reprocessing (say, rebuilding a machine learning feature store), they can consume past events from Pulsar’s log without impacting live ingestion. Pulsar’s replay and infinite retention options act as a safety net, making recovery and maintenance tasks far less painful than in the past.
- Zero-Downtime Elasticsearch Maintenance: One striking example of Pulsar’s impact is how Blueshift revamped its Elasticsearch indexing pipeline. Blueshift’s platform relies on Elasticsearch for powering user profile search and segmentation, with hundreds of indices across multiple clusters ingesting billions of documents (user data, events, etc.). In the past, intensive maintenance tasks such as reindexing or shard reconfiguration risked performance degradation or required downtime. By integrating Apache Pulsar into its architecture, Blueshift introduced a new approach that decouples live indexing from maintenance workflows. Pulsar’s durable, replayable message streams and native subscription fan-out model allow multiple consumers to independently process the same data streams, enabling Blueshift to run parallel maintenance or migration operations without affecting production indexing. This design ensures continuity of data during long-running background tasks, allowing index updates, rebalances, or optimizations to complete seamlessly with no service interruption or performance degradation. As a result, complex Elasticsearch operations that once required downtime can now be executed transparently with far greater operational agility.
Benefits and Business Impact
By rebuilding its data infrastructure around Apache Pulsar, Blueshift realized significant technical and business benefits:
- Reduced Complexity and Cost: Simplifying from three messaging systems to one Pulsar-based platform immediately lowered Blueshift’s operational complexity and expenses. The team no longer maintains separate Kafka, NSQ, and Sidekiq clusters – a consolidated Pulsar cluster handles all streaming, queueing, and pub/sub needs. This infrastructure consolidation cuts down on maintenance effort, infrastructure footprint, and training, allowing engineers to focus on innovation rather than babysitting multiple systems.
- Higher Reliability and Resilience: The Pulsar-driven event architecture has virtually eliminated cascading failures that previously caused platform outages. Services are insulated by Pulsar topics, so an issue in one area results at most in a backlog and localized delay, not a platform-wide crash. Blueshift’s platform now stays operational through hiccups like machine failures or sudden traffic spikes – precisely the resilient behavior the team set out to achieve. This improved reliability translates into better uptime and trust for Blueshift’s customers, as the system can handle unexpected disturbances “within some acceptable degradation” rather than going down.
- Guaranteed Customer SLAs: Thanks to Pulsar, Blueshift can confidently offer per-customer performance guarantees. Each client’s data streams are isolated in their own set of topics, protected by Pulsar’s tenant and namespace isolation. One customer uploading millions of records will not slow down another customer’s processing. This not only avoids awkward conversations (“we’re slow because another customer overloaded the system”), but it also ensures consistent, predictable service for all clients big and small. In terms of business impact, this isolation is a competitive advantage – Blueshift can handle large enterprise workloads without letting any single tenant degrade overall platform performance.
- Streamlined Operations and Recovery: Pulsar’s rich feature set has made day-to-day data operations much easier. The ability to replay data from Pulsar means the team can recover from errors or backfill data at any time, without special tooling. Complex maintenance tasks, like the Elasticsearch reindexing, are now done with zero downtime using Pulsar to keep systems in sync. Moreover, scaling up throughput is as simple as adding more consumer instances to a topic – Pulsar handles load balancing – which gives Blueshift headroom to grow on demand. These improvements translate to less firefighting for the engineering team and more agility in rolling out new features or handling traffic peaks. Overall, Apache Pulsar has become a force multiplier for Blueshift’s developers and SREs, reducing risk and toil while improving service quality.
Conclusion: Pulsar as the Foundation for a Resilient Data Platform
Blueshift’s journey illustrates how a robust messaging backbone can unlock the full potential of a data-intensive platform. By adopting Apache Pulsar, Blueshift transformed a fragile legacy system into a scalable, event-driven architecture that powers real-time customer engagement on a global scale. Pulsar now serves as the “central nervous system” of Blueshift’s platform, connecting hundreds of microservices and data pipelines in a decoupled, reliable manner. Features like persistent storage, multi-tenancy, and flexible subscriptions enabled Blueshift to achieve a level of flexibility and resilience that would have been impractical otherwise. With Pulsar ensuring no data is lost and no service is overwhelmed, Blueshift’s team can innovate faster and deliver new capabilities knowing the backbone will scale and recover gracefully. The end result is a win-win: developers spend less time on plumbing and more on product, while customers experience a highly reliable, real-time personalization service even as data volumes explode.
Blueshift’s next-gen infrastructure, built on Pulsar, is a compelling blueprint for any organization facing growth challenges with legacy architecture. It demonstrates that modern event streaming technology can replace brittle, monolithic designs with cloud-native resiliency – enabling mission-critical systems to meet strict SLAs and adapt to change with ease. As Blueshift continues to expand its AI-driven customer engagement platform, Apache Pulsar remains the backbone that ensures every message reaches its destination and every customer action is processed promptly, come what may.
To learn more about Blueshift’s Pulsar journey and architectural insights, watch the full talk from Data Streaming Summit 2025 on YouTube.
Next-Gen Data Infra: Building Resilient, Scalable Architecture with Apache Pulsar

.png)

.png)


