StreamNative Introduces Lakestream Architecture and Launches Native Kafka Service

Read Announcement > Sign Up Now >
StreamNative Logo
Success Story Dec 18, 202310 min

Scaling IoT Control Center with Pulsar Integration at Cisco

Chandra GangulySr Director of IoT Control Center, Cisco
Alec HothanPrincipal Engineer, Cisco
Company:Cisco
Industry:Telecommunications
Size:10000+ employees
Apache Pulsar
Scaling IoT Control Center with Pulsar Integration at Cisco

"We needed a messaging system that had to be reliable, scalable, and had extremely light overhead. Everything needs to be geo-replicated and encrypted/secure. What we are investing today, we are investing for the future."

Chandra Ganguly

Sr Director of IoT Control Center at Cisco

Executive Summary

  • Massive IoT Scale: Cisco's IoT Control Center manages 245 million devices (including 84 million connected cars) across 17 geo-replicated deployments, serving 35+ service providers and 35,000 enterprises with 4.5 billion API calls per month.
  • Legacy Modernization: Cisco replaced its legacy Kestrel messaging broker with Apache Pulsar, gaining reliable pub/sub messaging with geo-replication, encryption, and the scalability needed to support 300+ applications per instance.
  • Kubernetes-Native Operations: By combining Pulsar Operator with Flux CD on Kubernetes, Cisco achieved fully declarative, Git-driven deployments — rolling out new Pulsar versions to any cluster within 30 minutes with zero service interruption.

Watch the Presentation

Chandra Ganguly and Alec Hothan presented Cisco IoT Control Center's Pulsar deployment at Pulsar Summit 2023.

Customer Overview

Cisco IoT Control Center is a device lifecycle management platform for IoT connectivity. It enables service providers and enterprises to manage connected devices — from connected cars and smart meters to alarm systems and industrial sensors — through a multi-tenant SaaS platform.

"We have about 245 million and growing number of devices today... about 84 million of connected cars... it's one of our biggest and fastest growing business."

The scale of the platform is substantial: 17 geo-replicated deployments worldwide, over 35 service providers, more than 35,000 enterprises, and 4.5 billion API calls every month. The platform processes around 2 million automation rules every two minutes, making reliable and performant messaging infrastructure a critical requirement.

Challenges

As the IoT Control Center grew in both device count and customer density, Cisco's existing messaging infrastructure began to show its limits:

  • Legacy Broker Limitations: The platform had been using Kestrel, Twitter's open-source message queue, as its messaging backbone. While functional, Kestrel lacked the features needed for a modern, globally distributed IoT platform — including native geo-replication and encryption capabilities.
  • Scale and Cost Pressure: With the device count growing rapidly and customer density increasing, Cisco needed a messaging system that could scale efficiently without proportional cost increases. The system had to support over 300 applications per instance using pub/sub messaging patterns.
  • Geo-Replication and Security: Every deployment required data to be replicated between paired data centers for disaster recovery. All communication needed to be encrypted end-to-end, and the system had to integrate with customer networks that might use different technologies like Kafka.
  • Future Investment: Cisco was making a long-term infrastructure bet. Whatever system they chose had to be extensible enough to support evolving requirements for years to come.

"We needed a messaging system that had to be reliable, scalable, and had extremely light overhead. Everything needs to be geo-replicated and encrypted/secure. What we are investing today, we are investing for the future."

Solution

Cisco chose Apache Pulsar as the replacement for Kestrel, building a comprehensive CDR (Call Detail Record) processing pipeline on top of it. The architecture flows through several stages: ingestion, protobuf normalization, deduplication, rating, automation, and database storage.

A key optimization was the switch from JSON to Protocol Buffers for message serialization:

"It used to be JSON and JSON is very slow for encoding and decoding. The size of messages with JSON was over 1 to 2 kilobytes per message, and with protobuf we could reduce this to around 400 bytes."

This reduction in message size — from 1–2 KB down to roughly 400 bytes — significantly improved throughput and reduced network overhead across the platform's 17 deployments.

Cluster sizes range from 50,000 messages per second for smaller deployments to 500,000 messages per second for the largest, with most clusters running at 10–20% of capacity to absorb traffic spikes. Cisco leverages Pulsar's built-in geo-replication to mirror data between paired data centers, providing automatic disaster recovery.

The platform handles predictable daily and weekly traffic patterns — with peaks during business hours and quieter periods at night — as well as unexpected spikes. Pulsar's architecture proved well-suited to absorbing these fluctuations without intervention.

The deduplication stage is particularly demanding: it maintains a lookup window of one month across billions of records, ensuring that duplicate CDRs are identified and filtered before downstream processing.

Deployment

Cisco's deployment model is built on Kubernetes with Flux CD (a CNCF GitOps project) for continuous delivery, using two custom operators to manage Pulsar infrastructure declaratively:

  • Pulsar Operator: Handles the deployment and lifecycle management of Pulsar clusters themselves — brokers, bookies, ZooKeeper nodes, and proxies.
  • Pulsar Resource Operator: Manages Pulsar resources like namespaces, topics, and geo-replication configuration as Kubernetes custom resources.

"With Pulsar operator, I just have to declare what resource I need, and the operator will take care of all the deployment."

The entire deployment workflow is Git-driven. Engineers commit changes to a repository, Flux CD detects the updates, and the operators reconcile the desired state with the running clusters. A custom CI/CD import tool copies container images into Cisco's internal registry and rebuilds Helm charts for each deployment.

Security is handled through multiple layers: Istio service mesh provides mTLS between all services, and JWT tokens are used for Pulsar client authentication. Observability is centralized through a cloud logging system that provides a single pane of glass across dozens of clusters.

For storage, Cisco uses either local NVMe drives for high-performance deployments or network-attached storage (NetApp) where flexibility is preferred. In practice, network-attached storage proved sufficient for most use cases, simplifying operations by decoupling storage from specific nodes.

Results

After several months in production, Pulsar has delivered on Cisco's requirements:

  • Reliability Under Load: The system has proven reliable across all 17 deployments, absorbing daily traffic patterns and unexpected spikes without issue. The combination of running clusters at 10–20% capacity with Pulsar's elastic architecture provides substantial headroom.
  • Rapid, Painless Updates: New Pulsar versions are deployed to any cluster within 30 minutes using the GitOps workflow. Cisco maintains a 3-week release cycle without service interruption, and quarterly failover exercises between paired data centers have been consistently successful.

"This combination of using Pulsar Operator with something like Flux works really well for us and allows us to do really painless updates. A new update of Pulsar — we can get it deployed on any cluster within half an hour."

  • Low Operational Overhead: Once proper alerts and monitoring are in place, the Pulsar clusters require very little day-to-day maintenance. The declarative, operator-driven model means that infrastructure changes are version-controlled and reproducible.
  • Storage Flexibility: Network-attached storage proved performant enough for the majority of workloads, reducing the operational complexity of managing local NVMe storage across many clusters. Local storage remains available for the most demanding deployments.

Future Plans

Cisco's team continues to evolve their Pulsar deployment with several initiatives on the roadmap:

  • Autoscaling Pulsar: Automatically adjusting cluster resources based on traffic patterns to optimize costs and handle growth without manual intervention.
  • Transactions and Functions: Leveraging Pulsar's built-in transaction support and serverless Functions framework to simplify pipeline stages and ensure exactly-once processing semantics.
  • Ambient Mesh: Migrating from Istio's sidecar proxy model to ambient mesh, which eliminates per-pod sidecars in favor of node-level proxies — reducing resource overhead and simplifying the networking layer.
  • Tiered Storage: For deployments using local NVMe storage, tiered storage will allow older data to automatically move to cheaper object storage, reducing the cost of long-term data retention without sacrificing accessibility.