Apache Pulsar is a cloud-native, distributed messaging and streaming platform that manages hundreds of billions of events per day. Pulsar was originally developed and deployed inside Yahoo as the consolidated messaging platform connecting critical Yahoo applications such as Yahoo Finance, Yahoo Mail, and Flickr, to data. Pulsar was contributed to open source by Yahoo in 2016 and became a top-level Apache Software Foundation project in 2018.
Pulsar is a multi-tenant, high-performance solution for server-to-server messaging, including features such as native support for multiple clusters in a Pulsar instance, seamless geo-replication of messages across clusters, low publish and end-to-end latency, seamless scalability to over a million topics, and guaranteed message delivery with persistent message storage provided by Apache BookKeeper, among others.
Pulsar’s decoupled architecture enables you to reduce architectural complexity and put millions of topics into a single cluster. Pulsar’s distributed storage system design is segment-based. This hierarchical topic namespace enables Pulsar users to maintain millions of topics in a single cluster. Compared to monolithic system architectures, which often require organizations to segment and silo their data and workloads across multiple clusters, Pulsar reduces complexity and management overheard. Pulsar also provides granular resource management with both hard and soft isolation, making it possible to prevent producers, consumers, and topics from overwhelming the cluster. This ensures that consistent and predictable economy-scale performance can be maintained, even as the number of clients and data volumes grow.
Apache Pulsar was designed with a multi-layer architecture in which each layer is scalable, distributed, and decoupled from the other layers. With Pulsar, you can add new topics as needed and seamlessly scale performance. Scaling in Pulsar is a simple, non-disruptive operation. With partition-centric storage architectures, expanding capacity often requires partition rebalancing, which in turn requires recopying entire partitions to new nodes. Recopying data is expensive and error prone, and it consumes network bandwidth and I/O. With Pulsar, you can scale as needed without the expense and disruption of repartitioning. Pulsar’s decoupling of processing, message serving and storage layers means you can independently scale resources.
Apache Pulsar replicates each message to multiple storage nodes and was designed to be able to handle both single and multiple node failures. With stateless brokers, Pulsar only needs to update metadata to transfer ownership from one broker to another when a topic is moved to a different broker--no data is copied during this transfer of ownership. Pulsar supports full-mesh replication, in which individual topics can be replicated to any number of external datacenters and replication is done in multiple directions, all with a simple configuration setting at the tenant level. In Pulsar, all replica repairs happen in the background and are handled by the storage layer to avoid impacting other workloads being processed by brokers.