Apache BookKeeper is a scalable, fault-tolerant, and low-latency log storage service optimized for real-time workloads. BookKeeper was originally developed at Yahoo! Research, then incubated as a subproject under Apache ZooKeeper beginning in 2011, and finally graduated to being a top-level Apache project in January, 2015. Since its initial introduction, BookKeeper has been widely adopted—including by enterprises like Twitter, Yahoo, and Salesforce—for storing and serving mission-critical data and supporting a variety of use cases. In this blog post, I’ll take a look at how BookKeeper provides durability, consistency, and low latency and I’ll highlight the guarantees and key features of BookKeeper—all of which are available out of the box, in open source.
In a previous blog post, I gave a high-level overview of Apache BookKeeper and introduced several BookKeeper concepts and terms. A BookKeeper cluster is composed of:
BookKeeper clients can use either the higher-level DistributedLog API (aka the log stream API) or a lower-level ledger API that enables direct interaction with bookies. A typical BookKeeper installation is shown in Figure 1 directly below.
Figure 1: A typical BookKeeper installation (plus applications connecting via multiple APIs)
As we described in the Intro to Apache BookKeeper blog post, a real-time storage platform should meet all of the following requirements:
BookKeeper meets all of these requirements by providing the following guarantees:
|Replication||For the sake of fault tolerance, data is replicated and durably stored on multiple machines and even, optionally, in multiple datacenters.|
|Durability||Data is committed to persistent storage when replication is successful. fsync is enforced prior to sending acknowledgments to clients.|
|Consistency||A simple, repeatable read consistency model is used to guarantee consistency between readers.|
|Availability||Ensemble changes and speculative reads are used for improving write and read availability while also enforcing consistency and durability.|
|Low latency||Write and read latency are protected through I/O isolation while also maintaining consistency and durability.|
BookKeeper replicates and stores several identical copies of each data record—typically three or five—on different machines within a datacenter or, optionally, spanning multiple datacenters. Unlike other distributed systems, that use a master/slave or pipeline replication algorithm to replicate data between replicas (such as Apache HDFS, Ceph, and Kafka), Apache BookKeeper uses a quorum-vote parallel replication algorithm for replicating data to ensure predictable low latency. Figure 2 below illustrates replication within a BookKeeper ensemble.
Figure 2. Ensemble, write, and ack quorums in BookKeeper replication
In the diagram above:
BookKeeper replication is based on a set of core ideas:
These two core tenets ensure that BookKeeper replication is able to:
Every data record written to BookKeeper is guaranteed to be replicated and durably stored in the configured number of bookies. This is achieved using explicit disk fsync and write confirmation.
Most recent NoSQL-type databases, distributed filesystems, and messaging systems (such as Apache Kafka) assume that simply replicating data to multiple nodes in volatile storage/memory is sufficient to meet best-effort durability requirements, but the problem that those systems present is that they allow for potential data loss.
BookKeeper was designed to provide much stronger durability guarantees, with the goal of fully preventing data loss and thus meeting stringent enterprise requirements.
Guaranteeing consistency is a common problem in distributed systems, particularly when replication is introduced to provide durability and availability. BookKeeper provides a simple but strong consistency guarantee—repeatable read consistency—for data stored in logs:
This repeatable read consistency is accomplished by a LastAddConfirmed (LAC) protocol used by BookKeeper. I’ll walk you through some additional details in a future blog post.
In a CAP (Consistency, Availability, Partition tolerance) theorem context, BookKeeper is a CP system. In reality, however, Apache BookKeeper offers very high availability even in the presence of hardware, network, and other failures. To guarantee high write and read availability, BookKeeper uses the following mechanisms:
|Write availability||Ensemble changes||Clients will reconfigure record placement—which bookies they write records to—when bookie failures occur. This ensures that writes will always succeed as long as there are enough total bookies remaining in the cluster.|
|Read availability||Speculative reads||Unlike other systems that only read from one storage node designated as the leader, such as Apache Kafka, BookKeeper enables clients to read records from any bookie in an ensemble. This helps spread read traffic across bookies, which has the added benefit of reducing tail read latency.|
Strong durability and consistency guarantees are complex distributed system problems, especially when combined with the goal of meeting enterprise-grade low latency requirements. BookKeeper fulfills these requirements in the following ways:
Finally, it is worth pointing out, durability with explicit fsync and write confirmation and repeatable read consistency are crucial to stateful processing, especially effectively-once processing for streaming applications.
In this blog post, I examined which guarantees you can expect from BookKeeper regarding durability, consistency, availability, and low latency. I hope that this post provides a compelling set of reasons to choose BookKeeper as a storage platform for your real-time workloads. In next few blog posts, I’ll delve more deeply into how BookKeeper replicates data as well as which mechanisms it uses to ensure low latency while still guaranteeing consistency and durability.
If you’re interested in BookKeeper or DistributedLog, you may want to join the growing BookKeeper community via the BookKeeper mailing list or the Slack channel. You can also download the latest release (version 4.5.0) from here to get started.