In a previous blog post, I examined the guarantees that you can expect from Apache BookKeeper regarding durability, consistency, availability, and low latency. I hope that that post provided you with a compelling set of reasons why BookKeeper offers a best-of-breed storage platform for your real-time workloads. In this second part, I’ll highlight a number of other compelling features of Apache BookKeeper, particularly in domains I/O isolation, data distribution, scalability, and operability.
Predictable low latency is important for real-time applications, especially for mission-critical online services like core business services and databases. Take messaging systems as an example. In most messaging systems, a slow consumer can cause a message backlog, which can potentially lead to general performance degradation. The problem is that slow consumers force the storage system to read the data from a persistent storage medium, which leads to I/O thrashing and page cache swap-in-and-out. This happens when the storage I/O component shares a single path for writes, tailing reads, and catch-up reads.
In BookKeeper, bookies (individual BookKeeper storage nodes) are designed to utilize three separate I/O paths for writes, tailing reads, and catch-up reads, respectively. Separating these paths is important because writes and tailing reads require predictable low latency while throughput is more important for catch-up reads. Providing physical isolation between these workloads means that BookKeeper is able to fully utilize:
I/O isolation means that BookKeeper can provide these different advantages without impeding the others. For more details, you can read my previous blog post on bookie storage.
Services built on top of BookKeeper (such as Apache Pulsar) store log streams as segmented ledgers in BookKeeper. Those segments (ledgers) are replicated to multiple bookies. This maximizes data placement options, which yields several benefits, such as high write availability, traffic balancing, and a simplified operational experience. I will highlight a few benefits from a deployment and operational perspective.
First, the storage capacity for a single log stream will never be limited by the storage capacity of a single host. The data can be stored as long as the whole cluster has enough capacity to do so.
Secondly, there is no log stream rebalancing involved when expanding the bookkeeper cluster. Administrators can simply drop in new machines to expand the bookkeeper cluster. The new bookies will be discovered by the cluster, and will be immediately available for writing the segments. BookKeeper also provides various placement policies including rack-aware, region-aware and weight-based, to maximize the placement options.
Thirdly, it leads to much fast and more efficient replica repair on machine failures. So when a segment is missing (due to machine failures) or corrupted (due to disk failures), bookkeeper can identify what segments to repair (re-replicating entries to meet the replicas requirement) and repair them concurrently from and to multiple hosts.
BookKeeper’s horizontal-scalability is advantageous compared to a partition-centric system like Apache Kafka, in which a log stream (known as a Kafka partition) is stored sequentially only on a subset of machines, and expanding a Kafka cluster requires expensive data rebalancing which is a resource-intensive, error-prone, and expensive operation. Additionally, on a partition-centric system, a single broken or corrupted disk will require the system to copy the entire log stream into a new disk to meet replication requirements.
Figure 1.Log stream: all log segments are replicated to a configurable number of bookies (replication=3 here) across N possible bookies (N=4 here). Log segments are evenly distributed to achieve horizontable scalability with no rebalancing.
As a real-time log stream storage, it is important to be able to scale as traffic increases or more and more data are written into the system. The Apache BookKeeper’s scalability is due to the following factors.
Stream scalability is the ability of a log stream storage to support a large number of streams, ranging from hundreds to millions of ledgers and streams, while continuing to provide consistent performance. The key to achieving this is the storage format. If ledgers and streams are stored in dedicated files, it will have trouble scaling because I/O will be scattered across the disk, as these files will be flushed from the page cache to disk periodically. BookKeeper stores the data from ledgers and streams in an interleaved storage format, where entries from different ledgers and streams are aggregated to store in large files and then indexes. This reduces both number of files and also I/O contention, allow BookKeeper to scale to a large number of ledgers and streams.
Bookie scalability is the ability of a log stream storage to support rapidly growing traffic by adding more bookies (storage nodes in BookKeeper). In BookKeeper, there is no direct interaction between bookies. This allows BookKeeper to expand the cluster by just simply adding new machines. Also, because of the way BookKeeper distributes the data on bookies, when expanding a bookkeeper cluster, there is no expensive partition data rebalancing that exhausts the system network and I/O bandwidth. This allows the cluster size to grow regardless of how data is distributed. Both Yahoo and Twitter have been running BookKeeper with hundreds to thousands of bookies in a single cluster.
Client scalability is the ability for log stream storage to support a large amount of concurrent clients and support large fanouts. This is achieved by BookKeeper in multiple places:
Applications can increase throughput by using more streams or adding more bookies. Additionally, BookKeeper also provides the ability to scale the throughput for a single stream by increasing the size of ensemble (an ensemble is a subset of bookies used for storing a given ledger or stream) and striping the data across the bookies. This is critical for some stateful applications that require data ordering on a single stream.
Apache BookKeeper is designed for operational simplicity. You can easily expand capacity by adding more bookie nodes while the system is in operation. If a bookie node becomes unavailable automatically all the entries it contains are marked as under replicated and bookkeeper autorecovery daemons will automatically re-replicate the data from other available replicas into new bookies. BookKeeper provides a read-only mode on running bookie nodes. A bookie is automatically turned into a read-only mode under some certain situations, such disks are filling up, disks become broken. A read-only bookie will not accept new writes any more, while still be able to serve read traffic. This self-healing property reduces a lot of operational pain points.
Additionally, BookKeeper provides multiple ways for managing the cluster, using admin CLI tools, using Java admin library and the other using HTTP REST API. The REST API provides flexibility to write your own external tool or incorporate certain operations into an existing tool.
Apache BookKeeper supports a pluggable authentication mechanism that applications can use to authenticate themselves. BookKeeper can also be configured to support multiple authentication mechanisms. The purpose of an authentication provider is to establish the identity of the client and to assign the client an identifier, which can be used to determine which operations the client is authorized to perform. BookKeeper by default supports two authentication providers out of the box: TLS and SASL (Kerberos).
If you’re interested in BookKeeper, its history and use case, you can also read my previous blog post Messaging, Storage or Both or you can watch the videos that Matteo Merli and I presented for our Strata talk. Those videos can be found on YouTube.
If you’re interested in BookKeeper or DistributedLog, you may want to join the BookKeeper community via the BookKeeper mailing list or the Slack channel. You can also download the latest release (version 4.5.0) from here to get started.