sharetwitterlinkedIn

Benchmarking Pulsar and Kafka - A More Accurate Perspective on Pulsar’s Performance

November 9, 2020
head img

Note: This post presents StreamNative’s response to Confluent’s recent article “Benchmarking Apache Kafka, Apache Pulsar, and RabbitMQ: Which is the Fastest?”. For a brief overview of these systems, see our review of Pulsar vs. Kafka (part 1, part 2).

Executive Summary

Today, many companies are looking at real-time data streaming applications to develop new products and services. Organizations must first understand the advantages and differentiators of the different event streaming systems before they can select the technology best-suited to meet their business needs.

Benchmarks are one method organizations use to compare and measure the performance of different technologies. In order for these benchmarks to be meaningful, they must be done correctly and provide accurate information. Unfortunately, it is all too easy for benchmarks to fail to provide accurate insights due to any number of issues.

Confluent recently ran a benchmark to evaluate how Kafka, Pulsar, and RabbitMQ compare in terms of throughput and latency. According to Confluent’s blog, Kakfa was able to achieve the “best throughput” with “low latency” and RabbitMQ was able to provide “low latency” at “lower throughputs”. Overall, their benchmark declared Kafka the clear winner in terms of “speed”.

While Kafka is an established technology, Pulsar is the top streaming technology of choice for many companies today, from global corporations to innovative start-ups. In fact, at the recent Splunk summit, conf20, Sendur Sellakumar, Splunk’s Chief Product Officer, discussed their decision to adopt Pulsar over Kafka:

"... we've shifted to Apache Pulsar as our underlying streaming. It is our bet on the long term architecture for enterprise-grade multi-tenant streaming."

    - Sendur Sellakumar, CPO, Splunk 

This is just one of many examples of companies adopting Pulsar. These companies choose Pulsar because it provides the ability to horizontally and cost effectively scale to massive data volumes, with no single point of failure, in modern elastic cloud environments, like Kubernetes. At the same time, built-in features like automatic data rebalancing, multi-tenancy, geo-replication, and tiered storage with infinite retention, simplify operations and make it easier for teams to focus on business goals.

Ultimately, developers are adopting Pulsar for its features, performance, and because all of the unique aspects of Pulsar, mentioned above, make it well suited to be the backbone for streaming data.

Knowing what we know, we had to take a closer look at Confluent’s benchmark to try to understand their results. We found two issues that were highly problematic. First, and the largest source of inaccuracy, is Confluent’s limited knowledge of Pulsar. Without understanding the technology, they were not able to set-up the test in a way that could accurately measure Pulsar’s performance.

Second, their performance measurements were based on a narrow set of test parameters. This limited the applicability of the results and failed to provide readers with an accurate picture of the technologies’ capabilities across different workloads and real-world use cases.

In order to provide the community a more accurate picture, we decided to address these issues and repeat the test. Key updates included:

  1. We updated the benchmark setup to include all of the durability levels supported by Pulsar and Kafka. This allowed us to compare throughput and latency at the same level of durability.
  2. We fixed the OpenMessaging Benchmark (OMB) framework to eliminate the variants introduced by using different instances, and corrected configuration errors in their OMB Pulsar driver.
  3. Finally, we measured additional performance factors and conditions, such as varying numbers of partitions and mixed workloads that contain writes, tailing-reads, and catch-up reads to provide a more comprehensive view of performance.

With these updates made, we repeated the test. The result - Pulsar significantly outperformed Kafka in scenarios that more closely resembled real-world workloads and matched Kafka’s performance in the basic scenario Confluent used.

The following section highlights the most important findings. A more comprehensive performance report in the section StreamNative Benchmark Results also gives detail of our test setup and additional commentary.

StreamNative Benchmark Result Highlights

#1 With the same durability guarantee as Kafka, Pulsar achieves 605 MB/s publish and end-to-end throughput (same as Kafka), and 3.5 GB/s catch-up read throughput (3.5 times higher than Kafka). Increasing the number of partitions and changing durability levels have no impact on Pulsar's throughput. However, the Kafka's throughput was severely impacted when changing the number of partitions or changing durability levels.

Table 1: Throughput differences between Pulsar and Kafka under different workloads with different durability guarantees

Durability Levels Partitions Pulsar Kafka
Peak Publish + Tailing Reads Throughput (MB/s) Level-1 Durability 1 300 MB/s 160 MB/s
100 300 MB/s 420 MB/s
2000 300 MB/s 300 MB/s
Level-2 Durability 1 300 MB/s 180 MB/s
100 605 MB/s 605 MB/s
2000 605 MB/s 300 MB/s
Peak Catch-up Reads Throughput (MB/s) Level-1 Durability 100 1.7 GB/s 1 GB/s
Level-2 Durability 100 3.5 GB/s 1 GB/s

For details on "Level-1 Durability", see An Overview of Durability in Distributed Systems section for detailed discussion on durability differences between Pulsar and Kafka.

#2: Pulsar delivers significantly better latency than Kafka in each of the different test cases (including different number of subscriptions, different number of topics, and different durability guarantees).

Pulsar’s 99th percentile latency is within the range of 5 and 15 milliseconds. Kafka’s 99th percentile latency can go up to seconds and is hugely impacted by the number of topics, subscriptions and different durability guarantees.

Table 2: End-to-End P99 Latency between Pulsar and Kafka of different number of subscriptions with different durability guarantees

Partitions & Subscriptions Local Durability Replication Durability Pulsar Kafka
End-to-End P99 Latency (ms)
(Publish + Tailing Reads)
100 Partitions, 1 Subscription Sync Ack-1 5.86 18.75
Ack-2 11.64 64.62
Async Ack-1 5.33 6.94
Ack-2 5.55 10.43
100 Partitions, 10 Subscriptions Sync Ack-1 7.12 145.10
Ack-2 14.65 1599.79
Async Ack-1 6.84 89.80
Ack-2 6.94 1295.78

Table 3: End-to-End P99 Latency between Pulsar and Kafka of different number of topics with different durability guarantees

Local Durability Replication Durability Partitions Pulsar Kafka
End-to-End P99 Latency (ms)
(Publish + Tailing Reads)
Sync Ack-1 100 5.86 18.75
5000 6.26 79236
10000 6.67 187840
Ack-2 100 11.64 64.62
5000 14.38 157960
10000 15.78 197140
Async Ack-1 100 5.33 6.94
5000 5.75 86641
10000 6.64 184513
Ack-2 100 5.55 10.43
5000 6.20 116028
10000 7.50 200793

#3: Pulsar provides significantly better I/O isolation than Kafka. Pulsar’s 99th percentile publish latency remains around 5 milliseconds when there are consumers catching up on reading historic data. In contrast, Kafka’s latency is severely impacted by catchup reads. Kafka’s 99th percentile publish latency can increase from milliseconds to multiple seconds.

Table 4: Publish P99 Latency between Pulsar and Kafka with catching up reads

Local Durability Replication Durability Pulsar Kafka
Publish P99 Latency (ms)
(Mixed Workload)
Sync Ack-1 5.89 13.48
Ack-2 15.39 2091.31
Async Ack-1 10.44 9.51
Ack-2 35.51 1014.95

All of our benchmarks are open source, so curious readers can repeat the test for themselves. Additionally, you can dig deeper into the results and metrics, which are available in the repository.

Although our benchmark is more accurate and more comprehensive than Confluent’s, it doesn’t cover every scenario. Ultimately, no benchmark can replace testing done on your own hardware with your own workloads. We encourage you to evaluate additional variables and scenarios and to test using your own setups and environments.

CONTENT

A Deeper Look at Confluent Benchmark

Confluent used the OpenMessaging Benchmark (OMB) Framework as the basis for their benchmark with a few modifications. In this section, we describe the issues we found in Confluent’s benchmark and explain how they impacted Confluent’s test results and led to erroneous conclusions.

Issues with Confluent Setup

A fundamental problem with Confluent’s benchmark is that Pulsar was not set up properly. (We will talk more about this in the section on the StreamNative Benchmark). In addition to the issues tuning Pulsar, Pulsar and Kafka were set up with different durability guarantees. Because the level of durability impacts performance, the durability settings on both systems must be equivalent for a comparison to be meaningful.

Confluent’s engineers used the default durability guarantee for Pulsar, which is a much stronger guarantee than the configuration which was used for Kafka. Because increasing durability negatively impacts latency and throughput, Confluent’s test placed a much higher demand on Pulsar than it did on Kafka. In the version of Pulsar used by Confluent, there was not yet support for reducing the durability down to a level to match Kafka, but such a future will be released as part of Pulsar in an upcoming release and was used in this test. Had their engineers used this new equivalent durability setting on both systems, the test results would have allowed for an accurate comparison. We certainly don’t fault Confluent’s engineers for not using a not-yet-released feature, however, the writeup failed to provide the necessary context for these results and treated that as though they were equivalent, that additional context which will be provided here.

Issues with OMB Framework

Confluent’s benchmark followed OMB Framework guidelines, which recommend using the same instance type across multiple event streaming systems. However, in our testing we found large amounts of variance among different instances of the same type, particularly when it came to disk IO. In order to minimize this variance, we used the same instances from run to run for both Pulsar and Kafka, which we found significantly helped to increase the accuracy of the results, as even small variations in disk IO performance can result in much larger variance in overall system performance. We would suggest the OMB framework guidelines be updated to include this recommendation in the future.

Issues with Confluent Methodology

Confluent’s benchmark measured only a few, limited scenarios. For example, real-world workloads consist of writes, tailing reads, and catch-up reads. Tailing-reads occur when a consumer is reading recent messages near the “tail” of the log, which was the only scenario tested by Confluent. In contrast, a catch-up read is when a consumer has a large amount of backlog it must consume to “catch-up” to the tail of the log, which is a common and critical task in a real-world system. Catch-up reads, when not taken into account, can severely impact the latency of writes and tailing reads. As Confluent’s benchmark focused only on throughput and end-to-end latency, it fails to give a complete picture of expected behavior across a variety of workloads. Likewise to further give a result closer to real-world use cases, we also considered it important to run the benchmark with varying numbers of subscriptions and partitions. Very few organizations only care about a few topics with a handful of partitions and consumers, they need the ability to have large numbers of different consumers with a large number of distinct topics/partitions to map to their business use cases

To summarize, we have outlined specific issues with Confluent’s methodology in the following table:

Table 5: Issues with Confluent’s benchmark methodology

Parameters Tested Exclusions Limitations
Writes and tailing reads Catch-up reads While maximum throughput and end-to-end latency are useful to illustrate the basic performance characteristics of an event streaming system, limiting the study to two parameters provides only a partial view of system performance.
1 subscription Varying numbers of subscriptions / consumer groups Did not show how the number of subscriptions impacts throughput and latency.
100 partitions Varying numbers of partitions Did not show how the number of partitions impacts throughput and latency.

Many of the issues with Confluent’s benchmark stem from a limited understanding of Pulsar. To help others avoid these problems when running benchmarks in the future, we’ll provide some insights on the technology.

Understanding Pulsar’s durability guarantees is required in order to run an accurate benchmark, so we will begin our discussion there. We’ll start with a general overview of durability in distributed systems, and then explain the differences between the durability guarantees offered by Pulsar and Kafka.

An Overview of Durability in Distributed Systems

Durability refers to maintaining system consistency and availability in the face of external problems, such as a hardware or operating system failure. Single-node storage systems (such as RDBMS) will “fsync” writes to disk to ensure maximum durability. Operating systems will typically buffer writes, which can be lost in the event of failure, but an fsync will ensure these data is written to physical storage. In distributed systems, durability typically comes from replication, with multiple copies of the data being distributed to different nodes that can fail independently. However, it is important not to conflate local durability (fsyncing data) with replication durability, as they both have a distinct purpose. In the following sections, we explain some key differences between these features and why both are important.

Replication durability and local durability

Distributed systems typically provide both replication durability and local durability. Separate mechanisms control each type of durability. You can use these mechanisms in various combinations to set the desired level of durability.

Replication durability is achieved by using an algorithm to create multiple copies of data so the same data can be stored in several locations to improve availability and accessibility. The number of replicas N, determines the system’s failure tolerance, with many systems requiring a “quorum”, or N/2 + 1 nodes, to acknowledge a write. Some systems offer the ability to continue to serve existing data with any single replica still available. This replication mechanism is key to handling the total loss of an instance, with a new instance able to re-replicate data from the existing replicas, while also being critical to availability and consensus (which is beyond the scope of this discussion).

By contrast, local durability determines what an acknowledgement means at the individual node level. This requires fsyncing data to a persistent medium to ensure no data is lost, even if a power failure or hardware failure occurs. An fsync of data ensures that in the event of transient failure, where the machine can recover, the node has all the data for it’s previous acknowledgements.

Durability Modes: Sync vs. Async

Different types of systems offer varying levels of durability guarantees. In general, the level of overall durability any given system can provide depends on the following:

  • Whether the system fsyncs data to local disks
  • Whether the system replicates data to multiple locations
  • When the system acknowledges replication to a peer
  • When the system acknowledges writes to the client

Among different systems, these choices vary widely and not all systems give users the option to control these values, but in general, systems that lack some of these mechanisms (such as replication in non-distributed systems) offer less durability.

To discuss this more concretely, we can define two durability modes which control when a system acknowledges writes, both internally for replication and to the client. These are “sync” and “async”. These modes operate as described below.

  • Sync Durability: The system returns a write response to the peer/client ONLY AFTER the data has been successfully f-synced to local disks (local durability) or replicated to multiple locations (replication durability).
  • Async Durability: The system returns a write response to the peer/client BEFORE the data has been successfully f-synced to local disks (local durability) or replicated to multiple locations (replication durability).

Durability Levels: Measuring durability guarantees

Durability guarantees can take many forms and depend on the following variables:

  • Whether data is stored locally, replicated in multiple locations, or both
  • When writes are acknowledged (sync vs. async)

Like with durability mode, we define some durability “levels” which we can use to differentiate between different distributed systems. We define four levels. Table 6 describes each, from the highest level of durability to the lowest.

Table 6: Durability Levels of Distributed Systems

Level Replication Local Operation
1 Sync Sync The system returns a write response to the client ONLY AFTER the data has been replicated to multiple (at least the majority of) locations AND each replica has been successfully fsync-ed to local disks.
2 Sync Async The system returns a write response to the client ONLY AFTER the data has been replicated to multiple (at least the majority of) locations.There is no guarantee that each replica has successfully fsync-ed to local disks.
3 Async Sync The system returns a write response to the client when one replica has been successfully fsync-ed to a local disk. There is no guarantee that data is replicated to the other locations.
4 Async Async The system returns a write response to the client immediately after the data has been replicated to multiple locations. There are no replication or local durability guarantees.

Most distributed relational database management systems (such as NewSQL databases) guarantee the highest level of durability; therefore, they would be categorized as Level 1.

Much like a database, Pulsar is a Level 1 system that provides the highest level of durability by default. In addition, Pulsar allows the option to customize the desired durability level for each application individually. By contrast, most of Kafka production deployments are configured to operate as either a Level 2 or Level 4 system. Based on our limited knowledge on Kafka, Kafka can be configured to operate as a Level 1 system by setting flush.messages to 1 and flush.ms to 0. But configuring those 2 settings has a severe impact on both throughput and latency. We will discuss it more in our benchmark results.

We’ll look at the durability capabilities of each in detail, beginning with Pulsar.

Durability in Pulsar

Pulsar offers durability guarantees at all levels. It can replicate data to multiple locations and fsync data to local disks. Pulsar has two durability modes (sync and async described earlier). Each option is individually configurable. You can use them in various combinations to customize settings for individual use cases.

Pulsar controls replication durability using a raft-equivalent, quorum-based replication protocol. You can tune the durability mode for replication durability by adjusting the ack-quorum-size and write-quorum-size parameters. The settings for these parameters are described in Table 7 below. The durability levels supported by Pulsar are described in Table 8 below. (A detailed discussion of Pulsar’s replication protocol and consensus algorithm are beyond the scope of this article; however, we will explore these areas in depth in a future blog post.)

Table 7: Durability Configuration Settings in Pulsar

Configuration Settings Durability Mode
Replication ackQuorumSize = 1 Async
ackQuorumSize ≥ writeQuorumSize / 2 + 1 Sync
Local (default)
journalWriteData = true
journalSyncData = true
Sync
journalWriteData = true
journalSyncData = false
Async
journalWriteData = false
journalSyncData = false
Async

Table 8: Durability Levels in Pulsar

Durability Level Replication Durability Local Durability
Level 1 Sync:
ackQuorumSize ≥ writeQuorumSize / 2 + 1
Sync:
journalWriteData = true
journalSyncData = true
Level 3 Async:
ackQuorumSize = 1
Sync:
journalWriteData = true
journalSyncData = true
Level 2 Sync:
ackQuorumSize ≥ writeQuorumSize / 2 + 1
Async:
journalWriteData = true
journalSyncData = false
Level 4 Async:
ackQuorumSize = 1
Async:
journalWriteData = true
journalSyncData = false
Level 2 Sync:
ackQuorumSize ≥ writeQuorumSize / 2 + 1
Async:
journalWriteData = false
journalSyncData = false
Level 4 Async:
ackQuorumSize = 1
Async:
journalWriteData = false
journalSyncData = false

Pulsar controls local durability by writing and/or fsyncing data to a journal disk(s). Pulsar also provides options for tuning the local durability mode using the configuration parameters in Table 9:

Table 9: Pulsar’s Local Durability Mode Parameters

Parameter Description Values
journalWriteData Controls whether a bookie writes data to its journal disks before persisting data to the ledger disks true = enable journaling
false= disable journaling
journalSyncData Controls whether a bookie fsyncs data to journal disks before returning a write acknowledgement to brokers true = enable fsync
false= disable fsync

Durability in Kafka

Kafka offers three durability levels: Level 1, Level 2 and Level 4. Kafka can provide replication durability at Level 2 (default settings), but offers no durability guarantees at Level 4 because it lacks the ability to fsync data to disks before acknowledging writes. Kafka can be configured to operate as a Level 1 system by setting flush.messages to 1 and flush.ms to 0. However such setup is rarely seen in Kafka production deployments.

Kafka’s ISR replication protocol controls replication durability. You can tune Kafka’s replication durability mode by adjusting the acks and min.insync.replicas parameters associated with this protocol. The settings for these parameters are described in Table 10 below. The durability levels supported by Kafka are described in Table 11 below. (A detailed explanation of Kafka’s replication protocol is beyond the scope of this article; however, we will explore how Kafka’s protocol differs from Pulsar’s in a future blog post.)

Table 10: Durability Configuration Settings in Kafka

Configuration Settings Durability Mode
Replication acks = 1 Async
acks = all Sync
Local Default fsync settings Async
flush.messages = 1
flush.ms = 0
Sync

Table 11: Durability Levels in Kafka

Durability Level Replication Durability Local Durability
Level 2 Sync:
acks = all
Async:
Default fsync settings
Level 4 Async:
acks = 1
Async:
Default fsync settings
Level 1 Sync:
acks = all
Sync:
flush.messages = 1
flush.ms = 0
Level 4 Async:
acks = 1
Sync:
flush.messages = 1
flush.ms = 0

Unlike Pulsar, Kafka does not write data to a separate journal disk(s). Instead, Kafka acknowledges writes before fsyncing data to disks. This operation minimizes I/O contention between writes and reads, and prevents performance degradation.

Kafka’s does offer the ability to fsync after every message, with the above flush.messages = 1 and flush.ms = 0, and while this can be used to greatly reduce the likelihood of message loss, however it severely impacts the throughput and latency, which ultimately means such settings is rarely used in production deployments.

Kafka’s inability to journal data makes it vulnerable to data loss in the event of a machine failure or power outage. This is a significant weakness, and one of the main reasons why Tencent chose Pulsar for their new billing system.

Durability Differences Between Pulsar and Kafka

Pulsar’s durability settings are highly configurable and allow users to optimize durability settings to meet the requirements of an individual application, use case, or hardware configuration.

Because Kafka offers less flexibility, depending on the scenario, it is not always possible to establish equivalent durability settings in both systems. This makes benchmarking difficult. To address this, the OMB Framework recommends using the closest settings available.

With this background, we can now describe the gaps in Confluent’s benchmark. Confluent attempted to simulate Pulsar’s fsyncing behavior. In Kafka, the settings Confluent chose provide async durability. However, the settings they chose for Pulsar provide sync durability. This discrepancy produced flawed test results that inaccurately portrayed Pulsar’s performance as inferior. As you will see when we review the results of our own benchmark later, Pulsar performs as well as or better than Kafka, while offering stronger durability guarantees.

StreamNative Benchmark

To get a more accurate picture of Pulsar’s performance, we needed to address the issues with the Confluent benchmark. We focused on tuning Pulsar’s configuration, ensuring the durability settings on both systems were equivalent, and including additional performance factors and conditions, such as varying numbers of partitions and mixed workloads, to enable us to measure performance across different use cases. The following sections explain the changes we made in detail.

StreamNative Setup

Our benchmarking setup included all the durability levels supported by Pulsar and Kafka. This allowed us to compare throughput and latency at the same level of durability. The durability settings we used are described below.

Replication Durability Setup

Our replication durability setup was identical to Confluent’s. Although we made no changes, we are sharing the specific settings we used in Table 12 for completeness.

Table 12: Replication Durability Setup Settings

Durability Mode Configurations
Pulsar Sync ensemble-size=3
write-quorum-size=3
ack-quorum-size=2
Async ensemble-size=3
write-quorum-size=3
ack-quorum-size=1
Kafka Sync replicas=3
acks=all
min.insync.replicas=2
Async replicas=3
acks=1
min.insync.replicas=2

A new Pulsar feature gives applications the option to skip journaling, which relaxes the local durability guarantee, avoids write amplification, and improves write throughput. (This feature will be available in the next release of Apache BookKeeper). However, this feature will not be made the default, nor do we recommend it for most scenarios, as it still introduces the potential for message loss.

We used this feature in our benchmark to ensure an accurate performance comparison between the two systems. Bypassing journaling on Pulsar provides the same local durability guarantee as Kafka’s default fsync settings.

Pulsar’s new feature includes a new local durability mode (Async - Bypass journal). We used this mode to configure Pulsar to match Kafka’s default level of local durability. Table 13 shows the specific settings for our benchmark.

Table 13: Local Durability Setup Settings for StreamNative’s Benchmark

Durability Mode Configurations
Pulsar Sync (default) journalWriteData=true
journalSyncData=true
journalMaxGroupWaitMSec=1
Async
(write to journal)
journalWriteData=true
journalSyncData=false
journalMaxGroupWaitMSec=1
journalPageCacheFlushIntervalMSec=1000
Async
(bypass journaling)
journalWriteData=false
journalSyncData=false
Kafka Sync flush.messages=1
flush.ms=0
Async (default) flush.messages=10000 (default)
flush.ms=1000 (default)

StreamNative Framework

We fixed some issues in Confluent’s OMB Framework fork and corrected configuration errors in their OMB Pulsar driver. The new benchmarking code we developed, including the fixes described below, is available as open source.

Fixes in the OMB Framework

Confluent followed the OMB Framework’s recommendation to use two sets of instances—one for Kafka and another for Pulsar. For our benchmark, we allocated one set of three instances to eliminate variations. In our first test, we deployed all three instances on Pulsar. Then, we repeated the test on Kafka using the same set of instances.

Because we used the same machines for benchmarking different systems, we cleared the filesystem pagecache before each run. This ensured the current test would not be impacted by previous activity.

Fixes in the OMB Pulsar Driver Configuration

We fixed a number of errors in Confluent’s OMB Pulsar driver configuration. The following sections explain the specific changes we made to the broker, bookie, producer, consumer, and Pulsar image.

Broker Changes

Pulsar brokers use the managedLedgerNewEntriesCheckDelayInMillis parameter to determine the length of time (in milliseconds) a catch-up subscription must wait before dispatching messages to its consumers. In the OMB Framework, the value for this parameter was set to 10. This was the main reason why Confluent’s benchmark inaccurately showed Pulsar to have higher latency than Kafka. We changed the value to 0 to emulate Kafka’s latency behavior on Pulsar. After making this change, Pulsar showed significantly better latency than Kafka in all test cases.

Additionally, to optimize performance, we increased the value of the bookkeeperNumberOfChannelsPerBookieparameter from 16 to 64 to prevent any single Netty channel between a broker and a bookie from becoming a bottleneck. Such bottlenecks cause high latency when large volumes of messages accumulate in a Netty IO queue.

We intend on providing this guidance more clearly in the Pulsar documentation to help users who are looking to optimize entirely for end-to-end latency.

Bookie Changes

We added a new bookie configuration to test Pulsar’s performance when bypassing journaling. See the Durability section for a discussion on this and recall that with this feature, we more closely match Kafka’s durability guarantees.

To test the performance of this feature, we built a customized image based on the official Pulsar 2.6.1 release to include this change. (For more details, see Pulsar Image.)

We configured the following settings manually to bypass journaling in Pulsar.

journalWriteData=false
journalSyncData=false

Additionally, we changed the value of the journalPageCacheFlushIntervalMSec parameter from 1 to 1000 to benchmark async local durability (journalSyncData=false) in Pulsar. Increasing the value enabled Pulsar to simulate Kafka’s flushing behavior as described below.

Kafka ensures local durability by flushing dirty pages in the filesystem page cache to disks. Data is flushed by a set of background threads called pdflush. Pdflush is configurable and the wait time between flushes is typically set to 5 seconds. Setting Pulsar’s journalPageCacheFlushIntervalMSec parameter to 1000 is equivalent to a 5-second pdflush interval on Kafka. Making this change enabled us to benchmark async local durability more precisely and achieve a more accurate comparison between Pulsar and Kafka.

Producer Changes

Our batching configuration was identical to Confluent’s with one exception: We increased the switch interval to make it longer than the batch interval. Specifically, we changed the value of the batchingPartitionSwitchFrequencyByPublishDelay parameter from 1 to 2. This change ensured Pulsar’s producer would focus on only one partition during each batching period.

Setting the switch interval and the batch interval to the same value can cause Pulsar to switch partitions more often than necessary, which generates too many small batches and can potentially impact throughput. Making the switch interval larger than the batch interval minimizes this risk.

Consumer Changes

Pulsar clients use receiver queues to apply back pressure when applications are unable to process incoming messages fast enough. The size of the consumer receiver queue can affect end-to-end latency. A larger queue can pre-fetch and buffer more messages than a smaller one.

Two parameters determine the size of the receiver queue: receiverQueueSize and maxTotalReceiverQueueSizeAcrossPartitions. Pulsar calculates the receiver queue size as follows:

Math.min(receiverQueueSize, maxTotalReceiverQueueSizeAcrossPartitions / number of partitions)

For example, if maxTotalReceiverQueueSizeAcrossPartitions is set to 50000 and you have 100 partitions, the Pulsar client sets the consumer’s receiver queue size to 500 on each partition.

For our benchmark, we increased maxTotalReceiverQueueSizeAcrossPartitions from 50000 to 5000000. This tuning optimization ensured consumers would not apply back pressure.

Pulsar Image

We built a customized Pulsar release (v. 2.6.1-sn-16) to include the Pulsar and BookKeeper fixes described above. Version 2.6.1-sn-16 is based on the official Pulsar 2.6.1 release and available to download at here.

StreamNative Methodology

We updated Confluent’s benchmarking methodology to get a more comprehensive view of performance using real-world workloads. Specifically, we made the following changes for our test:

  • Added catch-up reads to evaluate the following:
    • The maximum level of throughput each system can achieve when processing catch-up reads
    • How reads impact publish and end-to-end latency
  • Varied the number of partitions to see how each change impacted throughput and latency
  • Varied the number of subscriptions to see how each change impacted throughput and latency

Our benchmark scenarios measured the following types of workloads:

  • Maximum Throughput: What is the maximum throughput each system can achieve?
  • Publish and Tailing Read Latency: What are the minimum publish and end-to-end tailing latency levels each system can achieve for a given throughput?
  • Catch-up Reads: What is the maximum throughput each system can achieve when reading messages from a large backlog?
  • Mixed Workload: What are the minimum publish and end-to-end tailing latency levels each system can achieve while consumers are catching up? How do catch-up reads impact publish latency and end-to-end tailing latency?

Testbed

The OMB Framework recommends specific testbed definitions (for instance types and JVM configurations) and workload driver configurations (for the producer, consumer, and server side). Our benchmark used the same testbed definitions as Confluent’s. These testbed definitions can be found in our fork within Confluent’s OMB repository.

Below, we highlight the disk throughput and disk fsync latency we observed. These hardware metrics are important to consider when interpreting benchmark results.

Disk Throughput

Our benchmark used the same instance type as Confluent’s—specifically,i3en.2xlarge (with 8 vCores, 64 GB RAM, 2 x 2, 500 GB NVMe SSDs). We confirmed that i3en.2xlarge instances can support up to ~655 MB/s of write throughput across two disks. See the dd result below.

Disk 1
dd if=/dev/zero of=/mnt/data-1/test bs=1M count=65536 oflag=direct
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 210.08 s, 327 MB/s

Disk 2
dd if=/dev/zero of=/mnt/data-2/test bs=1M count=65536 oflag=direct
65536+0 records in
65536+0 records out
68719476736 bytes (69 GB) copied, 209.635 s, 328 MB/s

Disk data sync latency

It is critical to capture the fsync latency on NVMe SSDs when running latency-related tests. We observed that the 99th percentile fsync latency on these 3 instances varies from 1 millisecond to 6 milliseconds as shown in the following diagram. As was mentioned earlier, we saw large amounts of variance of disks from different instances. That was primarily manifested in this latency and we found a set of instances that exhibited consistent latency.

Figure 14: 99th percentile fsync latency on 3 different instances

StreamNative Benchmark Results

We have summarized our benchmark results below. You can find our complete benchmark report.

Maximum Throughput Test

See the full report of “Maximum Throughput Test” here.

The Maximum Throughput Test was designed to determine the maximum throughput each system can achieve when processing workloads that include publish and tailing-reads under different durability guarantees. We also varied the number of topic partitions to see how each change impacted the maximum throughput.

We found that:

  1. When configured to provide level-1 durability (sync replication durability and sync local durability), Pulsar achieved a throughput of ~300 MB/s, which reached the physical limit of the journal disk’s bandwidth. Pulsar is implemented on top of a scalable and durable log storage (Apache BookKeeper) to make maximum use of disk bandwidth without sacrificing durability guarantees. Kafka was able to achieve ~420 MB/s with 100 partitions. It should be noted that when providing level-1 durability, Pulsar was configured to use one disk as journal disk for writes and the other disk as ledger disk for reads, comparing to Kafka use both disks for writes and reads. While Pulsar's setup is able to provide better I/O isolation, its throughput was also limited by the maximum bandwidth of a single disk (~300 MB/s). Alternative disk configurations can be beneficial to Pulsar and allow for more cost effective operation, which will be discussed in a later blog post.
  2. When configured to provide level-2 durability (sync replication durability and async local durability), Pulsar and Kafka each achieved a max throughput of ~600 MB/s. Both systems reached the physical limit of disk bandwidth.
  3. The maximum throughput of Kafka on one partition is only ½ of the max throughput of Pulsar.
  4. Varying the number of partitions had no effect on Pulsar’s throughput, but it did affect Kafka’s.
    • Pulsar sustained maximum throughput (~300 MB/s under a level-1 durability guarantee and ~600 MB/s under a level-2 durability guarantee) as the number of partitions was increased from 100 to 2000.
    • Kafka’s throughput decreased by half as the number of partitions was increased from 100 to 2000.

Publish and End-to-End Latency Test

See the full report of “Publish and End-to-End Latency Test” here.

The Publish and End-to-End Latency Test was designed to determine the lowest latency each system can achieve when processing workloads that consist of publish and tailing-reads under different durability guarantees. We varied the number of subscriptions and the number of partitions to see how each change impacted both publish and end-to-end latency.

We found that

  1. Pulsar’s publish and end-to-end latency were significantly (up to hundreds of times) lower than Kafka’s in all test cases, which evaluated various durability guarantees and varying numbers of partitions and subscriptions. Pulsar’s 99th percentile publish latency and end-to-end latency stayed within 10 milliseconds, even as the number of partitions was increased from 100 to 10000 or as the number of subscriptions was increased from 1 to 10.
  2. Kafka’s publish and end-to-end latency was greatly affected by variations in the numbers of subscriptions and partitions.
    • Both publish and end-to-end latency increased from ~5 milliseconds to ~13 seconds as the number of subscriptions was increased from 1 to 10.
    • Both publish and end-to-end latency increased from ~5 milliseconds to ~200 seconds as the number of topic partitions was increased from 100 to 10000.

Catch-up Read Test

See the full report of “Catch-up Read Test” here.

The Catch-up Read Test was designed to determine the maximum throughput each system can achieve when processing workloads that contain catch-up reads only. At the beginning of the test, a producer sent messages at a fixed rate of 200K per second. When the producer had sent 512GB of data, consumers began to read the messages that had been received. The consumers processed the accumulated messages and had no difficulty keeping up with the producer, which continued to send new messages at the same speed.

When processing catch-up reads, Pulsar’s maximum throughput was 3.5 times faster than Kafka’s. Pulsar achieved a maximum throughput of 3.5 GB/s (3.5 million messages/second) while Kafka achieved a throughput of only 1 GB/s (1 million messages/second).

Mixed Workload Test

See the full report of “Mixed Workload Test” here.

This Mixed Workload Test was designed to determine the impact of catch-up reads on publish and tailing reads in mixed workloads. At the beginning of the test, producers sent messages at a fixed rate of 200K per second and consumers consume messages in tailing mode. After the producer produces 512GB of messages, it will start a new set of catch-up consumers to read all the messages from the beginning. At the same time, producers and existing tailing-read consumers continued to publish and consume messages at the same speed.

We tested Kafka and Pulsar using different durability settings and found that catch-up reads seriously affected Kafka’s publish latency, but had little impact on Pulsar. Kafka’s 99th percentile publish latency increased from 5 milliseconds to 1-3 seconds. However, Pulsar maintained a 99th percentile publish latency ranging from several milliseconds to tens of milliseconds.

The links below provide convenient access to individual sections of our benchmark report.

All the raw data of the benchmark results are also available at here.

Conclusion

A tricky aspect of benchmarks is that they often represent only a narrow combination of business logic and configuration options, which may or may not reflect real-world use cases or best practices. Benchmarks can further be compromised by issues in their framework, set-up, and methodology. We noted all of these issues in the recent Confluent benchmark.

At the community’s request, the team at StreamNative set out to run this benchmark in order to provide knowledge, insights, and transparency into Pulsar’s true performance capabilities. In order to run a more accurate benchmark, we identified and fixed the issues with the Confluent benchmark, and also added new test parameters that would provide insights into how the technologies compared in more real-world use cases.

The results to our benchmark showed that, with the same durability guarantee as Kafka, Pulsar is able to outperform Kafka in workloads resembling real-world use cases and to achieve the same end-to-end through as Kafka in Confluent’s limited use case. Furthermore, Pulsar delivers significantly better latency than Kafka in each of the different test cases, including varying subscriptions, topics, and durability guarantees, and better I/O isolation than Kafka.

As noted, no benchmark can replace testing done on your own hardware with your own workloads. We encourage you to test Pulsar and Kafka using your own setups and workloads in order to understand how each system performs in your particular production environment. If you have any questions on Pulsar best practices as you go through, please reach out to us directly, or feel free to join the Pulsar Slack channel.

In the next few months we will publish a series of blog posts to help the community better understand and leverage Pulsar to meet their business needs. Specifically, we will show the performance of Pulsar in different workloads and setups, how to select and size your hardware across different cloud providers and on-prem environments, and show how you can use Pulsar to build the most cost effective streaming platform.

If you have any questions about Pulsar’s storage architecture, its performance characteristics, or the results of this benchmark, we encourage you to join the Pulsar Slack channel to discuss.

© StreamNative, Inc. 2022Apache, Apache Pulsar, Apache BookKeeper, Apache Flink, and associated open source project names are trademarks of the Apache Software Foundation.TermsPrivacy