August 14, 2025
6 min read

Kafka’s Approach to Multi-Region Streaming (Geo-Replicated Architectures with Kafka and Pulsar 2/6)

Matteo Meril
Co-Founder and CTO, StreamNative

TL;DR

  • Apache Kafka achieves geo-replication through separate clusters and replication tools. A Kafka cluster is usually confined to one region, so multi-region streaming means running multiple clusters and copying data between them.
  • MirrorMaker 2 (MM2) is Kafka’s primary tool for cross-cluster replication. It runs as a Kafka Connect-based process that consumes from source topics and produces to target clusters, propagating messages continuously for disaster recovery, data distribution, or migrations.
  • Kafka’s geo-replication is asynchronous and typically active-passive: one cluster is primary (active) and others are secondary copies. Active-active (bidirectional) setups are possible with MirrorMaker 2 but introduce complexity (e.g. naming conflicts, offset translation).
  • Newer Kafka features like Confluent’s Cluster Linking (in Confluent Platform) aim to simplify multi-region replication by replicating data within brokers and preserving offsets. However, in open-source Kafka, MirrorMaker 2 remains the go-to solution for multi-region continuity.
  • Key challenges of Kafka’s approach include extra operational overhead (managing MirrorMaker connectors), offset inconsistencies between clusters, and ensuring failover processes so consumers can switch clusters with minimal downtime.

Kafka Clusters and Geographic Boundaries

By design, an Apache Kafka cluster is a tightly coupled set of brokers with a single distributed log. While Kafka’s broker replication (the intra-cluster replication) keeps data redundant across nodes and Availability Zones, it doesn’t by itself replicate data across distant regions. Stretching a single Kafka cluster over high-latency links (e.g. brokers spread across continents) is not recommended – it would suffer from slow replication and leader election issues. Instead, the typical approach is to deploy independent Kafka clusters per region or datacenter, then mirror data between clusters for geo-replication.

In a multi-region Kafka deployment, you might have a cluster “Kafka-US-East” and another “Kafka-US-West,” each serving local clients. To keep them in sync (say, to have US-West as a backup for US-East, or to aggregate events globally), Kafka provides MirrorMaker. The fundamental idea is simple: use a Kafka consumer-producer pair to continuously copy messages from topics in one cluster to topics in another. This can be one-way (unidirectional replication) or two-way (bi-directional between two clusters, if needed).

MirrorMaker 2: Kafka’s Geo-Replication Workhorse

MirrorMaker 2 (MM2) is the evolution of Kafka’s original MirrorMaker (MM1) and is the standard way to replicate data across Kafka clusters in open-source Kafka. MirrorMaker 2 is built on the Kafka Connect framework, which makes it essentially a set of connectors (source, heartbeat, and checkpoint connectors) dedicated to replication. Running MirrorMaker 2 means running a Connect cluster (which can be on separate machines or even co-located on the Kafka brokers) configured to replicate topics of interest from one cluster to another.

Some key aspects of MirrorMaker 2 architecture:

  • It uses Kafka Connect Source connectors to consume from the source cluster’s topics and Connect Sink (producer) connectors to write to the target cluster. Internally, it’s effectively the same as a custom consumer reading from Cluster A and producing to Cluster B, but packaged and managed as a fault-tolerant Connect job.
  • MirrorMaker 2 includes a checkpoint connector that periodically records the source consumer group offsets in the target cluster (stored in special internal topics). This facilitates offset translation: if consumers fail over to the target cluster, they can know where to resume consuming in the new cluster based on their positions in the old cluster.
  • Heartbeat connectors emit periodic heartbeats used to monitor the health and lag of the replication process. This helps in monitoring whether the mirror is “caught up” or if it’s falling behind.

Using MirrorMaker 2, operators can implement various geo-replication patterns:

  • Active-Passive (Disaster Recovery): Continuously replicate from a primary cluster to a secondary cluster that sits idle until needed. In a DR scenario, the secondary has all the data up to the failure point and can take over serving consumers. This is illustrated as the classic use case – if Cluster A goes down, Cluster B (which has been mirroring A) becomes the new source of truth.
  • Active-Active (Bidirectional): Set up MM2 to replicate in both directions between two clusters. This way, both clusters receive each other’s data. This scenario can support active-active applications (where each region produces some unique data and needs a global view). However, careful configuration is needed to avoid loops – typically, MirrorMaker applies cluster prefixes to mirrored topics to distinguish them. For example, data from Cluster A mirrored to Cluster B might be stored under topic name “A.topicName” on Cluster B. This prevents re-mirroring the same data back and forth endlessly.
  • Fan-out or Aggregation: One cluster can mirror to multiple target clusters (fan-out), or multiple source clusters can mirror into one central cluster (aggregation). For instance, you might aggregate logs from regional Kafka clusters into one global Kafka cluster for analytics. MirrorMaker 2 allows multiple independent source->target flows to run concurrently.

Operationally, MirrorMaker 2 is a separate component to manage. A best practice is to run the MirrorMaker (Connect) workers in the target region (the cluster receiving the data) for efficiency. This “consume remote, produce local” pattern leverages the fact that Kafka producers can be more latency-sensitive than consumers. By consuming over the WAN (which the source cluster can handle) and producing into the local cluster, you reduce latency and the chance of network issues affecting the producer side. It also avoids burdening the source cluster with additional producer load. In practice, that means if you’re mirroring from Region A to Region B, you’d deploy the MirrorMaker connectors in Region B close to the B brokers, so that writes are fast and any network slowness only impacts the reads from A.

Kafka’s documentation and community also emphasize monitoring replication lag – the delay between events in the source being available in the target. MirrorMaker 2’s checkpointing and heartbeats help track this. If lag grows unexpectedly, it might indicate network issues or that the MirrorMaker is underscaled (e.g., not enough consumer threads).

Challenges and Considerations in Kafka’s Geo-Replication

While MirrorMaker 2 gets the job done, it introduces some challenges that architects need to consider:

  • Operational Complexity: You’re effectively running a Kafka Connect cluster alongside your Kafka clusters. This means additional moving parts – configuration, scaling of connect workers, monitoring another system’s health. It’s an extra layer where things can fail independently. As noted in Kafka Improvement Proposals, MirrorMaker (being external to the brokers) can experience outages even when the Kafka clusters are healthy. This requires robust monitoring and possibly automation to restart or reconfigure mirroring if it stops.
  • Delayed Consistency: Kafka’s geo-replication via MM2 is asynchronous. There will always be some lag between the source and replica – hopefully only a few seconds or less, but it could be more under load or network strain. During that lag, data written to the source may not yet be in the target. If a disaster strikes at that moment, those last few messages might be missing on the backup (this defines your Recovery Point Objective). Most setups aim for seconds of lag at most.
  • Consumer Offset Translation: One of the trickiest aspects is handling consumer failover. In Kafka, consumer group offsets are stored in a special topic (__consumer_offsets). When you replicate data to a second cluster, the consumer offsets in that cluster are not automatically in sync with the primary. MirrorMaker 2’s checkpoint connector addresses this by translating and syncing offsets periodically. Still, if a failover happens, consumers might have to seek to the translated offset positions in the new cluster. There’s a potential for confusion or even duplicate/missed messages if the offset sync isn’t perfectly up-to-date. Confluent’s enterprise feature “Cluster Linking” improves this by preserving offsets exactly (since it effectively ships log segments over), but in open source Kafka, some manual intervention or careful planning is needed to smoothly switch consumers to a new cluster.
  • Topic Naming and Filtering: By default, MirrorMaker 2 will replicate all topics (or a whitelist/blacklist pattern). In multi-region architectures, it’s common to use prefixes to prevent collision or loops. For example, cluster “east” might prepend east. to all its topic names when mirroring to “west”, so on West cluster you have east.topic1 as the mirror of topic1 from East. This means consumers on the West cluster might subscribe to a different topic name. It’s a design decision: do you keep the same topic names on the secondary (and be careful to avoid feeding mirrored data back), or do you use distinct names? The Dattell comparison notes that Kafka “topics in the remote cluster must have a different name than the original” in two-way replication setups – because without that, the mirrored data would confuse the mirror maker on the return path. This complexity is something architects must plan for (often by segregating namespaces or using the built-in naming conventions of MM2, which by default uses prefixes for active/active links).
  • Active-Active Conflict Handling: If you do attempt an active-active deployment (both clusters accepting writes), you must ensure the same data isn’t produced in both places or you’ll have duplicates. Typically, this is done by geo-partitioning the writes (each region writes a subset of messages that the other doesn’t). For instance, Region A handles all events for customers in Americas, Region B for customers in EMEA, and they replicate to each other so both have the full global dataset read-only. There’s no inherent conflict resolution in Kafka’s mirror mechanism; it’s up to your application design to avoid two regions producing the same logical record.

Despite these challenges, Kafka’s approach has been battle-tested in industry. Companies like Uber have built complex mirroring topologies for multi-region Kafka, using tools like uReplicator (Uber’s enhancement of MirrorMaker) to handle massive scale. MirrorMaker 2 significantly improved on MirrorMaker 1 by adding the offset sync and better fault tolerance via Connect.

Additionally, Confluent Cluster Linking (a feature of Confluent Platform since version 7.x) provides an integrated alternative to MirrorMaker. Cluster Linking allows one Kafka cluster’s brokers to directly stream data from another cluster’s brokers, preserving message offsets and transaction semantics. It essentially treats the log as the unit of replication rather than going through producers/consumers. For those using Confluent, this can simplify multi-region setup: you configure a link and the clusters share topics with the same names, making failover easier (no renaming). However, Cluster Linking is not part of Apache Kafka open source (as of 2025, proposals like KIP-786/986 are still in discussion to bring similar ideas into OSS). So, in this series, we’ll focus on the open-source tools, namely MirrorMaker 2, while acknowledging these newer developments.

Example: Disaster Recovery Failover with Kafka

To ground this, imagine you have two Kafka clusters: Cluster A (primary) in us-east, and Cluster B (secondary) in us-west. MirrorMaker 2 is set up to replicate all topics from A to B. Producers and consumers normally all connect to Cluster A. Suddenly, Cluster A experiences a regional outage (network cut or major failure). Here’s what happens in a well-planned setup:

  1. Detection: Monitoring alerts that Cluster A is down. Also, MirrorMaker on B will start failing to fetch new data (or heartbeat fails).
  2. Failover Switch: Operators (or an automated system) update client connection info to point producers/consumers to Cluster B, the DR cluster.
  3. Consumer Offset Sync: Thanks to MirrorMaker’s checkpoints, Cluster B knows the offsets of consumer groups from A. Consumers either automatically start at the correct position (if using Confluent cluster linking or a custom failover script using the stored checkpoints), or they may need to be pointed to the translated offsets MirrorMaker maintained.
  4. Resume Operations: Producers now send to Cluster B’s topics, and consumers read from Cluster B. All data up until the outage was already in Cluster B (maybe lagging by a second or two), so the stream continues with minimal interruption.

When Cluster A is restored, one could optionally mirror data back or even make Cluster A the primary again (fail-back). However, doing so without duplicates requires careful coordination (perhaps wiping Cluster A and re-mirroring from B to A to get it caught up). This is why many Kafka deployments in practice run in an active-passive DR mode, where the passive is only used during failover, rather than true active-active.

Cross-Region Considerations for Kafka

A few additional points architects should note:

  • Networking and Security: Cross-data-center links should be high-bandwidth and secure. MirrorMaker will happily saturate a network pipe with data. Ensure you have enough throughput between regions and use encryption (Kafka supports SSL for inter-cluster traffic).
  • Ordering Guarantees: MirrorMaker does not guarantee global ordering. Each cluster maintains its own order per partition. During failover, consumers might observe a slight disordering around the switch (especially if some messages were in transit). In critical systems, you might implement an “event ID” to deduplicate or reorder if needed across clusters.
  • Schema and Metadata: Kafka doesn’t automatically replicate schema registry data or other metadata. If you use Confluent Schema Registry, you’d need to replicate schema definitions to the DR site (Confluent offers Schema Linking for that). Likewise, ACLs and security settings need to be mirrored via external processes (or managed uniformly via an IaC approach).

In summary, Kafka’s multi-region story is one of multiple clusters bridged by replication processes. It gives you flexibility – you can tailor what to replicate and where – but demands careful setup. The benefit is proven technology and the vast Kafka ecosystem support. The downside is complexity in operating those bridges and in handling edge cases during failover.

In the next part, we’ll see how Apache Pulsar tackles geo-replication from a different angle – baking it into the system itself. Pulsar’s approach can simplify some of the above challenges (like built-in offset management across regions and no need for external processes). But it comes with its own trade-offs, as we’ll explore.

Key Takeaways

  • Kafka itself confines replication to within a cluster (for fault tolerance within a region). Geo-replication in Kafka is achieved by running multiple clusters and copying data between them using tools like MirrorMaker 2.
  • MirrorMaker 2 (Kafka Connect-based) is the de facto solution for multi-region Kafka. It continuously consumes from source topics and produces to target clusters, supporting DR, data migration, and multi-region data distribution use cases.
  • Kafka geo-replication is asynchronous and eventually consistent. There will be a replication lag, so a backup cluster may be seconds behind the primary. Active-passive (one-way replication) is simpler; active-active (bi-directional) requires careful configuration (like distinct topic names or prefixes to avoid infinite loops).
  • Operating Kafka across regions introduces extra overhead: you must manage MirrorMaker/Connect processes, handle offset translation so consumers can fail over without reprocessing data, and ensure monitoring of replication lag and health.
  • Newer features (e.g. Confluent Cluster Linking) improve the story by removing the external process and preserving offsets, but these are not part of open-source Kafka yet. Open-source users rely on MM2 or third-party tools for multi-region setups.
  • In practice, architecting Kafka for multi-region involves planning for failover procedures (how to switch clusters), client reconfiguration, and possibly tooling to synchronize metadata like schemas and ACLs across clusters.
  • Bottom line: Kafka can certainly be made to work in multi-region and hybrid-cloud scenarios – it powers many global systems – but it requires careful architecture. Next, we’ll look at Pulsar, which takes a more integrated approach to geo-replication, potentially simplifying multi-region streaming for architects.
This is some text inside of a div block.
Button Text
Matteo Meril
Matteo is the CTO at StreamNative, where he brings rich experience in distributed pub-sub messaging platforms. Matteo was one of the co-creators of Apache Pulsar during his time at Yahoo!. Matteo worked to create a global, distributed messaging system for Yahoo!, which would later become Apache Pulsar. Matteo is the PMC Chair of Apache Pulsar, where he helps to guide the community and ensure the success of the Pulsar project. He is also a PMC member for Apache BookKeeper. Matteo lives in Menlo Park, California.

Related articles

Aug 14, 2025
5 min read

Explore the Future of Data Streaming and AI — Data Streaming Summit San Francisco 2025 Schedule Announced

Aug 12, 2025
6 min read

Pulsar Newbie Guide for Kafka Engineers (Part 3): Ledgers & Bookies

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Kafka
Geo-replication
Pulsar