Latency Numbers Every Data Streaming Engineer Should Know

TL;DR
What “real-time” usually means
- Ultra-low latency (E2E): < 5 ms — tight budgets; no cross-region hops; avoid disk fsync on the hot path.
- Low latency (E2E): 5–100 ms — good for interactive dashboards, alerts, online features.
- Latency-relaxed (E2E): > 100 ms to minutes — fine for near-real-time analytics/ETL; enables aggressive batching & cost savings.
Storage / durability costs
- HDD seek/fsync: 5–20 ms (one flush can consume your entire ultra-low budget).
- SATA/NVMe SSD fsync: ~0.05–1 ms (device & kernel dependent).
- Object storage PUT (e.g., S3): ~10–100 ms until write completes; listing/metadata may add more.
Network reality (one way; round-trip is ~2×)
- Same host / loopback: < 0.1 ms (µs range).
- Same rack / same AZ: ~0.1–0.5 ms one way (~0.2–1 ms RTT).
- Cross-AZ, same region: ~0.5–2 ms one way.
- Cross-region, same continent: ~15–40 ms one way (~30–80 ms RTT).
- Intercontinental: ~50–150 ms one way (~100–300+ ms RTT).
Broker publish (producer → log)
acks=0/1
, same-AZ, SSD: ~0.2–2 ms per write (no replica wait).acks=all
(sync to quorum), same-AZ: ~0.3–5 ms (adds network + replica fsync).- Sync replication across AZs: +1–5 ms.
- Sync replication across regions: +50–200+ ms (generally incompatible with sub-100 ms goals).
- Producer batching (
linger
): adds +N ms intentionally (typical 5–50 ms) to trade latency for throughput.
Consumer side
- Long-poll / push-like fetch: ~sub-millisecond to a few ms once available.
- Polling interval (misconfigured): adds 0–500+ ms directly to E2E.
- Light in-memory transform: typically < 1 ms per record; heavy I/O dominates instead.
End-to-end (wire → result)
- Well-tuned, single-region, durable: commonly ~10–50 ms p50; watch p99.9 tails (GC, bursts) ~50–200+ ms.
- With cross-region sync: expect ~100–300+ ms minimum, dominated by RTT.
Table visibility (e.g., Iceberg)
- Commit interval governs freshness:
- “Fast” configs: ~5–30 s visibility.
- Common enterprise configs: ~1–10 min visibility.
- Rule: visibility ≈ commit cadence (plus seconds for object store/metadata). Use shorter commits for freshness, longer for efficiency.
Sync vs async (what it costs)
- Synchronous replication/commits: add ≥ 1 RTT per replica/round, but give stronger durability/ordering.
- Asynchronous replication/commits: near-local latency, but risk temporary lag or data loss on failure.
Handy heuristics
- If your path includes disk fsync OR cross-AZ, budgeting < 1 ms is unrealistic.
- If your path includes cross-region sync, budgeting < 100 ms is unrealistic.
- To stay < 10 ms E2E, keep everything in one AZ, avoid per-record fsync, and minimize batching/poll delays.
- For cost-optimized pipelines, aim seconds–minutes latency via batching and table commits; keep only hours of “hot” data in the stream.
What Real-Time Really Means (Latency Classes)
In data streaming, “real-time” can mean different things depending on context and requirements. Generally, it implies that data flows and is processed with minimal delay. However, not all real-time systems demand the same speed. We can break down latency targets into a few categories for clarity:
- Ultra-Low Latency (< 5 ms): This is the realm of hard real-time responsiveness. Systems requiring ultra-low latency (on the order of a few milliseconds or less) are typically found in high-frequency trading, real-time control systems, or in-memory data processing. Achieving sub-10ms end-to-end latency often means using highly optimized, specialized infrastructure – for example, colocating services in the same memory or machine, using kernel-bypass networking, or other techniques. It’s latency-optimized at the expense of higher cost or complexity, since even a single HDD disk seek (≈10 ms) would break this budget . (For perspective, 100 ms is often cited as the threshold where a response feels instantaneous to a human , so 10 ms is an order of magnitude faster than a typical UI interaction.)
- Low Latency (5–100 ms): This range covers the interactive real-time experiences most users and applications consider “real-time.” Anything under a few hundred milliseconds generally feels immediate for interactive applications (the classic “<100 ms” rule of thumb for instant UI feedback ). In data streaming, latencies in the tens of milliseconds up to a couple hundred milliseconds are often sufficient for use cases like live dashboards, online analytics, or alerting systems. Achieving 5–100 ms latency typically still requires streaming (event-at-a-time) processing rather than long micro-batches, but it’s more forgiving than the ultra-low range. Many real-time stream processing platforms (like Apache Flink with event-at-a-time processing) target latencies well under a second – often in the 10s of milliseconds or below if tuned correctly . This level usually involves some optimizations (e.g. in-memory buffering, minimal disk flushes), but may trade off a bit of throughput or cost to stay fast .
- Latency-Relaxed (> 100 ms): When we go above a few hundred milliseconds, we’re in near-real-time or latency-relaxed territory. Latencies from several hundred milliseconds to a few seconds might be acceptable for certain analytics, reporting, or ETL scenarios where “real-time” means “within a second or two” rather than instant. In practice, many so-called real-time pipelines in industry tolerate second-level or even minute-level delays if the use case isn’t user-facing or time-critical . For example, updating an analytics dashboard every 5 seconds or even every minute can be fine for business intelligence needs. This latency-relaxed approach often allows more cost-optimized designs – e.g. using micro-batches, compressing data, writing to cheaper storage, etc., because we’re not racing the clock on each event. Essentially, if your application can accept >100 ms delays, you have freedom to batch and buffer data for efficiency, gaining throughput or reducing cost at the expense of immediacy. (In fact, many “real-time” data streaming systems integrating with data lakes or warehouses are happy with 1–5 minute latencies, which is near-real-time by broader definition.)
Latency vs. Cost Trade-offs: It’s important to recognize that pushing into ultra-low latency often comes with exponential cost or complexity. For instance, keeping data in an in-memory store or a hot streaming cluster for instant access is far more expensive than writing it to a data lake and querying it with a slight delay. A study at Netflix found that storing long-term data in Kafka was 38× more expensive than storing it in an Apache Iceberg data lake, so they keep only a few hours of hot data in Kafka and tier the rest to the data lake. In general, achieving lower latency might require more computing resources, more careful tuning, or specialized hardware, whereas relaxing latency requirements can dramatically lower costs by allowing more batching and using cheaper storage or network options. Always ask, “Do I need this result in milliseconds, or just soon enough?” – the answer will guide whether to optimize for lowest latency or prioritize simplicity and cost-efficiency.
Understanding the Physics of Latency (Hardware and Network)
Real-time data streaming performance is grounded in some unavoidable physical realities. To design and troubleshoot streaming systems, engineers should know the ballpark latency of fundamental operations in hardware and networks – these form the lower bounds of any system’s latency.
Figure: Relative scale of latency for various operations (logarithmic). A CPU L1 cache reference happens in under a nanosecond, whereas a disk seek may take ~10 milliseconds, about 7 orders of magnitude slower . Network latencies vary by distance: a round-trip within the same data center is ~0.5 ms , while a transcontinental round trip (California ↔ Netherlands) is ~150 ms . These physical delays set the floor for streaming latency.
Storage Latency – HDD vs. SSD/NVMe vs. Memory: Not all storage is equal. Traditional hard disk drives (HDDs) have mechanical seek times on the order of milliseconds. For example, a single disk seek or fsync (force write to disk) might take ~10–20 ms on a spinning disk. This is 10,000× slower than in-memory operations. In a streaming context, if your pipeline depends on flushing to an HDD for durability on each message, that alone could add tens of milliseconds latency (and severely cap throughput to only ~50–100 writes per second per disk). Modern solid-state drives (SSD) and especially NVMe drives are much faster. An SSD can complete a random read in ~150 μs (0.15 ms), and high-end NVMe drives can fsync writes in under 1 ms. In fact, one test showed an Intel Optane NVMe (an extremely fast storage device) could sync a write in about 0.043 ms (that’s 43 microseconds) on average – over 400× faster than an HDD’s 18 ms flush. This huge difference means streaming platforms that write to disk (for example, persisting logs or state) can achieve far lower latency with SSD/NVMe storage. Many distributed log systems (like Apache Kafka or Apache BookKeeper) rely on sequential writes which are faster, but they still benefit from SSDs for low latency commits. The key takeaway: if you need low latency and must touch disk, use the fastest storage possible (or amortize the cost of slow storage with batching), because hardware can be a limiting factor.
Network Latency – Local vs. Cross-Domain: When data doesn’t stay on one machine, network hops become a major contributor to latency. The speed of light (and network infrastructure) imposes delays that no amount of software optimization can eliminate. For instance, a packet round-trip within the same data center (or Availability Zone) is often around 0.5 ms or less. Cloud providers design AZ networks to be very fast and local – latencies ~sub-millisecond are typical within one zone or region. If you communicate across availability zones in the same region, latency might bump up to the low single-digit milliseconds. Measurements in AWS show cross-AZ pings around 1–2 ms on average. This is still quite low, but not zero – crossing outside a single facility adds a slight delay. Now, consider cross-region or long-distance communication: latency grows with distance. Sending data across a continent or ocean will take on the order of tens to hundreds of milliseconds. For example, a round-trip from the west coast of the US to the Netherlands (~5,000 miles) is about 150 ms. Even between New York and San Francisco (around 2,900 miles), a ping might be ~60–80 ms. The rule of thumb is ~5 μs of latency per km of fiber (or ~8 ms per 1000 miles) one way, plus routing overhead – so geography directly impacts network delay. For global data streaming, this means if you’re replicating or forwarding events to another region, you instantly introduce perhaps 50–200+ ms of latency just due to the speed of light and network hops.
Implications for Streaming Systems: All these physical latencies add up in a streaming pipeline. If your producer must write to disk, that disk’s latency is a hard floor on how fast you can acknowledge an event. If your stream has to replicate data to a far-away data center, you incur at least the network round-trip latency in doing so. For instance, a streaming system that synchronously replicates messages to another region will always have at least, say, ~100 ms latency minimum just from networking, no matter how optimized the code is. Likewise, if using a distributed storage, a cross-AZ write might add ~1–2 ms each way – seemingly small, but significant when you’re aiming for ~10 ms total. This is why geo-distributed streaming designs often involve trade-offs: either accept higher latency for stronger consistency (data replicated everywhere before use), or relax consistency (async replication) to keep latency low (more on this later). It’s also why co-locating stream processors close to their data sources and sinks is important for low latency – every meter of distance and every hardware boundary (memory vs disk vs network) adds delay. Understanding these baseline numbers (disk = milliseconds, local network = sub-ms, cross-country = dozens of ms) helps an engineer set realistic expectations and choose architectures that meet their latency goals.
Publish, Consume, and End-to-End Latency in Streaming
When we talk about “latency” in a data streaming context, it’s useful to distinguish where that latency comes from. Generally, the user cares about end-to-end latency – the delay from an event being produced to that event being fully processed/visible at its destination. We can break this into two pieces: publish latency on the producer side, and consume latency on the consumer side. Let’s define each:
- Publish Latency (Producer → Broker): This is the time it takes for an event to go from the producer to being durably stored and available on the streaming platform. It includes network transit from the producer to the broker (which could be sub-ms in the same data center, or more if remote), plus any processing the broker does (e.g. writing to a log, replicating to followers, etc.). For example, a producer sending a message to Kafka will typically wait for an acknowledgment. If the broker writes to disk and replicates to followers before acknowledging (for durability), the publish latency includes the disk write and the network hop to followers. A synchronous publish (waiting for replicas) will have higher latency than an async fire-and-forget publish. Tuning factors like Kafka’s acks setting illustrate this: requiring acknowledgement from all replicas (acks="all") adds latency but guarantees durability, whereas acks=1 or 0 responds faster but risks data loss. Batching on the producer side also affects publish latency – e.g., a producer might wait 50 ms to batch multiple events into one request for efficiency, which adds a fixed delay (this is configurable via linger time in Kafka producers, for instance). In summary, publish latency is influenced by network hop from producer to stream, broker processing (disk I/O, etc.), and replication strategy.
- Consume Latency (Broker → Consumer): This is the time it takes for a stored event to be delivered to or fetched by the consumer after it’s available on the broker. In a push-based system this can be very fast (brokers push immediately), whereas in pull-based systems (like Kafka’s default consumer model), there might be a slight delay depending on the polling interval. For example, if a consumer polls for new messages every 100 ms, then an event might sit up to 0–100 ms before the consumer picks it up. Many streaming frameworks and message queues allow long-polling or event-driven consumption to minimize this, so consume latency can often be only a few milliseconds or less once the data is available. That said, consumer side processing can add to latency as well – e.g., how quickly the consumer code or downstream system can process the event once received. If the consumer is doing heavy computation or is bottlenecked, that contributes to end-to-end latency. In practice, a well-tuned streaming consumer will fetch data almost as soon as it arrives (often yielding end-to-end latencies only marginally above the publish latency).
- End-to-End Latency: This is what users ultimately experience – the total time from when data is generated to when it’s processed/useable at its destination. End-to-end latency = publish latency + consume latency (plus any processing time in between). For example, suppose a sensor emits an event at time T0. It’s published to a stream and acknowledged at T0+20 ms, and a consumer picks it up at T0+30 ms, then processes and stores the result by T0+50 ms. The end-to-end latency is 50 ms. In a well-designed streaming pipeline, publish and consume latencies can often be on the order of only a few milliseconds each, yielding end-to-end latencies perhaps tens of milliseconds above the raw network and processing time. But if either side is misconfigured, latency can creep up. For instance, if the producer batches for 100 ms or the consumer only polls every 200 ms, those will directly add to end-to-end delay. P99 latency (99th percentile) is also critical – even if average latency is 50 ms, the slowest 1% of events might take significantly longer due to occasional stalls, GC pauses, or bursts of load. Streaming engineers need to monitor and optimize for these tail latencies as well, since a “real-time” system is only as responsive as its slowest pertinent result.
To reduce end-to-end latency, one typically does things like: use small batches (or no batching) on the producer, configure low linger or flush intervals, ensure the broker has adequate I/O throughput (SSD disks, etc.), and have consumers that process promptly. However, each of these can impact throughput or resource usage – again underscoring the latency vs. throughput trade-off. It’s often a balancing act: “How many events per second can I handle at 50 ms latency?” versus “If I allow 500 ms latency, I could batch more and handle much higher throughput.” There’s no free lunch, but understanding where the latency comes from (network, disk, acks, poll intervals) helps target the right optimizations .
Data Visibility Latency with Analytical Storage (e.g. Apache Iceberg)
Streaming data often doesn’t end with an in-memory consumer; many pipelines flow into analytical databases or data lakes for further use (such as Apache Iceberg, Delta Lake, etc. which store data on cloud storage). It’s crucial to understand that these systems have a different model of latency – typically micro-batch commits – which can introduce a substantial delay in data visibility. “Data visibility latency” refers to how quickly data that was ingested into the table format becomes queryable or visible to downstream consumers (like analytics jobs or queries on the table).
With Apache Iceberg (a popular table format for data lakes), data is committed in snapshots. A streaming job (e.g., Flink writing to Iceberg) will buffer a set of events into a data file and then commit that file as a new table snapshot. This commit might happen, say, every 5 minutes or every 1 minute – it’s configurable, but there’s a trade-off. Frequent commits = lower latency, but too many small files and metadata overhead; Infrequent commits = higher latency, but more efficient batching . In practice, “One to 10 minutes commit intervals are pretty common.” Many organizations choose to commit every few minutes. If you commit a new snapshot every 5 minutes, that means data written to the table may not be visible to readers until that commit occurs. The latency to visibility is thus on the order of the commit interval (plus a tiny processing lag). For example, if an event arrived just after the last commit, it might wait nearly the full interval (almost 5 minutes) before it’s committed and visible; on average, you’d see a couple minutes of latency. As one data engineer put it, this pattern is essentially incremental batch processing – you’ve traded sub-second streaming latency for minute-level latency in exchange for efficiency and lower cost .
To illustrate, in one experiment with an Iceberg streaming source, using a 10-second commit interval caused the downstream consumer to see events with a median latency of ~10 seconds, and a max latency under ~40 seconds. The median matched the commit interval (10s) because data arrives in chunks each commit. When the commit interval was larger, say 1 minute, the latency would similarly track around that scale. This stop-and-go pattern is due to Iceberg’s design: it offers atomic, transactional commits for reliability (no partial data visible), but that means holding data until commit. The benefit is strong consistency – readers either see an entire batch or nothing – and excellent throughput (writing big files to S3 efficiently), at the cost of latency. In many analytics scenarios, this is acceptable. As noted by experts, a lot of streaming use cases are “fine with minute-level latency” and explicitly not looking for sub-second results. By using a table format like Iceberg, they get a more cost-effective, simpler pipeline (no continuously running hot storage for every event) and can still achieve end-to-end latencies on the order of minutes, rather than hours for traditional batch .
However, if you need faster visibility, you can shorten the commit interval – some users commit every few seconds for “real-time analytics” tables. Iceberg (and similar systems) can support that, but beware of generating too many small files and excessive metadata load if commits are too frequent. There’s a sweet spot: commonly 1–10 minutes as mentioned, but some do 5 or 10 seconds in specialized cases. Also, some emerging techniques like continuous streaming sinks or table change feeders are making it possible to get lower latencies from data lakes by tailing the commits. Still, as a data streaming engineer, you should know to account for this extra latency when integrating streaming with data lake tables. The “real-time” portion might be snappy, but once you hand off to a system that commits to cloud storage, the latency jumps to whatever the commit policy is. Designing your pipeline, you’d decide: does this use case truly need second-level updates, or can it tolerate a 1-2 minute delay (with a big drop in cost)? Understanding data visibility latency ensures you set the right expectations with consumers of the data. If truly low latency is required all the way, you might keep the data in a fast store (like Kafka or a real-time database) rather than immediately landing it to Iceberg – or use hybrid approaches (hot data in Kafka for instant use, cold data in Iceberg for cost-efficiency).
Synchronous vs. Asynchronous Operations (Impact on Latency and Consistency)
A fundamental design choice in distributed streaming systems is whether operations are done synchronously (blocking/waiting for a result) or asynchronously (proceeding without an immediate confirmation). This choice has big implications for latency (and data safety). Two areas where this comes up are data replication and event processing:
In synchronous replication, every event (or transaction) is durably stored in multiple places before the system acknowledges it as “done”. For a streaming platform, that could mean when a producer publishes a message, the broker waits until, say, 2 other nodes have written the message too (replicated) before sending an ACK to the producer. The obvious advantage is strong consistency and durability – you won’t lose data even if one node crashes right after. The cost, however, is extra latency. The producer’s publish latency now includes one or more network round-trips and disk writes to the replicas. If those replicas are in the same rack, the delay might be small (maybe 1–2 ms ); if they’re across data centers, it could be tens of milliseconds or more. Each additional replica or distant node increases the latency, because the commit has to travel further or to more endpoints. As CockroachDB’s engineers note, synchronous replication is robust but comes “at the cost of very high write latency” in widespread clusters – it can be “crippling for many applications” if they can’t tolerate the wait. Thus, not every streaming pipeline uses fully synchronous replication for all data; there’s often a configuration (like Kafka’s acks setting) to choose how many replicas must ack. If you set acks=all (fully sync), you get strongest durability with higher latency; if you set acks=1 (just leader ack), you get lower latency but risk that if the leader dies before followers catch up, that message could be lost.
In asynchronous replication, the idea is “send and pray”. The producer or primary doesn’t wait for the followers/replicas to confirm. For example, a database might commit locally and return “success” to the user, then ship the data to a backup server a moment later. In streaming, you might publish with acks=0 (no wait at all) or acks=1 (wait only for the leader’s own write). This minimizes latency – essentially you’re only as slow as the primary write, which could be just a local disk write. But the trade-off is obvious: if something fails at the wrong time, data might not make it to the backup. Asynchronous systems introduce the possibility of temporary inconsistency (followers lag behind) and data loss if the primary node crashes before sending out the buffered events. In practice, many systems choose a middle ground (like Kafka’s default acks=1 with replication – a good balance of some durability with minimal latency impact). Also, some systems offer “semi-synchronous” modes or parallel async replication where at least one replica is waited on but others are async, to balance safety and speed.
Beyond replication, the sync vs async concept applies to processing workflows too. For instance, consider an event that triggers two downstream actions. If done synchronously, the first action must complete (and maybe respond) before the second starts – ensuring order or simplicity but incurring the cumulative latency of both steps. In an asynchronous (or concurrent) design, those actions could happen in parallel or in a fire-and-forget manner, reducing overall response time seen by the initial trigger. The downside is added complexity: you need to handle out-of-order completions, correlate results, and possibly deal with partial failures. Another example: a stream processing job might do an external database call for each event. If it does so synchronously, each event might incur, say, 50 ms to get a response, severely slowing the pipeline (and if hundreds are concurrent, they queue up). If instead the job uses an asynchronous, non-blocking approach (issuing requests without waiting, and handling responses as they come), it can pipeline those calls and achieve much higher throughput and lower per-event latency – at the cost of a more complex design (e.g. using async I/O, callbacks or promises, etc.). The general rule is, synchronous = simpler but can add latency through waiting, asynchronous = faster throughput and lower waiting time, but more complex and potentially inconsistent intermediate state.
When designing a streaming system, decide where you need strong ordering or guarantees (which might force synchronous steps) and where you can afford asynchrony. For example, if losing one event is absolutely unacceptable, you’ll lean towards synchronous replication (and pay the latency cost). If throughput and speed are paramount and occasional loss is tolerable (or mitigated by upstream retry), you might go async. Modern cloud databases highlight this trade-off: a fully synchronous commit across regions might be 200 ms latency (too slow for many apps), whereas an async commit is maybe 5–10 ms but risks a second of data if a failure occurs. It’s all about business requirements. As a streaming engineer, knowing the latency hit of synchronous operations is key – it might be worth ~50 ms to guarantee order/durability, or it might not, depending on your use case. Often, systems offer tunable consistency levels so you can choose per workload. But whichever path, the influence on latency must be understood and communicated.
Conclusion
Latency is a critical aspect of data streaming systems – it’s often the very reason we choose streaming over batch processing. But “real-time” isn’t one-size-fits-all: it spans from a few milliseconds to a few minutes, and knowing the difference is essential. Every data streaming engineer should internalize the key latency numbers and concepts: how fast hardware can realistically move data (nanoseconds in CPU, microseconds in memory/SSD, milliseconds on disk or across networks), and how those translate into end-to-end pipeline delays. Recognizing these orders of magnitude helps in design (e.g. you won’t expect a cross-country streaming pipeline to ever be 5 ms; physics won’t allow it). It also helps in debugging – if you see 100 ms delays, is it network? Disk? Queuing? The numbers give clues.
Furthermore, engineering for low latency is a balancing act with throughput, cost, and complexity. You’ve seen how batching and relaxing real-time requirements can cut costs dramatically (e.g. using Iceberg with minute-level latency vs. a live feed at sub-second latency) . Always tie your latency goals to business needs: if a dashboard updates in 5 seconds instead of 0.5 seconds, does it matter? If not, you might save a lot of money with a simpler, slightly slower design. On the other hand, if milliseconds matter (say, in fraud detection or user experience), then you know where to focus investment – like faster storage, avoiding cross-region hops, or using asynchronous processing to shave off waits.
In summary, the “latency numbers” every streaming engineer should know aren’t just abstract timings – they’re guideposts for making architectural decisions. By knowing what truly constitutes real-time in your context, understanding the physical limits, and measuring publish/consume/visibility delays in your pipeline, you can design streaming systems that meet their SLAs without guesswork. Jeff Dean’s famous list of latency numbers taught programmers to respect the reality of time in computing; similarly, our tour of streaming latency shows that latency is a feature you must budget and engineer for. Equipped with this knowledge, you can more confidently build systems that strike the right balance between blinding speed and practical efficiency – delivering data when and where it’s needed, in real real-time.
Newsletter
Our strategies and tactics delivered right to your inbox