Kafka Under Pressure: Netflix's Blueprint for Unshakeable Kafka Resilience

Ursa Wins VLDB 2025 Best Industry Paper: The First Lakehouse-Native Streaming Engine for Kafka

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Deny Accept

Breakout session

35 mins

Kafka Under Pressure: Netflix's Blueprint for Unshakeable Kafka Resilience

Jorge Rodriguez

Vinay Rayini

Resources

Download Slide Deck ↓

How does a Kafka cluster handle 10× or even 100× traffic spikes while maintaining high throughput and availability? At Netflix, live streaming events place unprecedented demands on our core Kafka infrastructure, requiring innovative solutions to keep services resilient under extreme load.

In this talk, we share Netflix’s blueprint for Kafka resilience, covering strategies that go beyond out-of-the-box configurations to maximize uptime, minimize data loss, and maintain service performance during peak loads.

Key topics include:

Broker Stability Under Overload: Techniques to ensure Kafka brokers remain stable even during extreme traffic surges.
Adaptive Clients: Transforming producers and consumers into active participants that dynamically adjust behavior in real time to protect cluster health.
Operational Insights: Lessons learned from scaling Kafka at Netflix, including monitoring, failure mitigation, and proactive management strategies.
High-Throughput Design Patterns: Architectures and operational patterns to sustain performance during unpredictable traffic spikes.

Whether you’re a Kafka engineer, platform architect, or operations lead, this talk provides actionable strategies and insights for building resilient, scalable, and high-performing Kafka infrastructures capable of surviving even the most demanding workloads.

Jorge Rodriguez

Senior Software Engineer, Netflix

Jorge Rodriguez is a Senior Software Engineer on the Data Movement Engines team at Netflix. During the past 4 years, he's been contributing to the Apache Kafka and Flink platforms to enable realtime data processing at Netflix.

Vinay Rayini

Software Engineer, Netflix

Vinay is a Software Engineer on the Data Movement Engines team at Netflix, where he has spent the last two years developing and scaling the Kafka as a Service platform. This platform is crucial for collecting and transporting over 23 trillion events and 50 petabytes of data daily. Previously, he worked at Microsoft and Google on distributed systems and real-time data processing initiatives.

Recommended resources

Watch more events.

Show all

Video

34 mins

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Recommended resources

High-throughput streaming Lakehouse with Apache Hudi

Schema Management and Streaming Data Products

Flink Streaming Ingestion to Cloud-lake at Scale

Newsletter