Kafka Under Pressure: Netflix's Blueprint for Unshakeable Kafka Resilience
Jorge Rodriguez
Vinay Rayini

How does a Kafka cluster handle 10× or even 100× traffic spikes while maintaining high throughput and availability? At Netflix, live streaming events place unprecedented demands on our core Kafka infrastructure, requiring innovative solutions to keep services resilient under extreme load.

In this talk, we share Netflix’s blueprint for Kafka resilience, covering strategies that go beyond out-of-the-box configurations to maximize uptime, minimize data loss, and maintain service performance during peak loads.

Key topics include:

  • Broker Stability Under Overload: Techniques to ensure Kafka brokers remain stable even during extreme traffic surges.
  • Adaptive Clients: Transforming producers and consumers into active participants that dynamically adjust behavior in real time to protect cluster health.
  • Operational Insights: Lessons learned from scaling Kafka at Netflix, including monitoring, failure mitigation, and proactive management strategies.
  • High-Throughput Design Patterns: Architectures and operational patterns to sustain performance during unpredictable traffic spikes.

Whether you’re a Kafka engineer, platform architect, or operations lead, this talk provides actionable strategies and insights for building resilient, scalable, and high-performing Kafka infrastructures capable of surviving even the most demanding workloads.

Jorge Rodriguez
Senior Software Engineer, Netflix

Jorge Rodriguez is a Senior Software Engineer on the Data Movement Engines team at Netflix. During the past 4 years, he's been contributing to the Apache Kafka and Flink platforms to enable realtime data processing at Netflix.

Vinay Rayini
Software Engineer, Netflix

Vinay is a Software Engineer on the Data Movement Engines team at Netflix, where he has spent the last two years developing and scaling the Kafka as a Service platform. This platform is crucial for collecting and transporting over 23 trillion events and 50 petabytes of data daily. Previously, he worked at Microsoft and Google on distributed systems and real-time data processing initiatives.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.