Real-Time Data Streaming: Processing 1M Graphs/Sec with Spark

Introducing the StreamNative AI Hub — Agent Engine, MCP Server & more.

By clicking "Accept all cookies" you agree to have cookies stored on your device to improve site navigation, analyze site usage, and assist with our marketing efforts. See our privacy policy for more information.

Deny Accept

Breakout session

30 min

Processing 1M Identity Graphs per Second with Spark Structured Streaming

Akanksha Nagpal

Resources

Download Slide Deck ↓

TL;DR

Adobe Experience Platform addressed the challenge of processing over 1 million identity graphs per second by leveraging Spark Structured Streaming and Delta Lake, resulting in a 10x pipeline scalability and maintaining privacy compliance. This approach enabled efficient data handling with reduced latency and ensured system stability, even during peak loads.

Opening

Imagine processing identity data at the scale of a bustling city like San Francisco every second—Adobe Experience Platform does precisely that. With a staggering 70 billion records flowing through their systems daily, Adobe faces the daunting task of keeping this data fresh and compliant with privacy standards. This session dives into how they tackled these challenges using Spark Structured Streaming and Delta Lake, sharing insights into their journey of scaling data pipelines by 10x and maintaining performance and compliance.

What You'll Learn (Key Takeaways)

Leveraging Micro-Batching for Efficiency – By optimizing Spark micro-batch intervals and implementing deterministic deduplication, Adobe reduced redundant data processing by over 80%, stabilizing workloads and minimizing resource consumption.
Async Task Processing for Latency Reduction – Implementing an asynchronous execution model for data ingestion allowed Adobe to offload I/O heavy tasks, resulting in more balanced resource utilization and reduced latency without increasing infrastructure costs.
Addressing Data Skew with Repartitioning – Adobe solved data skew issues by introducing explicit repartitioning logic, achieving better parallelism and reducing task imbalance by over 40%.
Ensuring Compliance with Delta Lake – Through the use of Delta Lake's vacuum feature and marker files, Adobe effectively managed data retention and regulatory compliance, ensuring secure and compliant data deletion processes.

Q&A Highlights

Q: How did you evaluate Spark versus Flink, and why choose Spark for this use case?
A: We opted for Spark because it integrates seamlessly with our existing batch pipelines, reducing duplicate code and management overhead. Spark also benefits from managed services on platforms like Databricks, which was crucial for operational efficiency.

Q: Can you explain what identity graphs are and how they work with Spark?
A: Identity graphs unify fragmented identifiers into a single profile, crucial for personalization. Spark's distributed processing handles the large-scale data ingestion efficiently, supporting our proprietary algorithms for identity resolution.

Q: How do you ensure data freshness and avoid stale data issues with in-memory snapshots?
A: We use Netflix Holo for in-memory snapshots and implement a custom heartbeat check to ensure data remains fresh, preventing stale data from affecting long-running applications.

Q: How does your deployment strategy ensure no downtime?
A: We use a blue-green deployment strategy, leveraging geospace dependency injection and monitoring new deployments closely to ensure a smooth transition without any service disruption.

By sharing their journey, Adobe provides valuable insights into scaling real-time data processing while maintaining compliance, offering a robust example for data streaming practitioners seeking to optimize their own pipelines.

Akanksha Nagpal

Sr Software Engineer, Adobe

Akanksha is a Software Engineer with extensive experience in designing and building large-scale distributed systems within the Adobe Experience Platform (AEP). She has led the development of systems for efficient big data ingestion into Adobe's Identity Graph, including building highly scalable data pipelines with Apache Spark, Flink and Scala, handling over 500,000 messages per second. Passionate about innovation in data processing, she is dedicated to contributing to the broader engineering community.