
TL;DR
Adobe Experience Platform addressed the challenge of processing over 1 million identity graphs per second by leveraging Spark Structured Streaming and Delta Lake, resulting in a 10x pipeline scalability and maintaining privacy compliance. This approach enabled efficient data handling with reduced latency and ensured system stability, even during peak loads.
Opening
Imagine processing identity data at the scale of a bustling city like San Francisco every second—Adobe Experience Platform does precisely that. With a staggering 70 billion records flowing through their systems daily, Adobe faces the daunting task of keeping this data fresh and compliant with privacy standards. This session dives into how they tackled these challenges using Spark Structured Streaming and Delta Lake, sharing insights into their journey of scaling data pipelines by 10x and maintaining performance and compliance.
What You'll Learn (Key Takeaways)
- Leveraging Micro-Batching for Efficiency – By optimizing Spark micro-batch intervals and implementing deterministic deduplication, Adobe reduced redundant data processing by over 80%, stabilizing workloads and minimizing resource consumption.
- Async Task Processing for Latency Reduction – Implementing an asynchronous execution model for data ingestion allowed Adobe to offload I/O heavy tasks, resulting in more balanced resource utilization and reduced latency without increasing infrastructure costs.
- Addressing Data Skew with Repartitioning – Adobe solved data skew issues by introducing explicit repartitioning logic, achieving better parallelism and reducing task imbalance by over 40%.
- Ensuring Compliance with Delta Lake – Through the use of Delta Lake's vacuum feature and marker files, Adobe effectively managed data retention and regulatory compliance, ensuring secure and compliant data deletion processes.
Q&A Highlights
Q: How did you evaluate Spark versus Flink, and why choose Spark for this use case?
A: We opted for Spark because it integrates seamlessly with our existing batch pipelines, reducing duplicate code and management overhead. Spark also benefits from managed services on platforms like Databricks, which was crucial for operational efficiency.
Q: Can you explain what identity graphs are and how they work with Spark?
A: Identity graphs unify fragmented identifiers into a single profile, crucial for personalization. Spark's distributed processing handles the large-scale data ingestion efficiently, supporting our proprietary algorithms for identity resolution.
Q: How do you ensure data freshness and avoid stale data issues with in-memory snapshots?
A: We use Netflix Holo for in-memory snapshots and implement a custom heartbeat check to ensure data remains fresh, preventing stale data from affecting long-running applications.
Q: How does your deployment strategy ensure no downtime?
A: We use a blue-green deployment strategy, leveraging geospace dependency injection and monitoring new deployments closely to ensure a smooth transition without any service disruption.
By sharing their journey, Adobe provides valuable insights into scaling real-time data processing while maintaining compliance, offering a robust example for data streaming practitioners seeking to optimize their own pipelines.
Newsletter
Our strategies and tactics delivered right to your inbox