Jul 24, 2023
6 min read

Enhance Your Compliance and Data Governance with Apache Pulsar and StreamNative

Gilles Barbier
Head Of Community, StreamNative
Marshall Portwood
Account Executive, StreamNative
Image symbolising data compliance

In today's modern enterprises, engineering teams are confronted with multiple challenges. These include not only meeting strict deadlines but also ensuring adherence to regulatory compliance and establishing robust data governance.

Non-compliance can lead to severe penalties, reputational damage, and loss of customer trust. Therefore, it is imperative for organizations to leverage robust technologies that can facilitate comprehensive data compliance, especially in regulated and compliance-driven industries.

One such technology that has emerged as a powerful tool in this context is Apache Pulsar, an open-source distributed streaming and messaging system originally created at Yahoo and now part of the Apache Software Foundation. Apache Pulsar has become one of the most powerful pieces of technology for those concerned with data compliance:

  • Its multitenancy feature facilitates logical data separation based on teams, applications, or customers. 
  • It enables message replay for verifying data processing activities and long-term data retention to maintain an audit trail. 
  • Its built-in schema registry assures only predefined data schemas are accepted. 
  • It offers fine-grained access control, end-to-end data encryption for secure transportation, and supports multiple enterprise-grade authentication protocols to prevent unauthorized access. 

We will discuss how those features meet the most demanding technical requirements for building a compliant data management system today, taking GDPR as an example. 

Those features are just a few of the reasons that enterprises are turning to Apache Pulsar to solve these issues and choosing StreamNative, a company founded by the original creators of Apache Pulsar, to help them with an enterprise-grade managed out-of-the-box solution.

Technical Requirements for Building a Compliant Data Management System Today

Building a compliant data management system is a complex but essential task. It requires key technical capabilities to ensure data security, privacy, integrity, and accessibility to comply with regulatory requirements:

  • Data Security: A compliant data management system must have robust security measures in place. This includes encryption of data, strong access controls to prevent unauthorized access and secure data transfer protocols. 
  • Data Privacy: Privacy is a fundamental aspect of data compliance. The system should have measures such as anonymization and pseudonymization to protect sensitive data. It should also ensure secure data storage and provide controls for data subjects to manage their data.
  • Data Integrity: Ensuring the accuracy, consistency, and reliability of data is crucial. The system should have data validation and integrity checks to prevent data corruption or loss. It should also support data versioning to track changes over time.
  • Data Retention and Deletion: Regulatory requirements often specify how long certain types of data should be retained and when they should be deleted. The system should have clear data retention and deletion policies and mechanisms to enforce them.
  • Auditability: A compliant data management system should have comprehensive logging and monitoring capabilities. This allows for auditing and accountability, ensuring that all data processing activities are transparent and traceable.
  • Data Portability: Regulations like GDPR require that individuals should be able to move, copy or transfer their personal data easily from one IT environment to another. The system should support data portability to comply with such requirements.

Building a compliant data management system with these capabilities can be a challenging task. However, technologies like Apache Pulsar can significantly simplify this process. In the following sections, we will explore how.

Key Features of Apache Pulsar for Data Compliance

Apache Pulsar's design and features make it a powerful tool for building a compliant data streaming and messaging system. Here are some key features contributing to data compliance:

  • Multitenancy: Apache Pulsar's multi tenancy feature allows for logical separation of data within the same Pulsar instance. This means that data from different teams (tenants), applications, or customers can be isolated and remain invisible to others. Different tenants can have different policies. For example, one tenant might require data to be retained for seven years due to regulatory requirements, while another tenant might only need data to be retained for one year. Furthermore, having one Pulsar instance - instead of multiple Kafka clusters for example - inherently reduces risks by allowing a centralized management of data security and privacy. 
  • Message Replay: Apache Pulsar allows for message replay, which means that data can be reprocessed from a certain point in time. Message replay can be used to verify the accuracy of data processing activities. For example, in an audit, it might be necessary to replay messages to verify that all transactions were processed correctly. 
  • Long-term Retention: With Apache Pulsar, data can be stored for extended periods at a reasonable cost, thanks to its tiered-storage feature. Retaining messages for a certain period allows organizations to have an audit trail of data. This can be crucial for investigations or audits to verify that data processing activities comply with internal policies and external regulations. This feature is mandatory for compliance with regulations that require data to be retained for specific periods.
  • Schema Registry: Apache Pulsar has a built-in schema registry. Using schemas ensures that only data conforming to a predefined schema is accepted, preventing data corruption. Schemas ensure consistency and reliability of data across multiple producers and consumers through the organization. The schema registry supports schema evolution, which means that schemas can be updated over time while maintaining compatibility with older versions. This is crucial for data compliance as it allows organizations to adapt to changing data requirements while ensuring that older data is still valid and accessible.
  • Fine-Grained Access Control: Apache Pulsar allows administrators to control who can publish to a topic, who can subscribe to a topic, and who can consume from a topic. This can be configured at the namespace level or the individual topic level, providing a high degree of flexibility and control. This feature is crucial for maintaining data security and privacy, as many regulations require organizations to ensure that only authorized individuals can access certain types of data. 
  • End-to-End Encryption: Apache Pulsar supports end-to-end encryption of data. This means that data is encrypted from the point it enters the system until it reaches the intended recipient, safeguarding data privacy and security during transit and storage.
  • Enterprise-Grade Authentication Protocols: Apache Pulsar supports multiple authentication providers, including JWT, Athenz, Kerberos, and TLS. These enterprise-grade authentication protocols prevent unauthorized access to data.

These features of Apache Pulsar not only ensure data compliance but also provide flexibility and scalability, making it a suitable choice for organizations of all sizes. 

In the next section, we will discuss how Apache Pulsar aligns with the requirements of the General Data Protection Regulation (GDPR).

Apache Pulsar and GDPR Compliance

The General Data Protection Regulation (GDPR) is a critical regulation in the data privacy landscape that applies to all organizations processing the personal data of individuals in the European Union. It imposes strict requirements on data security, privacy, and governance. Let's explore how Apache Pulsar's features align with these requirements:

  • Data Minimization and Purpose Limitation: GDPR mandates that only necessary data should be collected and processed for specified, explicit, and legitimate purposes. Apache Pulsar's schema registry ensures that only data conforming to a predefined schema is accepted, thereby supporting data minimization. Its multi tenancy feature allows for logical separation of data, ensuring that data is processed only for its intended purpose.
  • Data Accuracy: GDPR requires that personal data should be accurate and kept up to date. Apache Pulsar's schema validation helps maintain data accuracy by ensuring that only data conforming to the schema is accepted.
  • Data Security: GDPR requires organizations to implement appropriate technical and organizational measures to ensure data security. Apache Pulsar's features such as access control, end-to-end encryption, and multi tenancy provide robust data security.
  • Accountability and Transparency: Under GDPR, organizations must be able to demonstrate compliance with data protection principles and provide transparent information to data subjects about how their data is processed. Apache Pulsar's message replay and comprehensive logging capabilities support auditing and transparency.
  • Data Portability: GDPR gives individuals the right to receive their personal data in a structured, commonly used, and machine-readable format. Apache Pulsar's flexible data processing capabilities make it easy to retrieve and aggregate data of a specific user.
  • Data Retention: GDPR mandates that personal data should not be retained longer than necessary. Apache Pulsar's long-term retention feature allows organizations to implement and enforce data retention policies.
  • Right To Erasure: Under Article 17 of the GDPR, individuals have the right to have their personal data erased under circumstances where the data is no longer necessary for the purpose it was originally collected. This can be implemented in a few different ways with Apache Pulsar:
  • ~ Topic deletion: The high cardinality of topics in Apache Pulsar allows for an architecture where individual topics can be dedicated to specific users or customers. By deleting these dedicated topics, organizations can effectively exercise the right to erasure.
  • ~ Encryption techniques: By proactively eliminating the encryption key associated with a particular user's data, an organization effectively renders the user's data irretrievable.
  • ~ Data Retention Policies: Apache Pulsar's ability to set data retention policies at the namespace or topic level can be leveraged for GDPR compliance. For example, by configuring a policy to discard messages immediately after they are consumed, organizations can ensure that personal data is not unnecessarily retained, aligning with GDPR's data minimization principles and right to erasure.

By aligning with these GDPR requirements, Apache Pulsar can help organizations not only achieve compliance but also build trust with their customers by ensuring the protection of their personal data.

Conclusion

As we have explored in this article, Apache Pulsar, with its robust and flexible features, provides a comprehensive solution to meet the complex requirements of data compliance, even more in highly regulated industries where data lineage, governance, and compliance are critical.

Apache Pulsar's features such as multitenancy, schema registry, long-term retention, message replay, access control, and end-to-end encryption, all contribute to ensuring data security, privacy, and integrity. For example, Apache Pulsar lets you align with the General Data Protection Regulation (GDPR) making it a compelling choice for organizations operating in or dealing with the European Union.

Of course, it's important to remember that while technology provides the tools for data compliance, it is the organization's responsibility to implement and maintain these tools effectively. Compliance is not a one-time task but an ongoing process that requires continuous monitoring, evaluation, and improvement.

Interested by Pulsar? StreamNative helps engineering teams worldwide make the move to Pulsar. Founded by the original creators of Apache Pulsar, StreamNative is one of the leading contributors to the open-source project and the author of the StreamNative Operators for running Apache Pulsar on Kubernetes, and of StreamNative Cloud, a fully managed service to help teams accelerate time-to-production.

Gilles Barbier
Marshall Portwood

Related articles

May 6, 2024
12 min read

A Guide to Evaluating the Infrastructure Costs of Apache Pulsar and Apache Kafka

Apr 29, 2024
6 min read

No Data Rebalance Needed! That's Why We Reimagined Kafka with Apache Pulsar to Make it 1000x More Elastic 

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Intro to Pulsar