Announcing the Hudi Sink Connector for Apache Pulsar
We’re excited to announce the general availability of the Hudi Sink connector for Apache Pulsar. The connector enables seamless integration between Apache Hudi and Apache Pulsar, improving the diversity of the Apache Pulsar ecosystem. The Hudi + Pulsar connector offers a convenient, efficient, and flexible approach to moving data from Pulsar to Hudi without requiring user code.
For more information on why lakehouse technologies are growing in popularity, check out this blog.
See the Hudi Sink connector in action during the Pulsar Summit SF 2022 talk Unlocking the Power of Lakehouse Architectures with Pulsar and Hudi from Addison Higham (Chief Architect, StreamNative) and Alexey Kudinkin (Founding Engineer, Onehouse).
What is the Hudi Sink connector?
The Hudi Sink connector is a Pulsar IO connector that pulls data from Apache Pulsar topics and persists data to Hudi tables.
Why develop the Hudi Sink connector?
In the last 5 years, the rise of streaming data and the need for lower data latency have pushed data lakes to their limits. As a result, lakehouse technologies such as Apache Hudi have seen rapid adoption. Apache Pulsar, a distributed, open-source pub-sub messaging and streaming platform for real-time workloads, is a natural fit for lakehouse architectures. Integrating Apache Pulsar with Lakehouse streamlines data lifecycle management and data analysis.
StreamNative built the Hudi Sink Connector to provide Hudi users with a way to connect the flow of messages from Pulsar and use more powerful features, while avoiding problems with connectivity that can appear when there are intrinsic differences1 between systems or privacy requirements.
The connector solves this problem by fully integrating with Pulsar (including its serverless functions, per-message processing, and event-stream processing). The connector presents a low-code solution with out-of-the-box capabilities such as multi-tenant connectivity, geo-replication, protocols for direct connection to end-user mobile clients or IoT clients, and more.
What are the benefits of using the Hudi Sink connector?
The integration between Hudi and Apache Pulsar provides three key benefits:
- Simplicity: Quickly move data from Apache Pulsar to Hudi without any user code.
- Efficiency: Reduce your time spent configuring the data layer. This means you more time to discover the maximum business value from real-time data in an effective way.
- Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet business and operational needs in real time.
How do I get started with the Hudi Sink connector?
The following example shows how to configure the connector running in a standalone Pulsar service.
Prerequisites
First, you need to prepare these components:
- Pulsar 2.10.1
- Lakehouse connector 2.10.1.1
- Python3
- pyspark 3.2.1
Initialize the pyspark environment.
1. Create a virtualenv with python3.
2. Download pyspark.
Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to quickly get the connector up and running.
Configure the sink connector with local Filesystem
1. Decompress the Pulsar package and go to its root directory.
2. Start the Pulsar service with the daemon client tool.
3. Create a directory for storing table data.
4. Create the sink configuration file hudi-sink.json. Note that you need to update archive and hoodie.base.path to the correct path.
5. Submit the Hudi sink with pulsar-admin.
6. Check the sink status to confirm it is running.
The expected output:
numRunning shows 1 and running shows true mean that the sink is started successfully.
7. Produce 100 messages to the topic test-hudi-pulsar to make Hudi flush records to the table hudi-connector-test.
8. Check the sink status to confirm the message consumed.
The expected output:
numReadFromPulsar shows 10 and numWrittenToSink shows 10 mean that the messages are written into the sink.
9. Start pyspark with Hudi.
10. Execute the following code in the pyspark.
Then it will show the table hudi-connector-test content, which is produced from the Pulsar topic test-hudi-pulsar.
How can I get involved?
The Hudi Sink connector is a major step in the journey of integrating lakehouse systems into the Pulsar ecosystem. To get involved with the Hudi Sink connector for Apache Pulsar, check out the following featured resources:
- Try out the Hudi Sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
- Make a contribution. The Hudi Sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
- Contact us. Feel free to create an issue on GitHub, send emails to the Pulsar mailing list, or message us on Twitter to get answers from Pulsar experts.
- Pulsar Summit Europe 2023 is taking place virtually on May 23rd. Engage with the community by submitting a CFP or becoming a community sponsor (no fee required).
1. Intrinsic differences exist between platforms that have no notion of schema and the ones that have sophisticated schema capabilities because there is no simple way to translate between them. These platform differences range from traditional messaging like Amazon SQS to multi-level hierarchical Avro schema written to a data lake. Distinctions also exist between platforms relying on different data representations, such as Pandas DataFrames and simple messages.
Newsletter
Our strategies and tactics delivered right to your inbox