Pulsar Flink Connector 2.5.0
Pulsar Flink connector 2.5.0 is released on August 28, 2020, thank Pulsar community for the great efforts. The Pulsar Flink connector integrates Apache Pulsar and Apache Flink (the data processing engine), allowing Apache Flink to read/write data from/to Apache Pulsar.
I will introduce some major features in Pulsar Flink connector 2.5.0.
Backgrounds
Apache Flink is a distributed computing engine that is upgraded rapidly. In version 1.11, Apache Flink supports the following new features:
- The core engine introduces unaligned checkpoints, which improves the fault tolerance mechanism of Flink, and improves checkpointing performance under heavy backpressure.
- A new Source API that simplifies the implementation of (custom) sources by unifying batch and streaming execution, as well as offloading internals such as event-time handling, watermark generation or idleness detection to Flink.
- Flink SQL supports Change Data Capture (CDC) to easily consume and interpret database changelogs from tools like Debezium. The renewed FileSystem Connector also expands the set of use cases and formats supported in the Table API/SQL, enabling scenarios like streaming data directly from Kafka to Hive.
- Multiple performance optimizations to PyFlink, including support for vectorized User-defined Functions (Python UDFs). This improves interoperability with libraries like Pandas and NumPy, making Flink more powerful for data science and ML workloads.
After Apache Flink 1.11 was released, we upgraded the Pulsar Flink connector to support Apache Flink 1.11. We met some difficulties in upgrade:
- Public APIs supported by Apache Flink 1.11 are changed greatly.
- Schema, which is originally checked through Table API, is checked at the start-up stage.
- The connector is converted to Catalog at runtime.
The new Pulsar Flink connector is not compatible with the previous versions. Therefore, we decided to upgrade Pulsar Flink connector through the following two components:
- pulsar-flink-1.11 module
- Pulsar Schema
- Pulsar Schema contains the type structure information of the message. Therefore, Pulsar Schema works well with Flink Table. In Apache Flink 1.9, the SQL type is bound to the physical type and used as Pulsar SchemaType. However, in Apache Flink 1.11, after the Table is changed, the SQL type can only use the default physical type, and Pulsar SchemaType does not support the default physical type of the Apache Flink date and event. We added new native types to Pulsar Schema so that Pulsar Schema can work with the Flink SQL type system.
Major features
Here are some major features introduced in Pulsar Flink connector 2.5.0.
pulsar-flink-1.11 module
This section describes some new features about the pulsar-flink-1.11 module.
Support Apache Flink 1.11 and Flink SQL DDL
In Apache Flink 1.11, some public APIs are added or deleted, causing the Pulsar Flink connectors of Apache Flink 1.9 and Apache Flink 1.11 incompatible. Therefore, the project is divided into two modules to support different Apache Flink versions.
- Support Apache Flink 1.11.
- Support Flink SQL Data Definition Language (DDL).
- Update the topic partition policy to consume/dispatch messages evenly.
- Make Apache Flink 1.11 compatible with Pulsar Schema.
For more information about implementation, see PR-115.
Support pulsardeserializationSchema interface
Add a PulsarDeserializationSchema interface between actual deserialization and user-defined deserialization schema, so users can use the custom deserialization schema to consume messages.
For more information about implementation, see PR-95.
Support JSON Schema in Flink Sink
Flink Sink supports JSON schema. For more information about implementation, see PR-116.
Implement PulsarCatalog based on GenericInMemoryCatalog
Implement PulsarCatalog based on GenericInMemoryCatalog by extending PulsarCatalog from in-memory catalog. For more information about implementation, see PR-91.
Pulsar Schema
Add Java 8 time and date type to Pulsar primitive Schema. Support Instant, LocalDate, LocalTime, LocalDateTime types in Pulsar primitive Schema. For more information about implementation, see PR-7874.
Thanks
The release of Pulsar Flink connector 2.5.0 is a big milestone for this rapidly developing project. Special thanks to Hang Chen, Zhanpeng Wu, Sijie Guo, and Jianyun Zhao who contributed to this release.
Looking forward to your contributions to Pulsar Flink connector.
Reference
Newsletter
Our strategies and tactics delivered right to your inbox