Sep 17, 2020
3 min read

Pulsar Flink Connector 2.5.0

Jianyun Zhao

Pulsar Flink connector 2.5.0 is released on August 28, 2020, thank Pulsar community for the great efforts. The Pulsar Flink connector integrates Apache Pulsar and Apache Flink (the data processing engine), allowing Apache Flink to read/write data from/to Apache Pulsar.

I will introduce some major features in Pulsar Flink connector 2.5.0.

Backgrounds

Apache Flink is a distributed computing engine that is upgraded rapidly. In version 1.11, Apache Flink supports the following new features:

  • The core engine introduces unaligned checkpoints, which improves the fault tolerance mechanism of Flink, and improves checkpointing performance under heavy backpressure.
  • A new Source API that simplifies the implementation of (custom) sources by unifying batch and streaming execution, as well as offloading internals such as event-time handling, watermark generation or idleness detection to Flink.
  • Flink SQL supports Change Data Capture (CDC) to easily consume and interpret database changelogs from tools like Debezium. The renewed FileSystem Connector also expands the set of use cases and formats supported in the Table API/SQL, enabling scenarios like streaming data directly from Kafka to Hive.
  • Multiple performance optimizations to PyFlink, including support for vectorized User-defined Functions (Python UDFs). This improves interoperability with libraries like Pandas and NumPy, making Flink more powerful for data science and ML workloads.

After Apache Flink 1.11 was released, we upgraded the Pulsar Flink connector to support Apache Flink 1.11. We met some difficulties in upgrade:

  • Public APIs supported by Apache Flink 1.11 are changed greatly.
  • Schema, which is originally checked through Table API, is checked at the start-up stage.
  • The connector is converted to Catalog at runtime.

The new Pulsar Flink connector is not compatible with the previous versions. Therefore, we decided to upgrade Pulsar Flink connector through the following two components:

  • pulsar-flink-1.11 module
  • Pulsar Schema
  • Pulsar Schema contains the type structure information of the message. Therefore, Pulsar Schema works well with Flink Table. In Apache Flink 1.9, the SQL type is bound to the physical type and used as Pulsar SchemaType. However, in Apache Flink 1.11, after the Table is changed, the SQL type can only use the default physical type, and Pulsar SchemaType does not support the default physical type of the Apache Flink date and event. We added new native types to Pulsar Schema so that Pulsar Schema can work with the Flink SQL type system.

Major features

Here are some major features introduced in Pulsar Flink connector 2.5.0.

pulsar-flink-1.11 module

This section describes some new features about the pulsar-flink-1.11 module.

Support Apache Flink 1.11 and Flink SQL DDL

In Apache Flink 1.11, some public APIs are added or deleted, causing the Pulsar Flink connectors of Apache Flink 1.9 and Apache Flink 1.11 incompatible. Therefore, the project is divided into two modules to support different Apache Flink versions.

  • Support Apache Flink 1.11.
  • Support Flink SQL Data Definition Language (DDL).
  • Update the topic partition policy to consume/dispatch messages evenly.
  • Make Apache Flink 1.11 compatible with Pulsar Schema.

For more information about implementation, see PR-115.

Support pulsardeserializationSchema interface

Add a PulsarDeserializationSchema interface between actual deserialization and user-defined deserialization schema, so users can use the custom deserialization schema to consume messages.

For more information about implementation, see PR-95.

Support JSON Schema in Flink Sink

Flink Sink supports JSON schema. For more information about implementation, see PR-116.

Implement PulsarCatalog based on GenericInMemoryCatalog

Implement PulsarCatalog based on GenericInMemoryCatalog by extending PulsarCatalog from in-memory catalog. For more information about implementation, see PR-91.

Pulsar Schema

Add Java 8 time and date type to Pulsar primitive Schema. Support Instant, LocalDate, LocalTime, LocalDateTime types in Pulsar primitive Schema. For more information about implementation, see PR-7874.

Thanks

The release of Pulsar Flink connector 2.5.0 is a big milestone for this rapidly developing project. Special thanks to Hang Chen, Zhanpeng Wu, Sijie Guo, and Jianyun Zhao who contributed to this release.

Looking forward to your contributions to Pulsar Flink connector.

Reference

Jianyun Zhao
Jianyun Zhao is a software engineer at StreamNative. Before that, he was responsible for the development of a Real-time computing system at Zhaopin.com.

Related articles

Apr 11, 2024
5 min read

The New CAP Theorem for Data Streaming: Understanding the Trade-offs Between Cost, Availability, and Performance

Mar 31, 2024
5 min read

Data Streaming Trends from Kafka Summit London 2024

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.
Apache Pulsar Announcements
Pulsar Connectors