We are excited to announce the general availability of the Google Cloud BigQuery sink connector for Apache Pulsar. This connector seamlessly synchronizes Pulsar data to BigQuery in real time, enabling Google Cloud BigQuery to leverage Pulsar and expanding the Apache Pulsar ecosystem.
Why develop the Google Cloud BigQuery sink connector?
Google Cloud BigQuery is a fully managed enterprise data warehouse that enables users to manage and analyze data with built-in features like machine learning, geospatial analysis, and business intelligence.
The Google Cloud BigQuery sink connector provides you with a way to write data from Pulsar to BigQuery in real time. It presents a low-code solution with out-of-the-box capabilities like strong fault tolerance, great scalability, automatic creation and update of table schema, partitioned tables, clustered tables, and many more.
Before the availability of this connector, you could only use the Cloud Storage Sink connector for Pulsar to move data to Cloud Storage. In order to perform query analysis in the form of external tables, you needed to use BigQuery (refer to Integrating Apache Pulsar with BigQuery). However, using external tables in BigQuery has many limitations, such as no support for clustered tables and poor query performance. This connector enables you to write data directly from Pulsar to BigQuery and supports partitioned and aggregate tables.
What are the benefits of using the Google Cloud BigQuery sink connector?
The integration between Google Cloud BigQuery and Apache Pulsar provides four key benefits.
Simplicity: Quickly move data from Apache Pulsar to Google Cloud BigQuery without any user code.
Efficiency: Reduce your time in configuring the data layer. This means you have more time to discover the maximum business value from real-time data in an effective way.
Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet the business and operational needs in real time.
Auto Schema: Automatically create and update a table’s schema based on the Pulsar topic schema.
How to get started with the Google Cloud BigQuery sink connector
First, you must run an Apache Pulsar cluster and a Google Cloud BigQuery service.
Prepare the Pulsar service. You can quickly run a Pulsar cluster anywhere by running $PULSAR_HOME/bin/pulsar standalone. Refer to the documentation for details.
Prepare the Google Cloud BigQuery service. See Google Cloud BigQuery Quickstarts for details. Note that you need to set up the GOOGLE_APPLICATION_CREDENTIALS environment variable to access Google BigQuery.
Set up the Google BigQuery connector. Download the connector from the Releases page, and then move the jar package to $PULSAR_HOME/connectors.
Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to get the connector up and running.
Configure the sink connector
Create a configuration file named google-bigquery-sink-config.json. The configured connector writes the message in the public/default/google-bigquery-pulsar topic to the test-pulsar table of BigQuery.
The Google BigQuery sink connector is a major step in the journey of integrating Pulsar with other big data systems. To get involved with the Google Cloud BigQuery sink connector for Apache Pulsar, check out the following featured resources:
Try out the Google BigQuery sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
Make a contribution. The Google BigQuery sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
Baodi is a platform engineer at StreamNative. He once worked in a fintech company for 5 years, mainly responsible for middleware development. His work focuses on event sourcing, domain-driven design, and real-time computing.