Announcing the Google Cloud BigQuery Sink Connector for Apache Pulsar

Baodi Shi

Platform Engineer at StreamNative

Text Link

Apache Pulsar

Apache Pulsar Announcements

Pulsar Connectors

We are excited to announce the general availability of the Google Cloud BigQuery sink connector for Apache Pulsar. This connector seamlessly synchronizes Pulsar data to BigQuery in real time, enabling Google Cloud BigQuery to leverage Pulsar and expanding the Apache Pulsar ecosystem.

What is the Google Cloud BigQuery sink connector?

The Google BigQuery sink connector pulls data from Pulsar topics and persists data to Google Cloud BigQuery tables.

Why develop the Google Cloud BigQuery sink connector?

Google Cloud BigQuery is a fully managed enterprise data warehouse that enables users to manage and analyze data with built-in features like machine learning, geospatial analysis, and business intelligence.

The Google Cloud BigQuery sink connector provides you with a way to write data from Pulsar to BigQuery in real time. It presents a low-code solution with out-of-the-box capabilities like strong fault tolerance, great scalability, automatic creation and update of table schema, partitioned tables, clustered tables, and many more.

Before the availability of this connector, you could only use the Cloud Storage Sink connector for Pulsar to move data to Cloud Storage. In order to perform query analysis in the form of external tables, you needed to use BigQuery (refer to Integrating Apache Pulsar with BigQuery). However, using external tables in BigQuery has many limitations, such as no support for clustered tables and poor query performance. This connector enables you to write data directly from Pulsar to BigQuery and supports partitioned and aggregate tables.

What are the benefits of using the Google Cloud BigQuery sink connector?

The integration between Google Cloud BigQuery and Apache Pulsar provides four key benefits.

Simplicity: Quickly move data from Apache Pulsar to Google Cloud BigQuery without any user code.
Efficiency: Reduce your time in configuring the data layer. This means you have more time to discover the maximum business value from real-time data in an effective way.
Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet the business and operational needs in real time.
Auto Schema: Automatically create and update a table’s schema based on the Pulsar topic schema.

How to get started with the Google Cloud BigQuery sink connector

Prerequisites

First, you must run an Apache Pulsar cluster and a Google Cloud BigQuery service.

Prepare the Pulsar service. You can quickly run a Pulsar cluster anywhere by running $PULSAR_HOME/bin/pulsar standalone. Refer to the documentation for details.
Prepare the Google Cloud BigQuery service. See Google Cloud BigQuery Quickstarts for details. Note that you need to set up the GOOGLE_APPLICATION_CREDENTIALS environment variable to access Google BigQuery.
Set up the Google BigQuery connector. Download the connector from the Releases page, and then move the jar package to $PULSAR_HOME/connectors.

Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to get the connector up and running.

Configure the sink connector

Create a configuration file named google-bigquery-sink-config.json. The configured connector writes the message in the public/default/google-bigquery-pulsar topic to the test-pulsar table of BigQuery.

<script>
{
     "name": "google-bigquery-sink",
     "archive": "$PULSAR_HOME/connectors/pulsar-io-bigquery-{{connector:version}}.jar",
     "className": "org.apache.pulsar.ecosystem.io.bigquery.BigQuerySink",
     "tenant": "public",
     "namespace": "default",
     "inputs": [
       "google-bigquery-pulsar"
     ],
     "parallelism": 1,
     "configs": {
       "projectId": "SECRETS",
       "datasetName": "pulsar-io-google-bigquery",
       "tableName": "test-pulasr"
   }
 }
 <script>

2. Run the sink connector.

<script>
PULSAR_HOME/bin/pulsar-admin sinks localrun \
--sink-config-file google-bigquery-sink-config.json
<script>

3. You can send messages to the public/default/google-bigquery-pulsar topic, then view it in BigQuery.

For more information, see the Google Cloud BigQuery Sink documentation.

How can you get involved?

The Google BigQuery sink connector is a major step in the journey of integrating Pulsar with other big data systems. To get involved with the Google Cloud BigQuery sink connector for Apache Pulsar, check out the following featured resources:

Try out the Google BigQuery sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
Make a contribution. The Google BigQuery sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
Contact us. Feel free to create an issue on GitHub, send an email to the Pulsar mailing list, or message us on Twitter to get answers from Pulsar experts.

More resources

Make an inquiry: Interested in a fully-managed Pulsar offering built by the original creators of Pulsar? Contact us now.
Pulsar Summit Europe 2023 is taking place virtually on May 23rd. Engage with the community by submitting a CFP or becoming a community sponsor (no fee required).
Learn the Pulsar Fundamentals: Sign up for StreamNative Academy, developed by the original creators of Pulsar, and learn at your own pace with on-demand courses and hands-on labs.
Read the 2022 Pulsar vs. Kafka Benchmark Report for the latest performance comparison on maximum throughput, publish latency, and historical read rate.

Baodi Shi

Baodi is a platform engineer at StreamNative. He once worked in a fintech company for 5 years, mainly responsible for middleware development. His work focuses on event sourcing, domain-driven design, and real-time computing.