Blog
3 min read

Announcing the Google Cloud BigQuery Sink Connector for Apache Pulsar

We are excited to announce the general availability of the Google Cloud BigQuery sink connector for Apache Pulsar. This connector seamlessly synchronizes Pulsar data to BigQuery in real time, enabling Google Cloud BigQuery to leverage Pulsar and expanding the Apache Pulsar ecosystem.

What is the Google Cloud BigQuery sink connector?

The Google BigQuery sink connector pulls data from Pulsar topics and persists data to Google Cloud BigQuery tables.

logo pulsar and google cloud bgquery

Why develop the Google Cloud BigQuery sink connector?

Google Cloud BigQuery is a fully managed enterprise data warehouse that enables users to manage and analyze data with built-in features like machine learning, geospatial analysis, and business intelligence.

The Google Cloud BigQuery sink connector provides you with a way to write data from Pulsar to BigQuery in real time. It presents a low-code solution with out-of-the-box capabilities like strong fault tolerance, great scalability, automatic creation and update of table schema, partitioned tables, clustered tables, and many more.

Before the availability of this connector, you could only use the Cloud Storage Sink connector for Pulsar to move data to Cloud Storage. In order to perform query analysis in the form of external tables, you needed to use BigQuery (refer to Integrating Apache Pulsar with BigQuery). However, using external tables in BigQuery has many limitations, such as no support for clustered tables and poor query performance. This connector enables you to write data directly from Pulsar to BigQuery and supports partitioned and aggregate tables.

What are the benefits of using the Google Cloud BigQuery sink connector?

The integration between Google Cloud BigQuery and Apache Pulsar provides four key benefits.

  • Simplicity: Quickly move data from Apache Pulsar to Google Cloud BigQuery without any user code.
  • Efficiency: Reduce your time in configuring the data layer. This means you have more time to discover the maximum business value from real-time data in an effective way.
  • Scalability: Run in different modes (standalone or distributed). This allows you to build reactive data pipelines to meet the business and operational needs in real time.
  • Auto Schema: Automatically create and update a table’s schema based on the Pulsar topic schema.

How to get started with the Google Cloud BigQuery sink connector

Prerequisites

First, you must run an Apache Pulsar cluster and a Google Cloud BigQuery service.

  1. Prepare the Pulsar service. You can quickly run a Pulsar cluster anywhere by running $PULSAR_HOME/bin/pulsar standalone. Refer to the documentation for details.
  2. Prepare the Google Cloud BigQuery service. See Google Cloud BigQuery Quickstarts for details. Note that you need to set up the GOOGLE_APPLICATION_CREDENTIALS environment variable to access Google BigQuery.
  3. Set up the Google BigQuery connector. Download the connector from the Releases page, and then move the jar package to $PULSAR_HOME/connectors.

Apache Pulsar provides a Pulsar IO feature to run the connector. Follow the steps below to get the connector up and running.

Configure the sink connector

  1. Create a configuration file named google-bigquery-sink-config.json. The configured connector writes the message in the public/default/google-bigquery-pulsar topic to the test-pulsar table of BigQuery.
<script>
{
     "name": "google-bigquery-sink",
     "archive": "$PULSAR_HOME/connectors/pulsar-io-bigquery-{{connector:version}}.jar",
     "className": "org.apache.pulsar.ecosystem.io.bigquery.BigQuerySink",
     "tenant": "public",
     "namespace": "default",
     "inputs": [
       "google-bigquery-pulsar"
     ],
     "parallelism": 1,
     "configs": {
       "projectId": "SECRETS",
       "datasetName": "pulsar-io-google-bigquery",
       "tableName": "test-pulasr"
   }
 }
 <script> 

2. Run the sink connector.

<script>
PULSAR_HOME/bin/pulsar-admin sinks localrun \
--sink-config-file google-bigquery-sink-config.json
<script> 

3. You can send messages to the public/default/google-bigquery-pulsar topic, then view it in BigQuery.

For more information, see the Google Cloud BigQuery Sink documentation.

How can you get involved?

The Google BigQuery sink connector is a major step in the journey of integrating Pulsar with other big data systems. To get involved with the Google Cloud BigQuery sink connector for Apache Pulsar, check out the following featured resources:

  • Try out the Google BigQuery sink connector. To get started, download the connector and refer to the ReadMe that walks you through the whole process.
  • Make a contribution. The Google BigQuery sink connector is a community-driven service, which hosts its source code on the StreamNative GitHub repository. If you have any feature requests or bug reports, do not hesitate to share your feedback and ideas and submit a pull request.
  • Contact us. Feel free to create an issue on GitHub, send an email to the Pulsar mailing list, or message us on Twitter to get answers from Pulsar experts.

More resources

Pulsar has become one of the most active Apache projects over the past few years, with a vibrant community that continues to drive innovation and improvements to the project.

  • Start your Pulsar training today. Take the self-paced Pulsar courses or instructor-led Pulsar training developed by the original creators of Pulsar. This will get you started with Pulsar and help accelerate your learning.
  • Spin up a Pulsar cluster in minutes with StreamNative Cloud. StreamNative Cloud provides a simple, fast, and cost-effective way to run Pulsar in the public cloud.
  • Save your spot at the Pulsar Summit San Francisco. The first in-person Pulsar Summit is taking place this August! Sign up today to join the Pulsar community and the messaging and event streaming community.
Baodi Shi
Baodi is a platform engineer at StreamNative. He once worked in a fintech company for 5 years, mainly responsible for middleware development. His work focuses on event sourcing, domain-driven design, and real-time computing.

Newsletter

Our strategies and tactics delivered right to your inbox

Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.