Skip to main content
This article explains the Spark–BigQuery connector: what it is, when to use it, and how it integrates into a Dataproc + BigQuery workflow. It covers typical usage patterns, code examples, operational considerations, and best practices for production workloads. Why use Spark with BigQuery?
  • BigQuery excels at large-scale SQL analytics, but some workflows require capabilities beyond SQL:
    • Complex transformations and multi-step pipelines that are easier or more efficient in Spark.
    • Custom machine learning training and feature engineering with Spark MLlib.
    • Enrichment of BigQuery data with files in Cloud Storage or other external sources.
  • Spark provides distributed, in-memory processing for iterative workloads and advanced transformations that complement BigQuery’s SQL capabilities.
A presentation slide titled "Why Move Data From BigQuery to Spark?" that says to leverage Spark’s distributed in-memory processing for complex transformations, custom ML workflows, and data enrichment. It adds these use cases go beyond standard SQL capabilities.
What the connector does
  • The Spark–BigQuery connector bridges Spark and BigQuery so Spark jobs can:
    • Read BigQuery tables into Spark DataFrames.
    • Write Spark DataFrames back to BigQuery tables.
  • On Dataproc, the connector is included with the image and uses the BigQuery Storage API to read data in parallel for improved performance.
  • Typical workflow: extract (BigQuery) → transform/train (Spark on Dataproc) → load (BigQuery) — often described as ETTL (extract, transform, train, load).
How it works (high level)
  1. Read path:
    • Spark requests data via the connector, which uses the BigQuery Storage API to stream data in parallel into Spark partitions.
    • Optionally the connector uses temporary Cloud Storage files for certain jobs or to optimize writes.
  2. Transform/train:
    • Data is processed in-memory across the Dataproc cluster; you can join, filter, and run MLlib jobs.
  3. Write path:
    • DataFrames are written back to BigQuery by the connector using BigQuery streaming inserts or load jobs (which may use GCS as temporary storage).
Typical usage examples (PySpark)
  • Read a BigQuery table into a Spark DataFrame:
# PySpark: read from BigQuery
df = spark.read.format("bigquery") \
    .option("table", "my-project.my_dataset.my_table") \
    .load()
  • Write a Spark DataFrame back to BigQuery:
# PySpark: write to BigQuery
df.write.format("bigquery") \
    .option("table", "my-project.my_dataset.output_table") \
    .mode("overwrite") \
    .save()
  • Read via SQL (register as temp view, then run Spark SQL):
# Register BigQuery table as a temporary view, then query
df = spark.read.format("bigquery") \
    .option("table", "my-project.my_dataset.my_table") \
    .load()
df.createOrReplaceTempView("bq_table")
result = spark.sql("SELECT user_id, COUNT(*) AS cnt FROM bq_table GROUP BY user_id")
Operational notes and IAM
  • Dataproc: the connector is preinstalled in Dataproc images — no extra JARs required when using supported images.
  • Authentication:
    • Dataproc clusters use the cluster service account to authenticate to BigQuery and Cloud Storage.
    • If running Spark outside Dataproc, you must add the connector artifact and configure credentials (e.g., ADC or service account key).
  • Performance considerations:
    • Use the BigQuery Storage API for faster reads.
    • Tune Spark cluster size and partitioning to match the input table size.
    • For large writes, prefer load jobs (temporary GCS files) over streaming inserts when appropriate.
Permissions quick reference
OperationMinimum IAM role (example)
Read from BigQueryroles/bigquery.dataViewer
Write to BigQuery (create/overwrite)roles/bigquery.dataEditor or roles/bigquery.dataOwner
Use temporary GCS for writesroles/storage.objectAdmin (or scoped permissions to the bucket)
Dataproc cluster actionsroles/dataproc.worker / roles/dataproc.editor as applicable
On Dataproc the Spark–BigQuery connector comes preinstalled with the image; this lets you read and write BigQuery tables from Spark jobs without manually adding connector jars or dependencies.
Real-world scenario (example flow) A retail company trains a recommendations model:
  1. Query historical sales and user interaction tables from BigQuery into Spark DataFrames on Dataproc.
  2. Enrich those DataFrames with product metadata stored in Cloud Storage.
  3. Train a recommendation model using Spark MLlib and evaluate it across partitions.
  4. Write model outputs (predictions, feature tables, or aggregates) back to BigQuery for dashboards and downstream consumers.
Exam tip
  • If asked which Google Cloud service includes the Spark–BigQuery connector out of the box, the answer is Dataproc.
Ensure the Dataproc cluster’s service account has the necessary BigQuery and Cloud Storage IAM roles. The connector will be present, but read/write operations will fail without proper permissions.
Additional resources Summary
  • The Spark–BigQuery connector simplifies moving data between BigQuery and Spark, unlocking advanced transformations, iterative ML workflows, and enrichment with external files.
  • Dataproc makes integration straightforward by including the connector and managing the runtime; ensure IAM and temporary GCS access are configured for production runs.
Another option for accessing external data is BigQuery federated queries, where BigQuery queries data that remains in an external system without importing it.

Watch Video