BigQuery Federated Queries

Welcome — in this lesson you’ll learn what BigQuery federated queries are, when to use them, and how they work in practice. A federated query lets BigQuery read external data at query time and combine it with native BigQuery tables without first ingesting the data. This is ideal for quick, ad-hoc analysis and multi-source joins when building and maintaining an ETL pipeline is impractical.

Key points

Federated queries let BigQuery read external data sources at query time and join that data with native BigQuery tables.
Supported external sources include Cloud Storage (CSV, JSON, Avro, Parquet, ORC), Google Drive / Google Sheets, Cloud SQL (MySQL/PostgreSQL) and Cloud Spanner via BigQuery connections, Cloud Bigtable, and other sources exposed through connectors.
Conceptually similar to query engines like Presto that present a single SQL interface across multiple data sources.

Supported external sources (summary)

External resource	Access method / formats	Typical use case
Cloud Storage	`CSV`, `JSON`, `Avro`, `Parquet`, `ORC` via external table `OPTIONS(uris=...)`	Query logs or exported files without loading them into BigQuery
Google Sheets / Drive	External tables with `GOOGLE_SHEETS` driver	Blend client-provided spreadsheets with production datasets
Cloud SQL / Cloud Spanner	BigQuery Connections + federated queries or `EXTERNAL_QUERY()`	Ad-hoc analysis of transactional DBs (use cautiously)
Cloud Bigtable	Connector for large, low-latency key-value data	Analytical joins with time-series or wide-column data

When to use federated queries

Short-lived, ad-hoc analysis where ingesting data into BigQuery is slower or unnecessary.
Prototyping, proof-of-concept, or exploratory analysis to validate hypotheses before committing to ingestion and ETL.
Blending small client-supplied datasets (for example, a Google Sheet) with a production BigQuery dataset to avoid building a pipeline for transient data.

Important cautions

Federated queries can increase latency and may generate charges from both BigQuery and the external data source. Querying transactional systems (Cloud SQL or Cloud Spanner) can affect production performance. For recurring, large-scale analytics, ingesting data into BigQuery is generally more performant and cost-effective.

How it works (high-level)

Define an external table or create a BigQuery Connection that references the remote data source.
When your SQL references that external table, BigQuery reads the source data at query time and processes it together with native tables.
External data is not persisted in BigQuery unless you explicitly load it into a native table.

Example: define and query a Cloud Storage external table

Create an external table that points to CSV files in Cloud Storage, then join it with a native table:

-- Create an external table that reads CSV files directly from Cloud Storage
CREATE EXTERNAL TABLE `mydataset.external_orders`
OPTIONS (
  format = 'CSV',
  uris = ['gs://my-bucket/orders/*.csv'],
  skip_leading_rows = 1
);

-- Query and join with a native BigQuery table
SELECT t.order_id, t.total_amount, e.region
FROM `myproject.mydataset.transactions` AS t
LEFT JOIN `myproject.mydataset.external_orders` AS e
  ON t.order_id = CAST(e.order_id AS INT64)
WHERE t.event_date >= '2024-01-01';

Example: query a Google Sheet as an external table

Create an external table with GOOGLE_SHEETS format and query it in place (useful for small, client-managed sheets).

Cost and performance tips

Federated queries are excellent for quick results and small-to-medium datasets. For high-frequency queries or large datasets, prefer loading data into BigQuery to reduce latency and cost. Monitor query plan and bytes processed, and avoid scanning unnecessary columns or files in the external source.

Best practices

Limit the scanned data: use file filtering (for Cloud Storage URIs), column projection, and WHERE clauses to reduce bytes read.
Use federated queries for prototyping and low-frequency use; move stable, high-volume data into BigQuery for repeated analysis.
For Cloud SQL or Spanner, consider creating read replicas or exporting snapshots if analytical queries would impact production performance.
Track and monitor costs that may come from both BigQuery processing and the external source (e.g., Cloud SQL I/O).

Links and references

Summary BigQuery federated queries let you query and join external data where it lives without first ingesting it into BigQuery. They’re powerful for ad-hoc analysis, prototyping, and blending external client data, but consider latency, cost, and the potential impact on source systems when using them. See you in the next lesson.

Watch Video

BigQuery Spark BigQuery Connector

Looker Studio for Visualization

Introduction

GCP Networking

Identity and Access Management (IAM) in GCP

Cloud Observability

Development & CI/CD

Data Security & Encryption

Data Ingestion Options

Data Storage Options

Database (SQL, NoSQL and memory)

Data Orchestration Options

Data Processing

Data Integration & Transformation Tools

Data Warehouse & Analytics Options

Machine Learning Options

Multi-Cloud & Lakehouse Solutions

Data Management and Governance

GCP Data Engineering Architecture and Landscape

GCP Core Fundamentals & Understanding

BigQuery Federated Queries

Watch Video