Skip to main content
Welcome — in this lesson you’ll learn how table partitioning and clustering work in BigQuery, why they matter for cost and performance, and when to use each. If you’ve used Hive or other data warehouses, these concepts will feel familiar. This article maps those ideas to BigQuery with a concise example to show the practical differences and benefits.

Example scenario

Imagine a simple orders table with these columns:
  • order_id
  • country
  • status
  • order_date
Start with the table unpartitioned and unclustered, then examine how queries behave as you apply clustering and partitioning.

Unoptimized table behavior

When the table has no partitioning or clustering, BigQuery cannot prune by date or specific column values. For example:
SELECT * FROM dataset.orders;
This query scans the entire table. Even a filtered query such as:
SELECT * FROM dataset.orders WHERE country = 'US';
may still scan much of the table because the storage layout does not group rows by country or order_date. On very large tables this increases query latency and cost—especially with billions of rows.

Clustering by a column (example: country)

Clustering organizes data so rows with similar values for the clustered columns are stored physically close together. If you cluster by country, values like Japan, UK, and US will tend to reside in nearby blocks on disk. When you run:
SELECT * FROM dataset.orders WHERE country = 'US';
BigQuery can use cluster metadata to skip blocks that are unlikely to contain US rows, reducing the amount of data scanned. Clustering is most effective when your queries frequently filter on the clustered columns (equality predicates, GROUP BY, or JOIN keys) and when the clustered columns have moderate cardinality.

Partitioning (example: order_date)

Partitioning splits a table into segments, often by DATE or TIMESTAMP. A typical pattern is partitioning by order_date, so each day (or range) lives in a separate partition. If your query filters on the partition column, e.g.:
SELECT * FROM dataset.orders WHERE order_date = '2022-08-05';
BigQuery performs partition pruning first and scans only the relevant partition(s). This reduces the scanned data dramatically for time-bounded queries.

Combining partitioning and clustering

The best efficiency often comes from combining partitioning and clustering:
  • Partitioning prunes large swaths of data (for example, specific date ranges).
  • Clustering then prunes within those partitions to find rows for specific values (for example, country).
Example:
SELECT * FROM dataset.orders
WHERE order_date = '2022-08-05' AND country = 'US';
Execution flow:
  1. Partition pruning: BigQuery scans only the 2022-08-05 partition.
  2. Cluster pruning: Within that partition, BigQuery reads only the blocks that likely contain country = 'US'.
This two-stage pruning minimizes scanned bytes, lowers query latency, and reduces cost.

Sample DDL for a partitioned and clustered table

Use this DDL to create a table partitioned by order_date and clustered by country:
CREATE TABLE dataset.orders (
  order_id STRING,
  country STRING,
  status STRING,
  order_date DATE
)
PARTITION BY order_date
CLUSTER BY country;

Best practices and guidance

TopicRecommendationWhy it matters
Partition columnChoose a column used commonly in date or range filters (e.g., order_date).Partition pruning avoids scanning irrelevant time ranges.
Clustering columnsCluster on columns used in equality filters, GROUP BY, or JOIN predicates. Prefer moderate-cardinality columns (not unique IDs).Enables block-level pruning and speeds selective queries.
Number of clustering columnsKeep it small (typically 1–4 columns).More columns dilute clustering effectiveness and increase maintenance overhead.
When to define partitionsDefine partitioning at table creation where possible. To partition an existing table, create a new partitioned table and copy data.Initial partitioning produces the cleanest storage layout and best early performance.
Adding clusteringYou can add clustering to existing tables; subsequent writes will benefit.Enables incremental improvement without full rewrite, though initial data may not be optimally clustered.

Quick example queries

SELECT * FROM dataset.orders;
SELECT * FROM dataset.orders WHERE country = 'US';
SELECT * FROM dataset.orders WHERE order_date = '2022-08-05' AND country = 'US';
Partition pruning (by the partition column) runs first; clustering then prunes within the selected partitions. Together, they significantly reduce scanned data and query cost.
If your queries rarely filter on the partition column or clustered columns, BigQuery will still scan large portions of the table. Design partitions and clusters based on real query patterns to avoid unexpected costs.

Further reading and references

Designing partitioning and clustering thoughtfully for base models is a core responsibility for data engineers — these decisions directly affect performance and cost for everyone who queries the data. See you in the next lesson.

Watch Video