Example scenario
Imagine a simpleorders table with these columns:
order_idcountrystatusorder_date
Unoptimized table behavior
When the table has no partitioning or clustering, BigQuery cannot prune by date or specific column values. For example:country or order_date. On very large tables this increases query latency and cost—especially with billions of rows.
Clustering by a column (example: country)
Clustering organizes data so rows with similar values for the clustered columns are stored physically close together. If you cluster bycountry, values like Japan, UK, and US will tend to reside in nearby blocks on disk.
When you run:
US rows, reducing the amount of data scanned. Clustering is most effective when your queries frequently filter on the clustered columns (equality predicates, GROUP BY, or JOIN keys) and when the clustered columns have moderate cardinality.
Partitioning (example: order_date)
Partitioning splits a table into segments, often byDATE or TIMESTAMP. A typical pattern is partitioning by order_date, so each day (or range) lives in a separate partition.
If your query filters on the partition column, e.g.:
Combining partitioning and clustering
The best efficiency often comes from combining partitioning and clustering:- Partitioning prunes large swaths of data (for example, specific date ranges).
- Clustering then prunes within those partitions to find rows for specific values (for example,
country).
- Partition pruning: BigQuery scans only the
2022-08-05partition. - Cluster pruning: Within that partition, BigQuery reads only the blocks that likely contain
country = 'US'.
Sample DDL for a partitioned and clustered table
Use this DDL to create a table partitioned byorder_date and clustered by country:
Best practices and guidance
| Topic | Recommendation | Why it matters |
|---|---|---|
| Partition column | Choose a column used commonly in date or range filters (e.g., order_date). | Partition pruning avoids scanning irrelevant time ranges. |
| Clustering columns | Cluster on columns used in equality filters, GROUP BY, or JOIN predicates. Prefer moderate-cardinality columns (not unique IDs). | Enables block-level pruning and speeds selective queries. |
| Number of clustering columns | Keep it small (typically 1–4 columns). | More columns dilute clustering effectiveness and increase maintenance overhead. |
| When to define partitions | Define partitioning at table creation where possible. To partition an existing table, create a new partitioned table and copy data. | Initial partitioning produces the cleanest storage layout and best early performance. |
| Adding clustering | You can add clustering to existing tables; subsequent writes will benefit. | Enables incremental improvement without full rewrite, though initial data may not be optimally clustered. |
Quick example queries
Partition pruning (by the partition column) runs first; clustering then prunes within the selected partitions. Together, they significantly reduce scanned data and query cost.
If your queries rarely filter on the partition column or clustered columns, BigQuery will still scan large portions of the table. Design partitions and clusters based on real query patterns to avoid unexpected costs.
Further reading and references
- BigQuery partitioned tables: https://cloud.google.com/bigquery/docs/partitioned-tables
- BigQuery clustered tables: https://cloud.google.com/bigquery/docs/clustered-tables