BigQuery query performance
4 min readJan 23, 2021
- Control projection — Query only the columns that you need.
Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results). Avoid SELECT *. - Prune partitioned queries
When querying a partitioned table, use the_PARTITIONTIME
pseudo column to filter the partitions. - Denormalizing data
Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized.
The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It’s best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships. Avoid denormalization in these use cases:
(1) You have a star schema with frequently changing dimensions.
(2) BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation but can’t replace it. - Using nested and repeated fields
BigQuery doesn’t require a completely flat denormalization.
You can use nested and repeated fields to maintain relationships. Nesting data (STRUCT
). Repeated data (ARRAY
). Nested and repeated data (ARRAY
ofSTRUCT
s). - Use external data sources appropriately
If query performance is a top priority, do not use an external data source. - Avoid excessive wildcard tables
When querying wildcard tables, use the most granular prefix possible. - Reduce data before using a
JOIN
Reduce the amount of data that is processed before aJOIN
clause. Trim the data as early in the query as possible, before the query performs aJOIN
. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need. - Do not treat
WITH
clauses as prepared statements
UseWITH
clauses primarily for readability.WITH
clauses are used primarily for readability because they are not materialized.
For example, placing all your queries inWITH
clauses and then runningUNION ALL
is a misuse of theWITH
clause. If a query appears in more than oneWITH
clause, it executes in each clause. - Avoid tables sharded by date
Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables. - Avoid oversharding tables
Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead. - Avoid repeatedly transforming data via SQL queries
If you are using SQL to perform ETL operations, avoid situations where you are repeatedly transforming the same data. - Avoid JavaScript user-defined functions
Avoid using JavaScript user-defined functions. Use native UDFs instead. - Use approximate aggregation functions
If your use case supports it, use an approximate aggregation function. - Order query operations to maximize performance
UseORDER BY
only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query. - Optimize your join patterns
For queries that join data from multiple tables, optimize your join patterns. Start with the largest table. - Prune partitioned queries
When querying a partitioned table, use the_PARTITIONTIME
pseudo column to filter the partitions. - Avoid repeated joins and subqueries
Avoid repeatedly joining the same tables and using the same subqueries. - Carefully consider materializing large result sets
Carefully consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts. - Use a
LIMIT
clause with large sorts
If you are sorting a very large number of values, use aLIMIT
clause.
Writing results for a query with anORDER BY
clause can result inResources exceeded
errors. Because the final sorting must be done on a single slot, if you are attempting to order a very large result set, the final sorting can overwhelm the slot that is processing the data. If you are using anORDER BY
clause, also use aLIMIT
clause. - Self-joins
Avoid self-joins. Use a window function instead. - Data skew
If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can’t share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition. - Unbalanced joins
Data skew can also appear when you useJOIN
clauses. Because BigQuery shuffles data on each side of the join, all data with the same join key goes to the same shard. This shuffling can overload the slot.
To avoid performance issues that are associated with unbalanced joins:
(1) Pre-filter rows from the table with the unbalanced key.
(2) If possible, split the query into two queries. - Cross joins (Cartesian product)
Avoid joins that generate more outputs than inputs. When aCROSS JOIN
is required, pre-aggregate your data. - DML statements that update or insert single rows
Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.