BigQuery query performance

Suds Kumar
4 min readJan 23, 2021
  1. Control projection — Query only the columns that you need.
    Projection refers to the number of columns that are read by your query. Projecting excess columns incurs additional (wasted) I/O and materialization (writing results). Avoid SELECT *.
  2. Prune partitioned queries
    When querying a partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.
  3. Denormalizing data
    Denormalization is a common strategy for increasing read performance for relational datasets that were previously normalized.
    The recommended way to denormalize data in BigQuery is to use nested and repeated fields. It’s best to use this strategy when the relationships are hierarchical and frequently queried together, such as in parent-child relationships. Avoid denormalization in these use cases:
    (1) You have a star schema with frequently changing dimensions.
    (2) BigQuery complements an Online Transaction Processing (OLTP) system with row-level mutation but can’t replace it.
  4. Using nested and repeated fields
    BigQuery doesn’t require a completely flat denormalization.
    You can use nested and repeated fields to maintain relationships. Nesting data (STRUCT). Repeated data (ARRAY). Nested and repeated data (ARRAY of STRUCTs).
  5. Use external data sources appropriately
    If query performance is a top priority, do not use an external data source.
  6. Avoid excessive wildcard tables
    When querying wildcard tables, use the most granular prefix possible.
  7. Reduce data before using a JOIN
    Reduce the amount of data that is processed before a JOIN clause. Trim the data as early in the query as possible, before the query performs a JOIN. If you reduce data early in the processing cycle, shuffling and other complex operations only execute on the data that you need.
  8. Do not treat WITH clauses as prepared statements
    Use WITH clauses primarily for readability.WITH clauses are used primarily for readability because they are not materialized.
    For example, placing all your queries in WITH clauses and then running UNION ALL is a misuse of the WITH clause. If a query appears in more than one WITH clause, it executes in each clause.
  9. Avoid tables sharded by date
    Do not use tables sharded by date (also called date-named tables) in place of time-partitioned tables.
  10. Avoid oversharding tables
    Avoid creating too many table shards. If you are sharding tables by date, use time-partitioned tables instead.
  11. Avoid repeatedly transforming data via SQL queries
    If you are using SQL to perform ETL operations, avoid situations where you are repeatedly transforming the same data.
  12. Avoid JavaScript user-defined functions
    Avoid using JavaScript user-defined functions. Use native UDFs instead.
  13. Use approximate aggregation functions
    If your use case supports it, use an approximate aggregation function.
  14. Order query operations to maximize performance
    Use ORDER BY only in the outermost query or within window clauses (analytic functions). Push complex operations to the end of the query.
  15. Optimize your join patterns
    For queries that join data from multiple tables, optimize your join patterns. Start with the largest table.
  16. Prune partitioned queries
    When querying a partitioned table, use the _PARTITIONTIME pseudo column to filter the partitions.
  17. Avoid repeated joins and subqueries
    Avoid repeatedly joining the same tables and using the same subqueries.
  18. Carefully consider materializing large result sets
    Carefully consider materializing large result sets to a destination table. Writing large result sets has performance and cost impacts.
  19. Use a LIMIT clause with large sorts
    If you are sorting a very large number of values, use a LIMIT clause.
    Writing results for a query with an ORDER BY clause can result in Resources exceeded errors. Because the final sorting must be done on a single slot, if you are attempting to order a very large result set, the final sorting can overwhelm the slot that is processing the data. If you are using an ORDER BY clause, also use a LIMIT clause.
  20. Self-joins
    Avoid self-joins. Use a window function instead.
  21. Data skew
    If your query processes keys that are heavily skewed to a few values, filter your data as early as possible.
    Partition skew, sometimes called data skew, is when data is partitioned into very unequally sized partitions. This creates an imbalance in the amount of data sent between slots. You can’t share partitions between slots, so if one partition is especially large, it can slow down, or even crash the slot that processes the oversized partition.
  22. Unbalanced joins
    Data skew can also appear when you use JOIN clauses. Because BigQuery shuffles data on each side of the join, all data with the same join key goes to the same shard. This shuffling can overload the slot.
    To avoid performance issues that are associated with unbalanced joins:
    (1) Pre-filter rows from the table with the unbalanced key.
    (2) If possible, split the query into two queries.
  23. Cross joins (Cartesian product)
    Avoid joins that generate more outputs than inputs. When a CROSS JOIN is required, pre-aggregate your data.
  24. DML statements that update or insert single rows
    Avoid point-specific DML statements (updating or inserting 1 row at a time). Batch your updates and inserts.

--

--