Skip to content

Possible feature: Aggregating Indexes #9871

@nicktobey

Description

@nicktobey

A small number of aggregate functions have the property that if the input rows R are partitioned into R1, R2, ..., RN, then:

F(R1 || R2 || ... || RN)) = F( F(R1), F(R2), ... , F(RN) )

  • Aggregate functions that have this property include BIT_AND, BIT_OR, and SUM
  • MIN, MAX, and COUNT also meet the criteria, but aggregating indexes are not needed to optimize those.
  • AVG doesn't have this exact property but can be optimized in a similar way by indexing both SUM and COUNT

Imagine a user wants to regularly run one of these aggregate functions on a range of rows.

Example: Suppose you're using Dolt to represent metadata on a set of files, where each row represents a different file and has columns for things like size, hash, and other application-specific metadata. You want to know the total size of the every file in the set that matches some filter.

You can write an aggregate function query: SELECT key, SUM(size) FROM files GROUP BY key;.

The key column is indexed, so this can get evaluated by scanning that index. But this still requires scanning the entire table and reading the size column of every field.

We could leverage the tree structure of the index to massively improve this by including an extra value in each tree node containing the result of the aggregate function on that node's children. When the index is modified, newly written tree nodes will recompute these values based on their children.

During queries like the example above, the engine will be able to identify tree nodes where all of the node's children match the filter and use the function results cached in that node instead of visiting any of the children nodes.

An index that optimizes one or more aggregate functions this way is called an aggregating index. Aggregating indexes don't exist in MySQL, but here's an example of what they look like in FireBolt: https://docs.firebolt.io/reference-sql/commands/data-definition/create-aggregating-index

CREATE AGGREGATING INDEX my_table_agg_idx1 ON my_table (
  origin_country,
  product_name,
  COUNT(DISTINCT source),
  SUM(amount)
);

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions