Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL: Support aggregation commands on histogram fields #103060

Open
dgieselaar opened this issue Dec 6, 2023 · 8 comments
Open

ES|QL: Support aggregation commands on histogram fields #103060

dgieselaar opened this issue Dec 6, 2023 · 8 comments
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@dgieselaar
Copy link
Member

dgieselaar commented Dec 6, 2023

Description

Currently, running an ES|QL aggregation command on a histogram field results in an error. Aggregations should be supported on histogram fields, similar to how _search supports aggregations on histogram fields.

Use cases

In APM we use histograms to store latency distribution data (in transaction.duration.histogram). The aggregations we currently run on this field are: avg, pxx, sum, value_count.

@dgieselaar dgieselaar added >enhancement needs:triage Requires assignment of a team area label :Analytics/ES|QL AKA ESQL labels Dec 6, 2023
@elasticsearchmachine elasticsearchmachine added Team:QL (Deprecated) Meta label for query languages team and removed needs:triage Requires assignment of a team area label labels Dec 6, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-ql (Team:QL)

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/elasticsearch-esql (:Query Languages/ES|QL)

@nik9000
Copy link
Member

nik9000 commented Dec 6, 2023

Do you have any examples of things? I can guess, but it'd be nice to have an example of the kinds of STATS you expect.

One problem with the STATS here is that ESQL allows a lot more slicing that _search does so it'd be easier to put the query into a state where it wouldn't have the data. I'm kind of imagining something like FROM foo | WHERE hostname = 'blah' | STATS PERCENTILES(bytes_out) where hostname is a field that got removed in a downsampling operation. I suppose that thing's just not supported. I guess we'd get it for free by the field just not being there. Though maybe the error message should be different? I dunno.

@dgieselaar
Copy link
Member Author

@nik9000 AVG, SUM, MIN, MAX, Pxx. I'm not sure if I follow your example?

@wchaparro wchaparro added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Jan 2, 2024
@elasticsearchmachine elasticsearchmachine removed the Team:QL (Deprecated) Meta label for query languages team label Jan 2, 2024
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@luigidellaquila
Copy link
Contributor

I think first of all we'll have to support histogram field type (read and output at least).
Since a histogram field is practically an object containing two arrays, I can imagine it returned as a JSON.
Supporting new field types has some cost by itself and is not trivial.

After that, we can start defining the behavior for the single agg functions, starting from min, max, count and avg.
I guess it won't be much different from how _search implements them, eg. for the sample data reported here

PUT my-index-000001
{
  "mappings" : {
    "properties" : {
      "my_histogram" : {
        "type" : "histogram"
      },
      "my_text" : {
        "type" : "keyword"
      }
    }
  }
}

PUT my-index-000001/_doc/1
{
  "my_text" : "histogram_1",
  "my_histogram" : {
      "values" : [0.1, 0.2, 0.3, 0.4, 0.5], 
      "counts" : [3, 7, 23, 12, 6] 
   }
}

PUT my-index-000001/_doc/2
{
  "my_text" : "histogram_2",
  "my_histogram" : {
      "values" : [0.1, 0.25, 0.35, 0.4, 0.45, 0.5], 
      "counts" : [8, 17, 8, 7, 6, 2] 
   }
}

I guess the ESQL usage will be something like:

from my-index* | stats max = max(my_histogram), count = count(my_histogram) by my_text;

my_text     | max    | count
histogram_1 | 0.5    | 51
histogram_2 | 0.5    | 48

where max(my_histogram) is calculated on the "values", while count(my_histogram) is the sum of the "counts".
We will have to define the behavior of each single aggregation function, but at a first look it seems pretty natural at least for the basic aggs, and we can start from this as a guideline.

Wondering if it makes sense to allow histogram fields in other commands apart from STATS. Maybe they can be used in EVAL for simple assignment (no manipulation, at least in a first phase) and KEEP/DROP, but it's hard for me to imagine how to use them in commands like SORT, ENRICH and so on.

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@not-napoleon
Copy link
Member

I suggest we wait to implement histogram support until we encode the algorithm in the field (see #108208). This will let us choose the appropriate sketch for percentiles against the histogram, at a minimum, and may influence the implementation of other aggregations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

6 participants