-
Notifications
You must be signed in to change notification settings - Fork 181
Description
1. Summary
PPL allows specifying field order using names and wildcards.
Current Calcite PPL implementation preserve field order as row type, but it would not work when dynamic fields are introduced by #4984 since all the dynamic fields will be stored in single _MAP field.
This RFC proposes a method to produce query result with expected field order even when the result contains dynamic fields.
2. Motivation
Users expect deterministic output data layout, especially when exporting data.
We would like to replicate the Splunk behavior so that existing Splunk users can easily onboard OpenSearch PPL.
3. Expected behavior of field ordering
- Default: fields sorted UTF-8 lexicographically.
fields: replaces ordering; affects later fields.stats/timechart: define their own output order.join: ignores right-side ordering.rename: modify field order (replace fields in field order)
4. Proposed Solution
- Decide a field order based on the query.
- Field order is a list of string, which contains specific field name and/or wild card notation (e.g.
abc,*,prefix*, etc.) - The last command which emit field order will decide the query's field order. (The field order specified earlier command will be ignored.)
- Only a few commands emit field order (
fields,stats,timechart,rename)
- Once query execution is done, reorganize the result by using the field order from step 1
- Come up with the list of all the available fields in dynamic fields
- Reorder fields based on the field order from step 1
- Fields will be sorted lexicographically for wild card pattern and remaining fields
Examples:
source=idx | eval a=1, b=2 | fields a, * | eval c=3 | fields b, *, a
# Field order => [b, *, a]
# Result schema => [b, c, a]
source=idx | eval a=1, b=2, bc=3, bd=4 | fields a, bd, b*
# Field order => [a, bd, b*]
# Result schema => [a, bd, b, bc]
source=idx | eval a=1, b=2, bc=3, bd=4 | fields *, a
# Field order => [*, a]
# Result schema => [b, bc, bd, a]
source=idx | eval a=1, b=2, bc=3, bd=4 | fields b*, a | eval bx=5
# Field order => [b*, a]
# Result schema => [b, bc, bd, bx, a]
source=idx | eval a=1, b=2 | stats count as c by b, a
# Field order => [b, a, c]
# Result schema => [b, a, c]
source=idx | eval a=1, b=2 | stats count as c by b, a | fields *
# Field order => [*]
# Result schema => [a, b, c]
5. Difference from current spec
This method causes discrepancy with current Calcite PPL logic where:
- Current logic uses field order in the input schema as default
- Current logic append new fields at the end when a command adds fields (some command specify fields)
- Current logic applies wild card to the fields available when
fieldscommand is applied
Examples:
# suppose idx schema = [id, name, address]
source=idx | eval a=1
# current logic -> [id, name, address, a]
# new logic -> [a, address, id, name]
source=idx | fields a*, name, * | eval a=1
# current logic -> [address, name, id, a] (`a` comes at the end as it was added after `fields` command)
# new logic -> [a, address, name, id] (`a` comes first as it matches `a*`)
Option1: Apply new logic to all the query
- Consistent behavior among all the queries
- It could change the result layout for existing query
Option2(recommended): Apply new logic only when result contains dynamic fields
- It only affect the queries executable with dynamic fields feature (No surprise to existing users).
- Could be confusing as it change behavior when dynamic field is introduced.
- Apply new logic to all the query in the next major version
6. Alternative Solutions considered
- Store field order information in each data record
- Not efficient
Metadata
Metadata
Assignees
Labels
Type
Projects
Status
Status