Skip to content

[RFC] PPL Field Ordering #5045

@ykmr1224

Description

@ykmr1224

1. Summary

PPL allows specifying field order using names and wildcards.
Current Calcite PPL implementation preserve field order as row type, but it would not work when dynamic fields are introduced by #4984 since all the dynamic fields will be stored in single _MAP field.
This RFC proposes a method to produce query result with expected field order even when the result contains dynamic fields.

2. Motivation

Users expect deterministic output data layout, especially when exporting data.

We would like to replicate the Splunk behavior so that existing Splunk users can easily onboard OpenSearch PPL.

3. Expected behavior of field ordering

  • Default: fields sorted UTF-8 lexicographically.
  • fields: replaces ordering; affects later fields.
  • stats / timechart: define their own output order.
  • join: ignores right-side ordering.
  • rename: modify field order (replace fields in field order)

4. Proposed Solution

  1. Decide a field order based on the query.
  • Field order is a list of string, which contains specific field name and/or wild card notation (e.g. abc, *, prefix*, etc.)
  • The last command which emit field order will decide the query's field order. (The field order specified earlier command will be ignored.)
  • Only a few commands emit field order (fields, stats, timechart, rename)
  1. Once query execution is done, reorganize the result by using the field order from step 1
  • Come up with the list of all the available fields in dynamic fields
  • Reorder fields based on the field order from step 1
  • Fields will be sorted lexicographically for wild card pattern and remaining fields

Examples:

source=idx | eval a=1, b=2 | fields a, * | eval c=3 | fields b, *, a
# Field order => [b, *, a]
# Result schema => [b, c, a]

source=idx | eval a=1, b=2, bc=3, bd=4 | fields a, bd, b*
# Field order => [a, bd, b*]
# Result schema => [a, bd, b, bc]

source=idx | eval a=1, b=2, bc=3, bd=4 | fields *, a
# Field order => [*, a]
# Result schema => [b, bc, bd, a]

source=idx | eval a=1, b=2, bc=3, bd=4 | fields b*, a | eval bx=5
# Field order => [b*, a]
# Result schema => [b, bc, bd, bx, a]

source=idx | eval a=1, b=2 | stats count as c by b, a 
# Field order => [b, a, c]
# Result schema => [b, a, c]

source=idx | eval a=1, b=2 | stats count as c by b, a | fields *
# Field order => [*]
# Result schema => [a, b, c]

5. Difference from current spec

This method causes discrepancy with current Calcite PPL logic where:

  1. Current logic uses field order in the input schema as default
  2. Current logic append new fields at the end when a command adds fields (some command specify fields)
  3. Current logic applies wild card to the fields available when fields command is applied

Examples:

# suppose idx schema = [id, name, address]

source=idx | eval a=1
# current logic -> [id, name, address, a]
# new logic -> [a, address, id, name]

source=idx | fields a*, name, * | eval a=1
# current logic -> [address, name, id, a]  (`a` comes at the end as it was added after `fields` command)
# new logic -> [a, address, name, id] (`a` comes first as it matches `a*`)

Option1: Apply new logic to all the query

  • Consistent behavior among all the queries
  • It could change the result layout for existing query

Option2(recommended): Apply new logic only when result contains dynamic fields

  • It only affect the queries executable with dynamic fields feature (No surprise to existing users).
  • Could be confusing as it change behavior when dynamic field is introduced.
  • Apply new logic to all the query in the next major version

6. Alternative Solutions considered

  • Store field order information in each data record
    • Not efficient

Metadata

Metadata

Assignees

Labels

PPLPiped processing languageRFCRequest For Commentscalcitecalcite migration releated

Type

No type

Projects

Status

New

Status

Not Started

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions