Skip to content

[RFC] PPL convert Command #5001

@aalva500-prog

Description

@aalva500-prog

Problem Statement

PPL currently lacks a unified command for converting field values between different data types and formats. Users frequently need to convert data for analysis, calculations, and reporting purposes, but must rely on chaining multiple evaluation functions or writing complex expressions. Common conversion needs include:

  • Converting string representations to numeric values
  • Transforming timestamp formats for readability or calculations
  • Converting duration strings to seconds for mathematical operations
  • Normalizing memory values to consistent units
  • Extracting numeric values from mixed alphanumeric strings

This creates verbose queries and increases the learning curve for users who need to remember multiple function names and their specific syntax.

Current State

Users must currently use evaluation functions within the eval command to perform conversions:

source=logs | eval balance_num = tonumber(balance) | eval time_readable = strftime(_time, "%H:%M:%S")

Limitations of the current approach:

  • Requires knowledge of specific function names scattered across documentation
  • No unified interface for common conversion operations
  • Difficult to apply conversions to multiple fields efficiently
  • No support for specialized conversions like duration-to-seconds or memory unit normalization
  • Verbose syntax for simple operations

Long-Term Goals

  • Provide a unified convert command that consolidates common data type conversion operations
  • Simplify PPL queries by reducing the need for multiple eval statements
  • Support bulk field conversions with wildcard patterns
  • Enable in-place field conversion or creation of new fields via AS clause
  • Maintain high performance with large datasets through optimized conversion implementations
  • Ensure the solution is extensible for future conversion function additions
  • Align with SQL plugin architecture and PPL command patterns
  • Support 90% confidence that this addresses user conversion needs based on common log analysis patterns

Proposal

Implement a convert command in PPL with the following conversion functions:

  1. Numeric Conversions

    • auto() - Automatically converts fields to numbers using best-fit heuristics
    • num() - Converts to numbers, removing non-convertible values
    • rmcomma() - Removes commas from numeric strings
    • rmunit() - Extracts leading numbers and removes trailing text
  2. Time Conversions

    • ctime() - Converts UNIX timestamps to human-readable format
    • mktime() - Converts human-readable time strings to epoch time
    • mstime() - Converts MM:SS.SSS format to seconds
    • dur2sec() - Converts duration format [D+]HH:MM:SS to seconds
  3. Memory Conversions

    • memk() - Converts memory values (with k/m/g suffixes) to kilobytes
  4. Field Control

    • none() - Excludes specific fields from wildcard conversions

Syntax

... | convert [timeformat=<string>] <convert-function>(<field>) [AS <field>] [<convert-function>(<field>) [AS <field>]]...

Parameters

  • timeformat: (Optional) Output format for time conversions (used by ctime and mktime)
  • <convert-function>: One of the conversion functions listed above
  • <field>: Field name to convert, supports wildcards (*)
  • AS <field>: (Optional) Create new field with converted value, preserving original

Approach

Phase 1: Grammar and Parser Updates

  1. Update OpenSearchPPLParser.g4

    • Add convertCommand rule to commands section
    • Define convertFunction rule with all conversion function types
    • Add convertClause for handling multiple conversions in one command
    • Support timeformat parameter and AS clause
  2. Update OpenSearchPPLLexer.g4

    • Add keywords: CONVERT, AUTO, CTIME, MKTIME, MSTIME, DUR2SEC, MEMK, RMCOMMA, RMUNIT, TIMEFORMAT

Phase 2: AST Building

  1. Create AstBuilder visitor method
    • Implement visitConvertCommand() in AstBuilder.java
    • Parse conversion function types and field expressions
    • Handle wildcard field patterns
    • Extract timeformat parameter and AS aliases
    • Build Convert UnresolvedPlan node with list of ConvertFunction objects

Phase 3: Calcite UDF Implementation

  1. Create BaseConversionUDF abstract class

    • Extend ImplementorUDF to integrate with Calcite
    • Provide single constructor accepting conversion method name
    • Use ConversionImplementor (NotNullImplementor) for code generation
    • Implement getReturnTypeInference() returning nullable DOUBLE type
    • Define toDoubleOrNull() helper for consistent null handling
  2. Implement individual UDF classes

    • Create AutoConvertFunction extending BaseConversionUDF
    • Create NumConvertFunction extending BaseConversionUDF
    • Create RmcommaConvertFunction extending BaseConversionUDF
    • Create RmunitConvertFunction extending BaseConversionUDF
    • Each class: simple constructor calling super with method name (e.g., "autoConvert")
    • Register each function in Calcite's function repository

Phase 4: Conversion Logic Implementation

  1. Create ConversionUtils utility class

    • Implement static methods for each conversion function:
      • autoConvert() - Intelligent conversion with comma/unit handling
      • numConvert() - Number extraction with comma removal
      • rmcommaConvert() - Pure comma removal and parsing
      • rmunitConvert() - Leading number extraction
    • Each method signature: public static Object methodName(Object value)
    • Return Double on success, null on failure (consistent with eval+cast)
  2. Conversion logic details

    • auto(): Uses ConversionStrategy.COMPREHENSIVE with fallback chain

    • num(): Uses ConversionStrategy.STANDARD with numeric focus

    • rmcomma(): ConversionStrategy.COMMA_ONLY - regex comma removal

    • rmunit(): ConversionStrategy.UNIT_ONLY - extract leading numbers

    • All use private helper methods:

      • tryParseDouble() - Safe Double.parseDouble with null handling
      • isPotentiallyConvertible() - Quick validation before conversion
      • convertStandard(), convertComprehensive(), etc. - Strategy implementations

Phase 5: Integration and Execution

  1. Calcite integration flow

    • Convert command parsed → AST Convert node created
    • Calcite planner resolves UDF function calls
    • BaseConversionUDF.ConversionImplementor generates execution code
    • Generated code calls ConversionUtils.methodName()
    • Result passed through toDoubleOrNull() for type safety
    • Calcite handles expression evaluation and null propagation
  2. No custom operators needed

    • Leverage Calcite's expression evaluation engine
    • Benefits from Calcite's optimization passes
    • Integrates seamlessly with other Calcite features
    • Standard UDF pattern familiar to developers

Phase 6: Testing and Documentation

  1. Unit tests (ConversionUtilsTest.java)

    • Test each conversion function with various inputs
    • Verify null handling for non-convertible values
    • Test edge cases (empty strings, special characters, etc.)
  2. Integration tests

    • Test command parsing and execution
    • Test with AS clause for new field creation
    • Test wildcard patterns and multiple conversions
  3. Documentation (docs/user/ppl/cmd/convert.md)

    • Document each conversion function with examples
    • Explain null behavior on conversion failure
    • Provide usage patterns and best practices

Examples

Example 1: Convert all fields to numeric values

source=accounts | convert auto(*)

Input:

account_number balance age
1 "39,225" "32"
2 "5,686" "28"

Output:

account_number balance age
1 39225 32
2 5686 28

Note: auto() automatically converts string values to numbers where possible. "39,225" becomes 39225 by removing commas, "32" becomes 32.*

Example 2: Convert specific fields while excluding others

source=accounts | convert auto(*) none(account_id)

Input:

account_number balance age account_id
1 "39,225" "32" "ACC001"
2 "5,686" "28" "ACC002"

Output:

account_number balance age account_id
1 39225 32 "ACC001"
2 5686 28 "ACC002"

Note: auto() converts balance and age (removing special characters), but none(account_id) excludes account_id from conversion, preserving it as a string.*

Example 3: Convert duration to seconds

source=sessions | convert dur2sec(session_time) dur2sec(login_duration)

Input:

user_id session_time login_duration
user1 "00:15:30" "00:05:45"
user2 "01:20:15" "00:10:30"

Output:

user_id session_time login_duration
user1 930 345
user2 4815 630

Note: dur2sec() converts HH:MM:SS format to total seconds. 00:15:30 = (0×3600) + (15×60) + 30 = 930 seconds, 00:05:45 = 345 seconds.

Example 4: Convert with extended duration format (days + time)

source=logs | convert dur2sec(delay)

Input:

request_id delay
req1 "1+02:10:15"
req2 "00:05:30"
req3 "3+00:02:45"

Output:

request_id delay
req1 94215
req2 330
req3 259365

Note: dur2sec() handles extended format D+HH:MM:SS where D+ represents days. 1+02:10:15 = (1×86400) + (2×3600) + (10×60) + 15 = 94215 seconds.

Example 5: Remove units from field values

source=metrics | convert rmunit(duration)

Input:

metric_id duration
m1 "212 sec"
m2 "450 minutes"
m3 "30 hours"

Output:

metric_id duration
m1 212
m2 450
m3 30

Note: rmunit() extracts numbers from the beginning of values and removes trailing text. "212 sec" becomes 212, "450 minutes" becomes 450.

Example 6: Convert memory values to kilobytes

source=processes | convert memk(memory_usage)

Input:

process_id memory_usage
p1 "512m"
p2 "1g"
p3 "256k"

Output:

process_id memory_usage
p1 524288
p2 1048576
p3 256

Note: memk() converts memory units to kilobytes. "512m" = 512×1024 = 524288 KB, "1g" = 1×1024×1024 = 1048576 KB, "256k" = 256 KB.

Example 7: Convert UNIX timestamp to readable format

source=events | convert timeformat="%H:%M:%S" ctime(unix_time) AS readable_time

Input:

event_id unix_time
e1 1522196414
e2 1522196465
e3 1522196823

Output:

event_id unix_time readable_time
e1 1522196414 "00:20:14"
e2 1522196465 "00:21:05"
e3 1522196823 "00:27:03"

Note: ctime() converts UNIX timestamps to human-readable format. timeformat="%H:%M:%S" shows only hours:minutes:seconds. Original unix_time field is preserved with AS clause.

Example 8: Convert MM:SS.SSS to seconds

source=responses | convert mstime(response_time)

Input:

request_id response_time
r1 "02:15.500"
r2 "01:30.250"
r3 "03:45.750"

Output:

request_id response_time
r1 135.5
r2 90.25
r3 225.75

Note: mstime() converts MM:SS.SSS to total seconds. "02:15.500" = (2×60) + 15.5 = 135.5 seconds.

Example 9: Remove commas from numeric values

source=accounts | convert rmcomma(balance)

Input:

account_id balance
acc1 "39,225.50"
acc2 "5,686.75"
acc3 "32,838.25"

Output:

account_id balance
acc1 39225.50
acc2 5686.75
acc3 32838.25

Note: rmcomma() removes all commas from values. "39,225.50" becomes 39225.50 for proper numeric calculations.

Example 10: Multiple conversions in one command

source=data | convert rmcomma(balance) auto(age) memk(memory) rmunit(duration)

Input:

id balance age memory duration
1 "39,225" "32" "512m" "45 minutes"
2 "5,686" "28" "1g" "30 seconds"

Output:

id balance age memory duration
1 39225 32 524288 45
2 5686 28 1048576 30

Note: Multiple convert functions can be used in one command. rmcomma() removes commas, auto() converts to numbers, memk() converts to KB, rmunit() removes text.

Example 11: Create new fields with AS clause

source=logs | convert dur2sec(time_elapsed) AS time_seconds | stats sum(time_seconds) by user_id

Input:

user_id time_elapsed
user1 "01:30:45"
user2 "02:15:30"
user3 "00:45:15"

After convert:

user_id time_elapsed time_seconds
user1 "01:30:45" 5445
user2 "02:15:30" 8130
user3 "00:45:15" 2715

Note: dur2sec() converts HH:MM:SS to seconds, then the AS clause creates a new field time_seconds while preserving the original time_elapsed field. The converted values can then be used in aggregations like stats sum().

Alternative

Temporary workarounds users can employ:

  1. Continue using eval with tonumber(), tostring(), and other existing functions
  2. Pre-process data during ingestion to normalize formats
  3. Create custom UDFs for specific conversion needs

However, these alternatives:

  • Don't provide the unified interface that reduces cognitive load
  • Require more verbose queries
  • Don't cover all conversion scenarios (e.g., dur2sec, memk)
  • Don't support efficient bulk conversions with wildcards

Implementation Discussion

  1. Phased Implementation: Should we implement all 10 functions in the initial release, or prioritize based on usage patterns?

    • Recommendation: Start with numeric conversions (auto, num, rmcomma, rmunit) and time conversions (ctime, dur2sec) as these cover 80% of use cases
  2. Error Handling: How should the command handle conversion failures?

    • Option A: Silently leave unconvertible values unchanged
    • Option B: Set failed conversions to null (matches eval behavior) ✓
    • Option C: Provide warning logs for troubleshooting
    • Recommendation: Option B for consistency with existing PPL behavior
  3. Performance Optimization: For wildcard conversions, should we:

    • Analyze all field values to determine convertibility before applying?
    • Apply conversion attempts and roll back if too many failures occur?
    • Recommendation: Use sampling approach - analyze first N rows to determine field convertibility
  4. Timeformat Default: Should we use locale-aware defaults or fixed format?

    • Recommendation: Use ISO 8601 format (%Y-%m-%d %H:%M:%S) as default for international compatibility
  5. Extensibility: Should we design an interface for adding custom conversion functions?

    • Recommendation: No interface needed ( BaseConversionUDF abstract class is sufficient)
  6. Wildcard Performance: How to optimize wildcard matching with large field counts?

    • Recommendation: Build field pattern cache at planning time, apply conversions in streaming fashion
  7. AS Clause Scoping: When using AS clause, should original field remain accessible downstream?

    • Recommendation: Yes, both original and converted fields should be available unless explicitly dropped

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions