-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Problem Statement
PPL currently lacks a unified command for converting field values between different data types and formats. Users frequently need to convert data for analysis, calculations, and reporting purposes, but must rely on chaining multiple evaluation functions or writing complex expressions. Common conversion needs include:
- Converting string representations to numeric values
- Transforming timestamp formats for readability or calculations
- Converting duration strings to seconds for mathematical operations
- Normalizing memory values to consistent units
- Extracting numeric values from mixed alphanumeric strings
This creates verbose queries and increases the learning curve for users who need to remember multiple function names and their specific syntax.
Current State
Users must currently use evaluation functions within the eval command to perform conversions:
source=logs | eval balance_num = tonumber(balance) | eval time_readable = strftime(_time, "%H:%M:%S")Limitations of the current approach:
- Requires knowledge of specific function names scattered across documentation
- No unified interface for common conversion operations
- Difficult to apply conversions to multiple fields efficiently
- No support for specialized conversions like duration-to-seconds or memory unit normalization
- Verbose syntax for simple operations
Long-Term Goals
- Provide a unified
convertcommand that consolidates common data type conversion operations - Simplify PPL queries by reducing the need for multiple eval statements
- Support bulk field conversions with wildcard patterns
- Enable in-place field conversion or creation of new fields via AS clause
- Maintain high performance with large datasets through optimized conversion implementations
- Ensure the solution is extensible for future conversion function additions
- Align with SQL plugin architecture and PPL command patterns
- Support 90% confidence that this addresses user conversion needs based on common log analysis patterns
Proposal
Implement a convert command in PPL with the following conversion functions:
-
Numeric Conversions
auto()- Automatically converts fields to numbers using best-fit heuristicsnum()- Converts to numbers, removing non-convertible valuesrmcomma()- Removes commas from numeric stringsrmunit()- Extracts leading numbers and removes trailing text
-
Time Conversions
ctime()- Converts UNIX timestamps to human-readable formatmktime()- Converts human-readable time strings to epoch timemstime()- Converts MM:SS.SSS format to secondsdur2sec()- Converts duration format [D+]HH:MM:SS to seconds
-
Memory Conversions
memk()- Converts memory values (with k/m/g suffixes) to kilobytes
-
Field Control
none()- Excludes specific fields from wildcard conversions
Syntax
... | convert [timeformat=<string>] <convert-function>(<field>) [AS <field>] [<convert-function>(<field>) [AS <field>]]...
Parameters
timeformat: (Optional) Output format for time conversions (used byctimeandmktime)<convert-function>: One of the conversion functions listed above<field>: Field name to convert, supports wildcards (*)AS <field>: (Optional) Create new field with converted value, preserving original
Approach
Phase 1: Grammar and Parser Updates
-
Update OpenSearchPPLParser.g4
- Add
convertCommandrule tocommandssection - Define
convertFunctionrule with all conversion function types - Add
convertClausefor handling multiple conversions in one command - Support timeformat parameter and AS clause
- Add
-
Update OpenSearchPPLLexer.g4
- Add keywords: CONVERT, AUTO, CTIME, MKTIME, MSTIME, DUR2SEC, MEMK, RMCOMMA, RMUNIT, TIMEFORMAT
Phase 2: AST Building
- Create AstBuilder visitor method
- Implement
visitConvertCommand()in AstBuilder.java - Parse conversion function types and field expressions
- Handle wildcard field patterns
- Extract timeformat parameter and AS aliases
- Build
ConvertUnresolvedPlan node with list ofConvertFunctionobjects
- Implement
Phase 3: Calcite UDF Implementation
-
Create BaseConversionUDF abstract class
- Extend
ImplementorUDFto integrate with Calcite - Provide single constructor accepting conversion method name
- Use
ConversionImplementor(NotNullImplementor) for code generation - Implement
getReturnTypeInference()returning nullable DOUBLE type - Define
toDoubleOrNull()helper for consistent null handling
- Extend
-
Implement individual UDF classes
- Create
AutoConvertFunctionextending BaseConversionUDF - Create
NumConvertFunctionextending BaseConversionUDF - Create
RmcommaConvertFunctionextending BaseConversionUDF - Create
RmunitConvertFunctionextending BaseConversionUDF - Each class: simple constructor calling super with method name (e.g., "autoConvert")
- Register each function in Calcite's function repository
- Create
Phase 4: Conversion Logic Implementation
-
Create ConversionUtils utility class
- Implement static methods for each conversion function:
autoConvert()- Intelligent conversion with comma/unit handlingnumConvert()- Number extraction with comma removalrmcommaConvert()- Pure comma removal and parsingrmunitConvert()- Leading number extraction
- Each method signature:
public static Object methodName(Object value) - Return
Doubleon success,nullon failure (consistent with eval+cast)
- Implement static methods for each conversion function:
-
Conversion logic details
-
auto(): Uses ConversionStrategy.COMPREHENSIVE with fallback chain
-
num(): Uses ConversionStrategy.STANDARD with numeric focus
-
rmcomma(): ConversionStrategy.COMMA_ONLY - regex comma removal
-
rmunit(): ConversionStrategy.UNIT_ONLY - extract leading numbers
-
All use private helper methods:
tryParseDouble()- Safe Double.parseDouble with null handlingisPotentiallyConvertible()- Quick validation before conversionconvertStandard(),convertComprehensive(), etc. - Strategy implementations
-
Phase 5: Integration and Execution
-
Calcite integration flow
- Convert command parsed → AST
Convertnode created - Calcite planner resolves UDF function calls
BaseConversionUDF.ConversionImplementorgenerates execution code- Generated code calls
ConversionUtils.methodName() - Result passed through
toDoubleOrNull()for type safety - Calcite handles expression evaluation and null propagation
- Convert command parsed → AST
-
No custom operators needed
- Leverage Calcite's expression evaluation engine
- Benefits from Calcite's optimization passes
- Integrates seamlessly with other Calcite features
- Standard UDF pattern familiar to developers
Phase 6: Testing and Documentation
-
Unit tests (
ConversionUtilsTest.java)- Test each conversion function with various inputs
- Verify null handling for non-convertible values
- Test edge cases (empty strings, special characters, etc.)
-
Integration tests
- Test command parsing and execution
- Test with AS clause for new field creation
- Test wildcard patterns and multiple conversions
-
Documentation (
docs/user/ppl/cmd/convert.md)- Document each conversion function with examples
- Explain null behavior on conversion failure
- Provide usage patterns and best practices
Examples
Example 1: Convert all fields to numeric values
source=accounts | convert auto(*)Input:
| account_number | balance | age |
|---|---|---|
| 1 | "39,225" | "32" |
| 2 | "5,686" | "28" |
Output:
| account_number | balance | age |
|---|---|---|
| 1 | 39225 | 32 |
| 2 | 5686 | 28 |
Note: auto() automatically converts string values to numbers where possible. "39,225" becomes 39225 by removing commas, "32" becomes 32.*
Example 2: Convert specific fields while excluding others
source=accounts | convert auto(*) none(account_id)Input:
| account_number | balance | age | account_id |
|---|---|---|---|
| 1 | "39,225" | "32" | "ACC001" |
| 2 | "5,686" | "28" | "ACC002" |
Output:
| account_number | balance | age | account_id |
|---|---|---|---|
| 1 | 39225 | 32 | "ACC001" |
| 2 | 5686 | 28 | "ACC002" |
Note: auto() converts balance and age (removing special characters), but none(account_id) excludes account_id from conversion, preserving it as a string.*
Example 3: Convert duration to seconds
source=sessions | convert dur2sec(session_time) dur2sec(login_duration)Input:
| user_id | session_time | login_duration |
|---|---|---|
| user1 | "00:15:30" | "00:05:45" |
| user2 | "01:20:15" | "00:10:30" |
Output:
| user_id | session_time | login_duration |
|---|---|---|
| user1 | 930 | 345 |
| user2 | 4815 | 630 |
Note: dur2sec() converts HH:MM:SS format to total seconds. 00:15:30 = (0×3600) + (15×60) + 30 = 930 seconds, 00:05:45 = 345 seconds.
Example 4: Convert with extended duration format (days + time)
source=logs | convert dur2sec(delay)Input:
| request_id | delay |
|---|---|
| req1 | "1+02:10:15" |
| req2 | "00:05:30" |
| req3 | "3+00:02:45" |
Output:
| request_id | delay |
|---|---|
| req1 | 94215 |
| req2 | 330 |
| req3 | 259365 |
Note: dur2sec() handles extended format D+HH:MM:SS where D+ represents days. 1+02:10:15 = (1×86400) + (2×3600) + (10×60) + 15 = 94215 seconds.
Example 5: Remove units from field values
source=metrics | convert rmunit(duration)Input:
| metric_id | duration |
|---|---|
| m1 | "212 sec" |
| m2 | "450 minutes" |
| m3 | "30 hours" |
Output:
| metric_id | duration |
|---|---|
| m1 | 212 |
| m2 | 450 |
| m3 | 30 |
Note: rmunit() extracts numbers from the beginning of values and removes trailing text. "212 sec" becomes 212, "450 minutes" becomes 450.
Example 6: Convert memory values to kilobytes
source=processes | convert memk(memory_usage)Input:
| process_id | memory_usage |
|---|---|
| p1 | "512m" |
| p2 | "1g" |
| p3 | "256k" |
Output:
| process_id | memory_usage |
|---|---|
| p1 | 524288 |
| p2 | 1048576 |
| p3 | 256 |
Note: memk() converts memory units to kilobytes. "512m" = 512×1024 = 524288 KB, "1g" = 1×1024×1024 = 1048576 KB, "256k" = 256 KB.
Example 7: Convert UNIX timestamp to readable format
source=events | convert timeformat="%H:%M:%S" ctime(unix_time) AS readable_timeInput:
| event_id | unix_time |
|---|---|
| e1 | 1522196414 |
| e2 | 1522196465 |
| e3 | 1522196823 |
Output:
| event_id | unix_time | readable_time |
|---|---|---|
| e1 | 1522196414 | "00:20:14" |
| e2 | 1522196465 | "00:21:05" |
| e3 | 1522196823 | "00:27:03" |
Note: ctime() converts UNIX timestamps to human-readable format. timeformat="%H:%M:%S" shows only hours:minutes:seconds. Original unix_time field is preserved with AS clause.
Example 8: Convert MM:SS.SSS to seconds
source=responses | convert mstime(response_time)Input:
| request_id | response_time |
|---|---|
| r1 | "02:15.500" |
| r2 | "01:30.250" |
| r3 | "03:45.750" |
Output:
| request_id | response_time |
|---|---|
| r1 | 135.5 |
| r2 | 90.25 |
| r3 | 225.75 |
Note: mstime() converts MM:SS.SSS to total seconds. "02:15.500" = (2×60) + 15.5 = 135.5 seconds.
Example 9: Remove commas from numeric values
source=accounts | convert rmcomma(balance)Input:
| account_id | balance |
|---|---|
| acc1 | "39,225.50" |
| acc2 | "5,686.75" |
| acc3 | "32,838.25" |
Output:
| account_id | balance |
|---|---|
| acc1 | 39225.50 |
| acc2 | 5686.75 |
| acc3 | 32838.25 |
Note: rmcomma() removes all commas from values. "39,225.50" becomes 39225.50 for proper numeric calculations.
Example 10: Multiple conversions in one command
source=data | convert rmcomma(balance) auto(age) memk(memory) rmunit(duration)Input:
| id | balance | age | memory | duration |
|---|---|---|---|---|
| 1 | "39,225" | "32" | "512m" | "45 minutes" |
| 2 | "5,686" | "28" | "1g" | "30 seconds" |
Output:
| id | balance | age | memory | duration |
|---|---|---|---|---|
| 1 | 39225 | 32 | 524288 | 45 |
| 2 | 5686 | 28 | 1048576 | 30 |
Note: Multiple convert functions can be used in one command. rmcomma() removes commas, auto() converts to numbers, memk() converts to KB, rmunit() removes text.
Example 11: Create new fields with AS clause
source=logs | convert dur2sec(time_elapsed) AS time_seconds | stats sum(time_seconds) by user_idInput:
| user_id | time_elapsed |
|---|---|
| user1 | "01:30:45" |
| user2 | "02:15:30" |
| user3 | "00:45:15" |
After convert:
| user_id | time_elapsed | time_seconds |
|---|---|---|
| user1 | "01:30:45" | 5445 |
| user2 | "02:15:30" | 8130 |
| user3 | "00:45:15" | 2715 |
Note: dur2sec() converts HH:MM:SS to seconds, then the AS clause creates a new field time_seconds while preserving the original time_elapsed field. The converted values can then be used in aggregations like stats sum().
Alternative
Temporary workarounds users can employ:
- Continue using
evalwithtonumber(),tostring(), and other existing functions - Pre-process data during ingestion to normalize formats
- Create custom UDFs for specific conversion needs
However, these alternatives:
- Don't provide the unified interface that reduces cognitive load
- Require more verbose queries
- Don't cover all conversion scenarios (e.g., dur2sec, memk)
- Don't support efficient bulk conversions with wildcards
Implementation Discussion
-
Phased Implementation: Should we implement all 10 functions in the initial release, or prioritize based on usage patterns?
- Recommendation: Start with numeric conversions (auto, num, rmcomma, rmunit) and time conversions (ctime, dur2sec) as these cover 80% of use cases
-
Error Handling: How should the command handle conversion failures?
- Option A: Silently leave unconvertible values unchanged
- Option B: Set failed conversions to null (matches eval behavior) ✓
- Option C: Provide warning logs for troubleshooting
- Recommendation: Option B for consistency with existing PPL behavior
-
Performance Optimization: For wildcard conversions, should we:
- Analyze all field values to determine convertibility before applying?
- Apply conversion attempts and roll back if too many failures occur?
- Recommendation: Use sampling approach - analyze first N rows to determine field convertibility
-
Timeformat Default: Should we use locale-aware defaults or fixed format?
- Recommendation: Use ISO 8601 format (%Y-%m-%d %H:%M:%S) as default for international compatibility
-
Extensibility: Should we design an interface for adding custom conversion functions?
- Recommendation: No interface needed ( BaseConversionUDF abstract class is sufficient)
-
Wildcard Performance: How to optimize wildcard matching with large field counts?
- Recommendation: Build field pattern cache at planning time, apply conversions in streaming fashion
-
AS Clause Scoping: When using AS clause, should original field remain accessible downstream?
- Recommendation: Yes, both original and converted fields should be available unless explicitly dropped