Skip to content

Conversation

TheSisb
Copy link

@TheSisb TheSisb commented Sep 26, 2025

Hey all 👋 ,

tl;dr

I'm using danfojs in a project and noticed some severe performance issues:
image

I tracked it down to the .agg() calls:
image

16seconds seemed unreasonable, so I figured there would be easy wins to make.

After my fixes, I brought .agg down to <20ms from 16s

image

Note: all changes remain fully backwards compatible and all existing tests pass.

(I'm lazy so I had AI write this part)

Technical Optimizations Implemented

Data Structure Modernization

  • Replaced Objects with Maps: Migrated from {} to Map<string, data> for O(1) lookups vs O(n) property access
  • Efficient key-value caching: Map<string, ArrayType1D> for keyToValue storage
  // Before
  colDict: { [key: string]: {} } = {}
  keyToValue: { [key: string]: ArrayType1D } = {}

  // After
  private _colDict: Map<string, { [key: string]: ArrayType1D }> = new
  Map()
  keyToValue: Map<string, ArrayType1D> = new Map()

Smart Key Generation with Type-Aware Optimization

  • Cached key generators: Analyze column types once, reuse optimized functions
  • Fast paths for common cases: Single integers, numeric columns, mixed types
  • 25% improvement from eliminating redundant string operations
  // Single integer optimization - fastest path
  if (allInteger && this.colIndex.length === 1) {
    keyGenerator = (values: ArrayType1D) => String(values[0]);
  }

  // Custom concatenation for multiple columns (faster than join)
  keyGenerator = (values: ArrayType1D) => {
    let result = String(values[0]);
    for (let i = 1; i < values.length; i++) {
      result += "-" + String(values[i]);
    }
    return result;
  };

Optimized Array Operations

  • Direct array manipulation: Eliminated expensive Array.fill() and spread operations
  • Pre-allocated arrays: Reduced memory allocation overhead
  • Batch operations: Process multiple values efficiently
  // Before: Expensive array operations
  data[keyName] = Array(valueLen).fill(keyValue)
  data[colName] = [...data[colName], ...dataValue]

  // After: Direct assignment and efficient appending
  const keyArray = new Array(valueLen)
  for (let i = 0; i < valueLen; i++) {
    keyArray[i] = keyValue
  }
  // Direct array extension without recreating

Algorithm Improvements

  • Pre-computed column indices: Avoid repeated array lookups
  • Single-pass group construction: Build all groups in one iteration
  • Cached key-to-value mapping: Store relationships once, reuse everywhere

@TheSisb TheSisb changed the title improve groupBy and agg perf by a factor of 10 Improve groupby's agg performance on large data; from 16s to 20ms for 20k rows Sep 26, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant