Skip to content

Replace df.iterrows() with df.itertuples() for Significant Performance Gains #86

@aditya0by0

Description

@aditya0by0

Description:

The current codebase uses pandas.DataFrame.iterrows() to iterate over rows. This issue proposes replacing it with pandas.DataFrame.itertuples(), which is much faster and more memory-efficient.

📉 Performance Hierarchy

The general performance order for row-wise operations in Pandas (from fastest to slowest) is:

  1. Vectorization (best)
  2. Custom Cython routines
  3. apply with a vectorized function
  4. Cythonized reductions
  5. List comprehensions
  6. itertuples()
  7. iterrows() ⚠️
  8. Row-by-row updates with .loc on an empty DataFrame (worst)

This hierarchy is visualized in the following benchmark chart:

Image
Image from Medium Blog

🚀 Why Use itertuples()?

  • Returns namedtuples (faster object creation and field access).
  • Avoids the overhead of dtype conversion and index handling present in iterrows().
  • Benchmarks show itertuples can be 10x to 20× faster than iterrows().

References:

✅ Example

import pandas as pd

df = pd.DataFrame({
    'name': ['Alice', 'Bob'],
    'age': [25, 30]
})

# Efficient way
for row in df.itertuples(index=False):
    print(f"{row.name} is {row.age} years old")

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions