-
Notifications
You must be signed in to change notification settings - Fork 5
Open
Labels
priority: lowIssue with low priorityIssue with low priority
Description
Description:
The current codebase uses pandas.DataFrame.iterrows()
to iterate over rows. This issue proposes replacing it with pandas.DataFrame.itertuples()
, which is much faster and more memory-efficient.
📉 Performance Hierarchy
The general performance order for row-wise operations in Pandas (from fastest to slowest) is:
- Vectorization (best)
- Custom Cython routines
apply
with a vectorized function- Cythonized reductions
- List comprehensions
itertuples()
✅iterrows()
⚠️ - Row-by-row updates with
.loc
on an empty DataFrame (worst)
This hierarchy is visualized in the following benchmark chart:
Image from Medium Blog
🚀 Why Use itertuples()
?
- Returns namedtuples (faster object creation and field access).
- Avoids the overhead of dtype conversion and index handling present in
iterrows()
. - Benchmarks show itertuples can be 10x to 20× faster than
iterrows()
.
References:
✅ Example
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob'],
'age': [25, 30]
})
# Efficient way
for row in df.itertuples(index=False):
print(f"{row.name} is {row.age} years old")
Metadata
Metadata
Assignees
Labels
priority: lowIssue with low priorityIssue with low priority