Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing numpy dependency when running with PyArrow >= 18 #2380

Open
kien-truong opened this issue Mar 6, 2025 · 2 comments · May be fixed by #2397
Open

Missing numpy dependency when running with PyArrow >= 18 #2380

kien-truong opened this issue Mar 6, 2025 · 2 comments · May be fixed by #2397
Assignees
Labels
bug Something isn't working

Comments

@kien-truong
Copy link

dlt version

1.7

Describe the problem

Because PyArrow >= 18 moves numpy to optional runtime dependencies, when running pipelines with SQL sources and pyarrow backend, it'll fail with an import error

ModuleNotFoundError: No module named 'numpy'

Expected behavior

Pipelines using pyarrow backend works with pyarrow >= 18

Steps to reproduce

  • Install dlt with pyarrow >= 18
  • Run a dlt pipeline using SQL source and pyarrow backend

Operating system

Linux

Runtime environment

Local

Python version

3.12

dlt data source

Any SQL source

dlt destination

No response

Other deployment details

No response

Additional information

No response

@rudolfix rudolfix moved this from Todo to Planned in dlt core library Mar 10, 2025
@rudolfix rudolfix added the bug Something isn't working label Mar 10, 2025
@rudolfix
Copy link
Collaborator

rudolfix commented Mar 10, 2025

@kien-truong thanks for pointing this. numpy should be optional and dlt should not raise here. could you paste the full stack trace:


note on how to fix this:

  1. make sure sql_database works without numpy (arrow backend). there's must be a top level import somewhere after refactor!
  2. add numpy to arrow extras explicitly

@kien-truong
Copy link
Author

My stacktrace are mangled, but the exception comes from this line

def transpose_rows_to_columns(
rows: TDataItems, column_names: Iterable[str]
) -> dict[str, Any]: # dict[str, np.ndarray]
"""Transpose rows (data items) into columns (numpy arrays). Returns a dictionary of {column_name: column_data}
Uses pandas if available. Otherwise, use numpy, which is slower
"""
import numpy as np

@zilto zilto linked a pull request Mar 11, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: Planned
Development

Successfully merging a pull request may close this issue.

3 participants