Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] fugue_sql intermittently throwing segmentation fault errors #462

Open
jstammers opened this issue Apr 11, 2023 · 3 comments
Open

[BUG] fugue_sql intermittently throwing segmentation fault errors #462

jstammers opened this issue Apr 11, 2023 · 3 comments

Comments

@jstammers
Copy link

jstammers commented Apr 11, 2023

Minimal Code To Reproduce

Describe the bug
I have a set of unit tests that check the functionality of code that uses the fugue_sql API with a DuckDB backend. When running these tests locally, they all pass without any issue. However, when I run these as part of a Github actions workflow, I frequently encounter a segmentation fault that occurs at the following location

Current thread 0x00007f4e615547[40](https://github.com/****/****/actions/runs/4555672657/jobs/8035039892#step:7:41) (most recent call first):
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/dataframe.py", line 101 in as_arrow
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/dataframe.py", line 110 in as_local_bounded
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/dataframe/dataframe.py", line 90 in as_local
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue_duckdb/execution_engine.py", line 521 in convert_yield_dataframe
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_tasks.py", line 1[47](https://github.com/****/****/actions/runs/4555672657/jobs/8035039892#step:7:48) in set_result
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_tasks.py", line 293 in execute
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 683 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 171 in run_single
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 155 in run_tasks
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 129 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/adagio/instances.py", line 270 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/_workflow_context.py", line 54 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/workflow/workflow.py", line 1584 in run
  File "/home/runner/.cache/pypoetry/virtualenvs/****-CeyU5fzd-py3.10/lib/python3.10/site-packages/fugue/sql/api.py", line 107 in fugue_sql

The function that fails has the following form

def filter_df(
    df: pd.DataFrame,
    outlets: pd.DataFrame,
    adjustments: pd.DataFrame,
):
    query = """keys = SELECT DateId, ProductId, LocationId, AdjustmentFactor, AdjustmentType, id
    FROM adjustments INNER JOIN outlets USING (LocationId)
    fdt = SELECT * FROM keys INNER JOIN df USING (DateId, ProductId, LocationId)"""
    result = fa.fugue_sql(
        query,
        df=df,
        outlets=outlets,
        adjustments=adjustments,
        engine='duckdb',
        as_fugue=True,
    )
    return result.as_pandas()

And I have multiple unit tests that call this function. It's difficult to fully isolate the problem as I can't fully reproduce it locally.

In this instance, I have been able to refactor my function to use the fugue api, but it would be good to be able to use the fugue_sql API for more complex queries where the SQL syntax is more suitable.

from fugue import api as fa

df = fa.join(...)
df = fa.filter(...)

Expected behavior
I would expect these unit tests to run successfully.

Environment (please complete the following information):

  • Backend: pandas (duckdb)
  • Backend version: 0.8.2
  • Python version: 3.10
  • OS: linux
@goodwanghan
Copy link
Collaborator

@jstammers thanks for reporting. What duckdb version are you using?

I remember in earlier Duckdb versions (<3), I often saw segment fault but in later versions I have never seen this happening.

@goodwanghan
Copy link
Collaborator

One problem I saw in unit tests of duckdb is that it can have weird behaviors because the duckdb connection are not properly closed at certain step so the following steps are having issues.

@jstammers
Copy link
Author

Hi @goodwanghan, thanks for looking into this. I'm currently using 0.7.1 which I believe is the latest version.
It wouldn't surprise me if it's related to trying to a previous duckdb connection not being properly closed, but for now I will stick with the fugue API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants