SQLAlchemy + Pandas very slow when compared to AWS Wrangler #254

avibrazil · 2021-11-25T17:33:15Z

Tested same query that returns more than 2 million lines.

AWS Wrangler takes 1m34s to return a DataFrame.
SQLAlchemy + PyAthena takes 16m37s to return a DataFrame.

See attached notebook for proof and methods.

Also, PyAthena apparently returns an object almost 9% bigger. But this can be due to Pandas, SQLAlchemy, data types and other minor things that I wouldn’t care right now.

laughingman7743 · 2021-11-25T21:35:45Z

#46
https://github.com/laughingman7743/PyAthena/tree/master/benchmarks
https://github.com/laughingman7743/PyAthena#pandascursor

avibrazil · 2021-11-25T22:54:51Z

Thank you. I’ve seen all these links and methods before.

But they won’t clarify how to use this with SQLAlchemy. And I need to use SQLAlchemy.

So, is there any way to use PyAthena with SQLAlchemy and Pandas that would match acceptable performance?

Maybe a flag in the URL or something.

laughingman7743 · 2021-11-26T02:33:10Z

I do not have an implementation that uses a combination of SQLAlchemy and Pandas. What is the use case for that?

avibrazil · 2021-11-26T03:46:12Z

I have a Pandas application that is independent of database. So SQLAlchemy provides a unified single interface and code doesn’t need to be changed when changing database backend, just the URL.

laughingman7743 · 2021-11-26T04:54:31Z

For now, the current implementation does not allow the use of PandasCursor and SQLAlchemy in combination.
It needs some modification, but by implementing switching the cursor to be used by the connection URL, we may possibly speed up the retrieval of query results.

avibrazil · 2021-11-26T12:09:21Z

Are there any guides on how to use it? I’m using SQLALCHEMY query builder to query Athena through pandas read_sql_query(). Or simply the use case I put in code in the notebook attached to this bug report.

Also, I’d suggest to be more explicit about this problem when using PyAthena with SQLALCHEMY and Pandas. Simply to suggest to not use SQLALCHEMY, to bypass SQLALCHEMY, is not a solution.

laughingman7743 · 2021-11-26T12:22:56Z

What do you need a guide for, a guide for PandasCursor?
It has a DB-API2.0 interface, which is the same as the default cursor. The way to fetch data is also the same.

avibrazil · 2021-11-26T12:32:45Z

I don't know how to use it in this situation:

conn_str = "awsathena+rest://:@athena.{region_name}.amazonaws.com:443/{schema_name}?work_group={work_group}&compression=snappy"

pyathena_sqlalchemy_conn = sqlalchemy.create_engine(
    url = conn_str.format(
        region_name  = region_name,
        schema_name  = schema_name,
        work_group   = work_group
    ),
    echo_pool = True,
)

pyathena_df = pd.read_sql_query(query,con=pyathena_sqlalchemy_conn)

What should I pass to pandas.read_sql_query() in the con argument in order to use PandasCursor? And how to get it with SQLAlchemy?

Thank you in advance

laughingman7743 · 2021-11-26T13:05:52Z

In the current implementation, SQLAlchemy and PandasCursor cannot be used in combination. As long as you use SQLAlchmey, you cannot change the default cursor.

If you do not use SQLAlchemy, you can use PandasCursor as follows.

cursor = connect(s3_staging_dir="s3://YOUR_S3_BUCKET/path/to/",
                 region_name="REGION_NAME",
                 schema_name="SCHEMA_NAME",
                 work_group="WORK_GROUP",
                 cursor_class=PandasCursor).cursor()
df = cursor.execute(query).as_pandas()

laughingman7743 · 2021-11-26T13:13:59Z

If you want to use PandasCursor in combination with SQLAlchemy, you probably need to modify the part that assembles the arguments of the connection object from the URL.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/sqlalchemy_athena.py#L302-L332
There is no guarantee that SQLAlchemy will work correctly after changing to PandasCursor.

avibrazil · 2021-11-26T13:37:15Z

This new method apparently reduced query time from 16min to 6m35s. But still far from beating AWS Wrangler (without SQLALchemy, indeed) 1m15s:

conn_str = "awsathena+rest://:@athena.{region_name}.amazonaws.com:443/{schema_name}?work_group={work_group}&compression=snappy"

sqlalchemy_connect_args={}
if "awsathena+rest" in conn_str:
    from pyathena.pandas.cursor import PandasCursor
    sqlalchemy_connect_args.update(
        dict(
            cursor_class=PandasCursor
        )
    )

pyathena_sqlalchemy_conn = sqlalchemy.create_engine(
    url = conn_str.format(
        region_name  = region_name,
        schema_name  = schema_name,
        work_group   = work_group
    ),
    connect_args=sqlalchemy_connect_args,
    echo_pool = True,
)

cursor = pyathena_sqlalchemy_conn.raw_connection()

pyathena_df = pd.read_sql_query(query,con=cursor)

pyathena_df.info()

It consists of getting a PandasCursor from SQLAlchemy's raw connections.

But I don't know yet if I can reuse the cursor or should I get a new cursor for each query.

avibrazil · 2021-11-26T15:25:09Z

Here is a simplification. Don't need to get a raw_connection() to proceed. Simply using the connection constructed with a PandasCursor gives me 7m14s query time for my 2 million lines query, instead of 17m.

conn_str = "awsathena+rest://:@athena.{region_name}.amazonaws.com:443/{schema_name}?work_group={work_group}&compression=snappy"

sqlalchemy_connect_args={}
if "awsathena+rest" in conn_str:
    from pyathena.pandas.cursor import PandasCursor
    sqlalchemy_connect_args.update(
        dict(
            cursor_class=PandasCursor
        )
    )

pyathena_sqlalchemy_conn = sqlalchemy.create_engine(
    url = conn_str.format(
        region_name  = region_name,
        schema_name  = schema_name,
        work_group   = work_group
    ),
    connect_args=sqlalchemy_connect_args,
    echo_pool = True,
)

pyathena_df = pd.read_sql_query(query,con=pyathena_sqlalchemy_conn)

Still, PyAthena alone or AWS Wrangler (both without SQLAlchemy) run same query in 1m15s, for 2 million rows result.

Villain here is SQLAlchemy or the integration between SQLAlchemy and PyAthena.

I hope this can be solved in the future in order make PyAthena+SQLAlchemy to deliver same performance as PyAthena alone. SQLAlchemy is desired because it makes your code more elegant and succinct, and makes your app look nicer and independent from database engine.

avibrazil · 2021-11-26T15:39:42Z

For reference, here is the notebook used in this experiment:
AWS Wrangler × PyAthena × SQLAlchemy.zip.

laughingman7743 · 2021-11-26T16:21:39Z

The PandasCursor itself already has a DataFrame, but it handles the fetching of the DataFrame to comply with the DB-API interface.
https://github.com/laughingman7743/PyAthena/blob/master/pyathena/pandas/result_set.py#L100-L133
The read_sql_query method of Pandas calls the fetchall method of the cursor to create the DataFrame.
https://github.com/pandas-dev/pandas/blob/945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5/pandas/io/sql.py#L1521-L1602
The process generates an array of tuples from the DataFrame and re-creates the DataFrame. This is a slow process in general.

avibrazil · 2021-11-26T16:53:24Z

So indeed all this abstraction from SQLAlchemy comes with a toll.
Anyway, although not optimal, I'm happy to have a performance improvement with PandasCursor.

Add AthenaPandas dialect (refs: #254)

laughingman7743 · 2022-01-25T03:37:05Z

https://towardsdatascience.com/heres-the-most-efficient-way-to-iterate-through-your-pandas-dataframe-4dad88ac92ee

Use to_dict instead of Iterrows (ref #254)

laughingman7743 · 2022-12-18T08:08:39Z

I have implemented various performance improvements.

I think Pandas cursor can get a large result set at the same speed as awswrangler(aws-sdk-pandas) by combining the unload option with our own S3 file system implementation (implemented in v2.18.0).

laughingman7743 added a commit that referenced this issue Jan 23, 2022

Add AthenaPandas dialect (refs: #254)

f9cf11a

laughingman7743 added a commit that referenced this issue Jan 24, 2022

Merge pull request #271 from laughingman7743/add_pandas_dialect

9e42752

Add AthenaPandas dialect (refs: #254)

cansjt pushed a commit to cansjt/PyAthena that referenced this issue Jan 31, 2022

Add AthenaPandas dialect (refs: laughingman7743#254)

43c9c1f

laughingman7743 added a commit that referenced this issue Feb 23, 2022

Use to_dict instead of Iterrows (ref #254)

1eddcaf

laughingman7743 added a commit that referenced this issue Feb 27, 2022

Merge branch 'master' into #254

3df4e07

laughingman7743 added a commit that referenced this issue Mar 4, 2022

Merge pull request #284 from laughingman7743/#254

4660266

Use to_dict instead of Iterrows (ref #254)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SQLAlchemy + Pandas very slow when compared to AWS Wrangler #254

SQLAlchemy + Pandas very slow when compared to AWS Wrangler #254

avibrazil commented Nov 25, 2021 •

edited

Loading

laughingman7743 commented Nov 25, 2021

avibrazil commented Nov 25, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021 •

edited

Loading

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021 •

edited

Loading

avibrazil commented Nov 26, 2021

avibrazil commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021 •

edited

Loading

laughingman7743 commented Jan 25, 2022

laughingman7743 commented Dec 18, 2022 •

edited

Loading

SQLAlchemy + Pandas very slow when compared to AWS Wrangler #254

SQLAlchemy + Pandas very slow when compared to AWS Wrangler #254

Comments

avibrazil commented Nov 25, 2021 • edited Loading

laughingman7743 commented Nov 25, 2021

avibrazil commented Nov 25, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021 • edited Loading

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021 • edited Loading

avibrazil commented Nov 26, 2021

avibrazil commented Nov 26, 2021

laughingman7743 commented Nov 26, 2021

avibrazil commented Nov 26, 2021 • edited Loading

laughingman7743 commented Jan 25, 2022

laughingman7743 commented Dec 18, 2022 • edited Loading

avibrazil commented Nov 25, 2021 •

edited

Loading

avibrazil commented Nov 26, 2021 •

edited

Loading

avibrazil commented Nov 26, 2021 •

edited

Loading

avibrazil commented Nov 26, 2021 •

edited

Loading

laughingman7743 commented Dec 18, 2022 •

edited

Loading