Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

Closed
satniks opened this issue May 18, 2024 · 10 comments
Labels
bug Something isn't working help wanted Extra attention is needed spark

Comments

@satniks
Copy link

satniks commented May 18, 2024

I executed the default spark usage sample on Databricks notebook (compute cluster having Apache Spark 3.5.0). Surprisingly it took more than a minute for these sample dataframes having only 5 rows each. The legacy spark compare works nicely and gives results in few seconds.

Sample code: I just removed spark session creation as it already exists on databricks notebook.
https://capitalone.github.io/datacompy/spark_usage.html

Has anyone verified datacompy 0.12 with databricks spark? Does it work as expected with reasonable performance?

@satniks satniks changed the title datacompy version 0.12 spark sample take datacompy version 0.12 spark sample having 5 roes only takes more than a minute to execute on databricks May 18, 2024
@satniks satniks changed the title datacompy version 0.12 spark sample having 5 roes only takes more than a minute to execute on databricks datacompy version 0.12 spark sample having takes more than a minute to execute on databricks May 18, 2024
@satniks satniks changed the title datacompy version 0.12 spark sample having takes more than a minute to execute on databricks datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks May 18, 2024
@fdosani
Copy link
Member

fdosani commented May 18, 2024

Interesting. It shouldn't take minutes for 5 rows. I'm wondering if something isn't setup right with the new version. It's using pandas on spark api under the hood. I can try and test it later today.

@fdosani
Copy link
Member

fdosani commented May 18, 2024

I just ran the example code you pointed to above, and it took maybe 2–3 seconds to return the results. I haven't tested it in Databricks. For context, this is running on my home desktop with just the default Spark settings.

%%timeit
print(compare.report())
...
...
3.66 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Are you able to share your cluster settings?

@fdosani fdosani added bug Something isn't working help wanted Extra attention is needed spark labels May 18, 2024
@satniks
Copy link
Author

satniks commented May 19, 2024

Thanks @fdosani for quick check.

I am using single node cluster on AWS Databricks with following configuration. There is nothing else running on this cluster.

Runtime: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
Note Type: m5d.xlarge 16GB memory, 4 cores
Spark Config:
spark.master local[*, 4]
spark.databricks.cluster.profile singleNode

Nothing in the init script.

@fdosani
Copy link
Member

fdosani commented May 19, 2024

Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.

@satniks
Copy link
Author

satniks commented May 19, 2024

Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.

sure. I will check locally. It would great if someone from community test on databricks also.

@fdosani
Copy link
Member

fdosani commented May 31, 2024

@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.

@satniks
Copy link
Author

satniks commented May 31, 2024

@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.

Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.

@fdosani
Copy link
Member

fdosani commented May 31, 2024

Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.

Just to confirm I ran it with various sizes and never had issues with it completing. I just did 100M rows and it took 3-4 mins about. I'm going to close this issue for now, but if you have other issues feel free to reopen it.

@fdosani fdosani closed this as completed May 31, 2024
@satniks
Copy link
Author

satniks commented Jun 5, 2024

@fdosani , is it possible to keep legacy spark compare support for a while as it works seamlessly for our datasize?

@fdosani
Copy link
Member

fdosani commented Jun 5, 2024

@satniks of course. I've been working on a vanilla spark version which doesn't use Pandas on spark API. Its much faster if you want to try it out. On this branch: https://github.com/capitalone/datacompy/tree/vanilla-spark

The Legacy version will stick around for a while. I just don't plan any enhancements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed spark
Projects
None yet
Development

No branches or pull requests

2 participants