-
Notifications
You must be signed in to change notification settings - Fork 124
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300
Comments
Interesting. It shouldn't take minutes for 5 rows. I'm wondering if something isn't setup right with the new version. It's using pandas on spark api under the hood. I can try and test it later today. |
I just ran the example code you pointed to above, and it took maybe 2–3 seconds to return the results. I haven't tested it in Databricks. For context, this is running on my home desktop with just the default Spark settings. %%timeit
print(compare.report())
...
...
3.66 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Are you able to share your cluster settings? |
Thanks @fdosani for quick check. I am using single node cluster on AWS Databricks with following configuration. There is nothing else running on this cluster. Runtime: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12) Nothing in the init script. |
Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux. |
sure. I will check locally. It would great if someone from community test on databricks also. |
@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases. |
Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes. |
Just to confirm I ran it with various sizes and never had issues with it completing. I just did 100M rows and it took 3-4 mins about. I'm going to close this issue for now, but if you have other issues feel free to reopen it. |
@fdosani , is it possible to keep legacy spark compare support for a while as it works seamlessly for our datasize? |
@satniks of course. I've been working on a vanilla spark version which doesn't use Pandas on spark API. Its much faster if you want to try it out. On this branch: https://github.com/capitalone/datacompy/tree/vanilla-spark The Legacy version will stick around for a while. I just don't plan any enhancements. |
I executed the default spark usage sample on Databricks notebook (compute cluster having Apache Spark 3.5.0). Surprisingly it took more than a minute for these sample dataframes having only 5 rows each. The legacy spark compare works nicely and gives results in few seconds.
Sample code: I just removed spark session creation as it already exists on databricks notebook.
https://capitalone.github.io/datacompy/spark_usage.html
Has anyone verified datacompy 0.12 with databricks spark? Does it work as expected with reasonable performance?
The text was updated successfully, but these errors were encountered: