datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

satniks · 2024-05-18T16:59:43Z

I executed the default spark usage sample on Databricks notebook (compute cluster having Apache Spark 3.5.0). Surprisingly it took more than a minute for these sample dataframes having only 5 rows each. The legacy spark compare works nicely and gives results in few seconds.

Sample code: I just removed spark session creation as it already exists on databricks notebook.
https://capitalone.github.io/datacompy/spark_usage.html

Has anyone verified datacompy 0.12 with databricks spark? Does it work as expected with reasonable performance?

fdosani · 2024-05-18T17:31:10Z

Interesting. It shouldn't take minutes for 5 rows. I'm wondering if something isn't setup right with the new version. It's using pandas on spark api under the hood. I can try and test it later today.

fdosani · 2024-05-18T17:59:13Z

I just ran the example code you pointed to above, and it took maybe 2–3 seconds to return the results. I haven't tested it in Databricks. For context, this is running on my home desktop with just the default Spark settings.

%%timeit
print(compare.report())
...
...
3.66 s ± 159 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Are you able to share your cluster settings?

satniks · 2024-05-19T02:32:31Z

Thanks @fdosani for quick check.

I am using single node cluster on AWS Databricks with following configuration. There is nothing else running on this cluster.

Runtime: 14.3 LTS (includes Apache Spark 3.5.0, Scala 2.12)
Note Type: m5d.xlarge 16GB memory, 4 cores
Spark Config:
spark.master local[*, 4]
spark.databricks.cluster.profile singleNode

Nothing in the init script.

fdosani · 2024-05-19T02:36:37Z

Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.

satniks · 2024-05-19T02:45:23Z

Not really sure how to help here. Can you try running on your local (non-cloud) machine and see if it runs faster? I'm using Spark 3.5.1 on Linux.

sure. I will check locally. It would great if someone from community test on databricks also.

fdosani · 2024-05-31T01:04:46Z

@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.

satniks · 2024-05-31T03:54:44Z

@satniks I was able to recreate this on databricks. It is odd, but I don't think this is a datacompy issue but rather something under the hood with databricks. I think one thing to consider is that for such small data the SparkCompare will be bad in general. You really need to get into the millions - billions or records since that is what it is intended for. If you have smallish data Polars and Pandas will be better in all cases.

Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.

fdosani · 2024-05-31T15:31:54Z

Thanks @fdosani for confirmation on small dataset on Databricks. I also tried with few thousands records and it ran for several minutes so I cancelled the job assuming it will not succeed. I will check again for thousands and more records next week and see how long it takes.

Just to confirm I ran it with various sizes and never had issues with it completing. I just did 100M rows and it took 3-4 mins about. I'm going to close this issue for now, but if you have other issues feel free to reopen it.

satniks · 2024-06-05T03:48:21Z

@fdosani , is it possible to keep legacy spark compare support for a while as it works seamlessly for our datasize?

fdosani · 2024-06-05T05:24:39Z

@satniks of course. I've been working on a vanilla spark version which doesn't use Pandas on spark API. Its much faster if you want to try it out. On this branch: https://github.com/capitalone/datacompy/tree/vanilla-spark

The Legacy version will stick around for a while. I just don't plan any enhancements.

satniks changed the title ~~datacompy version 0.12 spark sample take~~ datacompy version 0.12 spark sample having 5 roes only takes more than a minute to execute on databricks May 18, 2024

satniks changed the title ~~datacompy version 0.12 spark sample having 5 roes only takes more than a minute to execute on databricks~~ datacompy version 0.12 spark sample having takes more than a minute to execute on databricks May 18, 2024

satniks changed the title ~~datacompy version 0.12 spark sample having takes more than a minute to execute on databricks~~ datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks May 18, 2024

fdosani added bug Something isn't working help wanted Extra attention is needed spark labels May 18, 2024

fdosani closed this as completed May 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

satniks commented May 18, 2024

fdosani commented May 18, 2024

fdosani commented May 18, 2024

satniks commented May 19, 2024

fdosani commented May 19, 2024

satniks commented May 19, 2024

fdosani commented May 31, 2024

satniks commented May 31, 2024

fdosani commented May 31, 2024

satniks commented Jun 5, 2024

fdosani commented Jun 5, 2024 •

edited

datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

datacompy v0.12 spark sample with 5 rows only takes more than a minute to execute on databricks #300

Comments

satniks commented May 18, 2024

fdosani commented May 18, 2024

fdosani commented May 18, 2024

satniks commented May 19, 2024

fdosani commented May 19, 2024

satniks commented May 19, 2024

fdosani commented May 31, 2024

satniks commented May 31, 2024

fdosani commented May 31, 2024

satniks commented Jun 5, 2024

fdosani commented Jun 5, 2024 • edited

fdosani commented Jun 5, 2024 •

edited