-
Notifications
You must be signed in to change notification settings - Fork 8
Error message when using a data with 100s of cols #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Hi @ronald-smith-angel , thanks for opening this issue. What I do not know yet is: how do we catch the OOM java error and how do add a test for this? Also, we maybe should implement this in the Anyway, what I expect is a @ronald-smith-angel : Would you be interested in contributing? |
Depending on where/how the scan is run it could also make sense to give the user the ability to optionally pass a This would allow the user to configure the cluster configuration (such as upping driver memory or heap space) in a way that might be more suitable to run the specific tests efficiently. I think this is one of the few common use cases for actually having two SparkSessions existing in parallel. If this is something you'd consider making sense, I'm happy to make a PR for it. |
I have changed the logic in spark dialect when there are many columns. We are firing one query for every column to get the column metadata - now it should be just one. Please try reinstalling and/or pulling latest soda-sql-spark (2.1.5) and see if it is still a problem with too many columns. cc @stiebels @ronald-smith-angel @stiebels regardless of this issue, I think your suggestion makes a lot of sense. Please do open a PR when you have time 🙏🏽 |
Thanks for the info! Great, I'll open a PR soon. |
When a DS is large and has a big number of columns, the scan function
scan.execute(scan_definition, df)
fails with a spark OOM issue in the master due to the collection part of the metrics. A more meaningful message here would help to avoid miss leading the developer and let them know that the final result is too large and should be either filtered or split.The text was updated successfully, but these errors were encountered: