Skip to content

Error message when using a data with 100s of cols #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ronald-smith-angel opened this issue Jan 11, 2022 · 4 comments
Open

Error message when using a data with 100s of cols #105

ronald-smith-angel opened this issue Jan 11, 2022 · 4 comments

Comments

@ronald-smith-angel
Copy link

When a DS is large and has a big number of columns, the scan function scan.execute(scan_definition, df) fails with a spark OOM issue in the master due to the collection part of the metrics. A more meaningful message here would help to avoid miss leading the developer and let them know that the final result is too large and should be either filtered or split.

@JCZuurmond
Copy link
Contributor

Hi @ronald-smith-angel , thanks for opening this issue. What I do not know yet is: how do we catch the OOM java error and how do add a test for this?

Also, we maybe should implement this in the soda-sql-spark instead of here. @vijaykiran : what do you think?

Anyway, what I expect is a try except around the scan.execute. Preferably we do this for a specific error and add a more meaningful message with a raise RuntimeError("Dataset too large, try row filtering or column selection") from e - or something alike.

@ronald-smith-angel : Would you be interested in contributing?

@stiebels
Copy link

stiebels commented Feb 28, 2022

Depending on where/how the scan is run it could also make sense to give the user the ability to optionally pass a spark.conf object to the scan.execute function, which would trigger the creation of a dedicated (potentially additional) SparkSession via the newSession method that is then used for the execution of the scan (now it's getOrCreate [here], fetching the global session or creating a vanilla new one).

This would allow the user to configure the cluster configuration (such as upping driver memory or heap space) in a way that might be more suitable to run the specific tests efficiently. I think this is one of the few common use cases for actually having two SparkSessions existing in parallel.

If this is something you'd consider making sense, I'm happy to make a PR for it.

@vijaykiran
Copy link
Contributor

I have changed the logic in spark dialect when there are many columns. We are firing one query for every column to get the column metadata - now it should be just one. Please try reinstalling and/or pulling latest soda-sql-spark (2.1.5) and see if it is still a problem with too many columns.

cc @stiebels @ronald-smith-angel

@stiebels regardless of this issue, I think your suggestion makes a lot of sense. Please do open a PR when you have time 🙏🏽

@stiebels
Copy link

stiebels commented Mar 1, 2022

Thanks for the info! Great, I'll open a PR soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants