Error message when using a data with 100s of cols #105

ronald-smith-angel · 2022-01-11T14:49:24Z

When a DS is large and has a big number of columns, the scan function scan.execute(scan_definition, df) fails with a spark OOM issue in the master due to the collection part of the metrics. A more meaningful message here would help to avoid miss leading the developer and let them know that the final result is too large and should be either filtered or split.

The text was updated successfully, but these errors were encountered:

JCZuurmond · 2022-01-12T09:06:03Z

Hi @ronald-smith-angel , thanks for opening this issue. What I do not know yet is: how do we catch the OOM java error and how do add a test for this?

Also, we maybe should implement this in the soda-sql-spark instead of here. @vijaykiran : what do you think?

Anyway, what I expect is a try except around the scan.execute. Preferably we do this for a specific error and add a more meaningful message with a raise RuntimeError("Dataset too large, try row filtering or column selection") from e - or something alike.

@ronald-smith-angel : Would you be interested in contributing?

stiebels · 2022-02-28T21:27:53Z

Depending on where/how the scan is run it could also make sense to give the user the ability to optionally pass a spark.conf object to the scan.execute function, which would trigger the creation of a dedicated (potentially additional) SparkSession via the newSession method that is then used for the execution of the scan (now it's getOrCreate [here], fetching the global session or creating a vanilla new one).

This would allow the user to configure the cluster configuration (such as upping driver memory or heap space) in a way that might be more suitable to run the specific tests efficiently. I think this is one of the few common use cases for actually having two SparkSessions existing in parallel.

If this is something you'd consider making sense, I'm happy to make a PR for it.

vijaykiran · 2022-03-01T08:48:49Z

I have changed the logic in spark dialect when there are many columns. We are firing one query for every column to get the column metadata - now it should be just one. Please try reinstalling and/or pulling latest soda-sql-spark (2.1.5) and see if it is still a problem with too many columns.

cc @stiebels @ronald-smith-angel

@stiebels regardless of this issue, I think your suggestion makes a lot of sense. Please do open a PR when you have time 🙏🏽

stiebels · 2022-03-01T09:49:10Z

Thanks for the info! Great, I'll open a PR soon.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error message when using a data with 100s of cols #105

Error message when using a data with 100s of cols #105

ronald-smith-angel commented Jan 11, 2022

JCZuurmond commented Jan 12, 2022

stiebels commented Feb 28, 2022 •

edited

Loading

vijaykiran commented Mar 1, 2022

stiebels commented Mar 1, 2022

Error message when using a data with 100s of cols #105

Error message when using a data with 100s of cols #105

Comments

ronald-smith-angel commented Jan 11, 2022

JCZuurmond commented Jan 12, 2022

stiebels commented Feb 28, 2022 • edited Loading

vijaykiran commented Mar 1, 2022

stiebels commented Mar 1, 2022

stiebels commented Feb 28, 2022 •

edited

Loading