Fix spark logic in DLO strategy generation #293
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
problem: after deploying changes to spark cluster,
solution: change the lambda function to be executed within a java stream instead of a spark stream
explanation:
this bug cannot be reproduced in unittest
was unable to reproduce in spark-shell, but possibly due to difference between actual code and re-created environment’s code
there were challenges to using a spark jwdp debugger: FileContent classnotfound, FileContent.DATA method not found — possibly due to intellij not handling shaded class well
only guesses could be gathered from debugging/googling/investigating the stack trace, and made less deterministic since nearly the same logic is executed a few lines in advance without any issue.
so we decide to avoid the lambda expression pushing down to spark executor entirely and rely on just the java stream
Changes
For all the boxes checked, please include additional details of the changes made in this pull request.
Testing Done
since jwdp debugger could not effectively troubleshoot this issue, I built OS openhouse:
./gradlew build
then copied the uber jar into li-openhouse's spark deployable with reference in gradle like
implementation(files("/Users/chbush/code/li/openhouse/build/openhouse-spark-apps_2.12/libs/openhouse-spark-apps_2.12-1.0.0-uber.jar"))
I opened the jar
jar xf
and inspected the byte code of the new deployable and confirmed the updated code was within.then deployed this to a staging artifactory repo and referenced the jar in spark. once executing, I hooked in with a jwdp debugger to confirm the line of code I expected was being triggered
files.collectAsList().stream().filter(file -> file.getContent().equals(content));
we see the entirety of the task executes, but finishes with a new error

this is expected on the staging cluster since these columns do not exist yet 🥳
fixing staging cluster with desired columns
results in success
Additional Information
For all the boxes checked, include additional details of the changes made in this pull request.