-
Notifications
You must be signed in to change notification settings - Fork 141
Add analyze_join_columns / get_recommended_join_columns helper functions #387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
|
@mahangu Thanks for the PR, sorry for the delay. Taking a look through some of this still. We were discussing some things internally. Would you be able to articulate the use case for |
I think I see a usecase. It would be even better if different join columns were emitted, "what's the second best choice" "third". Then you could have a UI that cycles through the likely candidates. |
@fdosani Sure! The main use-case we have for this at the moment internally is in a regression test we have created for our internal data transformation system. Users can generate a scratch table from a changed transformation: e.g - They can then run a compare test on both tables using datacompy, to see whether their changes have caused any unforeseen regressions. Currently in this compare script, they need to specify the Doing this programmatically for them would decrease friction and make this compare step much smoother in two ways:
Of course, they can also specify custom
@paddymul I have tried to do something like this by adding a Thank you both for the reviews! 🙏🏾 Looking forward to chatting further. |
@ak-gupta do you have any thoughts on this. I'm OK with adding in the helper functions but I'd like to refine the intent here and have a larger discussion on what we need. |
We use datacompy for regression testing, and for our use-case it can be useful to try and programmatically work out join columns.
Further, when recently prototyping a browser-based UI for datacompy, https://github.com/mahangu/datacompy-web-ui, we worked on a method to recommend join columns.
This PR attempts to bring some of that functionality into datacompy core, via two helper functions:
analyze_join_columns
get_recommended_join_columns
This allows you to pass in
get_recommended_join_columns()
as a function to thejoin_columns
parameter:This is just a rough / basic proof-of-concept and likely requires some refinement to fit into your codebase.
So far, I have got basic tests passing for the Pandas/Polars/Spark implementations, but haven't tested with Snowflake:
If you think this is something that would be useful to add to datacompy, I'm happy to work with y'all on refining this PR further.
Thank you for taking a look!