Skip to content

Add analyze_join_columns / get_recommended_join_columns helper functions #387

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

mahangu
Copy link

@mahangu mahangu commented Mar 8, 2025

We use datacompy for regression testing, and for our use-case it can be useful to try and programmatically work out join columns.

Further, when recently prototyping a browser-based UI for datacompy, https://github.com/mahangu/datacompy-web-ui, we worked on a method to recommend join columns.

This PR attempts to bring some of that functionality into datacompy core, via two helper functions:

  1. analyze_join_columns
  2. get_recommended_join_columns

This allows you to pass in get_recommended_join_columns() as a function to the join_columns parameter:

df1 = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 5],
        "name": ["a", "b", "c", "d", "e"],
        "value1": [10, 20, 30, 40, 50],
    }
)
df2 = pd.DataFrame(
    {
        "id": [1, 2, 3, 4, 6],
        "name": ["a", "b", "c", "d", "f"],
        "value2": [11, 21, 31, 41, 61],
    }
)

compare = Compare(df1, df2, join_columns=get_recommended_join_columns(df1, df2))

This is just a rough / basic proof-of-concept and likely requires some refinement to fit into your codebase.

So far, I have got basic tests passing for the Pandas/Polars/Spark implementations, but haven't tested with Snowflake:

tests/test_helper.py::test_analyze_join_columns PASSED                                                                                                                         [  6%]
tests/test_helper.py::test_analyze_join_columns_with_nulls PASSED                                                                                                              [ 13%]
tests/test_helper.py::test_analyze_join_columns_with_threshold PASSED                                                                                                          [ 20%]
tests/test_helper.py::test_analyze_join_columns_unique_in_only_one_df PASSED                                                                                                   [ 26%]
tests/test_helper.py::test_get_recommended_join_columns PASSED                                                                                                                 [ 33%]
tests/test_helper.py::test_get_recommended_join_columns_directly_in_compare PASSED                                                                                             [ 40%]
tests/test_helper.py::test_analyze_join_columns_polars PASSED                                                                                                                  [ 46%]
tests/test_helper.py::test_analyze_join_columns_with_nulls_polars PASSED                                                                                                       [ 53%]
tests/test_helper.py::test_get_recommended_join_columns_polars PASSED                                                                                                          [ 60%]
tests/test_helper.py::test_analyze_join_columns_spark PASSED                                                                                                                   [ 66%]
tests/test_helper.py::test_analyze_join_columns_with_nulls_spark PASSED                                                                                                        [ 73%]
tests/test_helper.py::test_get_recommended_join_columns_spark PASSED                                                                                                           [ 80%]
tests/test_helper.py::test_analyze_join_columns_snowflake SKIPPED (Snowflake is not installed)                                                                                 [ 86%]
tests/test_helper.py::test_analyze_join_columns_with_nulls_snowflake SKIPPED (Snowflake is not installed)                                                                      [ 93%]
tests/test_helper.py::test_get_recommended_join_columns_snowflake SKIPPED (Snowflake is not installed)                                                                         [100%]

================================================================================== warnings summary ==================================================================================
tests/test_helper.py:8
  /Users/mahangu/code/datacompy/tests/test_helper.py:8: UserWarning: SparkPandasCompare currently only supports Numpy < 2.Please note that the SparkPandasCompare functionality will not work and currently is not supported.
    from datacompy import Compare

tests/test_helper.py: 32 warnings
  /Users/mahangu/code/datacompy/spark-only-venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:37: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(pandas.__version__) < LooseVersion(minimum_pandas_version):

tests/test_helper.py: 16 warnings
  /Users/mahangu/code/datacompy/spark-only-venv/lib/python3.10/site-packages/pyspark/sql/pandas/utils.py:64: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
    if LooseVersion(pyarrow.__version__) < LooseVersion(minimum_pyarrow_version):

tests/test_helper.py: 22 warnings
  /Users/mahangu/code/datacompy/spark-only-venv/lib/python3.10/site-packages/pyspark/sql/pandas/serializers.py:224: DeprecationWarning: is_categorical_dtype is deprecated and will be removed in a future version. Use isinstance(dtype, pd.CategoricalDtype) instead
    if is_categorical_dtype(series.dtype):

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
===================================================================== 12 passed, 3 skipped, 71 warnings in 4.56s =====================================================================

If you think this is something that would be useful to add to datacompy, I'm happy to work with y'all on refining this PR further.

Thank you for taking a look!

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@mahangu mahangu marked this pull request as ready for review March 8, 2025 02:45
@fdosani
Copy link
Member

fdosani commented Mar 11, 2025

@mahangu Thanks for the PR, sorry for the delay. Taking a look through some of this still. We were discussing some things internally. Would you be able to articulate the use case for analyze_join_columns are you finding you sometimes when comparing it is helpful to have this? More out of curiosity from my side.

cc: @ak-gupta @rhaffar

@paddymul
Copy link

I think I see a usecase.
Many times I want to compare dataframes that I'm not super familiar with. I could manually look at each column and figure out the best join columns. I'm going to do this by my own internalized heruistics. It would be great to have a function taht expresses my join heuristics so I don't have to scan the columns and do that on my own.

It would be even better if different join columns were emitted, "what's the second best choice" "third". Then you could have a UI that cycles through the likely candidates.

@mahangu
Copy link
Author

mahangu commented Mar 12, 2025

Would you be able to articulate the use case for analyze_join_columns are you finding you sometimes when comparing it is helpful to have this? More out of curiosity from my side.

@fdosani Sure! The main use-case we have for this at the moment internally is in a regression test we have created for our internal data transformation system. Users can generate a scratch table from a changed transformation:

e.g -
original: marketing.page_views
scratch: scratch.marketing__page_views

They can then run a compare test on both tables using datacompy, to see whether their changes have caused any unforeseen regressions. Currently in this compare script, they need to specify the join_columns manually.

Doing this programmatically for them would decrease friction and make this compare step much smoother in two ways:

  1. We could use analyze_join_columns() to output a list of possible join columns they can choose from.
  2. We could allow them to just use get_recommended_join_columns() directly if they want to run a quick/first comparison on the default join columns that datacompy recommends.

Of course, they can also specify custom join_columns if they want to. Adding this functionality could just make the initial UX a bit smoother for users. We were initially going to add this feature to our internal script but wanted to explore adding it to datacompy core as we feel it could help other datacompy users as well.

It would be even better if different join columns were emitted, "what's the second best choice" "third". Then you could have a UI that cycles through the likely candidates.

@paddymul I have tried to do something like this by adding a uniqueness_score that we can order by when providing the join columns to the user - https://github.com/capitalone/datacompy/pull/387/files#diff-0d330e20f29f9779d6ec676715299be6bc6505293f7d15323b134c1ccbe2a2e6R137 - but open to other ideas too!

Thank you both for the reviews! 🙏🏾 Looking forward to chatting further.

@fdosani
Copy link
Member

fdosani commented Mar 24, 2025

@ak-gupta do you have any thoughts on this. I'm OK with adding in the helper functions but I'd like to refine the intent here and have a larger discussion on what we need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants