Inequality when comparing two empty numpy arrays #58

simonwongwong · 2020-03-10T19:14:44Z

Comparison of two empty numpy arrays currently return False, which results in showing diffs where there shouldn't be.

This is due to the way numpy compares empty arrays.
Running bool(np.array([]) == np.array([])) returns False and throws this warning:

The truth value of an empty array is ambiguous. Returning False, but in future this will result in an error. Use `array.size > 0` to check that an array is not empty.

Reproduce this bug with:

df1 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})
df2 = pd.DataFrame({"some_col": [np.array([]) for _ in range(10)], "id": [i for i in range(10)]})

pdcompare = datacompy.Compare(df1, df2, join_columns="id")
print(pdcompare.report())

output:

DataComPy Comparison
--------------------

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0       df1        2    10
1       df2        2    10

Column Summary
--------------

Number of columns in common: 2
Number of columns in df1 but not in df2: 0
Number of columns in df2 but not in df1: 0

Row Summary
-----------

Matched on: id
Any duplicates on match values: No
Absolute Tolerance: 0
Relative Tolerance: 0
Number of rows in common: 10
Number of rows in df1 but not in df2: 0
Number of rows in df2 but not in df1: 0

Number of rows with some compared columns unequal: 10
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 1
Total number of values which compare unequal: 10

Columns with Unequal Values or Types
------------------------------------

     Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0  some_col    object    object         10         0            0

Sample Rows with Unequal Values
-------------------------------

   id some_col (df1) some_col (df2)
9   9             []             []
0   0             []             []
3   3             []             []
7   7             []             []
5   5             []             []
1   1             []             []
4   4             []             []
2   2             []             []
8   8             []             []
6   6             []             []

The text was updated successfully, but these errors were encountered:

fdosani · 2020-04-13T13:50:16Z

Hey @simonwongwong hope all is well!
I'll take a closer look at this sometime this week. Thanks for bringing this up and opening up the issue.

simonwongwong · 2020-04-15T04:29:39Z

fdosani · 2020-05-14T20:14:53Z

@jborchma Just want to circle back on this. Thoughts on just checking if both are empty and throwing an exception? This might be something which is never encountered (comparing 2 empty dataframes)

jborchma · 2020-05-14T20:30:39Z

So technically two empty dataframes should be equal. Maybe we could return True and a log message?

fdosani · 2020-05-14T22:23:44Z

I'm aligned with that. I'll try and do a quick PR here.

fdosani · 2020-05-14T22:39:40Z

@jborchma So it seems like @simonwongwong is comparing arrays here. So obviously empty arrays make sense. But the following also doesn't work

df1 = pd.DataFrame({"some_col": [np.array([1,2]) for _ in range(10)], "id": [i for i in range(10)]}) 
df2 = pd.DataFrame({"some_col": [np.array([1,2]) for _ in range(10)], "id": [i for i in range(10)]})

Mainly due to the fact that: ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

0 is a issue. 1 element works fine, and than >1 is also an issue.
I guess this isn't a simple change but I'll look into it. I guess we never really had a usecase where the field was an array.

fdosani · 2020-05-15T00:00:29Z

@simonwongwong Are you comparing a lot of np.arrays? (Could you have arrays of > 1 length?) I'd like to think about the use case a bit more if you have thoughts.

simonwongwong · 2020-05-16T00:05:57Z

My use case was reading CSV files with empty arrays using pandas -- pandas will read arrays as numpy arrays and two empty numpy arrays cannot be equal

fdosani · 2020-05-16T00:24:07Z

Makes sense. I think the issue boils down to how Pandas internalises the dtype for an array. It will be an O (Object). But so is string, or another other items which isn't a base dtype that Pandas supports. We could check the first item of that columns and check to see if it is a np.ndarray but that seems really ugly to me. I'm open to other thoughts or suggestions. For your use case will it always be empty?

simonwongwong · 2020-05-16T03:27:14Z

In my case it wasn't always empty.
Another option could be to convert them to Python lists instead of np.ndarray

fdosani · 2020-05-17T00:15:12Z

@theianrobertson Thoughts on this issue? Dataframes with numpy arrays in columns.

jborchma · 2020-05-17T14:56:36Z

So what Simon really wants is elementwise comparison of the arrays, right?

fdosani · 2020-05-17T15:20:24Z

Yup that’s exactly it. Not an actual empty data frame.

…

On Sun, May 17, 2020 at 11:56 AM Jan Borchmann ***@***.***> wrote: So what Simon really wants is elementwise comparison of the arrays, right? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPJA5T7DIWYEVLRBRZLRNLRR73LBANCNFSM4LFGD7SQ> .

jborchma · 2020-05-17T15:25:15Z

I guess we would want to use something like the numpy function to compare arrays.

simonwongwong · 2020-05-17T15:25:22Z

Yeah, on non-empty arrays it'll do a normal element wise comparison, but empty np.arrays will never be equal

fdosani · 2020-05-17T15:29:15Z

The main issue is detecting if it is a np array. Since it comes up as an object in pandas for the dtype. We’d need to inspect the actual element to know for sure what we are dealing with. But then the questions comes up of we should support other things.

…

On Sun, May 17, 2020 at 12:25 PM Simon Wong ***@***.***> wrote: Yeah, on non-empty arrays it'll do a normal element wise comparison, but empty np.arrays will never be equal — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#58 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACPJA5SOHKSNZUMZXZZQCT3RR76W5ANCNFSM4LFGD7SQ> .

fdosani · 2020-05-17T15:30:55Z

Yeah, on non-empty arrays it'll do a normal element wise comparison, but empty np.arrays will never be equal

If you look at my above example I’m not sure datacompy will automatically work. It will complain and suggest any or all.

fdosani · 2020-06-09T02:14:02Z

@jborchma Any further thoughts on this. I think the main issue is where and if you draw the line of things to compare.

fdosani · 2021-08-05T23:11:33Z

@simonwongwong This was a while back now. I'm going to close this issue, but feel free to reopen if it seems like something we need to rehash. Trying to organize our backlog and work through some of these older issue if needed.

jonashaag · 2021-09-03T12:59:56Z

Hi everyone! This is a feature we're missing and I'm happy to spend some time implementing a solution (and also coming up with a proposal how to move forward, if you want).

fdosani · 2021-09-03T13:04:51Z

Hey @jonashaag yes please. Would love contributions and thoughts from others. Happy to have you take this on. Appreciate you willing to help out. 🚀

jonashaag · 2021-09-13T12:27:26Z

Had a look into the implementation -- the actual column comparison code (def columns_equal) seems rather unflexible/specifically built for use cases at Capital One. Here's two ideas how to deal with the NumPy array issue:

A) Add new fixed logic for NumPy arrays: try to detect NumPy array columns by looking at the actual series values. Use .all() for NumPy arrays.

B) Add a new system for custom declaration of "comparators", ie. give more flexibility to the user to configure how columns are compared. We would ship a default configuration that mimics the current behavior, and users would be free to change the configuration to their liking. This could be as simple as giving a list of comparators that are tried in order until one of them "understand" the data, ie. the user could pass something like:

columns_equal(..., comparators=[
    FloatComparator(rtol=1e-3),
    StringComparator(case_sensitive=False),
    ArrayComparator(aggregate="all")  # calls .all()
])

Or it could be an explicit list of comparators for each column, or something similar.

fdosani · 2021-09-18T16:01:22Z

@jonashaag ill take a look at this on Monday. Been on vacation all week. Thanks for your help with this. I do think datacompy could be ready for a major refactor to be honest. Especially aligning the spark and pandas APIs

fdosani · 2021-09-20T13:40:43Z

Reopening this issue. I like the idea of option B @jonashaag . But that seems like a bit of a refactor and something I've been thinking about with the package. I'd like to revisit it and see if there are opportunities to one make it more flexible and also play nicer with Spark/Pandas all in one spot. I was thinking maybe koalas might be a good option here. Option A would be the quickest and solve this direct issue immediately it seems.

Thoughts? @jonashaag @jborchma @elzzhu @theianrobertson ?

jonashaag · 2021-09-20T15:17:01Z

I have little experience with Spark and I'm not sure if I'll be able to invest the learning time right now.

fdosani · 2021-09-20T15:17:54Z

That is perfectly fine, that is something I can lean into.

dan9401 · 2025-04-02T18:46:32Z

Just to add an example that comparing list also seems to have the same issue:

df1 = pd.DataFrame(
    data={
        'test_col': [[], [1,2], [1,3], [2,3], [1,2,3]]
    },
    index=[0,1,3,4,5]
)

df2 = df1.copy(deep=True)

compare = datacompy.Compare(df1, df2, on_index=True)

Output:

Total number of values which compare unequal: 5

Columns with Unequal Values or Types
----------------------------------

    Column df1 dtype df2 dtype  # Unequal  Max Diff  # Null Diff
0  test_col   object   object         5       0.0            0

Sample Rows with Unequal Values
-----------------------------

   test_col (df1) test_col (df2)
4        [2, 3]        [2, 3]
3        [1, 3]        [1, 3]
1        [1, 2]        [1, 2]
0           []           []
5    [1, 2, 3]     [1, 2, 3]

fdosani self-assigned this Apr 13, 2020

fdosani added the bug Something isn't working label Apr 13, 2020

fdosani added help wanted Extra attention is needed question Further information is requested labels May 16, 2020

fdosani closed this as completed Aug 5, 2021

fdosani reopened this Sep 20, 2021

fdosani mentioned this issue Mar 28, 2025

Redesign for compare logic into modular "Comparators" #397

Open

fdosani mentioned this issue Apr 17, 2025

feat: Compatibility for Array and Lists #404

Merged

fdosani closed this as completed in #404 Apr 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inequality when comparing two empty numpy arrays #58

Inequality when comparing two empty numpy arrays #58

simonwongwong commented Mar 10, 2020

fdosani commented Apr 13, 2020

simonwongwong commented Apr 15, 2020

fdosani commented May 14, 2020

jborchma commented May 14, 2020

fdosani commented May 14, 2020

fdosani commented May 14, 2020

fdosani commented May 15, 2020

simonwongwong commented May 16, 2020

fdosani commented May 16, 2020

simonwongwong commented May 16, 2020

fdosani commented May 17, 2020

jborchma commented May 17, 2020

fdosani commented May 17, 2020 via email

jborchma commented May 17, 2020

simonwongwong commented May 17, 2020

fdosani commented May 17, 2020 via email

fdosani commented May 17, 2020

fdosani commented Jun 9, 2020

fdosani commented Aug 5, 2021

jonashaag commented Sep 3, 2021

fdosani commented Sep 3, 2021

jonashaag commented Sep 13, 2021

fdosani commented Sep 18, 2021 •

edited

Loading

fdosani commented Sep 20, 2021 •

edited

Loading

jonashaag commented Sep 20, 2021

fdosani commented Sep 20, 2021

dan9401 commented Apr 2, 2025

Inequality when comparing two empty numpy arrays #58

Inequality when comparing two empty numpy arrays #58

Comments

simonwongwong commented Mar 10, 2020

fdosani commented Apr 13, 2020

simonwongwong commented Apr 15, 2020

fdosani commented May 14, 2020

jborchma commented May 14, 2020

fdosani commented May 14, 2020

fdosani commented May 14, 2020

fdosani commented May 15, 2020

simonwongwong commented May 16, 2020

fdosani commented May 16, 2020

simonwongwong commented May 16, 2020

fdosani commented May 17, 2020

jborchma commented May 17, 2020

fdosani commented May 17, 2020 via email

jborchma commented May 17, 2020

simonwongwong commented May 17, 2020

fdosani commented May 17, 2020 via email

fdosani commented May 17, 2020

fdosani commented Jun 9, 2020

fdosani commented Aug 5, 2021

jonashaag commented Sep 3, 2021

fdosani commented Sep 3, 2021

jonashaag commented Sep 13, 2021

fdosani commented Sep 18, 2021 • edited Loading

fdosani commented Sep 20, 2021 • edited Loading

jonashaag commented Sep 20, 2021

fdosani commented Sep 20, 2021

dan9401 commented Apr 2, 2025

fdosani commented Sep 18, 2021 •

edited

Loading

fdosani commented Sep 20, 2021 •

edited

Loading