Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Nov 18, 2025

What changes were proposed in this pull request?

Fix the hashability of UserDefinedType

Why are the changes needed?

UDT is not hashable, e.g.

In [11]: from pyspark.testing.objects import ExamplePointUDT

In [12]: e = ExamplePointUDT()

In [13]: {e: 0}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[13], line 1
----> 1 {e: 0}

TypeError: unhashable type: 'ExamplePointUDT'

In [14]: from pyspark.ml.linalg import VectorUDT

In [15]: v = VectorUDT()

In [16]: {v: 1}
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[16], line 1
----> 1 {v: 1}

TypeError: unhashable type: 'VectorUDT'

see https://docs.python.org/3/reference/datamodel.html#object.__hash__

If a class does not define an eq() method it should not define a hash() operation either; if it defines eq() but not hash(), its instances will not be usable as items in hashable collections.

A class that overrides eq() and does not define hash() will have its hash() implicitly set to None. When the hash() method of a class is None, instances of the class will raise an appropriate TypeError when a program attempts to retrieve their hash value, and will also be correctly identified as unhashable when checking isinstance(obj, collections.abc.Hashable).

Does this PR introduce any user-facing change?

yes, hash(udt) will work after this fix

How was this patch tested?

added tests

Was this patch authored or co-authored using generative AI tooling?

no

@zhengruifeng zhengruifeng changed the title Fix the hashable of UserDefinedType [PYTHON] Fix the hashable of UserDefinedType Nov 18, 2025
@zhengruifeng zhengruifeng changed the title [PYTHON] Fix the hashable of UserDefinedType [SPARK-54397][PYTHON] Fix the hashable of UserDefinedType Nov 18, 2025
@zhengruifeng zhengruifeng changed the title [SPARK-54397][PYTHON] Fix the hashable of UserDefinedType [SPARK-54397][PYTHON] Make UserDefinedType hashable Nov 18, 2025
@zhengruifeng
Copy link
Contributor Author

also cc @gaogaotiantian

@gaogaotiantian
Copy link
Contributor

gaogaotiantian commented Nov 18, 2025

This is a very bad practice and I highly suggest that we don't do it. I don't believe UDT should be hashable. If I understand correctly, UDT can be mutable right? Mutable object should not be hashable. Also if two objects are equal, they must have the same hash - the current implementation does not guarantee that. Why is the __eq__ implementation so strange? Why do we need this to be hashable? No every object should be hashable.

Apologize about the comment above. I misunderstood the code. It looks a bit shady but if the user use it as we expected it should work properly. I was thinking about the actual objects. When the UDT is only created by SomeClassInheritsUDT() (so all instances are equal) then this should work.

However, I still have some concerns (the code itself still has a smell). Is it possible that the user creates different UDT instances with the same type?

For example

class VectorType(UserDefinedType):
    def __init__(self, dimension):
        self.dimension = dimension
    def __repr__(self):
        return f"Vector({self.dimension})"

type1 = VectorType(2)  # a 2-d vector
type2 = VectorType(3)  # a 3-d vector

assert type1 == type2  # True
assert hash(type1) == hash(type2)  # False

Is this valid code that users can write?

Also is it possible for the user to create their own container type like StructType? Do we assume the type instances are immutable?

@ueshin
Copy link
Member

ueshin commented Nov 18, 2025

@gaogaotiantian That's a good point.

assert type1 == type2  # True

I expect this case to be False.

E.g., the builtin type DecimalType returns False if the argument is different:

>>> from pyspark.sql.types import DecimalType
>>> DecimalType(10,0) == DecimalType(20,0)
False

We may need to revisit __eq__ as well?

@gaogaotiantian
Copy link
Contributor

I think the general direction should be making __eq__ strict. It's not too bad to fallback to the simple is comparison. If the user wants it to be smarter, they can implement their own. Without knowing the structure of an arbitrary user defined type, it's too dangerous to guess whether two instances are the same.

@zhengruifeng
Copy link
Contributor Author

I was trying to cache the converters {datetype: converter} and then found that UDT cannot be used as the key.

I think the general direction should be making eq strict.

@gaogaotiantian sounds good, another solution is to just remove the eq method in UDT, then it should use DataType.eq

class DataType:
"""Base class for data types."""
def __repr__(self) -> str:
return self.__class__.__name__ + "()"
def __hash__(self) -> int:
return hash(str(self))
def __eq__(self, other: Any) -> bool:
return isinstance(other, self.__class__) and self.__dict__ == other.__dict__

@gaogaotiantian
Copy link
Contributor

@zhengruifeng using __eq__ from DataType is much better, but I still have concerns. __dict__ does not contain all the info - even at Python level.

For example:

class Vector(UserDefinedType):
    __slots__ = ["dimension"]

    def __init__(self, dimension):
        super(Vector, self).__init__()
        self.dimension = dimension

type1 = Vector(3)
type2 = Vector(2)


assert type1.__dict__ == type2.__dict__  # This will be falsly True

This is a valid (even encouraged) usage of Python classes, I know it might not be common, but our code will have issues.

I think the safest way is to force users to write their own __eq__ and __hash__ methods so they can't blame us for failed code.

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Nov 19, 2025

I think the safest way is to force users to write their own eq and hash methods

@gaogaotiantian how does this affect existing UDTs written by users? type1 == type2 will fail?

@gaogaotiantian
Copy link
Contributor

So, the only case it will work is the two objects are the same object

type1 = Vector(3)
type2 = type1

assert type2 == type1

We can even enforce this by raising an NotImplementedError in __eq__ so it will explicitly ask users to implement their own __eq__ method - because there's no way we can guarantee the equal comparison.

However, it's possible that these proposals are a bit too strict and we want to provide some heuristics to the users. Or we just have to keep some backward compatibility for UDT. Falling back to DataType is not too bad - but technically it's not 100% correct. We probably need to think about tradeoff here.

@zhengruifeng
Copy link
Contributor Author

I agree that we don't have a strict __eq__ may cause some correctness issue, I think it is another issue and existing workflow might be affected if we raise NotImplementedError. We can revisit it in separate thread.

@gaogaotiantian
Copy link
Contributor

NotImplementedError is a bit too strict I agree - if we designed it from scratch that might be the way to go. How about return self is other? Do we have a lot of use case where we need to compare

type1 = Vector(3)
type2 = Vector(3)
assert type1 == type2

@holdenk
Copy link
Contributor

holdenk commented Nov 21, 2025

I also like a default impl of is, that supports the use case of caching.

@zhengruifeng
Copy link
Contributor Author

NotImplementedError is a bit too strict I agree - if we designed it from scratch that might be the way to go. How about return self is other? Do we have a lot of use case where we need to compare

type1 = Vector(3)
type2 = Vector(3)
assert type1 == type2

@gaogaotiantian We can modify the builtin UDTs (VectorUDT/MatrixUDT/etc) but I am not sure whether __eq__ is used a lot in user code. I think we can revisit it in a separate thread/PR.

@gaogaotiantian
Copy link
Contributor

Yeah for now falling back to the default implementation is not too bad. We can change it when users have issues. It should cover most of the use cases and the users can always implement their own if they are not happy with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants