-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54397][PYTHON] Make UserDefinedType hashable
#53113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
UserDefinedTypeUserDefinedType
UserDefinedTypeUserDefinedType
UserDefinedTypeUserDefinedType hashable
|
also cc @gaogaotiantian |
|
Apologize about the comment above. I misunderstood the code. It looks a bit shady but if the user use it as we expected it should work properly. I was thinking about the actual objects. When the UDT is only created by However, I still have some concerns (the code itself still has a smell). Is it possible that the user creates different UDT instances with the same type? For example class VectorType(UserDefinedType):
def __init__(self, dimension):
self.dimension = dimension
def __repr__(self):
return f"Vector({self.dimension})"
type1 = VectorType(2) # a 2-d vector
type2 = VectorType(3) # a 3-d vector
assert type1 == type2 # True
assert hash(type1) == hash(type2) # FalseIs this valid code that users can write? Also is it possible for the user to create their own container type like |
|
@gaogaotiantian That's a good point. assert type1 == type2 # TrueI expect this case to be E.g., the builtin type >>> from pyspark.sql.types import DecimalType
>>> DecimalType(10,0) == DecimalType(20,0)
FalseWe may need to revisit |
|
I think the general direction should be making |
|
I was trying to cache the converters
@gaogaotiantian sounds good, another solution is to just remove the eq method in UDT, then it should use DataType.eq spark/python/pyspark/sql/types.py Lines 115 to 125 in 9380e91
|
|
@zhengruifeng using For example: class Vector(UserDefinedType):
__slots__ = ["dimension"]
def __init__(self, dimension):
super(Vector, self).__init__()
self.dimension = dimension
type1 = Vector(3)
type2 = Vector(2)
assert type1.__dict__ == type2.__dict__ # This will be falsly TrueThis is a valid (even encouraged) usage of Python classes, I know it might not be common, but our code will have issues. I think the safest way is to force users to write their own |
@gaogaotiantian how does this affect existing UDTs written by users? |
|
So, the only case it will work is the two objects are the same object type1 = Vector(3)
type2 = type1
assert type2 == type1We can even enforce this by raising an However, it's possible that these proposals are a bit too strict and we want to provide some heuristics to the users. Or we just have to keep some backward compatibility for UDT. Falling back to |
|
I agree that we don't have a strict |
|
type1 = Vector(3)
type2 = Vector(3)
assert type1 == type2 |
|
I also like a default impl of is, that supports the use case of caching. |
@gaogaotiantian We can modify the builtin UDTs (VectorUDT/MatrixUDT/etc) but I am not sure whether |
|
Yeah for now falling back to the default implementation is not too bad. We can change it when users have issues. It should cover most of the use cases and the users can always implement their own if they are not happy with it. |
What changes were proposed in this pull request?
Fix the hashability of
UserDefinedTypeWhy are the changes needed?
UDT is not hashable, e.g.
see https://docs.python.org/3/reference/datamodel.html#object.__hash__
Does this PR introduce any user-facing change?
yes,
hash(udt)will work after this fixHow was this patch tested?
added tests
Was this patch authored or co-authored using generative AI tooling?
no