Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trino to arrow type conversion for spooling protocol #24891

Open
dysn opened this issue Feb 3, 2025 · 2 comments
Open

trino to arrow type conversion for spooling protocol #24891

dysn opened this issue Feb 3, 2025 · 2 comments

Comments

@dysn
Copy link

dysn commented Feb 3, 2025

in order to support arrow ipc format for spooling, we need to be able to convert trino internal data format to arrow

arrow types to not directly map to all trino types. in general, we can handle mismatches by some combination of

  1. perform a lossy conversion. eg. truncate trino picosecond timestamps to arrow nanosecond timestamps
  2. use an arrow extension type. arrow extension types allow a user to define a type and how it should map to underlying arrow types. eg. create a 12 byte integer for picosecond timestamps
  3. fail queries that produce types that cant be converted

the following types are known to have issues

  1. time and timestamps. trino can represent picosecond precision times, arrow is limited to nanoseconds.
  2. timestamp and time with timezone. trino can represent timestamps with a value specific timezone. arrow (like parquet) can only represent timestamps with a file scoped timezone
  3. hyperloglog
  4. user defined types

summarizing a slack discussion on this topic

@dain suggested that there may be some dependency on client protocol
"The difficult with Arrow is everyone tries to mix together “what should we do for Trino (java) clients” with “What should we do for Python”. For python, my opinion is you map everything to standard arrow types. For everything else, my opinion is you use a custom protocol as you can support all types, and be much more efficient."

@wendigo suggested that we can ignore user defined types for now

"I think that we can narrow the Arrow-support scope only to the Trino-owned types, without the support for custom types at the moment (I’ve never seen a single external type usage)"

@dain suggested that we could support user defined types similarly to how arrow supports extension types
" think we need to extend types so that they can describe how they are mapped to low level primitive types. Then if we are in an situation where we don’t understand how to serialize a type (e.g., compatible arrow), the type gives us direction."

@wendigo
Copy link
Contributor

wendigo commented Feb 3, 2025

I agree that we should start small:

  • in first iteration we should support only Trino types supported by Arrow (without adding extension ones)
  • in the second iteration we can add extension types for the types outside of the Arrow type system,
  • in the third iteration we can add support for custom Trino types (this is rather want than must have)

@dysn
Copy link
Author

dysn commented Feb 4, 2025

that makes sense to me.

for the first iteration, do you think we should just return an error when incompatible types are used?

if you want, i think i can get the first and the second iterations done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants