Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distributed Execution on Beam #16

Open
alxmrs opened this issue Feb 17, 2024 · 6 comments
Open

Distributed Execution on Beam #16

alxmrs opened this issue Feb 17, 2024 · 6 comments

Comments

@alxmrs
Copy link
Owner

alxmrs commented Feb 17, 2024

Figure out a way to distribute all layers of SQL execution #10 on Apache Beam.

@alxmrs
Copy link
Owner Author

alxmrs commented Feb 18, 2024

Dataframes: https://beam.apache.org/documentation/dsls/dataframes/overview/
Xarray: Xarray-Beam

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 12, 2024

Beam's dataframes library supports multi indexes.

https://beam.apache.org/releases/pydoc/current/apache_beam.dataframe.io.html

This alone makes beam worthy of an exploration sooner rather than later.

@cisaacstern
Copy link
Collaborator

Interesting!

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 12, 2024

Some general thoughts on this issue in no particular order:

  • I think this would make a good 0.1 release
  • Users can call xarray_sql.beam.read_xarray()
  • Beam supports the Pandas API, with small differences, primarily specific to the fact that PCollections are unordered.
  • Unknown: is there a from_map-like interface to make implementing this easy?

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 12, 2024

This may not be feasible after all. It looks like hdf5 is intentionally not supported because it is a random access format. I think Xarray would follow this characteristic, too.

https://beam.apache.org/releases/pydoc/current/_modules/apache_beam/dataframe/io.html

Maybe this warrants the creation of an xarray-beam-like library for pandas or dask? Can a pd.(multi)index mimic an xbeam key?

@alxmrs
Copy link
Owner Author

alxmrs commented Mar 13, 2024

A core question to answer: do we really need random access?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants