-
Notifications
You must be signed in to change notification settings - Fork 191
Open
Labels
bugSomething isn't workingSomething isn't working
Description
Describe the bug
Reported my external user, where they are performing the removal workflow which does
pd.read_parquet(removal_path, **read_kwargs)Where removal_path is gs://bucket/path
I'm unsure if read_kwargs contains filesystem OR storage_options or None (i.e. authenticated through IAM)
File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/working_dir_files/gs_storage-bucket-cld-eyrl8sj57qhhpw9nl79rmfh1al_org_ty7jnxgzhmr7fin5j6kgk3clxl_cld_eyrl8sj57qhhpw9nl79rmfh1al_runtime_env_packages_pkg_35890417d05366313a5fcae2e351efba/olympus/projects/pythia/data/ray/dedup/semantic/removal.py", line 89, in process
removal_df = pd.read_parquet(
^^^^^^^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet
return impl.read(
^^^^^^^^^^
File "/home/ray/anaconda3/lib/python3.12/site-packages/pandas/io/parquet.py", line 274, in read
pa_table = self.api.parquet.read_table(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1844, in read_table
dataset = ParquetDataset(
^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1424, in __init__
self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/dataset.py", line 790, in dataset
return _filesystem_dataset(source, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/dataset.py", line 480, in _filesystem_dataset
factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow/_dataset.pyx", line 3418, in pyarrow._dataset.FileSystemDatasetFactory.__init__
File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/path/ffb90690f0e3.parquet', which is outside base dir 'gs://bucket/path'A clear and concise description of what the bug is.
Steps/Code to reproduce bug
Please list minimal steps or code snippet for us to be able to reproduce the bug.
A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.
Expected behavior
A clear and concise description of what you expected to happen.
Additional context
Add any other context about the problem here.
Metadata
Metadata
Assignees
Labels
bugSomething isn't workingSomething isn't working