Skip to content

pandas read_parquet on a directory might give error on cloud files #1217

@praateekmahajan

Description

@praateekmahajan

Describe the bug
Reported my external user, where they are performing the removal workflow which does

pd.read_parquet(removal_path, **read_kwargs)

Where removal_path is gs://bucket/path

I'm unsure if read_kwargs contains filesystem OR storage_options or None (i.e. authenticated through IAM)

  File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/working_dir_files/gs_storage-bucket-cld-eyrl8sj57qhhpw9nl79rmfh1al_org_ty7jnxgzhmr7fin5j6kgk3clxl_cld_eyrl8sj57qhhpw9nl79rmfh1al_runtime_env_packages_pkg_35890417d05366313a5fcae2e351efba/olympus/projects/pythia/data/ray/dedup/semantic/removal.py", line 89, in process
    removal_df = pd.read_parquet(
                 ^^^^^^^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pandas/io/parquet.py", line 667, in read_parquet
    return impl.read(
           ^^^^^^^^^^
  File "/home/ray/anaconda3/lib/python3.12/site-packages/pandas/io/parquet.py", line 274, in read
    pa_table = self.api.parquet.read_table(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1844, in read_table
    dataset = ParquetDataset(
              ^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/parquet/core.py", line 1424, in __init__
    self._dataset = ds.dataset(path_or_paths, filesystem=filesystem,
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/dataset.py", line 790, in dataset
    return _filesystem_dataset(source, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/ray/session_2025-10-30_14-15-10_692448_4479/runtime_resources/pip/717bcc3c8edf5a7a09cb329c9c31382fb4635bb9/virtualenv/lib/python3.12/site-packages/pyarrow/dataset.py", line 480, in _filesystem_dataset
    factory = FileSystemDatasetFactory(fs, paths_or_selector, format, options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "pyarrow/_dataset.pyx", line 3418, in pyarrow._dataset.FileSystemDatasetFactory.__init__
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: GetFileInfo() yielded path 'bucket/path/ffb90690f0e3.parquet', which is outside base dir 'gs://bucket/path'

A clear and concise description of what the bug is.

Steps/Code to reproduce bug

Please list minimal steps or code snippet for us to be able to reproduce the bug.

A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports.

Expected behavior

A clear and concise description of what you expected to happen.

Additional context

Add any other context about the problem here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions