Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ParquetDataset has an invalid parameter validate_schema #803

Open
ayushkarnawat opened this issue Jan 29, 2024 · 1 comment
Open

ParquetDataset has an invalid parameter validate_schema #803

ayushkarnawat opened this issue Jan 29, 2024 · 1 comment

Comments

@ayushkarnawat
Copy link

Description

Parquet files are unable to be read and loaded into the proper ParquetDataset object when used with make_batch_reader. This is due to a deprecated parameter validate_schema=False that was removed in v15.0.0 version of pyarrow.

Actual behavior

/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/keras/src/backend.py:452: UserWarning: `tf.keras.backend.set_learning_phase` is deprecated and will be removed after 2020-10-11. To update it, simply pass a True/False value to the `training` argument of the `__call__` method of your layer or model.
  warnings.warn(
2024-01-26 19:39:09,352 [INFO] /opt/omniai/work/instance1/jupyter/projects/luke/docs/examples/data/parquet
INFO:luke_logger:/opt/omniai/work/instance1/jupyter/projects/luke/docs/examples/data/parquet
2024-01-26 19:39:09,352 [INFO] Mode: val
INFO:luke_logger:Mode: val
2024-01-26 19:39:09,352 [INFO] Reading from: /opt/omniai/work/instance1/jupyter/projects/luke/docs/examples/data/parquet
INFO:luke_logger:Reading from: /opt/omniai/work/instance1/jupyter/projects/luke/docs/examples/data/parquet
2024-01-26 19:39:09,352 [INFO] num_epochs is: None
INFO:luke_logger:num_epochs is: None
2024-01-26 19:39:09,352 [INFO] Reader pool type: thread
INFO:luke_logger:Reader pool type: thread
2024-01-26 19:39:09,353 [INFO] file:///opt/omniai/work/instance1/jupyter/projects/luke/docs/examples/data/parquet
INFO:luke_logger:file:///opt/omniai/work/instance1/jupyter/projects/luke/docs/examples/data/parquet
/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/petastorm/fs_utils.py:88: FutureWarning: pyarrow.localfs is deprecated as of 2.0.0, please use pyarrow.fs.LocalFileSystem instead.
  self._filesystem = pyarrow.localfs
KILL ALL!
Traceback (most recent call last):
  File "/usr/local/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/usr/local/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/train/run.py", line 315, in <module>
    main(obj={})
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/click/core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
                     
            
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/click/core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/click/core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/click/core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/click/core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/click/decorators.py", line 33, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/train/run.py", line 169, in from_config
    run_steps(ctx)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/train/run.py", line 298, in run_steps
    luke_run.run_steps()
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/train/runner_class_tf.py", line 69, in run_steps
    result = step.execute(self.global_config, self.model)
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/train/steps_tf/train_evaluate_step.py", line 137, in execute
    evaluator_results = model.train_and_evaluate(
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/model_defs/models_tf/base_keras_model.py", line 221, in train_and_evaluate
    history = self.fit(
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/model_defs/models_tf/base_model.py", line 168, in fit
    return self.training_fn(
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/model_defs/models_tf/base_keras_model.py", line 122, in training_fn
    validation_data = validation_data()
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/data/base_reader.py", line 116, in val_input_fn
    dataset = dataset_fn(
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/data/base_reader.py", line 280, in get_petastorm_dataset
    reader = self.get_reader(  # pylint: disable=E1123
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke/data/local_reader.py", line 33, in get_reader
    return make_batch_reader(
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/petastorm/reader.py", line 298, in make_batch_reader
    dataset_metadata.get_schema_from_dataset_url(dataset_url_or_urls, hdfs_driver=hdfs_driver,
  File "/opt/omniai/work/instance1/jupyter/projects/luke/luke-env/lib/python3.9/site-packages/petastorm/etl/dataset_metadata.py", line 402, in get_schema_from_dataset_url
    dataset = pq.ParquetDataset(path_or_paths, filesystem=fs, validate_schema=False, metadata_nthreads=10)
TypeError: __init__() got an unexpected keyword argument 'validate_schema'

Expected behavior

The dataset is loaded properly into the ParquetDataset object so that it can be consumed downstream.

@zae-park
Copy link

zae-park commented Apr 9, 2024

It seems that the validate_schema argument has been removed with the update of pyarrow.
I resolved this issue using petastorm=0.12.1, pyarrow=10.0.1
Using Reader and make_reader in the petastorm, data is loaded successfully.
However, deprecated warning is being displayed due to the previous version, and It's not pretty.
I hope someone shares a fancy solution using the latest version.

here is my work

fs = s3fs.S3FileSystem(key="ACCESS_KEY", secret="SECRET_KEY", endpoint_url="ENDPOINT")
reader = make_reader(dataset_url="s3a://YOUR/DATA/PATH", filesystem=fs) as reader

or

reader = Reader(dataset_path = "s3a://YOUR/DATA/PATH")

#758 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants