Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using a dict in schema parameter in LazyFrame constructor implicitly sets expected row order #21659

Open
ekorchmar opened this issue Mar 8, 2025 · 0 comments
Labels
documentation Improvements or additions to documentation

Comments

@ekorchmar
Copy link

Description

Wording in documentation does not make this behavior clear.

Dictionaries of course have been an explicitly ordered iterable for a while in Python, but it is not their primary use. I tried to read a file supplying pre-defined schema, and caught dtype conversion errors because columns I specified were encountered in the wrong order. Using schema_overrides with the same input argument instead worked as expected.

I agree with this behavior, but error messages are very unhelpful:

  File "<MY_CODE_CALLING_POLARS>.py", line 111, in materialize
    self.data = self._lazy_frame.collect()
                ~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "venv/lib/python3.13/site-packages/polars/lazyframe/frame.py", line 2065, in collect
    return wrap_df(ldf.collect(callback))
                   ~~~~~~~~~~~^^^^^^^^^^
polars.exceptions.ComputeError: could not parse `Drug` as dtype `date` at column 'valid_end_date' (column number 8)

The current offset in the file is 44 bytes.

You might want to try:
- increasing `infer_schema_length` (e.g. `infer_schema_length=10000`),
- specifying correct dtype with the `schema_overrides` argument
- setting `ignore_errors` to `True`,
- adding `Drug` to the `null_values` list.

Original error: ```could not find a 'date/datetime' pattern for 'Drug'

The schema parameter needs to be documented as specifying the order in addition to datatypes.

Alternatively, implicit schema reordering could be enabled via another flag parameter, but since schema_overrides exists for use-case of loosely structured data, it may not be as important.

Use cases it would help:

  • Writing forward compatible code, where schema can be extended over time
  • Writing code that operates on aggregated data from many sources, where column order may not be always stable

Link

https://docs.pola.rs/api/python/stable/reference/lazyframe/index.html

@ekorchmar ekorchmar added the documentation Improvements or additions to documentation label Mar 8, 2025
ekorchmar added a commit to ekorchmar/ohdsi-hekate that referenced this issue Mar 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

1 participant