Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread(logicalYN=TRUE) could avoid detecting 'y'/'n' in header detection? #6643

Open
MichaelChirico opened this issue Dec 10, 2024 · 1 comment
Labels

Comments

@MichaelChirico
Copy link
Member

MichaelChirico commented Dec 10, 2024

Follow-up to #4564. There we set logicalYN=FALSE by default for back-compatibility because currently auto-detection of header has a bad interaction with logicalYN -- fread finds y in the header row and thinks "that's non-character data, namely, TRUE", thereby concluding header=FALSE. See the tests in that PR.

Ideally we can disable this parser for the header, but I don't think we have any logic for subsetting the valid parsers for the header only.

We'd also want to think through if doing so will create any undesirable edge cases. The only thing I can think of is something pathological like a file where header=FALSE that consists only of columns named y or n, then we might incorrectly detect header=TRUE based on the string-only data in the first row. AFAICT any realistic example where there are other columns would do fine. But perhaps we can also revisit the header detection logic -- I might think something like "do type detection on row 1; then do type detection on samples of rows 2...N, and compare the inferred types" would be fine.

Anyway, if we can remove these false positives of detecting y --> TRUE, I think we could switch to logicalYN=TRUE by default.

@MichaelChirico
Copy link
Member Author

Ah, well regarding switching the default, I'm reminded of the discussion for logical01:

19. In the NEWS for v1.11.0 (May 2018), section "NOTICE OF INTENDED FUTURE POTENTIAL BREAKING CHANGES" item 2, we stated the intention to eventually change `logical01` to be `TRUE` by default. After some consideration, reflection, and community input, we have decided not to move forward with this plan, i.e., `logical01` will remain `FALSE` by default in both `fread()` and `fwrite()`. See discussion in #5856; most importantly, we think changing the default would be a major disruption to reading "sharded" CSVs where data with the same schema is split into many files, some of which could be converted to `logical` while others remain `integer`. We will retain the option `datatable.logical01` for users who wish to use a different default -- for example, if you are doing input/output on tables with a large number of logical columns, where writing '0'/'1' to the CSV many millions of times is preferable to writing 'TRUE'/'FALSE'.

So we probably shouldn't plan to switch the default. Though we still have some vestiges of that discussion to remove from the docs...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant