Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vroom ignores col_names when dealing with imperfect data #522

Open
D3SL opened this issue Nov 13, 2023 · 0 comments
Open

Vroom ignores col_names when dealing with imperfect data #522

D3SL opened this issue Nov 13, 2023 · 0 comments

Comments

@D3SL
Copy link

D3SL commented Nov 13, 2023

Real world data is almost never perfect. Things like minor raggedness in a CSV can be caused by any number of things ranging from missing quotes around a string that contains the delimiter or simple typos. One of R's greatest strengths is just how good it is at dealing with situations like this. For example previously the trivial solution was defining placeholder columns in col_names (or equivalent). This would allow you to read the data and then clean it inside R:

with_edition(1,
read_csv(
  col_names = c("testrow","name","region","region2","test"),
  skip=1,
I("testrow,name,region,test\n
1,jim,footown,06\n
2,bob,footown,41\n
3,tom,footown, bobstreet,99\n
4,steve,footown, bobstreet,47\n
5,george,footown, bobstreet,62\n"))
)

# A tibble: 5 × 5
  testrow name   region  region2    test
    <dbl> <chr>  <chr>   <chr>     <dbl>
1       1 jim    footown 06           NA
2       2 bob    footown 41           NA
3       3 tom    footown bobstreet    99
4       4 steve  footown bobstreet    47
5       5 george footown bobstreet    62

In vroom, and now new versions of readr, this is impossible. Even with col_names explicitly defined there is no way to force readr/vroom to do the right thing.

vroom::vroom(
  col_names = c("testrow","name","region","region2","test"),
  skip=1,
I("testrow,name,region,test\n
1,jim,footown,06\n
2,bob,footown,41\n
3,tom,footown, bobstreet,99\n
4,steve,footown, bobstreet,47\n
5,george,footown, bobstreet,62\n"))


ℹ Use `spec()` to retrieve the full column specification for this data.Specify the column types or set `show_col_types = FALSE` to quiet this message.
# A tibble: 5 × 4
  testrow name   region  region2     
    <dbl> <chr>  <chr>   <chr>       
1       1 jim    footown 06          
2       2 bob    footown 41          
3       3 tom    footown bobstreet,99
4       4 steve  footown bobstreet,47
5       5 george footown bobstreet,62
Warning message:
One or more parsing issues, call `problems()` on your data frame for details, e.g.:
  dat <- vroom(...)

There's three very important issues here.

The first is that a de facto monopoly in the R ecosystem has once again made a very user-hostile breaking change without any announcement, warning, or even documentation.

The second is the documentation. Not only is this behavior not documented, the documentation that does exist explicitly leads users to believe the opposite will happen:

col_names
Either TRUE, FALSE or a character vector of column names.

If TRUE, the first row of the input will be used as the column names, and will not be included in the data frame. If FALSE, column names will be generated automatically: X1, X2, X3 etc.

If col_names is a character vector, the values will be used as the names of the columns, and the first row of the input will be read into the first row of the output data frame.

Missing (NA) column names will generate a warning, and be filled in with dummy names ...1, ...2 etc. Duplicate column names will generate a warning and be made unique, see name_repair to control how this is done.

And the third is the behavior itself. It's a severe antipattern to have an argument like col_names and then silently ignore the user's input, leaving them wondering why they've provided 5 column names and the function is giving errors about expecting 4 columns.

The ideal solution is obviously that user input should be authoritative. If a user supplies 5 columns vroom should return 5 columns with NAs where appropriate. But at absolute minimum the documentation should be changed to explicitly state that col_names is only a suggestion and will be ignored based on what vroom decides under the hood.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant