Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tibble: A tbl may contain an external pointer via attribute problems set by readr #9

Open
HenrikBengtsson opened this issue Oct 19, 2023 · 4 comments

Comments

@HenrikBengtsson
Copy link
Collaborator

A tbl may contain an external pointer via attribute problems, e.g.

spc_tbl_ [25,000 × 3] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ Index        : num [1:25000] 1 2 3 4 5 6 7 8 9 10 ...
 $ Height_Inches: num [1:25000] 65.8 71.5 69.4 68.2 67.8 ...
 $ Weight_Pounds: num [1:25000] 113 136 153 142 144 ...
 - attr(*, "spec")=
  .. cols(
  ..   Index = col_double(),
  ..   Height_Inches = col_double(),
  ..   Weight_Pounds = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 
@HenrikBengtsson
Copy link
Collaborator Author

It's actually the readr package that adds the problems attribute. From help("problems", package = "readr"):

"Readr functions will only throw an error if parsing fails in an unrecoverable way. However, there are lots of potential problems that you might want to know about - these are stored in the problems attribute of the output ..."

@HenrikBengtsson HenrikBengtsson changed the title tibble: A tbl may contain an external pointer via attribute problems tibble: A tbl may contain an external pointer via attribute problems set by **readr** Dec 13, 2023
@HenrikBengtsson
Copy link
Collaborator Author

marshal() on a tbl object could simply drop the problems attribute.

@HenrikBengtsson HenrikBengtsson changed the title tibble: A tbl may contain an external pointer via attribute problems set by **readr** tibble: A tbl may contain an external pointer via attribute problems set by readr Dec 13, 2023
@HenrikBengtsson
Copy link
Collaborator Author

HenrikBengtsson commented Dec 13, 2023

marshal() on a tbl object could simply drop the problems attribute.

Ah, the problems attribute may also contain non-pointer objects, so we don't always have to drop it. For example,

> x <- parse_integer(c("1X", "blah", "3"))
Warning: 2 parsing failures.
row col               expected actual
  1  -- no trailing characters   1X  
  2  -- no trailing characters   blah

> str(x)
 int [1:3] NA NA 3
 - attr(*, "problems")= tibble [2 × 4] (S3: tbl_df/tbl/data.frame)
  ..$ row     : int [1:2] 1 2
  ..$ col     : int [1:2] NA NA
  ..$ expected: chr [1:2] "no trailing characters" "no trailing characters"
  ..$ actual  : chr [1:2] "1X" "blah"

More clues about alternatives can be found in:

readr:::problems
function (x = .Last.value) 
{
    problems <- probs(x)
    if (is.null(problems)) {
        return(invisible(no_problems))
    }
    if (inherits(problems, "tbl_df")) {
        return(problems)
    }
    vroom::problems(x)
}

So, it looks like vroom might be involved too;

> vroom::problems
function (x = .Last.value, lazy = FALSE) 
{
    if (!inherits(x, "tbl_df")) {
        cli::cli_abort(c("The {.arg x} argument of {.fun vroom::problems} must be a data frame created by vroom:", 
            x = "{.arg x} has class {.cls {class(x)}}"))
    }
    if (!isTRUE(lazy)) {
        vroom_materialize(x, replace = FALSE)
    }
    probs <- attr(x, "problems")
    if (typeof(probs) != "externalptr") {
        cli::cli_abort(c("The {.arg x} argument of {.fun vroom::problems} must be a data frame created by vroom:", 
            x = "{.arg x} seems to have been created with something else, maybe readr?"))
    }
    probs <- vroom_errors_(probs)
    probs <- probs[!duplicated(probs), ]
    probs <- probs[order(probs$file, probs$row, probs$col), ]
    tibble::as_tibble(probs)
}
<environment: namespace:vroom>

@HenrikBengtsson
Copy link
Collaborator Author

HenrikBengtsson commented Dec 13, 2023

From the above, marshalling of tbl_df (sic!) could rely on the following "pruning" method:

prune.tbl_df <- function(x, ...) {
  problems <- attr(x, "problems", exact = TRUE)

  ## Materialize `problems` stored elsewhere in this process?
  if (typeof(problems) == "externalptr") {
     problems <- vroom::problems(x)
     attr(x, "problems") <- problems
  }

  x
}

Comment: We could use NextMethod("prune") at the end.

Comment 2: We've punted on the idea of having prune() methods thus far, but maybe this is an argument for having them. Maybe it should be names something else than "prune", because pruning could also mean "drop unnecessary content".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant