Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract only certain files from zip #460

Open
ggrothendieck opened this issue Dec 16, 2024 · 3 comments
Open

Extract only certain files from zip #460

ggrothendieck opened this issue Dec 16, 2024 · 3 comments

Comments

@ggrothendieck
Copy link

I am currently doing this to only extract csv files from a zip file and wondered if there is a more direct way of doing this? Would have been nice if which= could be a pattern (regular expression or glob).

import_csvs_from_zip <- function(x, ...) {
  filenames <- rio:::.list_archive(x)
  csv_names <- grep("\\.csv$", filenames, value = TRUE)
  import_list(x, which = csv_names, ...)
}

import_csvs_from_zip("myzip.zip", rbind = TRUE)
@chainsawriot
Copy link
Collaborator

@ggrothendieck I agree that that's a nice idea. It changes significantly how the import_list() behaves.

#' @param which If `file` is a single file path, this specifies which objects should be extracted (passed to [import()]'s `which` argument). Ignored otherwise.

We can only implement this kind of breaking changes in the next major version, if we must do that with which. Another approach is to have another parameter, e.g. similar to list.files() to have pattern.

@ggrothendieck
Copy link
Author

ggrothendieck commented Dec 20, 2024

Some options to maintain backwards compatability would be:

  • have a different parameter, e.g. regex= which is like which but uses regex
  • have another argument such as fixed = TRUE which affects how which is interpreted
  • If the which argument has class "AsIs" (or some other decided upon class) then it would be interpreted as a regular expression, e.g. which = I(".*\\.csv$"), tidyverse does something like that but has its own class and wrapper, stringr::regex(...). For example see delim= argument in ?separate_wider_delim

@ggrothendieck
Copy link
Author

ggrothendieck commented Dec 22, 2024

Another possibility would be to allow which= to specify a logical valued function that is applied to each name in the zip. Only those names for which the function returns TRUE are read. This would also be backwards compatible (if which= is a function it acts as described and if not it acts as it does now) and is powerful since it allows for many approaches within the function. For example the user could specify any of these to only read csv files:

which = \(x) endsWith(x, ".csv")
which = \(x) grepl("\\.csv$", x)
which = \(x) grepl(glob2rx("*.csv"), x)
which = \(x) substring(x, nchar(x) - 3) == ".csv"
which = \(x) tools::file_ext(x) == "csv"

If FUN is any of these or other function and x is a character vector of all names from the zip then the code below could be used internally to determine which names to read.

Filter(FUN, x)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants