Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for downloading zipped input data from URL #445

Merged
merged 4 commits into from
Jun 21, 2024
Merged

Conversation

alexdewar
Copy link
Contributor

This PR adds support for entering a URL to the input data instead of a file path on the command line. I implemented this using a C++ wrapper for curl.

Note that files are downloaded to a temporary folder and that there is no caching of URLs, just caching to check whether a zip file matching a particular checksum has already been extracted. While we could implement this by allowing users to pass a checksum for the input data as a command line argument, I don't think that's particularly user friendly. The better option is to let users specify it via the config file (#407), which I'll do later.

Closes #364.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/HealthGPS.Console/command_options.h Outdated Show resolved Hide resolved
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clang-tidy made some suggestions

src/HealthGPS.Console/command_options.h Show resolved Hide resolved
Copy link
Contributor

@jamesturner246 jamesturner246 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Only questions.

fmt::print(fg(fmt::color::red), "\nFile storage folder: {} not found.\n",
cmd.storage_folder.string());
cmd.exit_code = EXIT_FAILURE;
cmd.data_path_or_url = result["storage"].as<std::string>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it doesn't make sense any more to use a relative path in the config file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it does! I must have just deleted this without looking too closely at it. Oops 🙈

I'll fix it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait. This is a bit weird. It converts relative paths to absolute ones, but it doesn't actually change what the path points to. I assumed it was treating relative paths as if they were relative to the config file's path (and then fixing them up), but they're actually just relative to the CWD.

I suppose either way, if users are expecting to see absolute paths in log files etc., we should probably keep this behaviour though. What do you think?

If we were writing this from scratch, I think a better behaviour would be for paths to be relative to the config file, but I guess we'll break things if we change it now.

Copy link
Contributor

@jamesturner246 jamesturner246 Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose either way, if users are expecting to see absolute paths in log files etc., we should probably keep this behaviour though. What do you think?

Yeah, probably for the best. Rather than fiddling around with data dir location and config relative paths -- sounds error prone. If it's easy to do the relative paths, though, go ahead if you like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's easy enough to do, but I don't want to break people's existing config files.
Hopefully going forward everyone will use URLs instead anyway!

Comment on lines +57 to +65
std::filesystem::path path = path_or_url;
if (std::filesystem::is_directory(path)) {
root_ = std::move(path);
} else if (std::filesystem::is_regular_file(path) && path.extension() == ".zip") {
root_ = extract_zip_file_or_load_from_cache(path);
} else {
throw std::runtime_error(fmt::format(
"Path must either point to a zip file or a directory: {}", path_or_url));
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this implicitly check if the file exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I think under the hood stat() is used to both check whether a file/dir exists and what it is (at least on *nix -- think it's the same on Windows).

throw std::runtime_error(fmt::format("Failed to create file {}", download_path.string()));
}

curlpp::Cleanup cleanup;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleans up curl's global resources when it goes out of scope. Uses this function: https://curl.se/libcurl/c/curl_global_cleanup.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks a bit weird but I guess that's the C++/RAII way!

@alexdewar alexdewar enabled auto-merge June 20, 2024 13:48
alexdewar and others added 3 commits June 20, 2024 14:49
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
@alexdewar alexdewar merged commit 75a6008 into main Jun 21, 2024
5 checks passed
@alexdewar alexdewar deleted the input_data_from_url branch June 24, 2024 11:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow for reading data from URL
2 participants