-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Risk factor refactor #195
Risk factor refactor #195
Conversation
clang-tidy review says "All clean, LGTM! 👍" |
example/France.EBHLM.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is autogenerated from a folder with 50+ files and the country data file, not sure we gain much by hand editing this, perhaps changing the script that generate this file would be more beneficial. Alternatively, you could remove this section from the config file and read from here instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, the respective healthgps-tools script should also be updated. I'm considering a Python rewrite of the scripts, to sidestep the Windows R binding dependency.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @israel-vieira.
Though we do want the config files in this repo to be up to date and working with the current version of the code for testing purposes. Whether we edit it directly or just the script for generating it isn't so important I think, as long as it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made an issue for later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The list of diseases can be long, 30+. The current validation will list all the invalid diseases key, if any, before exiting. The exception alternative will break at the first invalid key, the use will need to keep on trying until all are correct, in the end, both methods are accomplishing the required validation, it is just a matter of style.
Mapping the config with the back-end in a case-sensitive world is never user friend, the identifier type tries to minimize the pain by making all keys lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Case-sensitivity: fair point. In which case, the task then becomes removing the parts of the codebase that are case-sensitive. I've spotted a few.
Diseases: agree, a style choice. My concern in these cases has always been that by warning and continuing for dodgy input, that it makes the problem slightly less obvious to diagnose when it fails later on in the program, than if it were to immediately fail on loading. Perhaps this can be replaced with a function which reads all diseases, detects missing ones with printed debug info, before throwing an exception for any missing inputs, as opposed to a case-by-case exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm with @israel-vieira on this - collecting all that is invalid and providing a full list to the user (within meaningful indication of what is wrong) results on a better user experience than having to try again and again until all issues are sorted. We have another project where this is the approach and is indeed much better.
All in all, having a strong validation and input pre-processing steps where things can fail early and inconsistencies fixed (casing, proxis, etc.) is a good thing that will result in simpler, easier to understand (and diagnose) code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally, I don't consider users input validation as an exceptional circumstance, the validation should focus on helping the users to fix the wrong or missing information provided, perhaps the system should show valid options before exiting gracefully.
In general, users don't read or understand stack traces with a message hidden inside a large block of text, this is kind of information is more relevant for developers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two issues here:
- Whether to "fast fail" when loading the config or provide a list of failures at the end
- Whether to implement this using exceptions or some other mechanism (e.g.
std::optional
)
My view is:
- Yes, if the validation fails, we should continue attempting to load the file and provide a list of errors at the end, rather than aborting immediately (for the most part)
- @jamesturner246's refactor of
get_country
andget_disease_info
to use exceptions is sensible and makes things clearer
The problem with at least some of the config loading code atm is that if validation fails, it doesn't abort immediately, but it also doesn't abort after the validation step either, so while you may get a warning on the console, the program will blithely plough on using default values, which is not good (see #161).
I'm a big believer in using exceptions to propagate errors in C++; it can make code much cleaner and safer. I'll give an example of how you might do "lazy failure" when loading disease info using exceptions:
std::vector<core::DiseaseInfo> get_diseases_info(core::Datastore &data_api, Configuration &config) {
const auto diseases = data_api.get_diseases();
fmt::print("\nThere are {} diseases in storage, {} selected.\n", diseases.size(),
config.diseases.size());
std::vector<core::DiseaseInfo> result;
std::set<std::string> missing_diseases;
for (const auto &code : config.diseases) {
try {
result.emplace_back(data_api.get_disease_info(code));
} catch (const std::invalid_argument &) {
missing_diseases.emplace(code);
}
}
if (!missing_diseases.empty()) {
std::cerr << "ERROR: The following diseases were not found:\n";
for (const auto &disease : missing_diseases) {
std::cerr << "\t" << disease << "\n";
}
throw std::runtime_error{"Some invalid diseases were given"};
}
return result;
}
(There might be things we could tweak -- e.g. I haven't used the fmt
library -- but this is just to give an idea.)
This code assumes that the caller will appropriately handle any thrown std::runtime_error
, which it absolutely should do (we might want to use a bespoke error type rather than std::runtime_error
though). So it means you a) tell the user what errors have occurred and b) propagate the errors up the call stack. Ultimately, the main config loading function should be wrapped in a try...catch
and if an error occurs, it just tells the user and exits the program (the user will already have been informed of the specific issues with the config loading by this point).
Note that this code is also safer, because a std::runtime_error
will also be thrown if the "disease"
value is missing, meaning that the caller knows that something has gone wrong, as opposed to the current state of affairs where just a warning message is displayed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with this ^. The user gets helpful human readable messages noting which keys were missing, and execution stops before there's a chance of accidental continuation on default values. I'm happy to merge, and happy to leave it with you if you already have the idea in mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right approach on the whole. I've made some minor suggestions.
I agree with @dalonsoa and @israel-vieira that we do probably want to do "lazy failure" when loading the config and print errors as we go along, rather than aborting as soon as an error is found. I think what you've done is an improvement though, so I'd be inclined to merge this as is for now and worry about that later. As I was planning on tackling this myself (see #161), I'm happy for you to leave this task for me, if you like.
example/France.EBHLM.json
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point @israel-vieira.
Though we do want the config files in this repo to be up to date and working with the current version of the code for testing purposes. Whether we edit it directly or just the script for generating it isn't so important I think, as long as it's correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there are two issues here:
- Whether to "fast fail" when loading the config or provide a list of failures at the end
- Whether to implement this using exceptions or some other mechanism (e.g.
std::optional
)
My view is:
- Yes, if the validation fails, we should continue attempting to load the file and provide a list of errors at the end, rather than aborting immediately (for the most part)
- @jamesturner246's refactor of
get_country
andget_disease_info
to use exceptions is sensible and makes things clearer
The problem with at least some of the config loading code atm is that if validation fails, it doesn't abort immediately, but it also doesn't abort after the validation step either, so while you may get a warning on the console, the program will blithely plough on using default values, which is not good (see #161).
I'm a big believer in using exceptions to propagate errors in C++; it can make code much cleaner and safer. I'll give an example of how you might do "lazy failure" when loading disease info using exceptions:
std::vector<core::DiseaseInfo> get_diseases_info(core::Datastore &data_api, Configuration &config) {
const auto diseases = data_api.get_diseases();
fmt::print("\nThere are {} diseases in storage, {} selected.\n", diseases.size(),
config.diseases.size());
std::vector<core::DiseaseInfo> result;
std::set<std::string> missing_diseases;
for (const auto &code : config.diseases) {
try {
result.emplace_back(data_api.get_disease_info(code));
} catch (const std::invalid_argument &) {
missing_diseases.emplace(code);
}
}
if (!missing_diseases.empty()) {
std::cerr << "ERROR: The following diseases were not found:\n";
for (const auto &disease : missing_diseases) {
std::cerr << "\t" << disease << "\n";
}
throw std::runtime_error{"Some invalid diseases were given"};
}
return result;
}
(There might be things we could tweak -- e.g. I haven't used the fmt
library -- but this is just to give an idea.)
This code assumes that the caller will appropriately handle any thrown std::runtime_error
, which it absolutely should do (we might want to use a bespoke error type rather than std::runtime_error
though). So it means you a) tell the user what errors have occurred and b) propagate the errors up the call stack. Ultimately, the main config loading function should be wrapped in a try...catch
and if an error occurs, it just tells the user and exits the program (the user will already have been informed of the specific issues with the config loading by this point).
Note that this code is also safer, because a std::runtime_error
will also be thrown if the "disease"
value is missing, meaning that the caller knows that something has gone wrong, as opposed to the current state of affairs where just a warning message is displayed.
Co-authored-by: Alex Dewar <[email protected]>
Co-authored-by: Alex Dewar <[email protected]>
clang-tidy review says "All clean, LGTM! 👍" |
1 similar comment
clang-tidy review says "All clean, LGTM! 👍" |
clang-tidy review says "All clean, LGTM! 👍" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, nothing to add that @alexdewar has not explained much, much better - and with examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Fixes #166
(not really)
This has been somewhat of a lacklustre attempt that fell short of my ambition... All that was really achieved is the removal of duplicate risk factor listings from
France.EBHLM.json
, and changing a couple of data manager methods to fail on missing data, rather than warn and continue. As such, it should be quite quick to review. The main issue still seems to be the presence of code for handling Hierarchical static and dynamic models in the wider codebase, which is resisting refactor efforts.In a moment of catharsis, I've now convinced myself that the best solution going forward is to merge static and dynamic RF model into a single model, which encapsulates all this extra HLM-specific code within it.
See #194. I'd appreciate your input, as I'm itching to get started merging them, rather than doing more piecemeal refactors that aren't solving the underlying problem.