Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Formulation_Manager.hpp to use boost regex #843

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 3 additions & 4 deletions include/realizations/catchment/Formulation_Manager.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@
#include <functional>
#include <dirent.h>
#include <sys/stat.h>
#include <regex>

#include <boost/regex.hpp>
#include <boost/property_tree/ptree.hpp>
#include <boost/property_tree/json_parser.hpp>
#include <FeatureBuilder.hpp>
Expand Down Expand Up @@ -410,7 +410,7 @@ namespace realization {
}

// Create a regular expression used to identify proper file names
std::regex pattern(filepattern);
boost::regex pattern(filepattern);

// A stream providing the functions necessary for evaluating a directory:
// https://www.gnu.org/software/libc/manual/html_node/Opening-a-Directory.html#Opening-a-Directory
Expand Down Expand Up @@ -463,8 +463,7 @@ namespace realization {
if (directory != nullptr) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this?
I just copied the code that stats the file after the regex match.
I tested it just using std and it matched the performance of adding {{id}} to the path variable, 27s serial 3.2s parallel 56x

Suggested change
if (directory != nullptr) {
if (directory != nullptr) {
// check if directory + file_pattern is a file before attempting to iterate
struct stat st;
if (stat((path + filepattern).c_str(), &st) == 0) {
if (S_ISREG(st.st_mode)) {
return forcing_params(
path + filepattern,
provider,
simulation_time_config.start_time,
simulation_time_config.end_time
);
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correction* 3.7s in 56x parallel. So not as fast as changing path to accept {{id}}, but probably only because I didn't stat the file when I did that, I just accepted the path without verifying if it existed. the speed difference serially is ~27.2 vs ~27.8 but my testing it just manually running it 5 times so I haven't been including the decimal place in the longer timings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed an apparent bug in that code. We probably shouldn't be checking S_ISREG, since that would (I think) exclude symbolic links, which should be legitimate, as long as they point at an existent file.

All of this code kinda suffers from the syndrome of "asking permission, rather than forgiveness" - we should ideally just be trying to directly open the files in question.

Copy link
Author

@JoshCu JoshCu Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood the original code a bit too, it's looping over all the files and checking matches, then returning at the first match it finds right? I mistook it to return all matching files. If it's only returning the first match I can have a go at reworking it to

  1. try and open the file without stat
  2. if that fails, build a list of every file in the directory
  3. run regex against that list
  4. if more than one match is found try to open the first match

The line between performance optimization and hack is blurry but I was wondering if it would be faster to get a string of all filenames, separated by some illegal filename character like hash or pipe, then run regex once against that string to pull out the matching files. Rather than running regex once per file?

Copy link
Author

@JoshCu JoshCu Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this is more clear than my fumbling with suggested changes master...JoshCu:ngen:open_files_first_ask_questions_later

How about this? I modified it to just try and open the file immediately like you suggested, then if that fails fall back to the regex matching, which also just tries to open the file rather than using stat. The bit at the top of the diff is just inverting the if directory != nullptr logic to unindent the code one layer.

In terms of performance, it's fractionally slower (~3% serial, ~10% Parallelx55 with a lot of deviation) than modifying path to accept {{id}} and bypassing the entire directory opening and regex section #843 (comment)

I'm not a fan of littering file and directory closes around before every possible exit path so if there's some nicer way to consolidate the cleanup I'd love to hear it.

bool match;
while ((entry = readdir(directory))) {
match = std::regex_match(entry->d_name, pattern);
if( match ) {
if (filepattern == entry->d_name || boost::regex_match(entry->d_name, pattern)) {
// If the entry is a regular file or symlink AND the name matches the pattern,
// we can consider this ready to be interpretted as valid forcing data (even if it isn't)
#ifdef _DIRENT_HAVE_D_TYPE
Expand Down
Loading