Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Formulation_Manager.hpp to use boost regex #843

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

JoshCu
Copy link

@JoshCu JoshCu commented Jun 26, 2024

Update the formulation manager to use boost::regex instead of std.

When matching complex regex for finding forcing files, std::regex becomes a significant portion of Ngen init. Especially with large numbers of catchments per core.

For an extreme example, when running serially for wb-479197 and it's upstreams ~6500 catchments. NGen::init takes ~153s, if I use the line "forcing": {"file_pattern": ".*{{id}}.*.csv" in my realization.
removing those wildcards reduces the time to 69 seconds.

For use cases where ngen is being run over a large area, this init time can become a non-insignificant portion of the total runtime, in the serial example, it took 153 seconds to init, 43s to run the models, 19s to route for a 24h simulation.

Changes

  • std::regex -> boost::regex

Testing

This was all tested on ngiab images using ngen f91e2ea
The run package was generated using the ngiab_preprocessor
24h run of ~6500 catchments, sloth + cfe + noaa-owp-modular + troute, subset from wb-479197 on hydrofabric v20.1

Hardware used Model name: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 RAM 2100mhz DDR4 Storage ~500mbs sata ssd
Realization.json
{
    "global": {
        "formulations": [
            {
                "name": "bmi_multi",
                "params": {
                    "name": "bmi_multi",
                    "model_type_name": "bmi_multi",
                    "main_output_variable": "Q_OUT",
                    "forcing_file": "",
                    "init_config": "",
                    "allow_exceed_end_time": true,
                    "modules": [
                        {
                            "name": "bmi_c++",
                            "params": {
                                "name": "bmi_c++",
                                "model_type_name": "SLOTH",
                                "main_output_variable": "z",
                                "init_config": "/dev/null",
                                "allow_exceed_end_time": true,
                                "fixed_time_step": false,
                                "uses_forcing_file": false,
                                "model_params": {
                                    "sloth_ice_fraction_schaake(1,double,m,node)": 0.0,
                                    "sloth_ice_fraction_xinanjiang(1,double,1,node)": 0.0,
                                    "sloth_soil_moisture_profile(1,double,1,node)": 0.0
                                },
                                "library_file": "/dmod/shared_libs/libslothmodel.so",
                                "registration_function": "none"
                            }
                        },
                        {
                            "name": "bmi_fortran",
                            "params": {
                                "name": "bmi_fortran",
                                "model_type_name": "NoahOWP",
                                "library_file": "/dmod/shared_libs/libsurfacebmi.so",
                                "forcing_file": "",
                                "init_config": "./config/cat_config/NOAH-OWP-M/{{id}}.input",
                                "allow_exceed_end_time": true,
                                "main_output_variable": "QINSUR",
                                "variables_names_map": {
                                    "PRCPNONC": "precip_rate",
                                    "Q2": "SPFH_2maboveground",
                                    "SFCTMP": "TMP_2maboveground",
                                    "UU": "UGRD_10maboveground",
                                    "VV": "VGRD_10maboveground",
                                    "LWDN": "DLWRF_surface",
                                    "SOLDN": "DSWRF_surface",
                                    "SFCPRS": "PRES_surface"
                                },
                                "uses_forcing_file": false
                            }
                        },
                        {
                            "name": "bmi_c",
                            "params": {
                                "name": "bmi_c",
                                "model_type_name": "CFE",
                                "main_output_variable": "Q_OUT",
                                "init_config": "./config/cat_config/CFE/{{id}}.ini",
                                "allow_exceed_end_time": true,
                                "fixed_time_step": false,
                                "uses_forcing_file": false,
                                "registration_function": "register_bmi_cfe",
                                "variables_names_map": {
                                    "water_potential_evaporation_flux": "EVAPOTRANS",
                                    "atmosphere_water__liquid_equivalent_precipitation_rate": "QINSUR",
                                    "ice_fraction_schaake": "sloth_ice_fraction_schaake",
                                    "ice_fraction_xinanjiang": "sloth_ice_fraction_xinanjiang",
                                    "soil_moisture_profile": "sloth_soil_moisture_profile"
                                },
                                "library_file": "/dmod/shared_libs/libcfebmi.so.1.0.0"
                            }
                        }
                    ],
                    "uses_forcing_file": false
                }
            }
        ],
        "forcing": {
            "file_pattern": "{{id}}.csv",
            "path": "./forcings/by_catchment/",
            "provider": "CsvPerFeature"
        }
    },
    "time": {
        "start_time": "2010-01-01 00:00:00",
        "end_time": "2010-01-02 00:00:00",
        "output_interval": 3600,
        "nts": 288.0
    },
    "routing": {
        "t_route_config_file_with_path": "/ngen/ngen/data/config/ngen.yaml"
    },
    "output_root": "/ngen/ngen/data/outputs/ngen"
}

boost+exact first checks if file_pattern matched the file exactly, then runs regex if it does not match

exact path is where the "path" formulation variable accepts the {{id}} placeholder allowing for file_pattern to be omitted and no regex being run.

Serial

library .*{{id}}.*.cat
{{id}}.cat
std 153s 69s
boost 62s 55s
re2 37s 36s
exact path NA 27s

10 mpi ranks

library {{id}}.cat
std 8.6s
boost 7.4s
re2 5.2s
exact path 4.2s

56 mpi ranks

library {{id}}.cat
std 4.8s
boost 4.5s
re2 3.9s
exact path 3.2s

Screenshots

interactive version of the graph here
image

std, boost, re2, on 10 cores

image

perf Flamegraph of unmodified fomulation manager running serially with double wildcards

kernel

Notes

  • re2 is faster, but I don't know if it's worth adding the dependency when we're already using boost
  • If it's worth it, I can open a PR for a version using re2
  • I'm not sure how common these wildcarded regex use cases are, or how many people would be running serially for large numbers of catchments
  • read.cpp in the geopackage code also uses std::regex, but it's called so infrequently that the performance benefit is negligible. I left it unchanged to reduce code modification, although I'm unsure if that's the correct decision.
  • changing {{id}}.cat to {{id}}\.cat reduces all times further, but I noticed the lack of escapement after I'd finished all my testing
  • I can upload the various docker images used to test to dockerhub if needed

Future work

I wanted to keep changes minimal because I'm not too familiar with the rest of ngen, but would it be worth modifying the formulation manager further to optionally disable regex completely?
For my use-case, the forcing files are just named cat-1234.csv and there is one per catchment so I don't need to use regex at all after the formulation manager replaces the {{id}} placeholder.

@PhilMiller
Copy link
Contributor

PhilMiller commented Jun 26, 2024

Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix.

We'll need to check whether boost::regex needs to be built, installed, and linked against, or is effectively 'header-only'. Everything else we use from Boost is in the latter category right now, so there may be some resistance to changing that. This sort of performance improvement may be worth it, though.

We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up?

@PhilMiller PhilMiller self-assigned this Jun 26, 2024
@JoshCu
Copy link
Author

JoshCu commented Jun 26, 2024

Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix.

We'll need to check whether boost::regex needs to be built, installed, and linked against, or is effectively 'header-only'. Everything else we use from Boost is in the latter category right now, so there may be some resistance to changing that. This sort of performance improvement may be worth it, though.

We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up?

Yeah of course! Happy to do any additional testing, just let me know what I should try.

As for boost regex being header only, I don't think it needs to be built? the dockerfile I'm using just downloads boost.bzip, unzips it, then adds the folder to path before building ngen. I don't have to make install it or anything if that answers the question?

my docker patch is just two sed commands with no additional changes to how ngen's built

RUN sed -i 's|#include <regex>|#include <boost/regex.hpp>|g' include/realizations/catchment/Formulation_Manager.hpp
RUN sed -i 's/std::regex/boost::regex/g' include/realizations/catchment/Formulation_Manager.hpp

the dockerfile is this one if it's of any use/interest :)

@JoshCu
Copy link
Author

JoshCu commented Jun 26, 2024

@PhilMiller What about changing path property to also accept {{id}}?
Changing this

if(forcing_prop_map.count("path") != 0){
    path = forcing_prop_map.at("path").as_string();
}

to

if(forcing_prop_map.count("path") != 0){
    path = forcing_prop_map.at("path").as_string();
    int id_index = path.find("{{id}}");
    if (id_index != std::string::npos) {
        path = path.replace(id_index, sizeof("{{id}}") - 1, identifier);
    }
}

means that I can put the {{id}} in the path, then remove file_pattern from my realization so that it hits the early return before any regex is used.
Testing it gives me 27s in serial and 3.2s in parallel 56 cores

@PhilMiller PhilMiller requested review from robertbartel and aaraney and removed request for hellkite500 June 28, 2024 14:29
@PhilMiller
Copy link
Contributor

The idea for {{id}} substitution in path sounds good to me. I don't know job setups too well, though, so I'll ask Bobby and/or Austin to comment on that.

Meanwhile, I added lines in the PR description's timing tables for "boost+exact" describing the code as amended to check against literal filepattern. Could you please put up numbers for that the current PR code?

@PhilMiller
Copy link
Contributor

If you're feeling diligent, std+exact might be helpful too, if we conclude that we don't want to add the dependency on Boost regex.

@aaraney
Copy link
Member

aaraney commented Jun 28, 2024

@PhilMiller, thanks for looping me in. I don't foresee this being an issue. @robertbartel, do you feel differently?

@robertbartel
Copy link
Contributor

Well, according to this, Boost.Regex is not header-only (someone sanity check me to make sure I'm not missing something).

I don't see a problem with adding support for {{id}} replacement within forcing.path. It seems to me like something worth doing anyway. We don't organize data like this in DMOD right now, though we could consider adapting to it, and I could easily see some users naturally wanting (i.e., regardless of performance implications) to arrange forcings files and other data like BMI configs together in catchment-specific directories.

@JoshCu
Copy link
Author

JoshCu commented Jun 28, 2024

I was surprised the timings with the filepattern exact check before attempting the regex weren't measurably faster. Without attaching a debugger I'd guess that the filepattern match only returns true at most once per pattern, but still runs the regex against every other non-exact file match. So in this example it would only prevent regex being used 1/6500th of the time? I'll rerun to confirm and update the tables

@PhilMiller
Copy link
Contributor

Oh, yeah, if we're going to try for exact match, we should just try opening the specific file, rather than testing against the entire enumerated contents of the directory.

@@ -463,8 +463,7 @@ namespace realization {
if (directory != nullptr) {
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like this?
I just copied the code that stats the file after the regex match.
I tested it just using std and it matched the performance of adding {{id}} to the path variable, 27s serial 3.2s parallel 56x

Suggested change
if (directory != nullptr) {
if (directory != nullptr) {
// check if directory + file_pattern is a file before attempting to iterate
struct stat st;
if (stat((path + filepattern).c_str(), &st) == 0) {
if (S_ISREG(st.st_mode)) {
return forcing_params(
path + filepattern,
provider,
simulation_time_config.start_time,
simulation_time_config.end_time
);
}
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

correction* 3.7s in 56x parallel. So not as fast as changing path to accept {{id}}, but probably only because I didn't stat the file when I did that, I just accepted the path without verifying if it existed. the speed difference serially is ~27.2 vs ~27.8 but my testing it just manually running it 5 times so I haven't been including the decimal place in the longer timings

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed an apparent bug in that code. We probably shouldn't be checking S_ISREG, since that would (I think) exclude symbolic links, which should be legitimate, as long as they point at an existent file.

All of this code kinda suffers from the syndrome of "asking permission, rather than forgiveness" - we should ideally just be trying to directly open the files in question.

Copy link
Author

@JoshCu JoshCu Jul 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood the original code a bit too, it's looping over all the files and checking matches, then returning at the first match it finds right? I mistook it to return all matching files. If it's only returning the first match I can have a go at reworking it to

  1. try and open the file without stat
  2. if that fails, build a list of every file in the directory
  3. run regex against that list
  4. if more than one match is found try to open the first match

The line between performance optimization and hack is blurry but I was wondering if it would be faster to get a string of all filenames, separated by some illegal filename character like hash or pipe, then run regex once against that string to pull out the matching files. Rather than running regex once per file?

Copy link
Author

@JoshCu JoshCu Jul 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hopefully this is more clear than my fumbling with suggested changes master...JoshCu:ngen:open_files_first_ask_questions_later

How about this? I modified it to just try and open the file immediately like you suggested, then if that fails fall back to the regex matching, which also just tries to open the file rather than using stat. The bit at the top of the diff is just inverting the if directory != nullptr logic to unindent the code one layer.

In terms of performance, it's fractionally slower (~3% serial, ~10% Parallelx55 with a lot of deviation) than modifying path to accept {{id}} and bypassing the entire directory opening and regex section #843 (comment)

I'm not a fan of littering file and directory closes around before every possible exit path so if there's some nicer way to consolidate the cleanup I'd love to hear it.

@JoshCu
Copy link
Author

JoshCu commented Aug 28, 2024

@PhilMiller If this is still of interest, what are the next steps?

Options

  1. Add boost regex?
  2. Add {{id}} substitution to the path variable?
  3. Try to open the file, if that fails, then begin regex matching?

Any combination of these work, but 2 might not be needed with 3 as it is only marginally quicker than 3

1 | This is the smallest code change, a moderate speed improvement, but boost.regex is not header only #843 (comment)

2 | This is the fastest, slightly larger change, but doesn't verify that the file exists before accepting it

3 | Slightly slower than 2, much larger change, seems to be a good option assuming my code is ok?

I'm happy to redo any changes and put then in a new pull request if needed :)

@program--
Copy link
Contributor

program-- commented Aug 28, 2024

As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something.

edit: looks like in 1.76 boost.regex changed to being header-only.

Regex:

  • Regex is now header only except in C++03 mode.
  • Support for C++03 is now deprecated.
  • The library can now be used "standalone" without the rest of Boost being present.

@JoshCu
Copy link
Author

JoshCu commented Aug 28, 2024

Is that why I didn't have to do anything other than building boost for the regex change to work? The highest version in rocky 9 epel is/was 1.75 so I had to build 1.79 from source anyway https://github.com/JoshCu/NGIAB-CloudInfra/blob/7cef18572883f4b9ec04471b650a87f860e488e3/docker/Dockerfile#L28

@program--
Copy link
Contributor

program-- commented Aug 28, 2024

Is that why I didn't have to do anything other than building boost for the regex change to work? ...

Yeah that'd be my assumption. Another verification that boost.regex doesn't need any additional building: our CI builds the changes successfully, and we only link to the Boost::boost/Boost::headers CMake target https://github.com/NOAA-OWP/ngen/actions/runs/9687130620/job/26813637727?pr=843#step:3:594

For older boost versions, boost.regex requires the regex component when calling find_package(Boost ...) to include linking to the compiled code AFAIK, otherwise it's only headers.

edit: to summarize, I think we are good to use boost.regex without worrying about extra dependencies.

@robertbartel
Copy link
Contributor

As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something.

Thanks, @program--. Looks like they (still) haven't updated Boost's Getting Started page to clearly reflect that and I didn't look hard enough earlier. If what's stated here in the dependency doc is still correct, a non-header-only scenario for this shouldn't be possible.

A bit of an aside: I wanted to quickly see if any other Boost libraries listed as "must be built" didn't really belong there, so I took a brief look at Boost.Chrono (the first one in that list). Besides details on Boost.Chrono directly, I came across this interesting line about its Boost.System dependency:

Boost.System has an undocumented feature (use of macro BOOST_ERROR_CODE_HEADER_ONLY) to make it header only.

So, buyer beware on any of the Getting Started "must be built" libraries.

@JoshCu JoshCu mentioned this pull request Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants