Update Formulation_Manager.hpp to use boost regex #843

JoshCu · 2024-06-26T19:48:50Z

Update the formulation manager to use boost::regex instead of std.

When matching complex regex for finding forcing files, std::regex becomes a significant portion of Ngen init. Especially with large numbers of catchments per core.

For an extreme example, when running serially for wb-479197 and it's upstreams ~6500 catchments. NGen::init takes ~153s, if I use the line "forcing": {"file_pattern": ".*{{id}}.*.csv" in my realization.
removing those wildcards reduces the time to 69 seconds.

For use cases where ngen is being run over a large area, this init time can become a non-insignificant portion of the total runtime, in the serial example, it took 153 seconds to init, 43s to run the models, 19s to route for a 24h simulation.

Changes

std::regex -> boost::regex

Testing

This was all tested on ngiab images using ngen f91e2ea
The run package was generated using the ngiab_preprocessor
24h run of ~6500 catchments, sloth + cfe + noaa-owp-modular + troute, subset from wb-479197 on hydrofabric v20.1

Hardware used

Model name: Intel(R) Xeon(R) CPU E5-2697 v3 @ 2.60GHz Thread(s) per core: 2 Core(s) per socket: 14 Socket(s): 2 CPU max MHz: 3600.0000 CPU min MHz: 1200.0000 RAM 2100mhz DDR4 Storage ~500mbs sata ssd

Realization.json

{
    "global": {
        "formulations": [
            {
                "name": "bmi_multi",
                "params": {
                    "name": "bmi_multi",
                    "model_type_name": "bmi_multi",
                    "main_output_variable": "Q_OUT",
                    "forcing_file": "",
                    "init_config": "",
                    "allow_exceed_end_time": true,
                    "modules": [
                        {
                            "name": "bmi_c++",
                            "params": {
                                "name": "bmi_c++",
                                "model_type_name": "SLOTH",
                                "main_output_variable": "z",
                                "init_config": "/dev/null",
                                "allow_exceed_end_time": true,
                                "fixed_time_step": false,
                                "uses_forcing_file": false,
                                "model_params": {
                                    "sloth_ice_fraction_schaake(1,double,m,node)": 0.0,
                                    "sloth_ice_fraction_xinanjiang(1,double,1,node)": 0.0,
                                    "sloth_soil_moisture_profile(1,double,1,node)": 0.0
                                },
                                "library_file": "/dmod/shared_libs/libslothmodel.so",
                                "registration_function": "none"
                            }
                        },
                        {
                            "name": "bmi_fortran",
                            "params": {
                                "name": "bmi_fortran",
                                "model_type_name": "NoahOWP",
                                "library_file": "/dmod/shared_libs/libsurfacebmi.so",
                                "forcing_file": "",
                                "init_config": "./config/cat_config/NOAH-OWP-M/{{id}}.input",
                                "allow_exceed_end_time": true,
                                "main_output_variable": "QINSUR",
                                "variables_names_map": {
                                    "PRCPNONC": "precip_rate",
                                    "Q2": "SPFH_2maboveground",
                                    "SFCTMP": "TMP_2maboveground",
                                    "UU": "UGRD_10maboveground",
                                    "VV": "VGRD_10maboveground",
                                    "LWDN": "DLWRF_surface",
                                    "SOLDN": "DSWRF_surface",
                                    "SFCPRS": "PRES_surface"
                                },
                                "uses_forcing_file": false
                            }
                        },
                        {
                            "name": "bmi_c",
                            "params": {
                                "name": "bmi_c",
                                "model_type_name": "CFE",
                                "main_output_variable": "Q_OUT",
                                "init_config": "./config/cat_config/CFE/{{id}}.ini",
                                "allow_exceed_end_time": true,
                                "fixed_time_step": false,
                                "uses_forcing_file": false,
                                "registration_function": "register_bmi_cfe",
                                "variables_names_map": {
                                    "water_potential_evaporation_flux": "EVAPOTRANS",
                                    "atmosphere_water__liquid_equivalent_precipitation_rate": "QINSUR",
                                    "ice_fraction_schaake": "sloth_ice_fraction_schaake",
                                    "ice_fraction_xinanjiang": "sloth_ice_fraction_xinanjiang",
                                    "soil_moisture_profile": "sloth_soil_moisture_profile"
                                },
                                "library_file": "/dmod/shared_libs/libcfebmi.so.1.0.0"
                            }
                        }
                    ],
                    "uses_forcing_file": false
                }
            }
        ],
        "forcing": {
            "file_pattern": "{{id}}.csv",
            "path": "./forcings/by_catchment/",
            "provider": "CsvPerFeature"
        }
    },
    "time": {
        "start_time": "2010-01-01 00:00:00",
        "end_time": "2010-01-02 00:00:00",
        "output_interval": 3600,
        "nts": 288.0
    },
    "routing": {
        "t_route_config_file_with_path": "/ngen/ngen/data/config/ngen.yaml"
    },
    "output_root": "/ngen/ngen/data/outputs/ngen"
}

boost+exact first checks if file_pattern matched the file exactly, then runs regex if it does not match

exact path is where the "path" formulation variable accepts the {{id}} placeholder allowing for file_pattern to be omitted and no regex being run.

Serial

library	.{{id}}..cat	{{id}}.cat
std	153s	69s
boost	62s	55s
re2	37s	36s
exact path	NA	27s

10 mpi ranks

library	{{id}}.cat
std	8.6s
boost	7.4s
re2	5.2s
exact path	4.2s

56 mpi ranks

library	{{id}}.cat
std	4.8s
boost	4.5s
re2	3.9s
exact path	3.2s

Screenshots

interactive version of the graph here

std, boost, re2, on 10 cores

perf Flamegraph of unmodified fomulation manager running serially with double wildcards

Notes

re2 is faster, but I don't know if it's worth adding the dependency when we're already using boost
If it's worth it, I can open a PR for a version using re2
I'm not sure how common these wildcarded regex use cases are, or how many people would be running serially for large numbers of catchments
read.cpp in the geopackage code also uses std::regex, but it's called so infrequently that the performance benefit is negligible. I left it unchanged to reduce code modification, although I'm unsure if that's the correct decision.
changing {{id}}.cat to {{id}}\.cat reduces all times further, but I noticed the lack of escapement after I'd finished all my testing
I can upload the various docker images used to test to dockerhub if needed

Future work

I wanted to keep changes minimal because I'm not too familiar with the rest of ngen, but would it be worth modifying the formulation manager further to optionally disable regex completely?
For my use-case, the forcing files are just named cat-1234.csv and there is one per catchment so I don't need to use regex at all after the formulation manager replaces the {{id}} placeholder.

PhilMiller · 2024-06-26T20:02:44Z

Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix.

We'll need to check whether boost::regex needs to be built, installed, and linked against, or is effectively 'header-only'. Everything else we use from Boost is in the latter category right now, so there may be some resistance to changing that. This sort of performance improvement may be worth it, though.

We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up?

include/realizations/catchment/Formulation_Manager.hpp

JoshCu · 2024-06-26T20:20:48Z

Hi Josh, and thanks for noticing the issue, looking into it, and proposing this fix.

We'll need to check whether boost::regex needs to be built, installed, and linked against, or is effectively 'header-only'. Everything else we use from Boost is in the latter category right now, so there may be some resistance to changing that. This sort of performance improvement may be worth it, though.

We'd definitely like to bring down excessive run times like you've observed. Would you be open to making some other changes and testing them, since you've already got this teed up?

Yeah of course! Happy to do any additional testing, just let me know what I should try.

As for boost regex being header only, I don't think it needs to be built? the dockerfile I'm using just downloads boost.bzip, unzips it, then adds the folder to path before building ngen. I don't have to make install it or anything if that answers the question?

my docker patch is just two sed commands with no additional changes to how ngen's built

RUN sed -i 's|#include <regex>|#include <boost/regex.hpp>|g' include/realizations/catchment/Formulation_Manager.hpp
RUN sed -i 's/std::regex/boost::regex/g' include/realizations/catchment/Formulation_Manager.hpp

the dockerfile is this one if it's of any use/interest :)

Co-authored-by: Phil Miller - NOAA <[email protected]>

JoshCu · 2024-06-26T22:39:20Z

@PhilMiller What about changing path property to also accept {{id}}?
Changing this

if(forcing_prop_map.count("path") != 0){
    path = forcing_prop_map.at("path").as_string();
}

to

if(forcing_prop_map.count("path") != 0){
    path = forcing_prop_map.at("path").as_string();
    int id_index = path.find("{{id}}");
    if (id_index != std::string::npos) {
        path = path.replace(id_index, sizeof("{{id}}") - 1, identifier);
    }
}

means that I can put the {{id}} in the path, then remove file_pattern from my realization so that it hits the early return before any regex is used.
Testing it gives me 27s in serial and 3.2s in parallel 56 cores

PhilMiller · 2024-06-28T14:32:14Z

The idea for {{id}} substitution in path sounds good to me. I don't know job setups too well, though, so I'll ask Bobby and/or Austin to comment on that.

Meanwhile, I added lines in the PR description's timing tables for "boost+exact" describing the code as amended to check against literal filepattern. Could you please put up numbers for that the current PR code?

PhilMiller · 2024-06-28T14:33:00Z

If you're feeling diligent, std+exact might be helpful too, if we conclude that we don't want to add the dependency on Boost regex.

aaraney · 2024-06-28T15:34:24Z

@PhilMiller, thanks for looping me in. I don't foresee this being an issue. @robertbartel, do you feel differently?

robertbartel · 2024-06-28T15:36:14Z

Well, according to this, Boost.Regex is not header-only (someone sanity check me to make sure I'm not missing something).

I don't see a problem with adding support for {{id}} replacement within forcing.path. It seems to me like something worth doing anyway. We don't organize data like this in DMOD right now, though we could consider adapting to it, and I could easily see some users naturally wanting (i.e., regardless of performance implications) to arrange forcings files and other data like BMI configs together in catchment-specific directories.

JoshCu · 2024-06-28T15:42:54Z

I was surprised the timings with the filepattern exact check before attempting the regex weren't measurably faster. Without attaching a debugger I'd guess that the filepattern match only returns true at most once per pattern, but still runs the regex against every other non-exact file match. So in this example it would only prevent regex being used 1/6500th of the time? I'll rerun to confirm and update the tables

PhilMiller · 2024-06-28T16:33:46Z

Oh, yeah, if we're going to try for exact match, we should just try opening the specific file, rather than testing against the entire enumerated contents of the directory.

JoshCu · 2024-06-28T18:06:32Z

include/realizations/catchment/Formulation_Manager.hpp

@@ -463,8 +463,7 @@ namespace realization {
                if (directory != nullptr) {


How about something like this?
I just copied the code that stats the file after the regex match.
I tested it just using std and it matched the performance of adding {{id}} to the path variable, 27s serial 3.2s parallel 56x

Suggested change

if (directory != nullptr) {

if (directory != nullptr) {

// check if directory + file_pattern is a file before attempting to iterate

struct stat st;

if (stat((path + filepattern).c_str(), &st) == 0) {

if (S_ISREG(st.st_mode)) {

return forcing_params(

path + filepattern,

provider,

simulation_time_config.start_time,

simulation_time_config.end_time

);

}

}

correction* 3.7s in 56x parallel. So not as fast as changing path to accept {{id}}, but probably only because I didn't stat the file when I did that, I just accepted the path without verifying if it existed. the speed difference serially is ~27.2 vs ~27.8 but my testing it just manually running it 5 times so I haven't been including the decimal place in the longer timings

I just noticed an apparent bug in that code. We probably shouldn't be checking S_ISREG, since that would (I think) exclude symbolic links, which should be legitimate, as long as they point at an existent file.

All of this code kinda suffers from the syndrome of "asking permission, rather than forgiveness" - we should ideally just be trying to directly open the files in question.

I think I misunderstood the original code a bit too, it's looping over all the files and checking matches, then returning at the first match it finds right? I mistook it to return all matching files. If it's only returning the first match I can have a go at reworking it to

try and open the file without stat

if that fails, build a list of every file in the directory

run regex against that list

if more than one match is found try to open the first match

The line between performance optimization and hack is blurry but I was wondering if it would be faster to get a string of all filenames, separated by some illegal filename character like hash or pipe, then run regex once against that string to pull out the matching files. Rather than running regex once per file?

hopefully this is more clear than my fumbling with suggested changes master...JoshCu:ngen:open_files_first_ask_questions_later

How about this? I modified it to just try and open the file immediately like you suggested, then if that fails fall back to the regex matching, which also just tries to open the file rather than using stat. The bit at the top of the diff is just inverting the if directory != nullptr logic to unindent the code one layer.

In terms of performance, it's fractionally slower (~3% serial, ~10% Parallelx55 with a lot of deviation) than modifying path to accept {{id}} and bypassing the entire directory opening and regex section #843 (comment)

I'm not a fan of littering file and directory closes around before every possible exit path so if there's some nicer way to consolidate the cleanup I'd love to hear it.

JoshCu · 2024-08-28T16:20:44Z

@PhilMiller If this is still of interest, what are the next steps?

Options

Add boost regex?
Add {{id}} substitution to the path variable?
Try to open the file, if that fails, then begin regex matching?

Any combination of these work, but 2 might not be needed with 3 as it is only marginally quicker than 3

1 | This is the smallest code change, a moderate speed improvement, but boost.regex is not header only #843 (comment)

2 | This is the fastest, slightly larger change, but doesn't verify that the file exists before accepting it

3 | Slightly slower than 2, much larger change, seems to be a good option assuming my code is ok?

I'm happy to redo any changes and put then in a new pull request if needed :)

program-- · 2024-08-28T16:36:11Z

As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something.

edit: looks like in 1.76 boost.regex changed to being header-only.

Regex:

Regex is now header only except in C++03 mode.

Support for C++03 is now deprecated.

The library can now be used "standalone" without the rest of Boost being present.

JoshCu · 2024-08-28T16:47:23Z

Is that why I didn't have to do anything other than building boost for the regex change to work? The highest version in rocky 9 epel is/was 1.75 so I had to build 1.79 from source anyway https://github.com/JoshCu/NGIAB-CloudInfra/blob/7cef18572883f4b9ec04471b650a87f860e488e3/docker/Dockerfile#L28

program-- · 2024-08-28T17:03:19Z

Is that why I didn't have to do anything other than building boost for the regex change to work? ...

Yeah that'd be my assumption. Another verification that boost.regex doesn't need any additional building: our CI builds the changes successfully, and we only link to the Boost::boost/Boost::headers CMake target https://github.com/NOAA-OWP/ngen/actions/runs/9687130620/job/26813637727?pr=843#step:3:594

For older boost versions, boost.regex requires the regex component when calling find_package(Boost ...) to include linking to the compiled code AFAIK, otherwise it's only headers.

edit: to summarize, I think we are good to use boost.regex without worrying about extra dependencies.

robertbartel · 2024-08-28T17:33:57Z

As a note, boost.regex is header-only in the general case (at least as of 1.79), as noted here. There is a build requirement for C++03 and below, and when using ICU support, but I don't think that is needed in this PR unless I'm missing something.

Thanks, @program--. Looks like they (still) haven't updated Boost's Getting Started page to clearly reflect that and I didn't look hard enough earlier. If what's stated here in the dependency doc is still correct, a non-header-only scenario for this shouldn't be possible.

A bit of an aside: I wanted to quickly see if any other Boost libraries listed as "must be built" didn't really belong there, so I took a brief look at Boost.Chrono (the first one in that list). Besides details on Boost.Chrono directly, I came across this interesting line about its Boost.System dependency:

Boost.System has an undocumented feature (use of macro BOOST_ERROR_CODE_HEADER_ONLY) to make it header only.

So, buyer beware on any of the Getting Started "must be built" libraries.

Update Formulation_Manager.hpp to use boost regex

a2a45ad

program-- requested a review from hellkite500 June 26, 2024 20:00

PhilMiller reviewed Jun 26, 2024

View reviewed changes

include/realizations/catchment/Formulation_Manager.hpp Outdated Show resolved Hide resolved

PhilMiller self-assigned this Jun 26, 2024

Update include/realizations/catchment/Formulation_Manager.hpp

201c071

Co-authored-by: Phil Miller - NOAA <[email protected]>

PhilMiller requested review from robertbartel and aaraney and removed request for hellkite500 June 28, 2024 14:29

JoshCu commented Jun 28, 2024

View reviewed changes

JoshCu mentioned this pull request Jul 2, 2024

Remove MPI_Barriers before routing to increase speed. #846

Open

2 tasks

JoshCu requested a review from PhilMiller July 11, 2024 15:42

JoshCu mentioned this pull request Oct 28, 2024

MPI Scaling fix #894

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Formulation_Manager.hpp to use boost regex #843

Update Formulation_Manager.hpp to use boost regex #843

JoshCu commented Jun 26, 2024 •

edited

Loading

PhilMiller commented Jun 26, 2024 •

edited

Loading

JoshCu commented Jun 26, 2024

JoshCu commented Jun 26, 2024

PhilMiller commented Jun 28, 2024

PhilMiller commented Jun 28, 2024

aaraney commented Jun 28, 2024

robertbartel commented Jun 28, 2024

JoshCu commented Jun 28, 2024

PhilMiller commented Jun 28, 2024

JoshCu Jun 28, 2024

JoshCu Jun 28, 2024

PhilMiller Jul 3, 2024

JoshCu Jul 3, 2024 •

edited

Loading

JoshCu Jul 10, 2024 •

edited

Loading

JoshCu commented Aug 28, 2024 •

edited

Loading

program-- commented Aug 28, 2024 •

edited

Loading

JoshCu commented Aug 28, 2024

program-- commented Aug 28, 2024 •

edited

Loading

robertbartel commented Aug 28, 2024

		@@ -463,8 +463,7 @@ namespace realization {
		if (directory != nullptr) {

-                if (directory != nullptr) {
+                if (directory != nullptr) {
+                    // check if directory + file_pattern is a file before attempting to iterate
+                    struct stat st;
+                    if (stat((path + filepattern).c_str(), &st) == 0) {
+                        if (S_ISREG(st.st_mode)) {
+                            return forcing_params(
+                                path + filepattern,
+                                provider,
+                                simulation_time_config.start_time,
+                                simulation_time_config.end_time
+                            );
+                        }
+                    }

Update Formulation_Manager.hpp to use boost regex #843

Are you sure you want to change the base?

Update Formulation_Manager.hpp to use boost regex #843

Conversation

JoshCu commented Jun 26, 2024 • edited Loading

Update the formulation manager to use boost::regex instead of std.

Changes

Testing

Serial

10 mpi ranks

56 mpi ranks

Screenshots

Notes

Future work

PhilMiller commented Jun 26, 2024 • edited Loading

JoshCu commented Jun 26, 2024

JoshCu commented Jun 26, 2024

PhilMiller commented Jun 28, 2024

PhilMiller commented Jun 28, 2024

aaraney commented Jun 28, 2024

robertbartel commented Jun 28, 2024

JoshCu commented Jun 28, 2024

PhilMiller commented Jun 28, 2024

JoshCu Jun 28, 2024

Choose a reason for hiding this comment

JoshCu Jun 28, 2024

Choose a reason for hiding this comment

PhilMiller Jul 3, 2024

Choose a reason for hiding this comment

JoshCu Jul 3, 2024 • edited Loading

Choose a reason for hiding this comment

JoshCu Jul 10, 2024 • edited Loading

Choose a reason for hiding this comment

JoshCu commented Aug 28, 2024 • edited Loading

Options

program-- commented Aug 28, 2024 • edited Loading

JoshCu commented Aug 28, 2024

program-- commented Aug 28, 2024 • edited Loading

robertbartel commented Aug 28, 2024

JoshCu commented Jun 26, 2024 •

edited

Loading

PhilMiller commented Jun 26, 2024 •

edited

Loading

JoshCu Jul 3, 2024 •

edited

Loading

JoshCu Jul 10, 2024 •

edited

Loading

JoshCu commented Aug 28, 2024 •

edited

Loading

program-- commented Aug 28, 2024 •

edited

Loading

program-- commented Aug 28, 2024 •

edited

Loading