Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Runtime issues with new test SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod #6951

Open
ndkeen opened this issue Jan 28, 2025 · 10 comments
Labels
EAMxx PRs focused on capabilities for EAMxx GCP google cloud platform

Comments

@ndkeen
Copy link
Contributor

ndkeen commented Jan 28, 2025

This new test is failing on several machines (as reported by cdash), and one possibility is that the testing harness is running in different environment. However, on gcp12, the testing env should be same as running "manually", so I tried a few tests here.

The error in this case is

14:  FAIL:
14: false
14: /home/ndk/E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:455
14: Error! Failed post-condition property check (cannot be repaired).
14:   - Atmosphere process name: homme
14:   - Property check name: NaN check for field o3_volume_mix_ratio
14:   - Atmosphere process MPI Rank: 14
14:   - Message: FieldNaNCheck failed.
14:   - field id: o3_volume_mix_ratio[Physics PG2] <double:ncol,lev>(8,72) [mol/mol]
14:   - indices (w/ global column index): (112,23)
14:   - lat/lon: (13.297919, 129.457411)
14:   - additional data (w/ local column index):

And these are tests that hit the error:

SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-prod
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu

Then trying with DEBUG, I see something different:

42: e3sm.exe: /home/ndk/E3SM/components/homme/src/share/cxx/GllFvRemapImpl.cpp:832: void Homme::GllFvRemapImpl::remap_tracer_dyn_to_fv_phys(int, int, const CPhys3T&, const Phys3T&): Assertion `qs_fv.extent_int(0) >= nelemd && qs_fv.extent_int(1) >= nf2 && qs_fv.extent_int(2) >= nq && qs_fv.extent_int(3) % packn == 0' failed.
42: 
42: Program received signal SIGABRT: Process abort signal.
42: 
42: Backtrace for this error:
42: #0  0x2afc81ab03ff in ???
42: #1  0x2afc81ab0387 in ???
42: #2  0x2afc81ab1a77 in ???
42: #3  0x2afc81aa91a5 in ???
42: #4  0x2afc81aa9251 in ???
42: #5  0x2dc5814 in _ZN5Homme14GllFvRemapImpl27remap_tracer_dyn_to_fv_physEiiRKN6Kokkos4ViewIPPPPKdJNS1_11LayoutRightENS1_6DeviceINS1_6SerialENS1_9HostSpaceEEENS1_12Experimental14EmptyViewHooksENS1_12MemoryTraitsILj9EEEEEERKNS2_IPPPPdJS8_SB_SG_EEE
42:     at /home/ndk/E3SM/components/homme/src/share/cxx/GllFvRemapImpl.cpp:832
42: #6  0x2db41e4 in _ZN5Homme10GllFvRemap27remap_tracer_dyn_to_fv_physEiiRKN6Kokkos4ViewIPPPPKdJNS1_11LayoutRightENS1_6DeviceINS1_6SerialENS1_9HostSpaceEEENS1_12Experimental14EmptyViewHooksENS1_12MemoryTraitsILj9EEEEEERKNS2_IPPPPdJS8_SB_SG_EEE
42:     at /home/ndk/E3SM/components/homme/src/share/cxx/GllFvRemap.cpp:80
42: #7  0x27d3122 in _ZN6scream13HommeDynamics33fv_phys_rrtmgp_active_gases_remapENS_7RunTypeE
42:     at /home/ndk/E3SM/components/eamxx/src/dynamics/homme/eamxx_homme_fv_phys.cpp:304
42: #8  0x2783446 in _ZN6scream13HommeDynamics15initialize_implENS_7RunTypeE
42:     at /home/ndk/E3SM/components/eamxx/src/dynamics/homme/eamxx_homme_process_interface.cpp:455
42: #9  0x37ce79c in _ZN6scream17AtmosphereProcess10initializeERKNS_4util9TimeStampENS_7RunTypeE
42:     at /home/ndk/E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
42: #10  0x37ff9c1 in _ZN6scream22AtmosphereProcessGroup15initialize_implENS_7RunTypeE
42:     at /home/ndk/E3SM/components/eamxx/src/share/atm_process/atmosphere_process_group.cpp:369
42: #11  0x37ce79c in _ZN6scream17AtmosphereProcess10initializeERKNS_4util9TimeStampENS_7RunTypeE
42:     at /home/ndk/E3SM/components/eamxx/src/share/atm_process/atmosphere_process.cpp:75
42: #12  0x25991ca in _ZN6scream7control16AtmosphereDriver20initialize_atm_procsEv
42:     at /home/ndk/E3SM/components/eamxx/src/control/atmosphere_driver.cpp:1628
42: #13  0x61645e in operator()
42:     at /home/ndk/E3SM/components/eamxx/src/mct_coupling/scream_cxx_f90_interface.cpp:261
42: #14  0x617326 in fpe_guard_wrapper<scream_init_atm()::<lambda()> >
42:     at /home/ndk/E3SM/components/eamxx/src/mct_coupling/scream_cxx_f90_interface.cpp:58
42: #15  0x61648d in scream_init_atm
42:     at /home/ndk/E3SM/components/eamxx/src/mct_coupling/scream_cxx_f90_interface.cpp:255
42: #16  0x60fbce in __atm_comp_mct_MOD_atm_init_mct
42:     at /home/ndk/E3SM/components/eamxx/src/mct_coupling/atm_comp_mct.F90:280
42: #17  0x452f5b in __component_mod_MOD_component_init_cc
42:     at /home/ndk/E3SM/driver-mct/main/component_mod.F90:248
42: #18  0x439910 in __cime_comp_mod_MOD_cime_init
42:     at /home/ndk/E3SM/driver-mct/main/cime_comp_mod.F90:1496

Tests where I see this:

SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod
SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1
SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-prod
SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu
@ndkeen ndkeen added EAMxx PRs focused on capabilities for EAMxx GCP google cloud platform labels Jan 28, 2025
@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 28, 2025

With a Dec12th checkout, I tried:

SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu
SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu

and both completed.

@bartgol
Copy link
Contributor

bartgol commented Jan 28, 2025

@brhillman Isn't o3 just an input field for rrtmgp? At least for runs without mam4xx (not sure otherwise) this should be constant for the whole run, read from file, no? So this points to some infrastructure issue

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 28, 2025

Is it possible that next is not quite right? I've been checking out various previous hashes (on master), and I'm not seeing these same some sort of fails. All of the below checkouts are passing these two tests:

SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1
master checkouts:
2025-01-14-PR6895-a8e4d8d 
2025-01-16-PR6901-1114bd6
2025-01-16-PR6902-f9c0747
2025-01-21-PR6868-ea3baaf
2025-01-22-PR6767-e973411
2025-01-22-PR6926-c47a588
2025-01-23-PR6933-9142903
2025-01-27-PR6698-544fb9d
2025-01-27-PR6942-edabec1

The first one does not have the eamxx-prod testmod. But after that, all of them pass

SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod

There are some fails with DEBUG, but the error is different. All of those checkouts fail this test with following error:

SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod

tbc, I saw this error with checkout 2025-01-16-PR6901-1114bd6 and more recent (ie I think it was born this way)

40: At line 502 of file /home/ndk/wacmy/b2025-01-16-PR6901-1114bd6/components/eam/src/physics/cosp2/external/src/cosp.F90
40: Fortran runtime error: Pointer argument 'cospout' is not associated

line 502:
    ! Set flag to deallocate rttov types (only done on final call to simulator)                                                                                                                                                                                                                                                    
    if (size(cospOUT%isccp_meantb) .eq. stop_idx) lrttov_cleanUp = .true.

I'm trying to checkout current master and next to just confirm. With a Jan28th checkout of both next and master, it looks like the only test that fails (for both next and master) is

SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod

and it fails with the above cospout pointer error. The others 3 tests are completing OK -- which is mysterious

SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod
SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu

@bartgol
Copy link
Contributor

bartgol commented Jan 29, 2025

Just pointing out that in SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-output-preset-1--eamxx-prod could be changed to be SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.gcp12_gnu.eamxx-prod (no output preset testmod). The eamxx-prod testmod is already testing a fair amount of IO options. Besides, the preset 1 is already tested in SMS_Lh4.ne4_ne4.F2010-SCREAMv1.eamxx-output-preset-1, which is also in the e3sm_developer testsuite. @jgfouca Maybe we can rm that testmod in the next round of de-scream-ification?

@ndkeen
Copy link
Contributor Author

ndkeen commented Jan 30, 2025

It seems as though there is still issue where the checkout/runs of nightly testing (which don't use jenkins here) are behaving differently than checking-out/running manually.

cdash results of jan29th checkout of next on gcp12 are still showing the fail in optimized build.
When I run DEBUG of same test (with same checkout still from cdash testing), I get the original error in top comment.

Then when I checkout next locally and try the two tests, I'm back to what I posed in last comment -- the cospout error with DEBUG and no fail with optimized build

During nightly testing, we know there might be slightly different environments.
But isn't there also some attempts at build-time savings when a create_test of a suite happens?
That is, attempt for certain cases to re-use build from a similar case?
I don't recall details

Note that the fail on cdash (again for opt builds), is actually completing the run.
End of atm.log shows:

[EAMxx] Finalize ... done!

@bartgol
Copy link
Contributor

bartgol commented Jan 31, 2025

The cosp error seems to happen at this line in the cosp submodule. The error seems legit, since cospOUT%isccp_meantb is never allocated in eamxx's cosp initialization. @brhillman I noticed that LOTS of cosp outputs are never initialized (the boolean that triggers them to be allocated is hard-coded to false, and never touched). I assume this is just another var we don't care about?

The question then becomes: can I just wrap that bad line in something like if (associated(cospOUT%isccp_meantb))?

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 4, 2025

On pm-cpu, I checked out master and next and tried:

SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_gnu.eamxx-output-preset-1--eamxx-prod
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-output-preset-1--eamxx-prod

SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_gnu.eamxx-output-preset-1--eamxx-prod
SMS_D_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-output-preset-1--eamxx-prod

All pass except the GNU DEBUG build which also hits the cospout issue.
So again with pm-cpu, the nightly testing is behaving differently than running tests manually.

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 6, 2025

The last cdash fail for this test has odd error:

 3:  check_downscale_consistency ERROR: column-level forcing differs from gridcell-l
 3:  evel forcing for urban point
 3:  c, g =          156          11
 3:  ENDRUN:
 3:  ERROR in atm2lndMod.F90 at line 440

SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.pm-cpu_intel.eamxx-output-preset-1--eamxx-prod

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 6, 2025

The test:
SMS_Lh4.ne4pg2_ne4pg2.F2010-SCREAMv1.MACHINE.eamxx-output-preset-1--eamxx-prod

fails on chrysalis, pm-cpu, and gcp12 in the testing system.
They pass when run manually, individually.
But note, gcp12 does not use jenkins, so I don't think it's issue in jenkins.
It could be it fails when run as part of suite.

@ndkeen
Copy link
Contributor Author

ndkeen commented Feb 7, 2025

Back to gcp12 -- which is first comment on this issue -- we still see that same error with NaN check for field o3_volume_mix_ratio. And I see same behavior with cdash testing and running manually. Earlier I said it was passing, but that was with previous checkouts. Just noting that error with this test on gcp12 is different than other machines and not related to testing env differences. I don't think we see this error anywhere else.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EAMxx PRs focused on capabilities for EAMxx GCP google cloud platform
Projects
None yet
Development

No branches or pull requests

2 participants