Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix access violation crash #2137

Merged
merged 3 commits into from
Feb 5, 2025
Merged

Conversation

likeuclinux
Copy link

@likeuclinux likeuclinux commented Dec 3, 2024

TYPE: bug fix

KEYWORDS: crash, access violation error

SOURCE: Charlie Li, software developer from lakes environmental, Canada

DESCRIPTION OF CHANGES:
Problem:
wrf crashed for access violation frequently, due to the change made in PR #1991
when namelist.input has sf_urban_physics = 0 and bl_pbl_physics = 1.

In inc/allocs_5.F
IF(okay_to_alloc.AND.in_use_for_config(id,'dlg_bep').AND.(.NOT.grid%is_intermediate))THEN
will be false, then it would go to following branch:

ELSE
ALLOCATE(grid%dlg_bep(1,1,1),STAT=ierr)
if (ierr.ne.0) then
CALL wrf_error_fatal ( &
'frame/module_domain.f: Failed to allocate grid%dlg_bep(1,1,1). ')
endif
ENDIF

it only allocate (1,1,1) for memory, then it trigger crash in phys/module_bl_ysu.F

if(present(a_u_bep) .and. present(a_v_bep) .and. present(a_t_bep) .and. &
present(a_q_bep) .and. present(a_e_bep) .and. present(b_u_bep) .and. &
present(b_v_bep) .and. present(b_t_bep) .and. present(b_q_bep) .and. &
present(b_e_bep) .and. present(dlg_bep) .and. present(dl_u_bep) .and. &
present(sf_bep) .and. present(vl_bep) .and. present(frc_urb2d)) then

    do k = kts, kte
       do i = its,ite
          a_u_hv(i,k)  = a_u_bep(i,k,j)
          a_v_hv(i,k)  = a_v_bep(i,k,j)
          a_t_hv(i,k)  = a_t_bep(i,k,j)
          a_q_hv(i,k)  = a_q_bep(i,k,j)
          a_e_hv(i,k)  = a_e_bep(i,k,j)
          b_u_hv(i,k)  = b_u_bep(i,k,j)
          b_v_hv(i,k)  = b_v_bep(i,k,j)
          b_t_hv(i,k)  = b_t_bep(i,k,j)
          b_q_hv(i,k)  = b_q_bep(i,k,j)
          b_e_hv(i,k)  = b_e_bep(i,k,j)
          dlg_hv(i,k)  = dlg_bep(i,k,j)
          dl_u_hv(i,k) = dl_u_bep(i,k,j)
          vlk_hv(i,k) = vl_bep(i,k,j)
          sfk_hv(i,k)  = sf_bep(i,k,j)
       enddo
    enddo
    do i = its, ite
       frcurb_hv(i) = frc_urb2d(i,j)
    enddo

endif

the present() check in code won't help, since upper level in
dyn_em\module_first_rk_step_part1.F
will always call pbl_driver with DLG_BEP=grid%dlg_bep, then it pass down to module_pbl_driver to module_bl_ysu

Solution:

the fix is actually using v4.5 logic like following:

if(present(a_u_bep) .and. present(a_v_bep) .and. present(a_t_bep) .and. &
present(a_q_bep) .and. present(a_e_bep) .and. present(b_u_bep) .and. &
present(b_v_bep) .and. present(b_t_bep) .and. present(b_q_bep) .and. &
present(b_e_bep) .and. present(dlg_bep) .and. present(dl_u_bep) .and. &
present(sf_bep) .and. present(vl_bep) .and. present(frc_urb2d)) then

 ! following v4.5 logic to fix access violation
 if(flag_bep) then

    do k = kts, kte
       do i = its,ite
          a_u_hv(i,k)  = a_u_bep(i,k,j)
          a_v_hv(i,k)  = a_v_bep(i,k,j)
          a_t_hv(i,k)  = a_t_bep(i,k,j)
          a_q_hv(i,k)  = a_q_bep(i,k,j)
          a_e_hv(i,k)  = a_e_bep(i,k,j)
          b_u_hv(i,k)  = b_u_bep(i,k,j)
          b_v_hv(i,k)  = b_v_bep(i,k,j)
          b_t_hv(i,k)  = b_t_bep(i,k,j)
          b_q_hv(i,k)  = b_q_bep(i,k,j)
          b_e_hv(i,k)  = b_e_bep(i,k,j)
          dlg_hv(i,k)  = dlg_bep(i,k,j)
          dl_u_hv(i,k) = dl_u_bep(i,k,j)
          vlk_hv(i,k) = vl_bep(i,k,j)
          sfk_hv(i,k)  = sf_bep(i,k,j)
       enddo
    enddo
    do i = its, ite
       frcurb_hv(i) = frc_urb2d(i,j)
    enddo

 endif

endif

the flag_bep came from:

SELECT CASE(sf_urban_physics)
CASE (BEPSCHEME)
flag_bep=.true.
CASE (BEP_BEMSCHEME)
flag_bep=.true.
CASE DEFAULT
flag_bep=.false.
END SELECT

when namelist.inpu has sf_urban_physics = 0, flag_bep will be false, thus the code to access array with (1,1,1) allocation won't execute

LIST OF MODIFIED FILES:
phys/module_bl_ysu.F

TESTS CONDUCTED:
The Jenkins tests are all passing.

RELEASE NOTE: This PR fixes a access violation error with PGI compiler in module_bl_ysu.F if urban option is turned off.

@likeuclinux likeuclinux requested review from a team as code owners December 3, 2024 00:22
@weiwangncar
Copy link
Collaborator

The regression test results:

Test Type              | Expected  | Received |  Failed
= = = = = = = = = = = = = = = = = = = = = = = =  = = = =
Number of Tests        : 23           24
Number of Builds       : 60           57
Number of Simulations  : 158           150        0
Number of Comparisons  : 95           86        0

Failed Simulations are: 
None
Which comparisons are not bit-for-bit: 
None

@islas islas changed the base branch from master to develop December 3, 2024 18:01
@islas
Copy link
Collaborator

islas commented Dec 3, 2024

@likeuclinux Could you separate out the edits in wrf_timeseries.F and start_em.F into their own PR? As it sounds like they are a separate issue from the not fully allocated dlg_bep array, it would help the review process if each PR was limited in scope to the exact issue being resolved.

@likeuclinux
Copy link
Author

Can I just create another PR for both wrf_timeseries.F and start_em.F changes?

@likeuclinux
Copy link
Author

great, I will do now

@islas
Copy link
Collaborator

islas commented Dec 5, 2024

Yes, sorry. That is what I meant. Those two files' edits appear to be related to improper deallocations which could be its own single PR

@likeuclinux
Copy link
Author

likeuclinux commented Dec 5, 2024

I just create this PR for memory leak issue:
only memory leak related#2139
#2139

@likeuclinux
Copy link
Author

I don't need re-create current PR 2137 for single file change relate to phys/module_bl_ysu.F ?

Copy link
Collaborator

@dudhia dudhia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@likeuclinux This one has #2139 as a subset. Need to remove one. Do we want all these changes?

dudhia
dudhia previously approved these changes Jan 16, 2025
Copy link
Collaborator

@dudhia dudhia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that #2139 looks like a subset and if so can be closed.

@dudhia dudhia mentioned this pull request Jan 16, 2025
@likeuclinux
Copy link
Author

This issue is fatal that cause wrf executable crash, while #2139 is just memory leak

@dudhia
Copy link
Collaborator

dudhia commented Jan 28, 2025

@weiwangncar This needs another review. It has #2137 as a subset and so needs to be merged after that one.


!Allocate the arrays for wind components
#if ( EM_CORE == 1 )
ALLOCATE ( earth_u_profile(grid%max_ts_level), earth_v_profile(grid%max_ts_level) )
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@likeuclinux Are the changes made in the PR for wrf_timeseries.F completely the same as in PR-2139?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@likeuclinux It looks like there is an extra empty line in the version of wrf_timeseries.F in this PR.

@weiwangncar
Copy link
Collaborator

@likeuclinux It looks like when I merged PR-2139, it resulted in a conflict in this PR. In your local tree, you can delete this file and checkout one from the repository (hence remove the change to wrf_timeseries.F from this PR), or resolve the conflict as indicated explicitly.

@weiwangncar
Copy link
Collaborator

@likeuclinux I was able to resolve the conflicit. Once regression tests are passed, it is good to go.

@weiwangncar weiwangncar merged commit 33ce70c into wrf-model:develop Feb 5, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants