Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta GPUs results give NaNs in the viscous sub-grid bubble benchmark case whereas Phoenix GPUs do not #396

Closed
sbryngelson opened this issue Apr 12, 2024 · 6 comments · Fixed by #497
Assignees
Labels
continuous-integration Continuous integration (CI) enhancement New feature or request help wanted Extra attention is needed

Comments

@sbryngelson
Copy link
Member

sbryngelson commented Apr 12, 2024

Delta GPUs results give NaNs in the viscous sub-grid bubble benchmark case whereas Phoenix GPUs do not.

This is regardless of "memory size" (I've checked 4gb).

I've tested A100s and A40s on Delta, both give the issue discussed further on Slack.

I tested A100s and V100s on Phoenix, both of which do not give the issue.

Both computers use NVHPC 22.11.

Error is this:

 [ 40%]  Time step      358 of 901 @ t_step = 357
 [ 40%]  Time step      359 of 901 @ t_step = 358
 [ 40%]  Time step      360 of 901 @ t_step = 359
Warning: ieee_inexact is signaling
ERROR STOP NaN(s) in timestep output.
 NaN(s) in timestep output.            0            0            0            1
             0          360          198           99           99
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

One can run this case via something like
./mfc.sh run benchmarks/viscous_weno5_sgb_mono/case.py 4 -t pre_process simulation -c delta --gpu
if you are already on a node with GPUs and have loaded the appropriate modules.

@sbryngelson
Copy link
Member Author

Update: This issue is associated with parallel_IO = 'T'. I witnessed it again on a Rogues Gallery GH200 chip w/ NVHPC 24.1.

@sbryngelson
Copy link
Member Author

I'm not sure if this is still "broken" or not.

@sbryngelson
Copy link
Member Author

sbryngelson commented May 24, 2024

Update: This is still broken. Related to PR #425

Update 2: This does not fail when case optimization is disabled. It only fails with case optimization enabled (on non-Phoenix computers).

I get the feeling that this line is not actually invoking case optimization....

./mfc.sh bench --mem 8 -j $(nproc) -o "$job_slug.yaml" -- -c phoenix $device_opts -n $n_ranks

@henryleberre

Update 3: Update 2 is incorrect and case optimization is not relevant

@henryleberre
Copy link
Member

@sbryngelson I'm pretty sure --case-optimization is enabled by default in bench.

@wilfonba
Copy link
Collaborator

The logs indicate that case optimization is enabled on Phoenix for the benchmarking. There's recompilation of code in cases that I would expect to see recompilation due to case optimization.

@sbryngelson
Copy link
Member Author

Nevermind, you're both right and it fails with and without case optimization on Delta (and presumably other computers).

@sbryngelson sbryngelson linked a pull request May 24, 2024 that will close this issue
@sbryngelson sbryngelson added enhancement New feature or request continuous-integration Continuous integration (CI) and removed invalid labels May 28, 2024
@sbryngelson sbryngelson linked a pull request Jul 1, 2024 that will close this issue
14 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
continuous-integration Continuous integration (CI) enhancement New feature or request help wanted Extra attention is needed
3 participants