Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doede forecast runtime error, #12

Open
iljamal opened this issue Jun 19, 2024 · 3 comments
Open

doede forecast runtime error, #12

iljamal opened this issue Jun 19, 2024 · 3 comments

Comments

@iljamal
Copy link

iljamal commented Jun 19, 2024

Starting the demo forecast:

deode case ?deode/data/config_files/configurations/cy48t3_arome -o cy48t3_arome.toml --start-suite
The Workflow reached to the forecast step
image

Which end with some netcdf symbol error :

snipet from ecflow output log:

[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] -  This is the ECMWF jobfilter
[ECMWF-INFO -ecsbatch] -  +++ Please report issues using the Support portal +++
[ECMWF-INFO -ecsbatch] -  +++ https://support.ecmwf.int                     +++
[ECMWF-INFO -ecsbatch] -  /usr/local/bin/ecsbatch: size: 49350, mtime: Thu Mar 14 09:29:45 2024
[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] - Time at submit: Wed Jun 19 07:29:51 2024 (1718782191.4708633) on ac6-209.bullx:/etc/ecmwf/nfs/dh1_home_b/eeim
[ECMWF-INFO -ecsbatch] - --- SLURM VARIABLES ---
[ECMWF-INFO -ecsbatch] - EC_CLUSTER=ac
[ECMWF-INFO -ecsbatch] - SLURM_EXPORT_ENV=ALL
[ECMWF-INFO -ecsbatch] - SBATCH_EXPORT=NONE
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - jobscript received on STDIN
[ECMWF-INFO -ecsbatch] - --- SCRIPT OPTIONS ---
[ECMWF-INFO -ecsbatch] - #SBATCH --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - #SBATCH --job-name=Forecast
[ECMWF-INFO -ecsbatch] - #SBATCH --qos=np
[ECMWF-INFO -ecsbatch] - #SBATCH --signal=USR1@30
[ECMWF-INFO -ecsbatch] - #SBATCH --time=01:00:00
[ECMWF-INFO -ecsbatch] - #SBATCH --nodes=2
[ECMWF-INFO -ecsbatch] - #SBATCH --ntasks=32
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - --- POST-PROCESSED OPTIONS ---
[ECMWF-INFO -ecsbatch] - ARG --job_name=Forecast
[ECMWF-INFO -ecsbatch] - ARG --ntasks=32
[ECMWF-INFO -ecsbatch] - ARG --nodes=2
[ECMWF-INFO -ecsbatch] - ARG --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - ARG --qos=np
[ECMWF-INFO -ecsbatch] - ARG --signal=USR1@30
[ECMWF-INFO -ecsbatch] - ARG --time=01:00:00
[ECMWF-INFO -ecsbatch] - ------------------------------
[ECMWF-INFO -ecsbatch] - jobtag: eeim-Forecast-2x512-/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
[ECMWF-INFO -ecsbatch] - ['/usr/bin/sbatch', '--job-name=Forecast', '--ntasks=32', '--nodes=2', '--output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1', '--qos=np', '--signal=USR1@30', '--time=01:00:00', '--licenses=h2resw01', '--export=EC_user_time_limit=01:00:00']
[ECMWF-INFO -ecsbatch] - ecsbatch executed on ac
[ECMWF-INFO -ecsbatch] - Job queued on ac using method local
[ECMWF-INFO -ecsbatch] - Submitted batch job 38281261
[ECMWF-INFO -ecprofile] /usr/bin/bash NON_INTERACTIVE on ac1-2015 at 20240619_073000.871, PID: 2377574, JOBID: 38281261
[ECMWF-INFO -ecprofile] $SCRATCH=/ec/res4/scratch/eeim
[ECMWF-INFO -ecprofile] $PERM=/perm/eeim
[ECMWF-INFO -ecprofile] $HPCPERM=/ec/res4/hpcperm/eeim
[ECMWF-INFO -ecprofile] $TMPDIR=/dev/shm/_tmpdir_.eeim.38281261
[ECMWF-INFO -ecprofile] $SCRATCHDIR=/ec/res4/scratchdir/eeim/5/38281261

The following have been reloaded with a version change:
  1) ecmwf-toolbox/2024.04.0.0 => ecmwf-toolbox/2024.02.1.0


The following have been reloaded with a version change:
  1) hdf5/1.14.3 => hdf5/1.10.6


The following have been reloaded with a version change:
  1) netcdf4/4.9.2 => netcdf4/4.7.4


Lmod is automatically replacing "openmpi/4.1.5.4" with "hpcx-openmpi/2.9.0".


Due to MODULEPATH changes, the following have been reloaded:
  1) ecmwf-toolbox/2024.02.1.0     3) hdf5/1.10.6            5) netcdf4/4.7.4
  2) fftw/3.3.9                    4) hpcx-openmpi/2.9.0

The following have been reloaded with a version change:
  1) prgenv/gnu => prgenv/intel

2024-06-19 07:30:11 | INFO     |    Only wait 20 seconds, if the server cannot be contacted (note default is 24 hours) before failing
2024-06-19 07:30:11 | INFO     | Calling init at: 07:30:11
2024-06-19 07:30:12 | INFO     | Running task /CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
2024-06-19 07:30:12 | INFO     | Task search path: ['/etc/ecmwf/nfs/dh1_home_b/eeim/Deode-Workflow/deode/tasks']
2024-06-19 07:30:12 | INFO     | Loading module deode.tasks.archive[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] -  This is the ECMWF jobfilter
[ECMWF-INFO -ecsbatch] -  +++ Please report issues using the Support portal +++
[ECMWF-INFO -ecsbatch] -  +++ https://support.ecmwf.int                     +++
[ECMWF-INFO -ecsbatch] -  /usr/local/bin/ecsbatch: size: 49350, mtime: Thu Mar 14 09:29:45 2024
[ECMWF-INFO -ecsbatch] - -------------------------------------------------------------------------------------
[ECMWF-INFO -ecsbatch] - Time at submit: Wed Jun 19 07:29:51 2024 (1718782191.4708633) on ac6-209.bullx:/etc/ecmwf/nfs/dh1_home_b/eeim
[ECMWF-INFO -ecsbatch] - --- SLURM VARIABLES ---
[ECMWF-INFO -ecsbatch] - EC_CLUSTER=ac
[ECMWF-INFO -ecsbatch] - SLURM_EXPORT_ENV=ALL
[ECMWF-INFO -ecsbatch] - SBATCH_EXPORT=NONE
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - jobscript received on STDIN
[ECMWF-INFO -ecsbatch] - --- SCRIPT OPTIONS ---
[ECMWF-INFO -ecsbatch] - #SBATCH --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - #SBATCH --job-name=Forecast
[ECMWF-INFO -ecsbatch] - #SBATCH --qos=np
[ECMWF-INFO -ecsbatch] - #SBATCH --signal=USR1@30
[ECMWF-INFO -ecsbatch] - #SBATCH --time=01:00:00
[ECMWF-INFO -ecsbatch] - #SBATCH --nodes=2
[ECMWF-INFO -ecsbatch] - #SBATCH --ntasks=32
[ECMWF-INFO -ecsbatch] - -----------------------
[ECMWF-INFO -ecsbatch] - --- POST-PROCESSED OPTIONS ---
[ECMWF-INFO -ecsbatch] - ARG --job_name=Forecast
[ECMWF-INFO -ecsbatch] - ARG --ntasks=32
[ECMWF-INFO -ecsbatch] - ARG --nodes=2
[ECMWF-INFO -ecsbatch] - ARG --output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1
[ECMWF-INFO -ecsbatch] - ARG --qos=np
[ECMWF-INFO -ecsbatch] - ARG --signal=USR1@30
[ECMWF-INFO -ecsbatch] - ARG --time=01:00:00
[ECMWF-INFO -ecsbatch] - ------------------------------
[ECMWF-INFO -ecsbatch] - jobtag: eeim-Forecast-2x512-/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
[ECMWF-INFO -ecsbatch] - ['/usr/bin/sbatch', '--job-name=Forecast', '--ntasks=32', '--nodes=2', '--output=/home/eeim/deode_ecflow/jobout/CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast.1', '--qos=np', '--signal=USR1@30', '--time=01:00:00', '--licenses=h2resw01', '--export=EC_user_time_limit=01:00:00']
[ECMWF-INFO -ecsbatch] - ecsbatch executed on ac
[ECMWF-INFO -ecsbatch] - Job queued on ac using method local
[ECMWF-INFO -ecsbatch] - Submitted batch job 38281261
[ECMWF-INFO -ecprofile] /usr/bin/bash NON_INTERACTIVE on ac1-2015 at 20240619_073000.871, PID: 2377574, JOBID: 38281261
[ECMWF-INFO -ecprofile] $SCRATCH=/ec/res4/scratch/eeim
[ECMWF-INFO -ecprofile] $PERM=/perm/eeim
[ECMWF-INFO -ecprofile] $HPCPERM=/ec/res4/hpcperm/eeim
[ECMWF-INFO -ecprofile] $TMPDIR=/dev/shm/_tmpdir_.eeim.38281261
[ECMWF-INFO -ecprofile] $SCRATCHDIR=/ec/res4/scratchdir/eeim/5/38281261

The following have been reloaded with a version change:
  1) ecmwf-toolbox/2024.04.0.0 => ecmwf-toolbox/2024.02.1.0
The following have been reloaded with a version change:
  1) hdf5/1.14.3 => hdf5/1.10.6
The following have been reloaded with a version change:
  1) netcdf4/4.9.2 => netcdf4/4.7.4
Lmod is automatically replacing "openmpi/4.1.5.4" with "hpcx-openmpi/2.9.0".
Due to MODULEPATH changes, the following have been reloaded:
  1) ecmwf-toolbox/2024.02.1.0     3) hdf5/1.10.6            5) netcdf4/4.7.4
  2) fftw/3.3.9                    4) hpcx-openmpi/2.9.0

The following have been reloaded with a version change:
  1) prgenv/gnu => prgenv/intel

2024-06-19 07:30:11 | INFO     |    Only wait 20 seconds, if the server cannot be contacted (note default is 24 hours) before failing
2024-06-19 07:30:11 | INFO     | Calling init at: 07:30:11
2024-06-19 07:30:12 | INFO     | Running task /CY48t3_AROME_DEMO_60x80_2500m/20230916/0000/Cycle/Forecasting/Forecast
2024-06-19 07:30:12 | INFO     | Task search path: ['/etc/ecmwf/nfs/dh1_home_b/eeim/Deode-Workflow/deode/tasks']
2024-06-19 07:30:12 | INFO     | Loading module deode.tasks.archive

< snip >

## EC_MEMINFO Detailed memory information for program /etc/ecmwf/nfs/dh1_perm_b/snh02/pack/bin/48t3_main.05.OMPIIFC2104.x/bin/MASTERODB -- wall-time :      0.714s
## EC_MEMINFO Running on 2 nodes (4-numa) with 24 compute + 4 I/O-tasks and 1+1 threads at 07:30:18.049 on 19-Jun-2024
## EC_MEMINFO The Job Name is Forecast and the Job ID is 38281261
## EC_MEMINFO 
## EC_MEMINFO                           | TC    | MEMORY USED(MB) | MEMORY FREE(MB)  -------------    -------------    -------------   INCLUDING CACHED|  %USED %HUGE  | Energy  Power
## EC_MEMINFO                           | Malloc| Inc Heap        | Numa region  0 | Numa region  1 | Numa region  2 | Numa region  3 |                |               |    (J)    (W)
## EC_MEMINFO Node Name                 | Heap  | RSS(sum)        | Small  Huge or | Small  Huge or | Small  Huge or | Small  Huge or | Total          |
## EC_MEMINFO                           | (sum) | Small    Huge   |  Only   Small  |  Only   Small  |  Only   Small  |  Only   Small  | Memfree+Cached |
## EC_MEMINFO    0 ac1-2015               33226    2379       0      4364   23238     2923   28364     2100   28714     1885   29088    243073    1312      1.0   0.0         0      0  Sm/p:oops:ifs_init
## EC_MEMINFO    1 ac1-2021               24996    1785       0      6282   21102     2647   27930     2927   27926     2391   28480    243353    1185      0.7   0.0         0      0  Sm/p:master:comput
/home/snh02/pack/48t3_main.05.OMPIIFC2104.x/bin/MASTERODB: symbol lookup error: /home/snh02/pack/48t3_main.05.OMPIIFC2104.x/bin/MASTERODB: undefined symbol: netcdf_mp_nf90_open_
srun: error: ac1-2015: task 0: Exited with exit code 127
srun: launch/slurm: _step_signal: Terminating StepId=38281261.0
slurmstepd: error: *** STEP 38281261.0 ON ac1-2015 CANCELLED AT 2024-06-19T07:30:20 ***
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   0000149E3ABE478C  for__signal_handl     Unknown  Unknown
libpthread-2.28.s  0000149E11962CF0  Unknown               Unknown  Unknown
mca_coll_libnbc.s  0000149DF69F917A  ompi_coll_libnbc_     Unknown  Unknown
libopen-pal.so.40  0000149E0E3A9324  opal_progress         Unknown  Unknown
libopen-pal.so.40  0000149E0E3AFF9D  ompi_sync_wait_mt     Unknown  Unknown
libmpi.so.40.30.1  0000149E14FD22A8  ompi_request_defa     Unknown  Unknown
libmpi.so.40.30.1  0000149E1500C858  ompi_coll_base_bc     Unknown  Unknown
mca_coll_tuned.so  0000149DF63D3320  ompi_coll_tuned_b     Unknown  Unknown
libmpi.so.40.30.1  0000149E14FE6A74  MPI_Bcast             Unknown  Unknown
libmpi_mpifh.so.4  0000149E152E1E44  pmpi_bcast            Unknown  Unknown
MASTERODB          0000000004FC6088  mpl_broadcast_mod         901  mpl_broadcast_mod.F90
MASTERODB          0000000004C0E648  easy_netcdf_read_         201  easy_netcdf_read_mpi.F90
MASTERODB          0000000002953592  yomclim_mp_read_g         100  yomclim.F90
MASTERODB          000000000217DFAB  suecrad_                 2472  suecrad.F90
MASTERODB          0000000002168CE2  suphec_                   259  suphec.F90
MASTERODB          0000000000CE4DBC  suphy_                     82  suphy.F90
MASTERODB          0000000000AD04A3  su0yomb_                  537  su0yomb.F90
MASTERODB          000000000041B08B  cnt0_                     188  cnt0.F90
MASTERODB          0000000000412A7F  MAIN__                    246  master.F90
MASTERODB          0000000000412422  Unknown               Unknown  Unknown
libc-2.28.so       0000149E115C5D85  __libc_start_main     Unknown  Unknown
MASTERODB          000000000041232E  Unknown               Unknown  Unknown

Seems that in the beginning many modules are reloaded ... is that now interfering somehow with my module environment from .bash_profile (?)

@iljamal
Copy link
Author

iljamal commented Jun 19, 2024

Removing any module load from .bash_profile seems to solve that and the forecast step completed sucesfully :)

was using :
module load prgenv/gnu cdo python3 nco ecmwf-toolbox
module load openmpi hdf5 netcdf4

@uandrae
Copy link
Contributor

uandrae commented Jun 19, 2024

These are very useful experiences, although painful for you. It suggests that we should perhaps make sure that batch jobs runs under a cleaner environment with the correct SBATCH directives.

Is it running now?

@uandrae
Copy link
Contributor

uandrae commented Jun 19, 2024

In general it brings surprises if you put a lot in .bashrc/.bash_profile and you are working on different projects with different needs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants