You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
From the output, it appears that the cricket-rpc-server is started correctly on the GPU as expected:
Next, I ran the ResNet training script in another shell:
It seems that ResNet is executed correctly (After the batch iteration reaches 64 times, the training is interrupted in advance according to the logic in the code). However, when I run pos_cli --dump --dir /root/ckpt --pid 41228 (which is the PID of python train.py) in the third shell while ResNet's train.py is being executed, I encounter the following error:
Additionally, there is no output from the PhOS Daemon:
Could you please help me understand what might be causing this issue and how to resolve it? Any assistance would be greatly appreciated!
Thank you very much!
The text was updated successfully, but these errors were encountered:
I have resolved the issue. The problem was that after completing the build and installation, I forgot to manually execute source /etc/profile, which caused $phos to be empty. As a result, env $phos python3 ./train.py failed to establish communication with phosd.
The correct value for $phos should be: export phos="LD_PRELOAD=cricket-client.so"
This is already written into /etc/profile by the build and install scripts.
To avoid similar issues, I recommend checking whether $phos is correctly set before running env $phos python3 ./train.py.
Hello, I am working with a setup that uses three V100-32GB GPUs.
I first started PhOS Daemon in a shell:
root@gpu2:~/scripts/build_scripts# pos_cli --start --target daemon
From the output, it appears that the cricket-rpc-server is started correctly on the GPU as expected:
Next, I ran the ResNet training script in another shell:
It seems that ResNet is executed correctly (After the batch iteration reaches 64 times, the training is interrupted in advance according to the logic in the code). However, when I run
pos_cli --dump --dir /root/ckpt --pid 41228
(which is the PID of python train.py) in the third shell while ResNet's train.py is being executed, I encounter the following error:Additionally, there is no output from the PhOS Daemon:
Could you please help me understand what might be causing this issue and how to resolve it? Any assistance would be greatly appreciated!
Thank you very much!
The text was updated successfully, but these errors were encountered: