Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure with pos_cli --dump Command When Running ResNet Training #19

Open
LiuMicheal opened this issue Feb 10, 2025 · 1 comment
Open

Comments

@LiuMicheal
Copy link

Hello, I am working with a setup that uses three V100-32GB GPUs.

I first started PhOS Daemon in a shell:

root@gpu2:~/scripts/build_scripts# pos_cli --start --target daemon

Image

From the output, it appears that the cricket-rpc-server is started correctly on the GPU as expected:

Image

Next, I ran the ResNet training script in another shell:

Image

It seems that ResNet is executed correctly (After the batch iteration reaches 64 times, the training is interrupted in advance according to the logic in the code). However, when I run pos_cli --dump --dir /root/ckpt --pid 41228 (which is the PID of python train.py) in the third shell while ResNet's train.py is being executed, I encounter the following error:

Image

Additionally, there is no output from the PhOS Daemon:

Image

Could you please help me understand what might be causing this issue and how to resolve it? Any assistance would be greatly appreciated!

Thank you very much!

@LiuMicheal
Copy link
Author

I have resolved the issue. The problem was that after completing the build and installation, I forgot to manually execute source /etc/profile, which caused $phos to be empty. As a result, env $phos python3 ./train.py failed to establish communication with phosd.

The correct value for $phos should be:
export phos="LD_PRELOAD=cricket-client.so"
This is already written into /etc/profile by the build and install scripts.

To avoid similar issues, I recommend checking whether $phos is correctly set before running env $phos python3 ./train.py.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant