Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproduce the model #4

Open
yikehuaili opened this issue Feb 24, 2025 · 4 comments
Open

Reproduce the model #4

yikehuaili opened this issue Feb 24, 2025 · 4 comments

Comments

@yikehuaili
Copy link

Hello, thank you for your great contribution.
I am a novice in AI modeling. When reproducing the model, I encountered the following error message. Can I ask how to solve it?
Execute command: GPU=0 bash scripts/run_exp_pipe.sh pepbench_codesign configs/pepbench/autoencoder/train_codesign.yaml configs/pepbench/ldm/train_codesign.yaml configs/pepbench/ldm/setup_latent_guidance.yaml configs/pepbench/test_codesign.yaml
Error message:
100%|██████████| 170/170 [01:25<00:00, 1.98it/s, loss=0, version=0]

0%| | 0/4157 [00:00<?, ?it/s]
100%|██████████| 4157/4157 [00:00<00:00, 1097138.29it/s]
2025-02-21 21:34:14::INFO::validating ...

0%| | 0/6 [00:00<?, ?it/s]
0%| | 0/6 [00:01<?, ?it/s]
Traceback (most recent call last):
File "/home/dengxj/yanlin/deeplearning/PepGLAD/train.py", line 78, in
main(args, opt_args)
File "/home/dengxj/yanlin/deeplearning/PepGLAD/train.py", line 71, in main
trainer.train(args.gpus, args.local_rank)
File "/home/dengxj/yanlin/deeplearning/PepGLAD/trainer/abs_trainer.py", line 253, in train
self._valid_epoch(device)
File "/home/dengxj/yanlin/deeplearning/PepGLAD/trainer/abs_trainer.py", line 155, in _valid_epoch
metric = self.valid_step(batch, self.valid_global_step)
File "/home/dengxj/yanlin/deeplearning/PepGLAD/trainer/ldm_trainer.py", line 52, in valid_step
loss, loss_dict = self.model(**batch)
ValueError: not enough values to unpack (expected 2, got 1)
cat: ./exps/pepbench_fixseq/LDM/version_0/checkpoint/topk_map.txt: 没有那个文件或目录
usage: setup_latent_guidance.py [-h] --config CONFIG --ckpt CKPT [--gpu GPU]
setup_latent_guidance.py: error: argument --ckpt: expected one argument
usage: generate.py [-h] --config CONFIG --ckpt CKPT [--save_dir SAVE_DIR]
[--gpu GPU] [--n_cpu N_CPU]
generate.py: error: argument --ckpt: expected one argument
Traceback (most recent call last):
File "/home/dengxj/yanlin/deeplearning/PepGLAD/cal_metrics.py", line 228, in
main(parse())
File "/home/dengxj/yanlin/deeplearning/PepGLAD/cal_metrics.py", line 153, in main
with open(args.results, 'r') as fin:
FileNotFoundError: [Errno 2] No such file or directory: './exps/pepbench_fixseq/LDM/version_0/results/results.jsonl'

@kxz18
Copy link
Collaborator

kxz18 commented Feb 24, 2025

Hi, could you share the complete logs? As I found there are some mismatches in the currently provided logs. The executed command specifies training with codesign mode, but later logs seems to find checkpoints in folders with "fixseq" as suffix. Also, information like "Training Autoencoder with config xxx" or "Using Autoencoder checkpoint" didn't appear, which makes me lost in tracking the stages.

@yikehuaili
Copy link
Author

Hi, could you share the complete logs? As I found there are some mismatches in the currently provided logs. The executed command specifies training with codesign mode, but later logs seems to find checkpoints in folders with "fixseq" as suffix. Also, information like "Training Autoencoder with config xxx" or "Using Autoencoder checkpoint" didn't appear, which makes me lost in tracking the stages.

OK, Thank you so much!

nohup.txt

@kxz18
Copy link
Collaborator

kxz18 commented Feb 24, 2025

Hi, looks like you need to either use a GPU with larger memory (at least 24G), or reduce the dynamic batch size here.
The logs keep throwing warning of cuda out of memory every step, which may also be the case during validation. The forward function is wrapped with oom decoration, which will return an OOM signal if running out of cuda memory. This is not expected during validation thus not handled by the codes for validation loops, which is the reason of the final error "ValueError: not enough values to unpack (expected 2, got 1)".

@yikehuaili
Copy link
Author

Hi, looks like you need to either use a GPU with larger memory (at least 24G), or reduce the dynamic batch size here. The logs keep throwing warning of cuda out of memory every step, which may also be the case during validation. The forward function is wrapped with oom decoration, which will return an OOM signal if running out of cuda memory. This is not expected during validation thus not handled by the codes for validation loops, which is the reason of the final error "ValueError: not enough values to unpack (expected 2, got 1)".

Thank you very much for your answer!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants