Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

训练报错 #2357

Open
someoneformulated opened this issue Jan 10, 2025 · 1 comment
Open

训练报错 #2357

someoneformulated opened this issue Jan 10, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@someoneformulated
Copy link

someoneformulated commented Jan 10, 2025

您好,我的训练有点问题:

下面是我的脚本( finetune.sh):

workspace=`pwd`

# which gpu to train or finetune
export CUDA_VISIBLE_DEVICES="0"
gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')

# model_name from model_hub, or model_dir in local path

## option 1, download model automatically
model_name_or_model_dir="iic/SenseVoiceSmall"


## option 2, download model by git
#local_path_root=${workspace}/modelscope_models
#mkdir -p ${local_path_root}/${model_name_or_model_dir}
#git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir}
#model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir}


# data dir, which contains: train.json, val.json
data_dir="../../../data/list"

# train_data="/home/whisper/qdm/finetune_whisper/pre_dataset/output_clean.jsonl"
# val_data="/home/whisper/qdm/finetune_whisper/pre_dataset/output_clean_test.jsonl"
train_data="${data_dir}/train.jsonl"
val_data="${data_dir}/val.jsonl"

train_data="${data_dir}/train2.jsonl"
val_data="${data_dir}/train2.jsonl"
# generate train.jsonl and val.jsonl from wav.scp and text.txt
# scp2jsonl \
# ++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \
# ++data_type_list='["source", "target"]' \
# ++jsonl_file_out="${train_data}"

# scp2jsonl \
# ++scp_file_list='["../../../data/list/val_wav.scp", "../../../data/list/val_text.txt"]' \
# ++data_type_list='["source", "target"]' \
# ++jsonl_file_out="${val_data}"


# exp output dir
output_dir="./outputs"
log_file="${output_dir}/log.txt"

deepspeed_config=${workspace}/../../ds_stage1.json

mkdir -p ${output_dir}
echo "log_file: ${log_file}"

# DISTRIBUTED_ARGS="
#     --nnodes ${WORLD_SIZE:-1} \
#     --nproc_per_node $gpu_num \
#     --node_rank ${RANK:-0} \
#     --master_addr ${MASTER_ADDR:-10.12.2.110} \
#     --master_port ${MASTER_PORT:-26669}
# "

# 单机单卡训练
DISTRIBUTED_ARGS="
    --nnodes 1 \
    --nproc_per_node 1 
"

echo $DISTRIBUTED_ARGS

torchrun $DISTRIBUTED_ARGS \
../../../funasr/bin/train_ds.py \
++model="${model_name_or_model_dir}" \
++train_data_set_list="${train_data}" \
++valid_data_set_list="${val_data}" \
++dataset="AudioDataset" \
++dataset_conf.index_ds="IndexDSJsonl" \
++dataset_conf.data_split_num=1 \
++dataset_conf.batch_sampler="BatchSampler" \
++dataset_conf.batch_size=1  \
++dataset_conf.sort_size=1024 \
++dataset_conf.batch_type="token" \
++dataset_conf.num_workers=1 \
++train_conf.max_epoch=51 \
++train_conf.log_interval=10 \
++train_conf.resume=true \
++train_conf.validate_interval=10 \
++train_conf.save_checkpoint_interval=10 \
++train_conf.keep_nbest_models=20 \
++train_conf.avg_nbest_model=10 \
++train_conf.use_deepspeed=false \
++train_conf.deepspeed_config=${deepspeed_config} \
++optim_conf.lr=0.0002 \
++output_dir="${output_dir}" &> ${log_file}

情况说明一、train_conf.max_epoch如果为50不会报错,但是修改为其他数字就会报错。
并且train_conf.max_epoch为50虽然没有报错。感觉有点奇怪,启动以后,很快就跑完了,我这里只用了4条数据测试,但换成2w条数据,也是是4、5分钟就完了。
并且和官网给的日志不太一样。官网的日志输出是这样的,

tail log.txt
[2024-03-21 15:55:52,137][root][INFO] - train, rank: 3, epoch: 0/50, step: 6990/1, total step: 6990, (loss_avg_rank: 0.327), (loss_avg_epoch: 0.409), (ppl_avg_epoch: 1.506), (acc_avg_epoch: 0.795), (lr: 1.165e-04), [('loss_att', 0.259), ('acc', 0.825), ('loss_pre', 0.04), ('loss', 0.299), ('batch_size', 40)], {'data_load': '0.000', 'forward_time': '0.315', 'backward_time': '0.555', 'optim_time': '0.076', 'total_time': '0.947'}, GPU, memory: usage: 3.830 GB, peak: 18.357 GB, cache: 20.910 GB, cache_peak: 20.910 GB
[2024-03-21 15:55:52,139][root][INFO] - train, rank: 1, epoch: 0/50, step: 6990/1, total step: 6990, (loss_avg_rank: 0.334), (loss_avg_epoch: 0.409), (ppl_avg_epoch: 1.506), (acc_avg_epoch: 0.795), (lr: 1.165e-04), [('loss_att', 0.285), ('acc', 0.823), ('loss_pre', 0.046), ('loss', 0.331), ('batch_size', 36)], {'data_load': '0.000', 'forward_time': '0.334', 'backward_time': '0.536', 'optim_time': '0.077', 'total_time': '0.948'}, GPU, memory: usage: 3.943 GB, peak: 18.291 GB, cache: 19.619 GB, cache_peak: 19.619 GB

我的输出是这样的:


。。。
         1.0496e-01,  1.0633e-01, -7.5043e-01, -3.6852e-02,  4.1134e-01,
        -2.9967e-02, -9.7470e-01, -1.4157e-01,  3.2578e-01,  4.8425e-01,
         2.0463e-01, -2.6971e-01,  1.7227e-01,  4.3726e-01,  5.9164e-02,
         1.4945e-01,  2.6952e-01,  2.5848e-01,  2.6835e-01, -4.3545e-03,
         1.8164e-01, -1.7565e-01,  7.3214e-01,  6.2356e-02,  7.7026e-01,
         2.8006e-01, -3.4716e-01,  3.7262e-01,  6.9245e-02, -3.8189e-02,
         7.6416e-01,  8.7658e-02, -3.6619e-04,  4.3033e-01, -1.5678e-01,
         4.1584e-01, -5.7490e-02,  3.7616e-01, -1.4754e-02,  2.9249e-01,
         1.6077e-01,  3.0055e-01,  3.9183e-01,  8.4442e-01, -3.2960e-01,
         1.3235e-01, -1.0908e-01,  3.4302e-01, -1.2563e-01,  7.4766e-02,
        -4.1710e-02,  6.5645e-01, -2.4714e-01, -1.1526e-01,  1.0522e-01,
         8.5411e-02,  3.9172e-01,  1.9230e-01,  5.9305e-01, -6.0740e-02,
        -7.0898e-01,  1.4300e-01,  4.6315e-01, -4.5488e-01,  4.1664e-01,
         3.8950e-01, -2.3405e-01,  7.2520e-02,  1.8923e-01,  8.1017e-03,
         2.1836e-01, -2.2270e-01,  2.6297e-01,  2.8572e-01, -4.4584e-01,
        -4.9053e-02, -2.1672e-01, -1.5343e-01,  2.2005e-01,  1.2702e-01,
         9.8512e-02, -2.6566e-01, -2.4361e-01, -1.2505e-01, -3.3991e-01,
         2.7988e-01,  3.4392e-01, -2.8082e-01, -6.4924e-02,  4.0739e-01,
         1.9175e-01, -6.6443e-02,  1.5796e-01, -1.2095e-01,  3.5853e-01,
         5.2779e-01, -9.9524e-02, -4.0632e-01,  6.0047e-01,  3.8338e-01,
         3.7267e-01, -1.1589e-01,  4.6491e-02,  1.1423e-01,  1.0846e-01,
         2.6370e-01,  4.3025e-01, -1.2456e-01,  2.9983e-01,  3.0809e-01,
         2.6918e-01, -3.3258e-01, -1.3452e-01,  3.5014e-01,  3.5969e-02,
         4.4269e-01,  3.1589e-01,  3.9790e-01, -6.3037e-02, -6.9474e-02,
         1.0233e-01,  4.4336e-01, -2.8833e-01,  5.1548e-02,  5.9957e-02,
         1.6814e-01,  3.4581e-01,  1.2404e-01, -5.9909e-01, -6.4265e-02,
         2.2229e-01,  2.3996e-01,  5.4490e-01,  2.3237e-01,  2.2760e+00,
         1.6224e-01, -3.1712e-01,  5.6402e-02, -2.2848e-01,  7.7132e-02,
         1.3676e-01,  1.8086e-01,  6.1650e-02,  1.5751e-01,  1.8755e-01,
        -3.1094e-01, -1.2075e-02,  6.0585e-01,  3.2042e-02,  1.6614e-01,
         7.7979e-02, -2.9261e-01, -3.8335e-01, -4.4142e-01,  5.9884e-01,
         1.4829e-01, -5.5344e-02,  2.1832e-01,  5.1646e-01, -1.4330e-01,
         4.5220e-01,  8.7958e-02,  1.6802e-01,  5.3935e-01, -4.4775e-02,
        -6.9581e-02,  1.9895e-01, -1.1928e-01,  4.2794e-01, -1.2613e-01,
         1.0478e-02,  3.3795e-01, -1.1114e-01, -1.7632e-01,  7.5668e-02,
         1.6833e-01,  1.4777e-01,  6.9408e-01,  5.5252e-03,  5.6109e-01,
         1.7669e-01,  1.3761e-02, -9.7118e-02,  2.9845e-01,  3.0550e-01,
        -4.1826e-02,  7.2616e-01,  6.8981e-01,  3.2401e-01,  4.2130e-01,
        -6.9514e-02, -2.0150e-01,  2.0334e-01, -2.2850e-03, -1.0942e-02,
        -4.3577e-03,  2.2796e-01,  4.9741e-01, -8.3082e-01,  1.1818e-01,
         2.6337e-01, -1.0720e-01,  3.8464e-01,  3.1583e-01,  1.4509e-01,
        -1.3499e-01, -1.6085e-01, -7.2189e-02,  1.8281e-01,  2.8829e-01,
         2.3725e-01,  1.7176e-01])), ('ctc.ctc_lo.weight', tensor([[ 0.0890,  0.0119,  0.0580,  ...,  0.0457,  0.1047,  0.0488],
        [-0.1731, -0.0495, -0.0359,  ...,  0.0086, -0.0733, -0.0729],
        [-0.1520, -0.0413, -0.0216,  ...,  0.0173, -0.0950, -0.0727],
        ...,
        [-0.1410, -0.0409, -0.0148,  ...,  0.0237, -0.0773, -0.0805],
        [-0.1557, -0.0464, -0.0301,  ...,  0.0130, -0.0772, -0.0645],
        [-0.1610, -0.0454, -0.0250,  ..., -0.0196, -0.1070, -0.0753]])), ('ctc.ctc_lo.bias', tensor([ 1.0193, -0.8354, -0.8890,  ..., -0.8869, -0.8532, -0.8672])), ('embed.weight', tensor([[ 1.3156, -1.4745,  1.6870,  ...,  0.8043, -1.1257, -0.1363],
        [-0.4160,  0.1275, -0.2145,  ..., -0.4853,  2.0950,  0.7758],
        [-0.3253,  0.4068,  2.3481,  ...,  0.3393,  0.4742, -1.4821],
        ...,
        [ 0.6011,  0.5135,  0.7779,  ...,  0.3062,  0.8599,  1.1042],
        [-0.0182,  0.5314,  0.0267,  ...,  0.3970,  0.2623,  0.0945],
        [ 0.2403,  0.5877,  0.5917,  ...,  0.9653,  0.7232,  0.0120]]))])} does not exist, avg the lastet checkpoint.
average_checkpoints: ['./outputs/model.pt.ep50', './outputs/model.pt.ep49', './outputs/model.pt.ep48', './outputs/model.pt.ep47', './outputs/model.pt.ep46', './outputs/model.pt.ep45', './outputs/model.pt.ep44', './outputs/model.pt.ep43', './outputs/model.pt.ep42', './outputs/model.pt.ep41']



情况说明二:
如果将train_conf.max_epoch修改为其他数字,比如60,500,51就会报错(并且,我的脚本里写的60,500,51,但是日志中依然写的是Train epoch: 50),报错的日志:


Model summary:
    Class Name: SenseVoiceSmall
    Total Number of model parameters: 234.00 M
    Number of trainable parameters: 234.00 M (100.0%)
    Type: torch.float32
[2025-01-10 14:09:41,022][root][INFO] - Build optim
[2025-01-10 14:09:41,026][root][INFO] - Build scheduler
[2025-01-10 14:09:41,026][root][INFO] - Build dataloader
[2025-01-10 14:09:41,026][root][INFO] - Build dataloader
[2025-01-10 14:09:41,026][root][INFO] - total_num of samplers: 4, ../../../data/list/train2.jsonl
[2025-01-10 14:09:41,026][root][INFO] - total_num of samplers: 4, ../../../data/list/train2.jsonl
0
Checkpoint loaded successfully from './outputs/model.pt'
[2025-01-10 14:09:41,411][root][INFO] - Train epoch: 50, rank: 0

[2025-01-10 14:09:41,416][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 5, after: 5
[2025-01-10 14:09:41,491][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 5, after: 5
Error executing job with overrides: ['++model=iic/SenseVoiceSmall', '++train_data_set_list=../../../data/list/train2.jsonl', '++valid_data_set_list=../../../data/list/train2.jsonl', '++dataset=AudioDataset', '++dataset_conf.index_ds=IndexDSJsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=1', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=1', '++train_conf.max_epoch=51', '++train_conf.log_interval=10', '++train_conf.resume=true', '++train_conf.validate_interval=10', '++train_conf.save_checkpoint_interval=10', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../ds_stage1.json', '++optim_conf.lr=0.0002', '++output_dir=./outputs']
Traceback (most recent call last):
  File "/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 225, in <module>
    main_hydra()
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main
    _run_hydra(
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra
    _run_app(
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app
    run_and_report(
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report
    raise ex
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report
    return func()
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda>
    lambda: hydra.run(
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run
    _ = ret.return_value
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value
    raise self._return_value
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job
    ret.return_value = task_function(task_cfg)
  File "/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 56, in main_hydra
    main(**kwargs)
  File "/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 173, in main
    trainer.train_epoch(
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/funasr/train_utils/trainer_ds.py", line 603, in train_epoch
    self.forward_step(model, batch, loss_dict=loss_dict)
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/funasr/train_utils/trainer_ds.py", line 670, in forward_step
    retval = model(**batch)
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: SenseVoiceSmall.forward() missing 4 required positional arguments: 'speech', 'speech_lengths', 'text', and 'text_lengths'
E0110 14:09:43.047000 919201 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 919299) of binary: /home/install/Anaconda3/envs/speech/bin/python
Traceback (most recent call last):
  File "/home/install/Anaconda3/envs/speech/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper
    return f(*args, **kwargs)
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main
    run(args)
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run
    elastic_launch(
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
../../../funasr/bin/train_ds.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-01-10_14:09:43
  host      : cdatc-NF5468M6
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 919299)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

麻烦看看,是我哪个地方没写对吗?谢谢了。

@someoneformulated someoneformulated added the bug Something isn't working label Jan 10, 2025
@yjlyjl666
Copy link

你仔细看它的代码:它model的配置危机config.yaml覆盖了你设置的参数,所以你需要修改他的config.yaml里面的train部分

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants