We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
您好,我的训练有点问题:
下面是我的脚本( finetune.sh):
workspace=`pwd` # which gpu to train or finetune export CUDA_VISIBLE_DEVICES="0" gpu_num=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') # model_name from model_hub, or model_dir in local path ## option 1, download model automatically model_name_or_model_dir="iic/SenseVoiceSmall" ## option 2, download model by git #local_path_root=${workspace}/modelscope_models #mkdir -p ${local_path_root}/${model_name_or_model_dir} #git clone https://www.modelscope.cn/${model_name_or_model_dir}.git ${local_path_root}/${model_name_or_model_dir} #model_name_or_model_dir=${local_path_root}/${model_name_or_model_dir} # data dir, which contains: train.json, val.json data_dir="../../../data/list" # train_data="/home/whisper/qdm/finetune_whisper/pre_dataset/output_clean.jsonl" # val_data="/home/whisper/qdm/finetune_whisper/pre_dataset/output_clean_test.jsonl" train_data="${data_dir}/train.jsonl" val_data="${data_dir}/val.jsonl" train_data="${data_dir}/train2.jsonl" val_data="${data_dir}/train2.jsonl" # generate train.jsonl and val.jsonl from wav.scp and text.txt # scp2jsonl \ # ++scp_file_list='["../../../data/list/train_wav.scp", "../../../data/list/train_text.txt"]' \ # ++data_type_list='["source", "target"]' \ # ++jsonl_file_out="${train_data}" # scp2jsonl \ # ++scp_file_list='["../../../data/list/val_wav.scp", "../../../data/list/val_text.txt"]' \ # ++data_type_list='["source", "target"]' \ # ++jsonl_file_out="${val_data}" # exp output dir output_dir="./outputs" log_file="${output_dir}/log.txt" deepspeed_config=${workspace}/../../ds_stage1.json mkdir -p ${output_dir} echo "log_file: ${log_file}" # DISTRIBUTED_ARGS=" # --nnodes ${WORLD_SIZE:-1} \ # --nproc_per_node $gpu_num \ # --node_rank ${RANK:-0} \ # --master_addr ${MASTER_ADDR:-10.12.2.110} \ # --master_port ${MASTER_PORT:-26669} # " # 单机单卡训练 DISTRIBUTED_ARGS=" --nnodes 1 \ --nproc_per_node 1 " echo $DISTRIBUTED_ARGS torchrun $DISTRIBUTED_ARGS \ ../../../funasr/bin/train_ds.py \ ++model="${model_name_or_model_dir}" \ ++train_data_set_list="${train_data}" \ ++valid_data_set_list="${val_data}" \ ++dataset="AudioDataset" \ ++dataset_conf.index_ds="IndexDSJsonl" \ ++dataset_conf.data_split_num=1 \ ++dataset_conf.batch_sampler="BatchSampler" \ ++dataset_conf.batch_size=1 \ ++dataset_conf.sort_size=1024 \ ++dataset_conf.batch_type="token" \ ++dataset_conf.num_workers=1 \ ++train_conf.max_epoch=51 \ ++train_conf.log_interval=10 \ ++train_conf.resume=true \ ++train_conf.validate_interval=10 \ ++train_conf.save_checkpoint_interval=10 \ ++train_conf.keep_nbest_models=20 \ ++train_conf.avg_nbest_model=10 \ ++train_conf.use_deepspeed=false \ ++train_conf.deepspeed_config=${deepspeed_config} \ ++optim_conf.lr=0.0002 \ ++output_dir="${output_dir}" &> ${log_file}
情况说明一、train_conf.max_epoch如果为50不会报错,但是修改为其他数字就会报错。 并且train_conf.max_epoch为50虽然没有报错。感觉有点奇怪,启动以后,很快就跑完了,我这里只用了4条数据测试,但换成2w条数据,也是是4、5分钟就完了。 并且和官网给的日志不太一样。官网的日志输出是这样的,
tail log.txt [2024-03-21 15:55:52,137][root][INFO] - train, rank: 3, epoch: 0/50, step: 6990/1, total step: 6990, (loss_avg_rank: 0.327), (loss_avg_epoch: 0.409), (ppl_avg_epoch: 1.506), (acc_avg_epoch: 0.795), (lr: 1.165e-04), [('loss_att', 0.259), ('acc', 0.825), ('loss_pre', 0.04), ('loss', 0.299), ('batch_size', 40)], {'data_load': '0.000', 'forward_time': '0.315', 'backward_time': '0.555', 'optim_time': '0.076', 'total_time': '0.947'}, GPU, memory: usage: 3.830 GB, peak: 18.357 GB, cache: 20.910 GB, cache_peak: 20.910 GB [2024-03-21 15:55:52,139][root][INFO] - train, rank: 1, epoch: 0/50, step: 6990/1, total step: 6990, (loss_avg_rank: 0.334), (loss_avg_epoch: 0.409), (ppl_avg_epoch: 1.506), (acc_avg_epoch: 0.795), (lr: 1.165e-04), [('loss_att', 0.285), ('acc', 0.823), ('loss_pre', 0.046), ('loss', 0.331), ('batch_size', 36)], {'data_load': '0.000', 'forward_time': '0.334', 'backward_time': '0.536', 'optim_time': '0.077', 'total_time': '0.948'}, GPU, memory: usage: 3.943 GB, peak: 18.291 GB, cache: 19.619 GB, cache_peak: 19.619 GB
我的输出是这样的:
。。。 1.0496e-01, 1.0633e-01, -7.5043e-01, -3.6852e-02, 4.1134e-01, -2.9967e-02, -9.7470e-01, -1.4157e-01, 3.2578e-01, 4.8425e-01, 2.0463e-01, -2.6971e-01, 1.7227e-01, 4.3726e-01, 5.9164e-02, 1.4945e-01, 2.6952e-01, 2.5848e-01, 2.6835e-01, -4.3545e-03, 1.8164e-01, -1.7565e-01, 7.3214e-01, 6.2356e-02, 7.7026e-01, 2.8006e-01, -3.4716e-01, 3.7262e-01, 6.9245e-02, -3.8189e-02, 7.6416e-01, 8.7658e-02, -3.6619e-04, 4.3033e-01, -1.5678e-01, 4.1584e-01, -5.7490e-02, 3.7616e-01, -1.4754e-02, 2.9249e-01, 1.6077e-01, 3.0055e-01, 3.9183e-01, 8.4442e-01, -3.2960e-01, 1.3235e-01, -1.0908e-01, 3.4302e-01, -1.2563e-01, 7.4766e-02, -4.1710e-02, 6.5645e-01, -2.4714e-01, -1.1526e-01, 1.0522e-01, 8.5411e-02, 3.9172e-01, 1.9230e-01, 5.9305e-01, -6.0740e-02, -7.0898e-01, 1.4300e-01, 4.6315e-01, -4.5488e-01, 4.1664e-01, 3.8950e-01, -2.3405e-01, 7.2520e-02, 1.8923e-01, 8.1017e-03, 2.1836e-01, -2.2270e-01, 2.6297e-01, 2.8572e-01, -4.4584e-01, -4.9053e-02, -2.1672e-01, -1.5343e-01, 2.2005e-01, 1.2702e-01, 9.8512e-02, -2.6566e-01, -2.4361e-01, -1.2505e-01, -3.3991e-01, 2.7988e-01, 3.4392e-01, -2.8082e-01, -6.4924e-02, 4.0739e-01, 1.9175e-01, -6.6443e-02, 1.5796e-01, -1.2095e-01, 3.5853e-01, 5.2779e-01, -9.9524e-02, -4.0632e-01, 6.0047e-01, 3.8338e-01, 3.7267e-01, -1.1589e-01, 4.6491e-02, 1.1423e-01, 1.0846e-01, 2.6370e-01, 4.3025e-01, -1.2456e-01, 2.9983e-01, 3.0809e-01, 2.6918e-01, -3.3258e-01, -1.3452e-01, 3.5014e-01, 3.5969e-02, 4.4269e-01, 3.1589e-01, 3.9790e-01, -6.3037e-02, -6.9474e-02, 1.0233e-01, 4.4336e-01, -2.8833e-01, 5.1548e-02, 5.9957e-02, 1.6814e-01, 3.4581e-01, 1.2404e-01, -5.9909e-01, -6.4265e-02, 2.2229e-01, 2.3996e-01, 5.4490e-01, 2.3237e-01, 2.2760e+00, 1.6224e-01, -3.1712e-01, 5.6402e-02, -2.2848e-01, 7.7132e-02, 1.3676e-01, 1.8086e-01, 6.1650e-02, 1.5751e-01, 1.8755e-01, -3.1094e-01, -1.2075e-02, 6.0585e-01, 3.2042e-02, 1.6614e-01, 7.7979e-02, -2.9261e-01, -3.8335e-01, -4.4142e-01, 5.9884e-01, 1.4829e-01, -5.5344e-02, 2.1832e-01, 5.1646e-01, -1.4330e-01, 4.5220e-01, 8.7958e-02, 1.6802e-01, 5.3935e-01, -4.4775e-02, -6.9581e-02, 1.9895e-01, -1.1928e-01, 4.2794e-01, -1.2613e-01, 1.0478e-02, 3.3795e-01, -1.1114e-01, -1.7632e-01, 7.5668e-02, 1.6833e-01, 1.4777e-01, 6.9408e-01, 5.5252e-03, 5.6109e-01, 1.7669e-01, 1.3761e-02, -9.7118e-02, 2.9845e-01, 3.0550e-01, -4.1826e-02, 7.2616e-01, 6.8981e-01, 3.2401e-01, 4.2130e-01, -6.9514e-02, -2.0150e-01, 2.0334e-01, -2.2850e-03, -1.0942e-02, -4.3577e-03, 2.2796e-01, 4.9741e-01, -8.3082e-01, 1.1818e-01, 2.6337e-01, -1.0720e-01, 3.8464e-01, 3.1583e-01, 1.4509e-01, -1.3499e-01, -1.6085e-01, -7.2189e-02, 1.8281e-01, 2.8829e-01, 2.3725e-01, 1.7176e-01])), ('ctc.ctc_lo.weight', tensor([[ 0.0890, 0.0119, 0.0580, ..., 0.0457, 0.1047, 0.0488], [-0.1731, -0.0495, -0.0359, ..., 0.0086, -0.0733, -0.0729], [-0.1520, -0.0413, -0.0216, ..., 0.0173, -0.0950, -0.0727], ..., [-0.1410, -0.0409, -0.0148, ..., 0.0237, -0.0773, -0.0805], [-0.1557, -0.0464, -0.0301, ..., 0.0130, -0.0772, -0.0645], [-0.1610, -0.0454, -0.0250, ..., -0.0196, -0.1070, -0.0753]])), ('ctc.ctc_lo.bias', tensor([ 1.0193, -0.8354, -0.8890, ..., -0.8869, -0.8532, -0.8672])), ('embed.weight', tensor([[ 1.3156, -1.4745, 1.6870, ..., 0.8043, -1.1257, -0.1363], [-0.4160, 0.1275, -0.2145, ..., -0.4853, 2.0950, 0.7758], [-0.3253, 0.4068, 2.3481, ..., 0.3393, 0.4742, -1.4821], ..., [ 0.6011, 0.5135, 0.7779, ..., 0.3062, 0.8599, 1.1042], [-0.0182, 0.5314, 0.0267, ..., 0.3970, 0.2623, 0.0945], [ 0.2403, 0.5877, 0.5917, ..., 0.9653, 0.7232, 0.0120]]))])} does not exist, avg the lastet checkpoint. average_checkpoints: ['./outputs/model.pt.ep50', './outputs/model.pt.ep49', './outputs/model.pt.ep48', './outputs/model.pt.ep47', './outputs/model.pt.ep46', './outputs/model.pt.ep45', './outputs/model.pt.ep44', './outputs/model.pt.ep43', './outputs/model.pt.ep42', './outputs/model.pt.ep41']
情况说明二: 如果将train_conf.max_epoch修改为其他数字,比如60,500,51就会报错(并且,我的脚本里写的60,500,51,但是日志中依然写的是Train epoch: 50),报错的日志:
Model summary: Class Name: SenseVoiceSmall Total Number of model parameters: 234.00 M Number of trainable parameters: 234.00 M (100.0%) Type: torch.float32 [2025-01-10 14:09:41,022][root][INFO] - Build optim [2025-01-10 14:09:41,026][root][INFO] - Build scheduler [2025-01-10 14:09:41,026][root][INFO] - Build dataloader [2025-01-10 14:09:41,026][root][INFO] - Build dataloader [2025-01-10 14:09:41,026][root][INFO] - total_num of samplers: 4, ../../../data/list/train2.jsonl [2025-01-10 14:09:41,026][root][INFO] - total_num of samplers: 4, ../../../data/list/train2.jsonl 0 Checkpoint loaded successfully from './outputs/model.pt' [2025-01-10 14:09:41,411][root][INFO] - Train epoch: 50, rank: 0 [2025-01-10 14:09:41,416][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 5, after: 5 [2025-01-10 14:09:41,491][root][INFO] - rank: 0, dataloader start from step: 0, batch_num: 5, after: 5 Error executing job with overrides: ['++model=iic/SenseVoiceSmall', '++train_data_set_list=../../../data/list/train2.jsonl', '++valid_data_set_list=../../../data/list/train2.jsonl', '++dataset=AudioDataset', '++dataset_conf.index_ds=IndexDSJsonl', '++dataset_conf.data_split_num=1', '++dataset_conf.batch_sampler=BatchSampler', '++dataset_conf.batch_size=1', '++dataset_conf.sort_size=1024', '++dataset_conf.batch_type=token', '++dataset_conf.num_workers=1', '++train_conf.max_epoch=51', '++train_conf.log_interval=10', '++train_conf.resume=true', '++train_conf.validate_interval=10', '++train_conf.save_checkpoint_interval=10', '++train_conf.keep_nbest_models=20', '++train_conf.avg_nbest_model=10', '++train_conf.use_deepspeed=false', '++train_conf.deepspeed_config=/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../ds_stage1.json', '++optim_conf.lr=0.0002', '++output_dir=./outputs'] Traceback (most recent call last): File "/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 225, in <module> main_hydra() File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/main.py", line 94, in decorated_main _run_hydra( File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 394, in _run_hydra _run_app( File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 457, in _run_app run_and_report( File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 223, in run_and_report raise ex File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 220, in run_and_report return func() File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/utils.py", line 458, in <lambda> lambda: hydra.run( File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/_internal/hydra.py", line 132, in run _ = ret.return_value File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/core/utils.py", line 260, in return_value raise self._return_value File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/hydra/core/utils.py", line 186, in run_job ret.return_value = task_function(task_cfg) File "/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 56, in main_hydra main(**kwargs) File "/home/whisper/qdm/finetune_whisper/FunASR/examples/industrial_data_pretraining/paraformer_streaming/../../../funasr/bin/train_ds.py", line 173, in main trainer.train_epoch( File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/funasr/train_utils/trainer_ds.py", line 603, in train_epoch self.forward_step(model, batch, loss_dict=loss_dict) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/funasr/train_utils/trainer_ds.py", line 670, in forward_step retval = model(**batch) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1747, in _call_impl return forward_call(*args, **kwargs) TypeError: SenseVoiceSmall.forward() missing 4 required positional arguments: 'speech', 'speech_lengths', 'text', and 'text_lengths' E0110 14:09:43.047000 919201 site-packages/torch/distributed/elastic/multiprocessing/api.py:869] failed (exitcode: 1) local_rank: 0 (pid: 919299) of binary: /home/install/Anaconda3/envs/speech/bin/python Traceback (most recent call last): File "/home/install/Anaconda3/envs/speech/bin/torchrun", line 8, in <module> sys.exit(main()) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 355, in wrapper return f(*args, **kwargs) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/run.py", line 919, in main run(args) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/run.py", line 910, in run elastic_launch( File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 138, in __call__ return launch_agent(self._config, self._entrypoint, list(args)) File "/home/install/Anaconda3/envs/speech/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 269, in launch_agent raise ChildFailedError( torch.distributed.elastic.multiprocessing.errors.ChildFailedError: ============================================================ ../../../funasr/bin/train_ds.py FAILED ------------------------------------------------------------ Failures: <NO_OTHER_FAILURES> ------------------------------------------------------------ Root Cause (first observed failure): [0]: time : 2025-01-10_14:09:43 host : cdatc-NF5468M6 rank : 0 (local_rank: 0) exitcode : 1 (pid: 919299) error_file: <N/A> traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html ============================================================
麻烦看看,是我哪个地方没写对吗?谢谢了。
The text was updated successfully, but these errors were encountered:
你仔细看它的代码:它model的配置危机config.yaml覆盖了你设置的参数,所以你需要修改他的config.yaml里面的train部分
Sorry, something went wrong.
No branches or pull requests
您好,我的训练有点问题:
下面是我的脚本( finetune.sh):
情况说明一、train_conf.max_epoch如果为50不会报错,但是修改为其他数字就会报错。
并且train_conf.max_epoch为50虽然没有报错。感觉有点奇怪,启动以后,很快就跑完了,我这里只用了4条数据测试,但换成2w条数据,也是是4、5分钟就完了。
并且和官网给的日志不太一样。官网的日志输出是这样的,
我的输出是这样的:
情况说明二:
如果将train_conf.max_epoch修改为其他数字,比如60,500,51就会报错(并且,我的脚本里写的60,500,51,但是日志中依然写的是Train epoch: 50),报错的日志:
麻烦看看,是我哪个地方没写对吗?谢谢了。
The text was updated successfully, but these errors were encountered: