Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

shared_list does not have data_set in forward block with TIMIT tutorial #157

Open
hajime9652 opened this issue Aug 29, 2019 · 27 comments
Open

Comments

@hajime9652
Copy link

hajime9652 commented Aug 29, 2019

------------------------------ Epoch 23 / 23 ------------------------------
 
----- Summary epoch 23 / 23
Training on ['TIMIT_tr']
Loss = 0.932 | err = 0.298 
-----
Validating on TIMIT_dev
Loss = 1.811 | err = 0.468 
-----
Learning rate on architecture1 = 0.08 
-----
Elapsed time (s) = 574

 
Testing TIMIT_test chunk = 1 / 1
shared list []
shared list [None, None, None, {'mfcc': ['mfcc', 'exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0_mfcc.lst', 'apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5_0827_test/data/test/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5_0827_test/mfcc/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |', '5', '5']}, {}, {'MLP_layers1': ['architecture1', 'MLP_layers1', 0]}, {'input': None, 'ref': None}]
output folder exp/TIMIT_MLP_basic
data_set_dict <class 'dict'>
data_set_dict {'input': None, 'ref': None}
Traceback (most recent call last):
  File "run_exp.py", line 340, in <module>
    data_set_inp, data_set_ref = convert_numpy_to_torch(data_set_dict, save_gpumem, use_cuda)
  File "/home/sysadmin/pytorch-kaldi/core.py", line 46, in convert_numpy_to_torch
    data_set_inp=torch.from_numpy(data_set_dict['input']).float()
TypeError: expected np.ndarray (got NoneType)
@hajime9652
Copy link
Author

hajime9652 commented Aug 29, 2019

# --------FORWARD--------#
for forward_data in forward_data_lst:

         # Compute the number of chunks
         N_ck_forward=compute_n_chunks(out_folder,forward_data,ep,N_ep_str_format,'forward')
         N_ck_str_format='0'+str(max(math.ceil(np.log10(N_ck_forward)),1))+'d'

         processes = list()
         info_files = list()
         for ck in range(N_ck_forward):

            if not is_production:
                print('Testing %s chunk = %i / %i' %(forward_data,ck+1, N_ck_forward))
            else:
                print('Forwarding %s chunk = %i / %i' %(forward_data,ck+1, N_ck_forward))

            # output file
            info_file=out_folder+'/exp_files/forward_'+forward_data+'_ep'+format(ep, N_ep_str_format)+'_ck'+format(ck, N_ck_str_format)+'.info'
            config_chunk_file=out_folder+'/exp_files/forward_'+forward_data+'_ep'+format(ep, N_ep_str_format)+'_ck'+format(ck, N_ck_str_format)+'.cfg'


            # Do forward if the chunk was not already processed
            if not(os.path.exists(info_file)):

                # Doing forward

                # getting the next chunk 
                next_config_file=cfg_file_list[op_counter]

                # run chunk processing                    
                if _run_forwarding_in_subprocesses(config):
                    shared_list = list()
                    print("shared list",shared_list)
                    output_folder = config['exp']['out_folder']
                    save_gpumem = strtobool(config['exp']['save_gpumem'])
                    use_cuda=strtobool(config['exp']['use_cuda'])
                    p = read_next_chunk_into_shared_list_with_subprocess(read_lab_fea, shared_list, config_chunk_file, is_production, output_folder, wait_for_process=True)
                    data_name, data_end_index_fea, data_end_index_lab, fea_dict, lab_dict, arch_dict, data_set_dict = extract_data_from_shared_list(shared_list)
                    print("shared list", shared_list)
                    print("output folder",output_folder)
                    print("data_set_dict",type(data_set_dict))
                    print("data_set_dict",data_set_dict)
                    data_set_inp, data_set_ref = convert_numpy_to_torch(data_set_dict, save_gpumem, use_cuda)

@hajime9652
Copy link
Author

When is shared_list overwrite?
and How to bring the correct data_set?

@TParcollet
Copy link
Collaborator

Hi ! Isn't it simply a problem with the path of the test dataset in the config file ?

@mravanelli
Copy link
Owner

mravanelli commented Aug 29, 2019 via email

@hajime9652
Copy link
Author

I will check again.

@hajime9652
Copy link
Author

I'm still in trouble.

ERROR MSG

------------------------------ Epoch 23 / 23 ------------------------------
 
----- Summary epoch 23 / 23
Training on ['TIMIT_tr']
Loss = 0.932 | err = 0.298 
-----
Validating on TIMIT_dev
Loss = 1.812 | err = 0.468 
-----
Learning rate on architecture1 = 0.08 
-----
Elapsed time (s) = 489

 
Testing TIMIT_test chunk = 1 / 1
config chunk file exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0.cfg
shared list [None, None, None, {'mfcc': ['mfcc', 'exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0_mfcc.lst', 'apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5/data/test/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5/mfcc/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |', '5', '5']}, {}, {'MLP_layers1': ['architecture1', 'MLP_layers1', 0]}, {'input': None, 'ref': None}]
Traceback (most recent call last):
  File "run_exp.py", line 338, in <module>
    data_set_inp, data_set_ref = convert_numpy_to_torch(data_set_dict, save_gpumem, use_cuda)
  File "/home/sysadmin/pytorch-kaldi/core.py", line 46, in convert_numpy_to_torch
    data_set_inp=torch.from_numpy(data_set_dict['input']).float()
TypeError: expected np.ndarray (got NoneType)

cfg

[dataset1]
data_name = TIMIT_tr
fea = fea_name=mfcc
        fea_lst=/home/sysadmin/kaldi/egs/timit/s5/data/train/feats.scp
        fea_opts=apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5/data/train/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5/mfcc/cmvn_train.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
        cw_left=5
        cw_right=5
        

lab = lab_name=lab_cd
        lab_folder=/home/sysadmin/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali
        lab_opts=ali-to-pdf
        lab_count_file=auto
        lab_data_folder=/home/sysadmin/kaldi/egs/timit/s5/data/train/
        lab_graph=/home/sysadmin/kaldi/egs/timit/s5/exp/tri3/graph
        

n_chunks = 5

[dataset2]
data_name = TIMIT_dev
fea = fea_name=mfcc
        fea_lst=/home/sysadmin/kaldi/egs/timit/s5/data/dev/feats.scp
        fea_opts=apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5/data/dev/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5/mfcc/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
        cw_left=5
        cw_right=5
        

lab = lab_name=lab_cd
        lab_folder=/home/sysadmin/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_dev
        lab_opts=ali-to-pdf
        lab_count_file=auto
        lab_data_folder=/home/sysadmin/kaldi/egs/timit/s5/data/dev/
        lab_graph=/home/sysadmin/kaldi/egs/timit/s5/exp/tri3/graph
        

n_chunks = 1

[dataset3]
data_name = TIMIT_test
fea = fea_name=mfcc
        fea_lst=/home/sysadmin/kaldi/egs/timit/s5/data/test/feats.scp
        fea_opts=apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5/data/test/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5/mfcc/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
        cw_left=5
        cw_right=5
        

lab = lab_name=lab_cd
        lab_folder=/home/sysadmin/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test
        lab_opts=ali-to-pdf
        lab_count_file=auto
        lab_data_folder=/home/sysadmin/kaldi/egs/timit/s5/data/test/
        lab_graph=/home/sysadmin/kaldi/egs/timit/s5/exp/tri3/graph
        

n_chunks = 1

@hajime9652
Copy link
Author

data_name, data_end_index_fea, data_end_index_lab, lab_dict, data_set_dict is None.
Especially why can not read lab_dict?

shared list [None, None, None, {'mfcc': ['mfcc', 'exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0_mfcc.lst', 'apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5/data/test/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5/mfcc/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |', '5', '5']}, {}, {'MLP_layers1': ['architecture1', 'MLP_layers1', 0]}, {'input': None, 'ref': None}]

lab_folder

$ ls /home/sysadmin/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test
ali.1.gz  ali.2.gz  ali.3.gz  ali.4.gz  final.mdl  log  num_jobs  phones.txt  tree

exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0.cfg

[cfg_proto]
cfg_proto = proto/global.proto
cfg_proto_chunk = proto/global_chunk.proto

[exp]
cmd = 
run_nn_script = run_nn
out_folder = exp/TIMIT_MLP_basic
seed = 1257
use_cuda = False
multi_gpu = False
save_gpumem = False
n_epochs_tr = 24
production = False
to_do = forward
out_info = exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0.info

[batches]
batch_size_train = 128
max_seq_length_train = 1000
batch_size_valid = 128
max_seq_length_valid = 1000

[architecture1]
arch_name = MLP_layers1
arch_proto = proto/MLP.proto
arch_library = neural_networks
arch_class = MLP
arch_pretrain_file = exp/TIMIT_MLP_basic/exp_files/train_TIMIT_tr_ep23_ck4_architecture1.pkl
arch_freeze = False
arch_seq_model = False
dnn_lay = 1024,1024,1024,1024,1896
dnn_drop = 0.15,0.15,0.15,0.15,0.0
dnn_use_laynorm_inp = False
dnn_use_batchnorm_inp = False
dnn_use_batchnorm = True,True,True,True,False
dnn_use_laynorm = False,False,False,False,False
dnn_act = relu,relu,relu,relu,softmax
arch_lr = 0.08
arch_halving_factor = 0.5
arch_improvement_threshold = 0.001
arch_opt = sgd
opt_momentum = 0.0
opt_weight_decay = 0.0
opt_dampening = 0.0
opt_nesterov = False

[model]
model_proto = proto/model.proto
model = out_dnn1=compute(MLP_layers1,mfcc)
        loss_final=cost_nll(out_dnn1,lab_cd)
        err_final=cost_err(out_dnn1,lab_cd)

[forward]
forward_out = out_dnn1
normalize_posteriors = True
normalize_with_counts_from = exp/TIMIT_MLP_basic/exp_files/forward_out_dnn1_lab_cd.count
save_out_file = False
require_decoding = True

[data_chunk]
fea = fea_name=mfcc
        fea_lst=exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0_mfcc.lst
        fea_opts=apply-cmvn --utt2spk=ark:/home/sysadmin/kaldi/egs/timit/s5/data/test/utt2spk  ark:/home/sysadmin/kaldi/egs/timit/s5/mfcc/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |
        cw_left=5
        cw_right=5
lab = lab_name=lab_cd
        lab_folder=/home/sysadmin/kaldi/egs/timit/s5/exp/dnn4_pretrain-dbn_dnn_ali_test
        lab_opts=ali-to-pdf
        lab_count_file=auto
        lab_data_folder=/home/sysadmin/kaldi/egs/timit/s5/data/test/
        lab_graph=/home/sysadmin/kaldi/egs/timit/s5/exp/tri3/graph

@spencerkirn
Copy link

Did you find a solution to this? I am having the exact same issue. Double checked all paths in my cfg file and the same error is occurring.

Note: I am using PyTorch-Kaldi on WSL without CUDA (still no CUDA support on WSL) not sure if this would make a difference.

@mravanelli
Copy link
Owner

mravanelli commented Oct 2, 2019 via email

@spencerkirn
Copy link

Thank you for the quick reply. I apologize if these are basic questions, I am new to using Kaldi and this toolkit. So I ran copy-feats ark:/home/spencer/kaldi/egs/timit/s5/mfcc/raw_mfcc_dev.1.ark ark,t:- and it ran just like you said it should, with a lot of numbers output to the terminal. So after that I ran copy-feats ark:/home/spencer/kaldi/egs/timit/s5/mfcc/raw_mfcc_dev.1.ark ark:- | apply-cmvn --utt2spk=ark:/home/spencer/kaldi/egs/timit/data/dev/utt2spk ark:/home/spencer/kaldi/egs/timit/s5/data/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark, t:- and got the attached error. One thing I noticed is that there is no cmvn_dev.ark in my data folder (no .ark files at all in that folder) is that meant to be the output or should there be a .ark file there? Seems like the error is centered around that file.

TIMITError

@mravanelli
Copy link
Owner

mravanelli commented Oct 3, 2019 via email

@spencerkirn
Copy link

No like I said there are not .ark files in that folder (or subfolders). I thought this might be an output folder, but it looks like the issue is in the creation of those files.

@mravanelli
Copy link
Owner

mravanelli commented Oct 3, 2019 via email

@spencerkirn
Copy link

Yea I had the wrong path for cmvn file, but when I run copy-feats ark:/home/spencer/kaldi/egs/timit/s5/mfcc/raw_mfcc_test.1.ark ark,t:- | apply-cmvn --utt2spk=ark:/home/spencer/kaldi/egs/timit/s5/data/test/utt2spk ark:/home/spencer/kaldi/egs/timit/s5/mfcc/cmvn_test.ark ark:- ark:- now I get a Kaldi Fatal error
TIMITError2

@spencerkirn
Copy link

spencerkirn commented Oct 25, 2019

In case anyone else has this issue: I resolved it by bypassing the if statement on line 328 of run_exp.py. There was some issue in how the shared_list object was being created that I could not figure out, but the else statement ran the run_nn function in a similar fashion as the training and validation steps.

So I commented out line 328 and created another variable set to False to bypass that if statement.

test=False
#if _run_forwarding_in_subprocesses(config)
if test:

@mravanelli
Copy link
Owner

mravanelli commented Oct 25, 2019 via email

@spencerkirn
Copy link

Yes, I checked all the paths in the config file and they were all correct. Bypassing that if statement though gave a result that looked very similar to the one in the tutorial.

TIMITResult

@mravanelli
Copy link
Owner

mravanelli commented Oct 25, 2019 via email

@spencerkirn
Copy link

There is still an error in the log.log file apparently (I had not check that file when I got the correct result). Something to do with decode_dnn.sh. Looks like the forward_TIMIT_test_ep*_ck*_out_dnn1_to_decode.ark files are not being created for some reason. Though for whatever reason this does not affect the outcome it seems.
TIMITError3

@mravanelli
Copy link
Owner

mravanelli commented Oct 25, 2019 via email

@kumarh22
Copy link

I am also having error at testing phase

------------------------------ Epoch 23 / 23 ------------------------------
 
----- Summary epoch 23 / 23
Training on ['TIMIT_tr']
Loss = 0.916 | err = 0.290 
-----
Validating on TIMIT_dev
Loss = 1.674 | err = 0.450 
-----
Learning rate on architecture1 = 0.0025 
-----
Elapsed time (s) = 3338

 
Testing TIMIT_test chunk = 1 / 1
Traceback (most recent call last):
  File "run_exp.py", line 475, in <module>
    data_set_inp, data_set_ref = convert_numpy_to_torch(data_set_dict, save_gpumem, use_cuda)
  File "/home/dev_ds/pytorch-kaldi/core.py", line 53, in convert_numpy_to_torch
    data_set_inp = torch.from_numpy(data_set_dict["input"]).float()
TypeError: expected np.ndarray (got NoneType)

when printed shared_list [print(shared_list)] in run_exp.py, looks as below.

[None, None, None, {'mfcc': ['mfcc', '/home/dev_ds/kaldi_dnn/egs/timit/s5/exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0_mfcc.lst', 'apply-cmvn --utt2spk=ark:/home/dev_ds/kaldi_dnn/egs/timit/s5/data/test/utt2spk ark:/home/dev_ds/kaldi_dnn/egs/timit/s5/mfcc/cmvn_test.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |', '5', '5']}, {}, {'MLP_layers1': ['architecture1', 'MLP_layers1', 0]}, {'input': None, 'ref': None}]

I used same validation data [dev] as test data, training and validation have no errors, but testing with same data throwing error.

@zhang7346
Copy link

@kumarh22 I got the same problem with you, have you solved?

@zhang7346
Copy link

zhang7346 commented Dec 27, 2019

@mravanelli I also got the error in test phase.

Testing TIMIT_test chunk = 1 / 1
info [None, None, None, {'mfcc': ['mfcc', 'exp/TIMIT_MLP_basic/exp_files/forward_TIMIT_test_ep23_ck0_mfcc.lst', 'apply-cmvn --utt2spk=ark:/home/zhang/code/kaldi_maked/egs/timit/s5/data/dev/utt2spk ark:/home/zhang/code/kaldi_maked/egs/timit/s5/mfcc/cmvn_dev.ark ark:- ark:- | add-deltas --delta-order=2 ark:- ark:- |', '5', '5']}, {}, {'MLP_layers1': ['architecture1', 'MLP_layers1', 0]}, {'input': None, 'ref': None}]
Traceback (most recent call last):
File "run_exp.py", line 476, in
data_set_inp, data_set_ref = convert_numpy_to_torch(data_set_dict, save_gpumem, use_cuda)
File "/data00/home/zhang/code/pytorch-kaldi/core.py", line 53, in convert_numpy_to_torch
data_set_inp = torch.from_numpy(data_set_dict["input"]).float()
TypeError: expected np.ndarray (got NoneType)

I had "manually" read the features to debug as you said above. It works in step2, and not came into error in step3(for step3, it runs for such a long time but without error, this is the same with eval file) and the log.log is just prov dopo prima
ps. I am using python3.7, torch 1.0 cpu only version
could you help me

@TParcollet
Copy link
Collaborator

Is the problem happening if you use the validation or training set as the test set?

@TParcollet TParcollet reopened this Dec 27, 2019
@zhang7346
Copy link

zhang7346 commented Dec 27, 2019

Is the problem happening if you use the validation or training set as the test set?

yes. I use the validation set as test set, but it still happen

@zhang7346
Copy link

Is the problem happening if you use the validation or training set as the test set?

yes. I use the validation set as test set, but it still happen

I find that when I use gpu version, the problem not appear again.

Serhiy-Shekhovtsov added a commit to sciforce/pytorch-kaldi that referenced this issue May 29, 2020
@Serhiy-Shekhovtsov
Copy link

Serhiy-Shekhovtsov commented May 29, 2020

Had the same issue today. Here are some findings:

Why does it only happen when running on CPU?

Because when CPU is used the forward will run in a subprocess and the method to run forward pass in a subprocess uses another version of read_lab_fea method here - read_lab_fea_refac01. While the same process forward pass will use the original read_lab_fea method.

So why it crashes when using the read_lab_fea_refac01 method?

First of all, because it will switch to production mode when reading fea_dict, lab_dict, arch_dict. By removing this line I fixed the initial issue. But there is another problem.
It will also return -1 as data_end_index and run_nn will crash anyway.

How to fix:

You can update this method to return False. I tried to use read_lab instead of read_lab_fea_refac01 here but it will crash anyway when trying to unpack the shared_list here. The shared_list has 6 items, not 7. There is only one item for the data_end_index data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants