-
Notifications
You must be signed in to change notification settings - Fork 343
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Missing labels in training / decoding in tf_clean branch #169
Comments
Kalpesh,
Ramon would know best about the “v1-tf” recipe, but I can see that there is an error message that says "Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.”, which shows that you did not run the “phn” recipe before running the “char” recipe. You need to do this, so that both of them use the same vocabulary. Next, you can configure the location of the temp folder in path.sh, and you want to change it to “/tmp” or something, if you don’t have “/scratch”, which is the default in our cluster. There is also "exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file” - which means maybe the training didn’t start correctly, or not at all?
Let me know if you have any other questions!
Florian
… On Jan 28, 2018, at 7:07 AM, Kalpesh Krishna ***@***.***> wrote:
Hello,
I am trying to run the TensorFlow based EESEN setup for Switchboard. More specifically, I am using the tf_clean branch and trying to run the asr_egs/swbd/v1-tf/run_ctc_char.sh script. I am having some trouble with the training and decoding steps, would appreciate your help! @ramonsanabria <https://github.com/ramonsanabria> , @fmetze <https://github.com/fmetze>
During the stage 3 (training), I get a number of error messages of the form -
********************************************************************************
********************************************************************************
Warning: sw02018-B_012508-012721 has not been found in labels file: /scratch/tmp.1hi5uR4EIR/labels.cv
********************************************************************************
********************************************************************************
Here are the training logs that follow. I suspect creating tr_y from scratch is a problem?
cleaning done: /scratch/tmp.1hi5uR4EIR/cv_local.scp
original scp length: 4000
scp deleted: 270
final scp length: 3730
number of labels not found: 270
TRAINING STARTS [2018-Jan-28 06:02:05]
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
('now:', 'Sun 2018-01-28 06:02:08')
('tf:', '1.1.0')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading training set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
tr_x:
--------------------------------------------------------------------------------
non augmented (mix) training set found for language: no_name_language ...
preparing dictionary for no_name_language...
ordering all languages (from scratch) train batches...
Augmenting data x3 and win 3...
--------------------------------------------------------------------------------
tr_y:
--------------------------------------------------------------------------------
creating tr_y from scratch...
unilanguage setup detected (in labels)...
--------------------------------------------------------------------------------
cv_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or set language...
cv (feats) found for language: no_name_language ...
preparing dictionary for no_name_language...
ordering all languages (from scratch) cv batches...
Augmenting data x3 and win 3...
--------------------------------------------------------------------------------
cv_y:
--------------------------------------------------------------------------------
creating cv_y from scratch...
unilanguage setup detected (in labels)...
languages checked ...
(cv_x vs cv_y vs tr_x vs tr_y)
Finally here are my decoding logs -
(python2.7_tf1.4) ***@***.***:v1-tf$ ./run_ctc_char.sh
=====================================================================
Decoding eval200 using AM
=====================================================================
./steps/decode_ctc_am_tf.sh --config exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl --data ./data/eval2000/ --weights exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/epoch25.ckpt --results exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/results/epoch25
exp/train_char_l4_c320_mdeepbilstm_w3_nfalse/model/config.pkl: line 135: syntax error: unexpected end of file
copy-feats 'ark,s,cs:apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:- |' ark,scp:/scratch/tmp.GgS1if0Wex/f.ark,/scratch/tmp.GgS1if0Wex/test_local.scp
apply-cmvn --norm-vars=true --utt2spk=ark:./data/eval2000//utt2spk scp:./data/eval2000//cmvn.scp scp:./data/eval2000//feats.scp ark:-
LOG (apply-cmvn[5.3.85~1-35950]:main():apply-cmvn.cc:159) Applied cepstral mean and variance normalization to 4458 utterances, errors on 0
LOG (copy-feats[5.3.85~1-35950]:main():copy-feats.cc:143) Copied 4458 feature matrices.
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
2.7.13 |Anaconda 4.3.1 (64-bit)| (default, Dec 20 2016, 23:09:15)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
('now:', 'Sun 2018-01-28 06:05:28')
('tf:', '1.1.0')
('cwd:', '/share/data/lang/users/kalpesh/eesen/asr_egs/swbd/v1-tf')
('library:', '/share/data/lang/users/kalpesh/eesen')
('git:', 'heads/master-dirty')
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
reading testing set
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_x:
--------------------------------------------------------------------------------
unilingual set up detected on test or set language...
test (feats) found for language: no_name_language ...
preparing dictionary for no_name_language...
ordering all languages (from scratch) test batches...
Augmenting data x3 and win 3...
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
test_y (for ter computation):
--------------------------------------------------------------------------------
unilanguage setup detected (in labels)...
no label files fins in /scratch/tmp.GgS1if0Wex with info_set: test
file: /share/data/lang/users/kalpesh/eesen/tf/ctc-am/reader/labels_reader/labels_reader.py function: __read_one_language line: 171
exiting...
Here are my logs from the first two stages (data preparation, fbank generation)
(python2.7_tf1.4) ***@***.***:v1-tf$ ./run_ctc_char.sh
=====================================================================
Data Preparation
=====================================================================
Switchboard-1 data preparation succeeded.
utils/fix_data_dir.sh: filtered data/train/segments from 264333 to 264072 lines based on filter /scratch/tmp.V26jBobg4D/recordings.
utils/fix_data_dir.sh: filtered /scratch/tmp.V26jBobg4D/speakers from 4876 to 4870 lines based on filter data/train/cmvn.scp.
utils/fix_data_dir.sh: filtered data/train/spk2utt from 4876 to 4870 lines based on filter /scratch/tmp.V26jBobg4D/speakers.
fix_data_dir.sh: kept 263890 utterances out of 264072
fix_data_dir.sh: old files are kept in data/train/.backup
Can't open data/local/dict_phn/lexicon1.txt: No such file or directory at local/swbd1_map_words.pl line 26.
Character-based dictionary (word spelling) preparation succeeded
Warning: for utterances en_4910-B_013563-013763 and en_4910-B_013594-013790, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_025539-025791 and en_4910-B_025541-025674, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_032263-032658 and en_4910-B_032299-032406, segments already overlap; leaving these times unchanged.
Warning: for utterances en_4910-B_035678-035757 and en_4910-B_035715-035865, segments already overlap; leaving these times unchanged.
Data preparation and formatting completed for Eval 2000
(but not MFCC extraction)
fix_data_dir.sh: kept 4458 utterances out of 4466
fix_data_dir.sh: old files are kept in data/eval2000/.backup
=====================================================================
FBank Feature Generation
=====================================================================
steps/make_fbank.sh --cmd run.pl --nj 32 data/train exp/make_fbank_pitch/train fbank_pitch
steps/make_fbank.sh: moving data/train/feats.scp to data/train/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/train
steps/make_fbank.sh [info]: segments file exists: using that.
Succeeded creating filterbank features for train
steps/compute_cmvn_stats.sh data/train exp/make_fbank_pitch/train fbank_pitch
Succeeded creating CMVN stats for train
fix_data_dir.sh: kept all 263890 utterances.
fix_data_dir.sh: old files are kept in data/train/.backup
steps/make_fbank.sh --cmd run.pl --nj 10 data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch
steps/make_fbank.sh: moving data/eval2000/feats.scp to data/eval2000/.backup
utils/validate_data_dir.sh: Successfully validated data-directory data/eval2000
steps/make_fbank.sh [info]: segments file exists: using that.
Succeeded creating filterbank features for eval2000
steps/compute_cmvn_stats.sh data/eval2000 exp/make_fbank_pitch/eval2000 fbank_pitch
Succeeded creating CMVN stats for eval2000
fix_data_dir.sh: kept all 4458 utterances.
fix_data_dir.sh: old files are kept in data/eval2000/.backup
utils/subset_data_dir.sh: reducing #utt from 263890 to 4000
utils/subset_data_dir.sh: reducing #utt from 263890 to 259890
utils/subset_data_dir.sh: reducing #utt from 259890 to 100000
Reduced number of utterances from 100000 to 76615
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept 76615 utterances out of 100000
fix_data_dir.sh: old files are kept in data/train_100k_nodup/.backup
Reduced number of utterances from 259890 to 192701
Using fix_data_dir.sh to reconcile the other files.
fix_data_dir.sh: kept 192701 utterances out of 259890
fix_data_dir.sh: old files are kept in data/train_nodup/.backup
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub <#169>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEnA8cGE0tSaYt7E-UWyF9etI7jwKl0Mks5tPGLqgaJpZM4Rvq8j>.
|
Hi @fmetze ,
Yes, I hadn't run the
This is an irrelevant error, it's happening since a pickle configuration file is sourced in the
The training did happen successfully. Here are the training logs. As a confirmation, is it usual for the Kaldi setup to discard 270 dev utterances, 11 eval2000 utterances and 973 train utterances due to transcripts like
However, the decoding does not seem to budge. Here are the logs. The suspicious lines seem to be
|
Good, not sure about the pickle error, but if you say it does not affect the training, then things should be fine. You should be fine running the test script from stage 4 only for decoding, the data should already be prepared. @ramonsanabria - any ideas about v1-tf here? |
Hi, The pickle error is irrelevant. The configuration is loaded properly. I will try to remove it as soon as I have time. @xinjli is cleaning up the swbd recipie. I have some experiments with different char-based units (removing numbers and noises) that for now seems to be improving a bit. |
I also found the issue that char recipe could not run without the phn recipe today. The same issue also happens in the swbd v1 recipe under the master branch. I will prepare a fix for this issue. |
Hi @ramonsanabria , @xinjli |
can you do: find /scratch/tmp.jihiXHPJkp ?
2018-02-07 2:02 GMT-05:00 Kalpesh Krishna <[email protected]>:
… Hi @ramonsanabria <https://github.com/ramonsanabria> , @xinjli
<https://github.com/xinjli>
Any idea about the no label files fins in /scratch/tmp.jihiXHPJkp with
info_set: test error I am receiving?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#169 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPVkPG-g1fkUMvbZjT0gQE2jwnL93ks5tSUqcgaJpZM4Rvq8j>
.
|
@ramonsanabria yes, I can find it.
I checked the code, the system searches for a file named What is the correct way to integrate this script into EESEN? |
I think we need a stage to generate labels.test for testing. It seems that we do not have any script for this now. |
Probably we can use following code to generate labels.test
eval2000 contains the text we need for evaluation and just replace $dir_am with variable in your environment |
You have:
https://github.com/srvk/eesen/blob/tf_clean/asr_egs/swbd/v1-tf/local/swbd1_prepare_char_dict_tf.py
This script can generate the units.txt. If you put --output_units it will
produce the units that you will further use (presumably this will be with
you train text). Then, the units produced by this script will be used as
--input_units to generate the labels.cv or labels.test.
Not sure which version is there. But I performed some cleaning of swbd that
we should discuss.
2018-02-07 16:37 GMT-05:00 Xinjian Li <[email protected]>:
… Probably we can use following code to generate labels.test
python ./local/swbd1_prepare_char_dict_tf.py --text_file
./data/eval2000/text --input_units ./data/local/dict_char/units.txt
--output_labels $dir_am/labels.test
eval2000 contains the text we need for evaluation and just replace $dir_am
with variable in your environment
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#169 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AMlwPZjhYxpTHXp48W9oskoSqT3eX1lXks5tShengaJpZM4Rvq8j>
.
|
@ramonsanabria could you describe the process you are using the compute the final WER of a trained model? I guess this is often called "scoring", in the Kaldi setup. Generally, raw transcripts are fed into |
Hi @ramonsanabria any update on ^? Also, how have you treated the space character? I cannot find an entry for the space in |
Hello,
I am trying to run the TensorFlow based EESEN setup for Switchboard. More specifically, I am using the
tf_clean
branch and trying to run theasr_egs/swbd/v1-tf/run_ctc_char.sh
script. I am having some trouble with the training and decoding steps, would appreciate your help! @ramonsanabria , @fmetzeDuring the stage 3 (training), I get a number of error messages of the form -
Here are the training logs that follow. I suspect
creating tr_y from scratch
is a problem?Finally here are my decoding logs -
Here are my logs from the first two stages (data preparation, fbank generation)
The text was updated successfully, but these errors were encountered: