You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.
I have corpus pair of 2 languages viz. Hindi and English but the corpus is not much large having only 78391 vectors(49806_eng + 28585_hin) in total as separate monolingual embeddings got as a result of fasttext training.
Now when i try to run the evaluate.py script i get very poor results by following command python3 evaluate.py --src_lang en --tgt_lang hi --src_emb dumped/debug/eng-hin/vectors-en.txt --tgt_emb dumped/debug/eng-hin/vectors-hi.txt --max_vocab 200000
the results are :
============ Initialized logger ============
INFO - 05/31/20 14:35:24 - 0:00:00 - cuda: True
dico_eval: default
emb_dim: 300
exp_id:
exp_name: debug
exp_path: /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
max_vocab: 200000
normalize_embeddings:
src_emb: dumped/debug/eng-hin/vectors-en.txt
src_lang: en
tgt_emb: dumped/debug/eng-hin/vectors-hi.txt
tgt_lang: hi
verbose: 2
INFO - 05/31/20 14:35:24 - 0:00:00 - The experiment will be stored in /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
INFO - 05/31/20 14:35:27 - 0:00:03 - Loaded 49806 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - Loaded 28585 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:32 - 0:00:08 - Dataset Found Not found Rho
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SEMEVAL17 263 125 0.4523
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-771 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_VERB-143 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-287 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RG-65 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_YP-130 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RW-STANFORD 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MEN-TR-3k 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-SIM 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SIMLEX-999 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-REL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MC-30 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-ALL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - Monolingual source word similarity score average: nan
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 5: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 10: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 10: 0.000000
Am i doing anything wrong or it is just the size of corpus which is affecting the results..
Kindly give response/comment over this issue..
Thank you
The text was updated successfully, but these errors were encountered:
iamsainianuj
changed the title
Does the corpus size affectes the mapping learned?
Does the corpus size affect the mapping learned?
May 31, 2020
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
I have corpus pair of 2 languages viz. Hindi and English but the corpus is not much large having only 78391 vectors(49806_eng + 28585_hin) in total as separate monolingual embeddings got as a result of fasttext training.
Now when i try to run the evaluate.py script i get very poor results by following command
python3 evaluate.py --src_lang en --tgt_lang hi --src_emb dumped/debug/eng-hin/vectors-en.txt --tgt_emb dumped/debug/eng-hin/vectors-hi.txt --max_vocab 200000
the results are :
============ Initialized logger ============
INFO - 05/31/20 14:35:24 - 0:00:00 - cuda: True
dico_eval: default
emb_dim: 300
exp_id:
exp_name: debug
exp_path: /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
max_vocab: 200000
normalize_embeddings:
src_emb: dumped/debug/eng-hin/vectors-en.txt
src_lang: en
tgt_emb: dumped/debug/eng-hin/vectors-hi.txt
tgt_lang: hi
verbose: 2
INFO - 05/31/20 14:35:24 - 0:00:00 - The experiment will be stored in /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
INFO - 05/31/20 14:35:27 - 0:00:03 - Loaded 49806 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - Loaded 28585 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:32 - 0:00:08 - Dataset Found Not found Rho
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SEMEVAL17 263 125 0.4523
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-771 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_VERB-143 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-287 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RG-65 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_YP-130 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RW-STANFORD 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MEN-TR-3k 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-SIM 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SIMLEX-999 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-REL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MC-30 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-ALL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - Monolingual source word similarity score average: nan
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 5: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 10: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 10: 0.000000
Am i doing anything wrong or it is just the size of corpus which is affecting the results..
Kindly give response/comment over this issue..
Thank you
The text was updated successfully, but these errors were encountered: