Does the corpus size affect the mapping learned? #167

iamsainianuj · 2020-05-31T09:10:27Z

I have corpus pair of 2 languages viz. Hindi and English but the corpus is not much large having only 78391 vectors(49806_eng + 28585_hin) in total as separate monolingual embeddings got as a result of fasttext training.

Now when i try to run the evaluate.py script i get very poor results by following command
python3 evaluate.py --src_lang en --tgt_lang hi --src_emb dumped/debug/eng-hin/vectors-en.txt --tgt_emb dumped/debug/eng-hin/vectors-hi.txt --max_vocab 200000

the results are :

============ Initialized logger ============
INFO - 05/31/20 14:35:24 - 0:00:00 - cuda: True
dico_eval: default
emb_dim: 300
exp_id:
exp_name: debug
exp_path: /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
max_vocab: 200000
normalize_embeddings:
src_emb: dumped/debug/eng-hin/vectors-en.txt
src_lang: en
tgt_emb: dumped/debug/eng-hin/vectors-hi.txt
tgt_lang: hi
verbose: 2
INFO - 05/31/20 14:35:24 - 0:00:00 - The experiment will be stored in /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
INFO - 05/31/20 14:35:27 - 0:00:03 - Loaded 49806 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - Loaded 28585 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:32 - 0:00:08 - Dataset Found Not found Rho
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SEMEVAL17 263 125 0.4523
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-771 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_VERB-143 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-287 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RG-65 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_YP-130 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RW-STANFORD 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MEN-TR-3k 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-SIM 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SIMLEX-999 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-REL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MC-30 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-ALL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - Monolingual source word similarity score average: nan
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 5: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 10: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 10: 0.000000

Am i doing anything wrong or it is just the size of corpus which is affecting the results..

Kindly give response/comment over this issue..

Thank you

The text was updated successfully, but these errors were encountered:

iamsainianuj changed the title ~~Does the corpus size affectes the mapping learned?~~ Does the corpus size affect the mapping learned? May 31, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Does the corpus size affect the mapping learned? #167

Does the corpus size affect the mapping learned? #167

iamsainianuj commented May 31, 2020 •

edited

Does the corpus size affect the mapping learned? #167

Does the corpus size affect the mapping learned? #167

Comments

iamsainianuj commented May 31, 2020 • edited

iamsainianuj commented May 31, 2020 •

edited