Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Does the corpus size affect the mapping learned? #167

Open
iamsainianuj opened this issue May 31, 2020 · 0 comments
Open

Does the corpus size affect the mapping learned? #167

iamsainianuj opened this issue May 31, 2020 · 0 comments

Comments

@iamsainianuj
Copy link

iamsainianuj commented May 31, 2020

I have corpus pair of 2 languages viz. Hindi and English but the corpus is not much large having only 78391 vectors(49806_eng + 28585_hin) in total as separate monolingual embeddings got as a result of fasttext training.

Now when i try to run the evaluate.py script i get very poor results by following command
python3 evaluate.py --src_lang en --tgt_lang hi --src_emb dumped/debug/eng-hin/vectors-en.txt --tgt_emb dumped/debug/eng-hin/vectors-hi.txt --max_vocab 200000

the results are :

============ Initialized logger ============
INFO - 05/31/20 14:35:24 - 0:00:00 - cuda: True
dico_eval: default
emb_dim: 300
exp_id:
exp_name: debug
exp_path: /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
max_vocab: 200000
normalize_embeddings:
src_emb: dumped/debug/eng-hin/vectors-en.txt
src_lang: en
tgt_emb: dumped/debug/eng-hin/vectors-hi.txt
tgt_lang: hi
verbose: 2
INFO - 05/31/20 14:35:24 - 0:00:00 - The experiment will be stored in /home/anuj/MUSE/dumped/debug/zgz7rlm5p8
INFO - 05/31/20 14:35:27 - 0:00:03 - Loaded 49806 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - Loaded 28585 pre-trained word embeddings.
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:32 - 0:00:08 - Dataset Found Not found Rho
INFO - 05/31/20 14:35:32 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SEMEVAL17 263 125 0.4523
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-771 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_VERB-143 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MTurk-287 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RG-65 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_YP-130 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_RW-STANFORD 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MEN-TR-3k 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-SIM 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_SIMLEX-999 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-REL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_MC-30 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - EN_WS-353-ALL 0 1 nan
INFO - 05/31/20 14:35:33 - 0:00:08 - ====================================================================
INFO - 05/31/20 14:35:33 - 0:00:08 - Monolingual source word similarity score average: nan
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 5: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - 846 source words - nn - Precision at k = 10: 0.118203
INFO - 05/31/20 14:35:33 - 0:00:08 - Found 1054 pairs of words in the dictionary (846 unique). 978 other pairs contained at least one unknown word (259 in lang1, 924 in lang2)
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 1: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 5: 0.000000
INFO - 05/31/20 14:35:33 - 0:00:09 - 846 source words - csls_knn_10 - Precision at k = 10: 0.000000

Am i doing anything wrong or it is just the size of corpus which is affecting the results..

Kindly give response/comment over this issue..

Thank you

@iamsainianuj iamsainianuj changed the title Does the corpus size affectes the mapping learned? Does the corpus size affect the mapping learned? May 31, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant