Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to get similarity score with 2 sentences test #2

Open
briancannon opened this issue Nov 9, 2017 · 9 comments
Open

How to get similarity score with 2 sentences test #2

briancannon opened this issue Nov 9, 2017 · 9 comments

Comments

@briancannon
Copy link

The model's output is a torch.cuda.FloatTensor. How can I get real score between 2 sentences?

@tuzhucheng
Copy link
Owner

Check out this line:

predictions.append((predict_classes * output.data.exp()).sum(dim=1))

@briancannon
Copy link
Author

I tried out and got the score.
But when I split test data set to smaller sets (size = 64 pair sentences) and try to evaluate each of them.
I get different results:

INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.587159 0.65102088053 1.398514747619629

INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test -0.0634823 -0.0976152631988 1.9832178354263306

INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.680005 0.517980672901 1.0506935119628906

Why is that? Are the model correct?

@tuzhucheng
Copy link
Owner

You mean evaluating each batch of test set sentences consisting of 64 sentence pairs?

@briancannon
Copy link
Author

Yes, I did. I just want to evaluate with different and smaller test data sets, not in order or somethings like that.

@tuzhucheng
Copy link
Owner

You showed three different sets of "Evaluation metrics for test". I'm guessing you are wondering why the results differ so much.

Do you mind explaining what you did to get the pearson_r, spearman_r, etc.. for those three sets of data?

@briancannon
Copy link
Author

You right. That's why I wonder.
Test data set has more 4000 sentence pairs. I try to evaluate with 3 smaller data set, each of them has 64 sentence pairs. Then I got different pearson_r and spearman_r results.

Could you explain to me? Thanks.

@tuzhucheng
Copy link
Owner

How many epochs did you train for?

If the model is not trained very well (high bias in training set) then we can expect to get poor results on the smaller test sets. They vary wildly since there is variation in the different small test sets you created. However, after you train the model properly (low bias in training and dev set), I think you can expect to see better test set metrics and more consistent performance among different test sets. Note for the model to be trained well the hyperparameters also play an extremely important role.

@briancannon
Copy link
Author

I trained with:
python main.py mpcnn.sick.model --dataset sick --epochs 19 --epsilon 1e-7 --dropout 0
And got (full test data set):
INFO - Evaluation metrics for test
INFO - pearson_r spearman_r KL-divergence loss
INFO - test 0.867389 0.808621796372 0.46649816802241434

You can use split -l 64 a.txt split_a.txt, then randomly select one of them to evaluate and see the result.
I tried this because when printing predictions.append((predict_classes * output.data.exp()).sum(dim=1)) , I find out the similarity score pretty different with the expected result.

@tuzhucheng
Copy link
Owner

Hmm, sorry I missed the notification.

Doing some error analysis is on my TODO list.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants