Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance on quora qa data set #7

Open
Chandrak1907 opened this issue Jan 13, 2020 · 1 comment
Open

Performance on quora qa data set #7

Chandrak1907 opened this issue Jan 13, 2020 · 1 comment

Comments

@Chandrak1907
Copy link

I used this model on quora qa data set (http://qim.fs.quoracdn.net/quora_duplicate_questions.tsv). Performance of the model is below:
-----------------|Model_output - 0 | |Model_output - 1
is_duplicate -0 | 218,328 | 36,696
is_duplicate -1 | 72,739 | 76,524

Do you have any suggestions for improving the performance of the model.

Code is here:

from semantic_text_similarity.models import WebBertSimilarity
from semantic_text_similarity.models import ClinicalBertSimilarity
web_model = WebBertSimilarity(device='cuda', batch_size=10) #defaults to GPU prediction

web_model.predict([("She won an olympic gold medal","The women is an olympic champion")])

# Quora

def check_score(row):
return web_model.predict([(row['question1'],row['question2'])])[0]
import pandas as pd
t2 = pd.read_csv("./quora_duplicate_questions.tsv",sep='\t')
t3= t2.dropna()
t3['model_score']=t3.apply(check_score,axis=1)
t3.to_csv("./t3_Jan10.csv",index=False)
t3 = pd.read_csv("./t3_Jan10.csv")
t3[t3.is_duplicate==0]['model_score'].mean()
t3[t3.is_duplicate==1]['model_score'].mean()
t3['model_output']=0
t3.loc[t3.model_score>3.71, 'model_output']=1
pd.crosstab(t3.is_duplicate, t3.model_output)

@AndriyMulyar
Copy link
Owner

AndriyMulyar commented Jan 13, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants