Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retriever Vectorizer is the bottleneck : Improving over TF-IDF & BM25 vectorizers #344

Open
raghavgurbaxani opened this issue Feb 20, 2020 · 0 comments

Comments

@raghavgurbaxani
Copy link

Hi,
Thanks for the great work on this project. This is a very helpful library for closed domain Q&A.
That being said, it seems through my experiments that the performance of the retriever is the bottleneck (reader performance is pretty good).

Upon investigating the code and studying the architecture, it seems like the retirever is the bottleneck.
image

As the BERT model is only invoked after getting the initial candidates from TF-IDF. So if the TF-IDF or BM25 miss out on the correct candidate paragraphs - the BERT model would miss out on the right answer as well. Which seems to indicate that the BERT model is completely dependent on the accuracy of the vectorizers.

Do you have any thoughts on how to improve the retriever accuracy and using deep learning based information retrieval (maybe sentence similarity based metrics). Any suggestions on more advanced vectorizers ?

Thanks. :)

@raghavgurbaxani raghavgurbaxani changed the title Retriever is the bottleneck : Improving over TF-IDF & BM25 vectorizers Retriever Vectorizer is the bottleneck : Improving over TF-IDF & BM25 vectorizers Feb 20, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant