Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Elastic indexer #105

Merged
merged 23 commits into from
Jan 6, 2020
Merged

Elastic indexer #105

merged 23 commits into from
Jan 6, 2020

Conversation

AvinashBukkittu
Copy link
Collaborator

This PR

  • Adds evaluator support for the pipeline. This means, we can now add an evaluator to the pipeline and call evaluate() on the pipeline to evaluate on a dataset.
  • Adds ElasticIndexer along with ElasticSearchIndexProcessor processor to index the documents.
  • Adds ElasticSearchProcessor for searching documents in an elastic indexer
  • Creates Passage Reranker example for MS Marco Dataset. Provides a baseline model for ranking using just Elastic Indexer
    • Adds an EvalReader to read MS Marco eval dataset
    • Adds MS Marco Eval script in the example

@codecov
Copy link

codecov bot commented Dec 30, 2019

Codecov Report

Merging #105 into master will increase coverage by 0.6%.
The diff coverage is 76.67%.

Impacted file tree graph

@@            Coverage Diff            @@
##           master     #105     +/-   ##
=========================================
+ Coverage   61.18%   61.78%   +0.6%     
=========================================
  Files          94      100      +6     
  Lines        6425     6684    +259     
=========================================
+ Hits         3931     4130    +199     
- Misses       2494     2554     +60
Impacted Files Coverage Δ
forte/data/readers/__init__.py 100% <100%> (ø) ⬆️
forte/indexers/tests/indexers_test.py 100% <100%> (ø) ⬆️
forte/processors/base/__init__.py 100% <100%> (ø) ⬆️
forte/processors/base/query_processor.py 89.47% <100%> (+1.23%) ⬆️
forte/data/readers/tests/conllu_ud_reader_test.py 97.77% <100%> (ø) ⬆️
forte/common/evaluation.py 81.25% <100%> (+1.25%) ⬆️
forte/processors/__init__.py 100% <100%> (ø) ⬆️
forte/processors/bert_based_query_creator.py 84.9% <100%> (+0.29%) ⬆️
forte/data/readers/tests/corpus_reader_test.py 100% <100%> (ø)
forte/processors/elastic_search_processor.py 42.85% <42.85%> (ø)
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 994ae19...9255a96. Read the comment docs.

Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in general the PR is OK. There are a few simple comments here, plus the comments in #103. Maybe we can plan to merge them today after these are fixed.

@AvinashBukkittu
Copy link
Collaborator Author

Importing relevant comments from #103

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

Let's have more typing here.

Added typing in MS Marco Evaluator

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

It would be better to store only some necessary information from the pack, here we only need the doc_id?

Simplified the logic of MS Marco Evaluator in b7632b8

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

passage is too specific as a name. how about rank_list?

Changed to results from passages

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

Add some docstring here to teach users to extend this method in order to create more complex queries.

Done in b7632b8

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

Do we need to benchmark the speed of the indexer? Hopefully, our wrapper won't decrease the speed a lot.

Added a benchmark testcase in b7632b8

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

The design of the _process_query seem to be a little difficult, especially on returning input_pack.

I agree. Essentially, the following three lines

query = Query(pack=query_pack)
query.set_value(value=query_value)
query_pack.add_entry(query)

are common in QueryProcessor. If we want to abstract away these details in _process method, I couldn't think of a better way other than to return the query_pack and the query_value from _process_query

  1. Adding indexer+reranker inference pipeline; passage reranking bert model #103 (comment)

at least in bulk mode, we should add a couple more documents.

Increased the limit to 10,000

hunterhector
hunterhector previously approved these changes Jan 3, 2020
Copy link
Member

@hunterhector hunterhector left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments addressed in b7632b8

Conflicts:
	.travis.yml
	setup.py
@hunterhector hunterhector merged commit c3f5e01 into master Jan 6, 2020
@mgupta1410 mgupta1410 deleted the elastic-indexer branch February 28, 2020 15:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants