-
Notifications
You must be signed in to change notification settings - Fork 31
Open
Description
Hi,
At this link you have published the detailed evaluation scores for each model and benchmark under different input lengths.
I have two questions about it:
-
How can we use these numbers to obtain the same results of Figure 6 in your paper?
-
For tasks like RAG, which is divided into HotpotQA etc, there are multiple files named with "k20, k220, k440" etc.
hotpotqa-dev-multikilt_1000_k20_dep3
hotpotqa-dev-multikilt_1000_k50_dep3
hotpotqa-dev-multikilt_1000_k105_dep3
hotpotqa-dev-multikilt_1000_k220_dep3
hotpotqa-dev-multikilt_1000_k440_dep3
hotpotqa-dev-multikilt_1000_k500_dep3
hotpotqa-dev-multikilt_1000_k1000_dep3However, there are 7 of these files for evaluation with different values of
Metadata
Metadata
Assignees
Labels
No labels