Reproducing Paper Results

Hi,

At this [link](https://docs.google.com/spreadsheets/d/1LBt6dP4UwZwU_CjoYhyAd_rjKhQLvo0Gq4cYUnpi_CA/edit?gid=327680602#gid=327680602) you have published the detailed evaluation scores for each model and benchmark under different input lengths. 

I have two questions about it:

1. How can we use these numbers to obtain the same results of Figure 6 in your paper? 

2. For tasks like RAG, which is divided into HotpotQA etc, there are multiple files named with "k20, k220, k440" etc. 

```bash
hotpotqa-dev-multikilt_1000_k20_dep3
hotpotqa-dev-multikilt_1000_k50_dep3
hotpotqa-dev-multikilt_1000_k105_dep3
hotpotqa-dev-multikilt_1000_k220_dep3
hotpotqa-dev-multikilt_1000_k440_dep3
hotpotqa-dev-multikilt_1000_k500_dep3
hotpotqa-dev-multikilt_1000_k1000_dep3
```

However, there are 7 of these files for evaluation with different values of $k$, but only 5 input lengths in the paper (8k, 16k, 32k, 64k and 128k). Which input length corresponds to which file? 



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Reproducing Paper Results #33

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Reproducing Paper Results #33

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions