Skip to content

Reproducing Paper Results #33

@framolfese

Description

@framolfese

Hi,

At this link you have published the detailed evaluation scores for each model and benchmark under different input lengths.

I have two questions about it:

  1. How can we use these numbers to obtain the same results of Figure 6 in your paper?

  2. For tasks like RAG, which is divided into HotpotQA etc, there are multiple files named with "k20, k220, k440" etc.

hotpotqa-dev-multikilt_1000_k20_dep3
hotpotqa-dev-multikilt_1000_k50_dep3
hotpotqa-dev-multikilt_1000_k105_dep3
hotpotqa-dev-multikilt_1000_k220_dep3
hotpotqa-dev-multikilt_1000_k440_dep3
hotpotqa-dev-multikilt_1000_k500_dep3
hotpotqa-dev-multikilt_1000_k1000_dep3

However, there are 7 of these files for evaluation with different values of $k$, but only 5 input lengths in the paper (8k, 16k, 32k, 64k and 128k). Which input length corresponds to which file?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions