Observing eval accuracy considerably lower than reported?

Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Observing eval accuracy considerably lower than reported? #3

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Observing eval accuracy considerably lower than reported? #3

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions