Open
Description
Hi, thanks for open-sourcing this code. I'm noticing that my tests with GPT-2 variants show considerably lower eval accuracies than what's reported in the paper & charts. I'm using the command provided in the README. I do not think the eval code itself is incorrect --- testing it with LLaMA shows much higher eval accuracies (as I would expect). But I cannot replicate the GPT-2 results; any pointers on what the issue might be?
Metadata
Metadata
Assignees
Labels
No labels