Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing HotpotQA Results #43

Open
haoyb22 opened this issue Aug 12, 2024 · 0 comments
Open

Reproducing HotpotQA Results #43

haoyb22 opened this issue Aug 12, 2024 · 0 comments

Comments

@haoyb22
Copy link

haoyb22 commented Aug 12, 2024

Hi,

Thanks for the great work. Unfortunately, we are unable to reproduce your results for ReAct / Reflexion on HotpotQA.

E.g. You say that ReAct+gpt-3.5-turbo has a baseline accuracy of 0.26 in Table 5 of your article. However, we tried to reproduce this result using gpt-3.5-turbo on your dataset hotpot-qa-distractor-sample.joblib, but only get a baseline(the 1st trial) accuracy of 0.09. You can see the detailed trajs here: https://github.com/haoyb22/Reflexion_hotpotqa/blob/main/100_questions_5_trials.txt

And we find that the trajs we got show diffferent features from yours. For example, the ReAct agent using gpt-3.5-turbo always tries to answer more than one step a time and doesn't follow the framework. The last bug causes error in Action steps and always stops the program, so we change your agent.py a bit just make the program continue to work. You can see the changed agent.py here: https://github.com/haoyb22/Reflexion_hotpotqa/blob/main/agents.py

Please clarify how to get the results you get?

@noahshinn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant