Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reproducing Results on Webshop #34

Open
ai-nikolai opened this issue Jan 18, 2025 · 2 comments
Open

Reproducing Results on Webshop #34

ai-nikolai opened this issue Jan 18, 2025 · 2 comments

Comments

@ai-nikolai
Copy link

@ysymyth @john-b-yang

Running the official repo with a variety of models including gpt-4, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, etc. results in terrible performance. These days ranging from 0-10% success rate using ReAct.

Could you be so kind to re-run official results and post an updated version of results so that we can use these as "official" benchmark. (For example reviewers are often complaining that new results are much worse than original results). However, we tried reproducing result using your implementation, other implementations and our own implementation always yielding very low scores.

Thank you very much.

Concrete Ask:
If running the whole thing is too much, which we understand. Could you for example maybe just run the results on the first 30-50 examples and report the scores there for a few models (e.g. gpt-3.5-turbo, gpt-4o and gpt-4o-mini).

Thank you very much.

@ai-nikolai
Copy link
Author

noahshinn/reflexion#49

@ai-nikolai
Copy link
Author

@ysymyth @noahshinn -

Problem:
Results on Webshop are very bad these days. Could you please help!

Ask:
Could you run Webshop and ReAct agent on a few test samples (e.g. 30) with some models to compare results: gpt-3.5-turbo-1106, gpt-3.5-turbo-0125, gpt-4o-mini.


Results:

Webshop results are terrible today using existing code-repos:

Agent Model Code Repo No. Test Samples Success Rate
ReAct gpt-3.5-turbo-instruct Reflexion 30 13%
ReAct gpt-4o-mini Reflexion 30 0.0%
ReAct gpt-3.5-turbo-1106 ADaPT 30 0.0%
ReAct gpt-3.5-turbo-0125 ADaPT 30 0.0%
ReAct gpt-3.5-turbo-instruct ReAct 30 13%
ReAct gpt-3.5-turbo-instruct Ours 30 13%
ReAct gpt-3.5-turbo-1106 Ours 30 0.0%
ReAct gpt-3.5-turbo-1106 ReAct 30 0.0%
ReAct gpt-3.5-turbo-1106 StateAct 30 0.0%

Webshop results using the same models 6 months ago:

Agent Model Code Repo No. Test Samples Success Rate
ReAct gpt-3.5-turbo-1106 StateAct 500 18.2%
ReAct gpt-3.5-turbo-0125 StateAct 500 14.6%
ReAct (2 years ago) gpt-3.5-turbo (not reported Version) ADaPT, ReAct & Reflexion 100/500 30%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant