Reproducing Results on Webshop #34

ai-nikolai · 2025-01-18T16:13:12Z

Running the official repo with a variety of models including gpt-4, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, etc. results in terrible performance. These days ranging from 0-10% success rate using ReAct.

Could you be so kind to re-run official results and post an updated version of results so that we can use these as "official" benchmark. (For example reviewers are often complaining that new results are much worse than original results). However, we tried reproducing result using your implementation, other implementations and our own implementation always yielding very low scores.

Thank you very much.

Concrete Ask:
If running the whole thing is too much, which we understand. Could you for example maybe just run the results on the first 30-50 examples and report the scores there for a few models (e.g. gpt-3.5-turbo, gpt-4o and gpt-4o-mini).

Thank you very much.

ai-nikolai · 2025-01-21T18:54:23Z

noahshinn/reflexion#49

ai-nikolai · 2025-01-21T18:54:43Z

@ysymyth @noahshinn -

Problem:
Results on Webshop are very bad these days. Could you please help!

Ask:
Could you run Webshop and ReAct agent on a few test samples (e.g. 30) with some models to compare results: gpt-3.5-turbo-1106, gpt-3.5-turbo-0125, gpt-4o-mini.

Results:

Webshop results are terrible today using existing code-repos:

Agent	Model	Code Repo	No. Test Samples	Success Rate
ReAct	gpt-3.5-turbo-instruct	Reflexion	30	13%
ReAct	gpt-4o-mini	Reflexion	30	0.0%
ReAct	gpt-3.5-turbo-1106	ADaPT	30	0.0%
ReAct	gpt-3.5-turbo-0125	ADaPT	30	0.0%
ReAct	gpt-3.5-turbo-instruct	ReAct	30	13%
ReAct	gpt-3.5-turbo-instruct	Ours	30	13%
ReAct	gpt-3.5-turbo-1106	Ours	30	0.0%
ReAct	gpt-3.5-turbo-1106	ReAct	30	0.0%
ReAct	gpt-3.5-turbo-1106	StateAct	30	0.0%

Webshop results using the same models 6 months ago:

Agent	Model	Code Repo	No. Test Samples	Success Rate
ReAct	gpt-3.5-turbo-1106	StateAct	500	18.2%
ReAct	gpt-3.5-turbo-0125	StateAct	500	14.6%
ReAct (2 years ago)	gpt-3.5-turbo (not reported Version)	ADaPT, ReAct & Reflexion	100/500	30%

This was referenced Jan 21, 2025

Reproducing Webshop Results noahshinn/reflexion#49

Open

Replicating Webshop Results archiki/ADaPT#10

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reproducing Results on Webshop #34

Reproducing Results on Webshop #34

ai-nikolai commented Jan 18, 2025

ai-nikolai commented Jan 21, 2025

ai-nikolai commented Jan 21, 2025

Reproducing Results on Webshop #34

Reproducing Results on Webshop #34

Comments

ai-nikolai commented Jan 18, 2025

ai-nikolai commented Jan 21, 2025

ai-nikolai commented Jan 21, 2025