You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Running the official repo with a variety of models including gpt-4, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, etc. results in terrible performance. These days ranging from 0-10% success rate using ReAct.
Could you be so kind to re-run official results and post an updated version of results so that we can use these as "official" benchmark. (For example reviewers are often complaining that new results are much worse than original results). However, we tried reproducing result using your implementation, other implementations and our own implementation always yielding very low scores.
Thank you very much.
Concrete Ask:
If running the whole thing is too much, which we understand. Could you for example maybe just run the results on the first 30-50 examples and report the scores there for a few models (e.g. gpt-3.5-turbo, gpt-4o and gpt-4o-mini).
Thank you very much.
The text was updated successfully, but these errors were encountered:
Problem:
Results on Webshop are very bad these days. Could you please help!
Ask:
Could you run Webshop and ReAct agent on a few test samples (e.g. 30) with some models to compare results: gpt-3.5-turbo-1106, gpt-3.5-turbo-0125, gpt-4o-mini.
Results:
Webshop results are terrible today using existing code-repos:
Agent
Model
Code Repo
No. Test Samples
Success Rate
ReAct
gpt-3.5-turbo-instruct
Reflexion
30
13%
ReAct
gpt-4o-mini
Reflexion
30
0.0%
ReAct
gpt-3.5-turbo-1106
ADaPT
30
0.0%
ReAct
gpt-3.5-turbo-0125
ADaPT
30
0.0%
ReAct
gpt-3.5-turbo-instruct
ReAct
30
13%
ReAct
gpt-3.5-turbo-instruct
Ours
30
13%
ReAct
gpt-3.5-turbo-1106
Ours
30
0.0%
ReAct
gpt-3.5-turbo-1106
ReAct
30
0.0%
ReAct
gpt-3.5-turbo-1106
StateAct
30
0.0%
Webshop results using the same models 6 months ago:
@ysymyth @john-b-yang
Running the official repo with a variety of models including gpt-4, gpt-4o, gpt-4o-mini, gpt-3.5-turbo, etc. results in terrible performance. These days ranging from 0-10% success rate using ReAct.
Could you be so kind to re-run official results and post an updated version of results so that we can use these as "official" benchmark. (For example reviewers are often complaining that new results are much worse than original results). However, we tried reproducing result using your implementation, other implementations and our own implementation always yielding very low scores.
Thank you very much.
Concrete Ask:
If running the whole thing is too much, which we understand. Could you for example maybe just run the results on the first 30-50 examples and report the scores there for a few models (e.g. gpt-3.5-turbo, gpt-4o and gpt-4o-mini).
Thank you very much.
The text was updated successfully, but these errors were encountered: