-
Notifications
You must be signed in to change notification settings - Fork 255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reproducing Webshop Results #49
Comments
Hey @ai-nikolai , have you taken a look at the errors or failure cases? Or could you point me to a log file that I can view? |
@noahshinn - thanks for the quick response. I will send you some logs tomorrow - don't have access to them atm. Generally, the confusing part is that 6 months ago the same models i.e. It would be very amazing if you could run react on webshop using one of these models using: |
Sounds great! I'll look out for the log files (or handful of examples) so that I can help. |
@noahshinn here are some traces from Traces of the 10 environments
|
Here are traces using Traces of the 10 environments
|
@noahshinn as you can see the traces and models start producing the search results very often. What is very strange is that the same models produced much better results 6 months ago. (Specifically the 1106 model produced 18% success rate (which is still much lower than 30%). It would be extremely beneficial, if you can run gpt-3.5-turbo-1106 on your end and report some results as well. Thank you very much. |
@noahshinn any luck with the above? |
Btw, that's an easy way of running webshop these days:
See this README for more details: |
@ysymyth @noahshinn -
Problem:
Results on Webshop are very bad these days. Could you please help!
Ask:
Could you run Webshop and ReAct agent on a few test samples (e.g. 30) with some models to compare results:
gpt-3.5-turbo-1106
,gpt-3.5-turbo-0125
,gpt-4o-mini
.Results:
Webshop results are terrible today using existing code-repos:
Webshop results using the same models 6 months ago:
The text was updated successfully, but these errors were encountered: