Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] The success condition for alfworld might be incorrect #19

Closed
stevenyangyj opened this issue Aug 4, 2023 · 8 comments
Closed

[Bug] The success condition for alfworld might be incorrect #19

stevenyangyj opened this issue Aug 4, 2023 · 8 comments

Comments

@stevenyangyj
Copy link

According to this line and the other line, the function alfworld_run will return True when the environment reaches the allowed maximal number of steps regardless the goal is or not achieved, which will result in a spuriously higher measure of success rate.

@noahshinn
Copy link
Owner

You are right. We didn't notice this because we used a max of 30 steps per trajectory for the paper results.

@stevenyangyj
Copy link
Author

Thanks so much for the quick response.

@stevenyangyj
Copy link
Author

You are right. We didn't notice this because we used a max of 30 steps per trajectory for the paper results.

Btw when I follow your fix commit (change 50 -> 49), the performance drastically drops. Did you observe this?

@noahshinn
Copy link
Owner

Which model are you using? can you provide numbers for the drop in performance

@stevenyangyj
Copy link
Author

stevenyangyj commented Aug 5, 2023

I used gpt-3.5-turbo as the same as the default config file, and the success rate is equal to 0.13 when the trial number is zero (lower than the reported performance, i.e., ~=0.6, as shown in Fig. 3 of the paper).

I am still waiting for the remaining trials.

@stevenyangyj
Copy link
Author

@noahshinn024 The results with 12 trials are below, and the success rate is much lower than the reported results.

***** Start Trial #11 *****

Environment #0 Trial #11: SUCCESS
Environment #1 Trial #11: FAIL
Environment #2 Trial #11: SUCCESS
Environment #3 Trial #11: SUCCESS
Environment #4 Trial #11: SUCCESS
Environment #5 Trial #11: FAIL
Environment #6 Trial #11: SUCCESS
Environment #7 Trial #11: FAIL
Environment #8 Trial #11: FAIL
Environment #9 Trial #11: FAIL
Environment #10 Trial #11: SUCCESS
Environment #11 Trial #11: SUCCESS
Environment #12 Trial #11: SUCCESS
Environment #13 Trial #11: SUCCESS
Environment #14 Trial #11: FAIL
Environment #15 Trial #11: SUCCESS
Environment #16 Trial #11: SUCCESS
Environment #17 Trial #11: SUCCESS
Environment #18 Trial #11: SUCCESS
Environment #19 Trial #11: SUCCESS
Environment #20 Trial #11: FAIL
Environment #21 Trial #11: SUCCESS
Environment #22 Trial #11: SUCCESS
Environment #23 Trial #11: FAIL
Environment #24 Trial #11: SUCCESS
Environment #25 Trial #11: SUCCESS
Environment #26 Trial #11: SUCCESS
Environment #27 Trial #11: FAIL
Environment #28 Trial #11: SUCCESS
Environment #29 Trial #11: SUCCESS
Environment #30 Trial #11: SUCCESS
Environment #31 Trial #11: SUCCESS
Environment #32 Trial #11: FAIL
Environment #33 Trial #11: FAIL
Environment #34 Trial #11: FAIL
Environment #35 Trial #11: SUCCESS
Environment #36 Trial #11: SUCCESS
Environment #37 Trial #11: FAIL
Environment #38 Trial #11: SUCCESS
Environment #39 Trial #11: SUCCESS
Environment #40 Trial #11: SUCCESS
Environment #41 Trial #11: SUCCESS
Environment #42 Trial #11: SUCCESS
Environment #43 Trial #11: SUCCESS
Environment #44 Trial #11: SUCCESS
Environment #45 Trial #11: SUCCESS
Environment #46 Trial #11: SUCCESS
Environment #47 Trial #11: SUCCESS
Environment #48 Trial #11: SUCCESS
Environment #49 Trial #11: SUCCESS
Environment #50 Trial #11: SUCCESS
Environment #51 Trial #11: SUCCESS
Environment #52 Trial #11: SUCCESS
Environment #53 Trial #11: SUCCESS
Environment #54 Trial #11: SUCCESS
Environment #55 Trial #11: SUCCESS
Environment #56 Trial #11: FAIL
Environment #57 Trial #11: SUCCESS
Environment #58 Trial #11: SUCCESS
Environment #59 Trial #11: SUCCESS
Environment #60 Trial #11: SUCCESS
Environment #61 Trial #11: SUCCESS
Environment #62 Trial #11: SUCCESS
Environment #63 Trial #11: FAIL
Environment #64 Trial #11: SUCCESS
Environment #65 Trial #11: SUCCESS
Environment #66 Trial #11: FAIL
Environment #67 Trial #11: SUCCESS
Environment #68 Trial #11: FAIL
Environment #69 Trial #11: FAIL
Environment #70 Trial #11: SUCCESS
Environment #71 Trial #11: FAIL
Environment #72 Trial #11: SUCCESS
Environment #73 Trial #11: SUCCESS
Environment #74 Trial #11: SUCCESS
Environment #75 Trial #11: SUCCESS
Environment #76 Trial #11: FAIL
Environment #77 Trial #11: SUCCESS
Environment #78 Trial #11: SUCCESS
Environment #79 Trial #11: SUCCESS
Environment #80 Trial #11: SUCCESS
Environment #81 Trial #11: SUCCESS
Environment #82 Trial #11: SUCCESS
Environment #83 Trial #11: SUCCESS
Environment #84 Trial #11: SUCCESS
Environment #85 Trial #11: SUCCESS
Environment #86 Trial #11: SUCCESS
Environment #87 Trial #11: FAIL
Environment #88 Trial #11: FAIL
Environment #89 Trial #11: SUCCESS
Environment #90 Trial #11: SUCCESS
Environment #91 Trial #11: SUCCESS
Environment #92 Trial #11: SUCCESS
Environment #93 Trial #11: SUCCESS
Environment #94 Trial #11: SUCCESS
Environment #95 Trial #11: FAIL
Environment #96 Trial #11: FAIL
Environment #97 Trial #11: SUCCESS
Environment #98 Trial #11: SUCCESS
Environment #99 Trial #11: SUCCESS
Environment #100 Trial #11: SUCCESS
Environment #101 Trial #11: SUCCESS
Environment #102 Trial #11: SUCCESS
Environment #103 Trial #11: SUCCESS
Environment #104 Trial #11: FAIL
Environment #105 Trial #11: SUCCESS
Environment #106 Trial #11: FAIL
Environment #107 Trial #11: SUCCESS
Environment #108 Trial #11: FAIL
Environment #109 Trial #11: FAIL
Environment #110 Trial #11: FAIL
Environment #111 Trial #11: SUCCESS
Environment #112 Trial #11: FAIL
Environment #113 Trial #11: SUCCESS
Environment #114 Trial #11: FAIL
Environment #115 Trial #11: SUCCESS
Environment #116 Trial #11: FAIL
Environment #117 Trial #11: SUCCESS
Environment #118 Trial #11: SUCCESS
Environment #119 Trial #11: SUCCESS
Environment #120 Trial #11: FAIL
Environment #121 Trial #11: SUCCESS
Environment #122 Trial #11: SUCCESS
Environment #123 Trial #11: SUCCESS
Environment #124 Trial #11: SUCCESS
Environment #125 Trial #11: SUCCESS
Environment #126 Trial #11: FAIL
Environment #127 Trial #11: SUCCESS
Environment #128 Trial #11: FAIL
Environment #129 Trial #11: SUCCESS
Environment #130 Trial #11: SUCCESS
Environment #131 Trial #11: SUCCESS
Environment #132 Trial #11: FAIL
Environment #133 Trial #11: SUCCESS


SUCCESS: 98
ADDITIONAL SUCCESS: 0
FAIL: 36
TOTAL: 134
ACCURACY: 0.73

***** End Trial #11 *****

@zhilizju
Copy link

@noahshinn024 The results with 12 trials are below, and the success rate is much lower than the reported results.

***** Start Trial #11 *****

Environment #0 Trial #11: SUCCESS Environment #1 Trial #11: FAIL Environment #2 Trial #11: SUCCESS Environment #3 Trial #11: SUCCESS Environment #4 Trial #11: SUCCESS Environment #5 Trial #11: FAIL Environment #6 Trial #11: SUCCESS Environment #7 Trial #11: FAIL Environment #8 Trial #11: FAIL Environment #9 Trial #11: FAIL Environment #10 Trial #11: SUCCESS Environment #11 Trial #11: SUCCESS Environment #12 Trial #11: SUCCESS Environment #13 Trial #11: SUCCESS Environment #14 Trial #11: FAIL Environment #15 Trial #11: SUCCESS Environment #16 Trial #11: SUCCESS Environment #17 Trial #11: SUCCESS Environment #18 Trial #11: SUCCESS Environment #19 Trial #11: SUCCESS Environment #20 Trial #11: FAIL Environment #21 Trial #11: SUCCESS Environment #22 Trial #11: SUCCESS Environment #23 Trial #11: FAIL Environment #24 Trial #11: SUCCESS Environment #25 Trial #11: SUCCESS Environment #26 Trial #11: SUCCESS Environment #27 Trial #11: FAIL Environment #28 Trial #11: SUCCESS Environment #29 Trial #11: SUCCESS Environment #30 Trial #11: SUCCESS Environment #31 Trial #11: SUCCESS Environment #32 Trial #11: FAIL Environment #33 Trial #11: FAIL Environment #34 Trial #11: FAIL Environment #35 Trial #11: SUCCESS Environment #36 Trial #11: SUCCESS Environment #37 Trial #11: FAIL Environment #38 Trial #11: SUCCESS Environment #39 Trial #11: SUCCESS Environment #40 Trial #11: SUCCESS Environment #41 Trial #11: SUCCESS Environment #42 Trial #11: SUCCESS Environment #43 Trial #11: SUCCESS Environment #44 Trial #11: SUCCESS Environment #45 Trial #11: SUCCESS Environment #46 Trial #11: SUCCESS Environment #47 Trial #11: SUCCESS Environment #48 Trial #11: SUCCESS Environment #49 Trial #11: SUCCESS Environment #50 Trial #11: SUCCESS Environment #51 Trial #11: SUCCESS Environment #52 Trial #11: SUCCESS Environment #53 Trial #11: SUCCESS Environment #54 Trial #11: SUCCESS Environment #55 Trial #11: SUCCESS Environment #56 Trial #11: FAIL Environment #57 Trial #11: SUCCESS Environment #58 Trial #11: SUCCESS Environment #59 Trial #11: SUCCESS Environment #60 Trial #11: SUCCESS Environment #61 Trial #11: SUCCESS Environment #62 Trial #11: SUCCESS Environment #63 Trial #11: FAIL Environment #64 Trial #11: SUCCESS Environment #65 Trial #11: SUCCESS Environment #66 Trial #11: FAIL Environment #67 Trial #11: SUCCESS Environment #68 Trial #11: FAIL Environment #69 Trial #11: FAIL Environment #70 Trial #11: SUCCESS Environment #71 Trial #11: FAIL Environment #72 Trial #11: SUCCESS Environment #73 Trial #11: SUCCESS Environment #74 Trial #11: SUCCESS Environment #75 Trial #11: SUCCESS Environment #76 Trial #11: FAIL Environment #77 Trial #11: SUCCESS Environment #78 Trial #11: SUCCESS Environment #79 Trial #11: SUCCESS Environment #80 Trial #11: SUCCESS Environment #81 Trial #11: SUCCESS Environment #82 Trial #11: SUCCESS Environment #83 Trial #11: SUCCESS Environment #84 Trial #11: SUCCESS Environment #85 Trial #11: SUCCESS Environment #86 Trial #11: SUCCESS Environment #87 Trial #11: FAIL Environment #88 Trial #11: FAIL Environment #89 Trial #11: SUCCESS Environment #90 Trial #11: SUCCESS Environment #91 Trial #11: SUCCESS Environment #92 Trial #11: SUCCESS Environment #93 Trial #11: SUCCESS Environment #94 Trial #11: SUCCESS Environment #95 Trial #11: FAIL Environment #96 Trial #11: FAIL Environment #97 Trial #11: SUCCESS Environment #98 Trial #11: SUCCESS Environment #99 Trial #11: SUCCESS Environment #100 Trial #11: SUCCESS Environment #101 Trial #11: SUCCESS Environment #102 Trial #11: SUCCESS Environment #103 Trial #11: SUCCESS Environment #104 Trial #11: FAIL Environment #105 Trial #11: SUCCESS Environment #106 Trial #11: FAIL Environment #107 Trial #11: SUCCESS Environment #108 Trial #11: FAIL Environment #109 Trial #11: FAIL Environment #110 Trial #11: FAIL Environment #111 Trial #11: SUCCESS Environment #112 Trial #11: FAIL Environment #113 Trial #11: SUCCESS Environment #114 Trial #11: FAIL Environment #115 Trial #11: SUCCESS Environment #116 Trial #11: FAIL Environment #117 Trial #11: SUCCESS Environment #118 Trial #11: SUCCESS Environment #119 Trial #11: SUCCESS Environment #120 Trial #11: FAIL Environment #121 Trial #11: SUCCESS Environment #122 Trial #11: SUCCESS Environment #123 Trial #11: SUCCESS Environment #124 Trial #11: SUCCESS Environment #125 Trial #11: SUCCESS Environment #126 Trial #11: FAIL Environment #127 Trial #11: SUCCESS Environment #128 Trial #11: FAIL Environment #129 Trial #11: SUCCESS Environment #130 Trial #11: SUCCESS Environment #131 Trial #11: SUCCESS Environment #132 Trial #11: FAIL Environment #133 Trial #11: SUCCESS

SUCCESS: 98

ADDITIONAL SUCCESS: 0
FAIL: 36
TOTAL: 134
ACCURACY: 0.73
***** End Trial #11 *****

A month has passed, have there been any further findings regarding this reproduction result? Have you successfully reproduced it?

@noahshinn
Copy link
Owner

Hi @stevenyangyj , thanks for these findings! Did you look through the log files to check if the errors can be explained by incorrect action choice or incorrect action specification? I am asking this because we used gpt-3.5 (text-davinci-003, not gpt-3.5-turbo) for the Alfworld runs. My guess is that the chat model may perform worse due to formatting errors as the complete action space is not defined at each time step. Let me know if this aligns with your findings!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants