You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I will share my understanding on this and welcome further thoughts.
For this CountDown task, the LLM needs to re-try different proposals to make the equation valid. When we do RL, the model gradually learns to output more re-tries (in order to get the answer correct -> maximize its reward). However, for some (harder) questions, we need many re-tries which may exceed the response length budget. This will lead to more answers with the format issue (red ones) because they cannot finish within budget.
Thanks for your great work! However, I have a question: why do the answers with the format issue (red ones) continue to increase after step 88?
The text was updated successfully, but these errors were encountered: