label leaks may happen? #27

LongLiveSocialism · 2023-11-03T03:40:11Z

Hi Noah, I'm reproducing your work, generally I view reflexion as some kind of in-context few-shot sft/rl, which requires supervised signals (either from environment or label) . However, in your code, the evaluation on hotpotQA seems directly using the validation set label as this supervised signal, which means label leaks happened. I'm pretty confused here.
Did you do the experiments on whether reflection on training samples could generalize to the validation samples? Or did I correctly get your thought?

noahshinn · 2023-11-04T06:38:33Z

Hi @LongLiveSocialism, thanks for the note.

Reflexion is a method to amplify binary rewards to natural language feedback that can be used to improve generative performance. The reward model can take many forms - as evidenced by our programming and decision-making tasks. Can you unpack your comment about "reflexion [being] some kind of in-context few-shot sft/rl"? The second half of your note seems to reference details that would be relevant if Reflexion were viewed as a supervised training process for the purpose of deployment to unseen samples, which was not the intent of the paper. The purpose is to do smart sampling conditioned on sparse feedback from the environment. I'd be happy to discuss this idea further though.

pengjiao123 · 2023-11-06T08:48:27Z

Hi @LongLiveSocialism, thanks for the note.

Reflexion is a method to amplify binary rewards to natural language feedback that can be used to improve generative performance. The reward model can take many forms - as evidenced by our programming and decision-making tasks. Can you unpack your comment about "reflexion [being] some kind of in-context few-shot sft/rl"? The second half of your note seems to reference details that would be relevant if Reflexion were viewed as a supervised training process for the purpose of deployment to unseen samples, which was not the intent of the paper. The purpose is to do smart sampling conditioned on sparse feedback from the environment. I'd be happy to discuss this idea further though.

In my opinion, @LongLiveSocialism does no focus on programming and decision-making tasks. However, the experiment in the hotpotQA task may not be very reasonable.

Because the evaluation used real labels, he/she believes that in this experiment, your reflexion may be an in context few shot sft/rl. It is clear that during the react process, the lack of ground truth label is a fact . So the better performance of the reflexion can be well understood(the main reason for the improvement may not be the reflexion architecture).

The purpose of reflexion is to do smart sampling conditional on sparse feedback from the environment.That is ok.
But for example, when faced with the same problem, one approach is to encourage better reasoning methods to obtain the possible correct way (the response remains the same for each attempt), while the other approach is based on the first approach, but with feedback from a real label, the second approach has better results.

Firstly, it is almost impossible to obtain the real label in actual scenarios or the vast majority of scenarios, and secondly, it may confuse others, is the role of real label supervision or is your feedback mechanism more important.

lazyupdate · 2023-11-22T05:24:08Z

I agree that label leakage is a concern. Although calculating rewards through ground truth doesn't directly expose the correct answers to the model, it can influence the model's decision-making process, leading it toward the correct answers.

For instance, consider a binary classification problem where the model needs to output Yes or No. And we use two round iterations.

Assuming that the model gets error in the first round, the model outputs Yes while the ground truth is No.
After the calculation, the reward would be 0, effectively telling the model that Yes is incorrect.
So, what would the model likely output in the next round? I believe it's highly probable that it would output No.

Using this approach, we can obtain a model with nearly 100% accuracy after two iterations. This weird performance boost is caused by label leakage instead of the RL process.

Of course, HotpotQA is not a simple binary classification task, and the model may not necessarily converge to the correct answer after several iterations. However, the truth labels do have a substantial supervisory effect on the model. In reality, most tasks don't have ground truth available for model iterations, limiting the applicability of this method.

In my opinion, a more reasonable approach would be to have the model itself (or utilize a powerful backend like GPT-4) to score the results as rewards rather than calculating them directly through ground truth, which would help avoid the issue of label leakage and make it capable to real-world scenarios that do not have truth label.

noahshinn · 2024-01-11T07:53:29Z

Sorry for the late response, but I should refer you to our ablation study shown in Figure 4 of our paper. In that study, we evaluated baseline sampling (blindly sample for N samples) vs episodic memory sampling (sampling conditioned on the previous samples and binary labels) and finally, reflexion sampling. We found that episodic memory sampling improved accuracy (which could be described by the process of elimination suggestion by @lazyupdate), but did not produce performance improvements as high as the reflexion sampling strategy. Episodic memory sampling contains labels and previous answers but does not lead to the best performance. This eliminates "label leakage" from being the sole contributor to the success of reflexion on HotPotQA. Let me know if there are further questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

label leaks may happen? #27

label leaks may happen? #27

LongLiveSocialism commented Nov 3, 2023

noahshinn commented Nov 4, 2023

pengjiao123 commented Nov 6, 2023 •

edited

Loading

lazyupdate commented Nov 22, 2023

noahshinn commented Jan 11, 2024

label leaks may happen? #27

label leaks may happen? #27

Comments

LongLiveSocialism commented Nov 3, 2023

noahshinn commented Nov 4, 2023

pengjiao123 commented Nov 6, 2023 • edited Loading

lazyupdate commented Nov 22, 2023

noahshinn commented Jan 11, 2024

pengjiao123 commented Nov 6, 2023 •

edited

Loading