Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential unexpected behavior in executor (that uses theglobal) #40

Open
allanj opened this issue Apr 7, 2024 · 5 comments
Open

Potential unexpected behavior in executor (that uses theglobal) #40

allanj opened this issue Apr 7, 2024 · 5 comments

Comments

@allanj
Copy link

allanj commented Apr 7, 2024

I was running the script https://github.com/noahshinn/reflexion/blob/main/programming_runs/run_reflexion_codellama_multi.sh
with CodeLLaMA model, simply change the codellama to codellama-7b

CUDA_VISIBLE_DEVICES=$1 python main.py \
  --run_name "reflexion_codellama_$1" \
  --root_dir "root" \
  --dataset_path ./benchmarks/humaneval-py.jsonl \
  --strategy "reflexion" \
  --language "py" \
  --model "codellama-7b" \
  --pass_at_k "1" \
  --max_iters "2" \
  --verbose | tee ./logs/reflexion_codellama_$1

The performance is around 30~40% though it will get killed in the middle of the experiment.

Observation

Some predictions present unexpected behavior while it is correct prediction is_solved = True. In the following prediction (which is also attached here)

  1. The first program generated actually failed the generated test cases (NOTE: these test cases actually generated by CodeLLaMA-7B), although this program passes the ground-truth test cases (eventually).
  2. After reflection, the model actually generate nothing (i.e., None).
  3. So, suppose we should take "" as our final output, right?
  4. But the "is_solved" is actually assigned to True.

image

Reason

Because we always use the function globals(), even though the program after reflexion has nothing, globals() already contain the program in the first attempt.

Thus, as we do not clear the information in globals() every time, potentially this could lead to some unexpected behavior.

I have been a bit confused here, hope the authors can clarify more on this.

@allanj
Copy link
Author

allanj commented Apr 7, 2024

Especially, globals() will definitely memorize previous implementations, which will be used if the future generation is None.

Or in future generation, it generate something (e.g., some functions) not exists but those functions are memorized in globals(). Potentially, that has some contamination in the data.

That means, the order to run the evaluation samples matters

@reyrey-app
Copy link

So if I understand correctly, it could potentially cause incorrect answers to be correct?

@allanj
Copy link
Author

allanj commented Apr 12, 2024

Yes, right!.

But I assume for more capable models like GPT-3.5/4 or other large models, such an issue might rarely happen.

I just got in touch with the authors, and currently making a fix.

@allanj
Copy link
Author

allanj commented Apr 12, 2024

I think the current issue is that, we need to deal with the situation
if cur_func_impl is None, what we are going to do,

We always put assertion for the current implementation

assert isinstance(cur_func_impl, str)

because it is possible that "cur_func_impl" is pure natural language response without any code extracted, which make the variable cur_func_impl to be None.

My current attempt:

  1. For the first attempt, we just set to "" empty string, as it is also initial implementation
if cur_func_impl is None:
    # code cannot be extracted from the response.
    cur_func_impl = ""
  1. For the reflection attempt, we can roll back and reflect again, rather than put an empty string.
if reflected_func_impl is None:
    # code cannot be extracted from the response.
    # we just skip this iteration and reflection again.
    cur_iter += 1
    continue

I'm making a PR on this, and will update more results here

Some changes: main...allanj:reflexion:fix_global
I'm running some experiments to make sure the results

@allanj
Copy link
Author

allanj commented Apr 13, 2024

If I clear the globals() variable information, the performance seems to drop a lot (if using CodeLLaMA-7B), I'm checking the reason

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants