How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

huoliangyu · 2024-04-20T02:50:49Z

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like

Hello,

I am exploring the effectiveness of open-interpreter on benchmarks like HumanEval and have encountered some challenges with the code generation process. Specifically, I've noticed that sometimes the interpreter only plans but does not generate actual code, and sometimes, the generated code contains errors and requires multiple modifications.

Could you please advise on how best to extract complete, non-redundant, and correct code from messages to automatically test on HumanEval?

Thank you!

Describe alternatives you've considered

No response

Additional context

No response

Steve235lab · 2024-04-23T17:01:06Z

Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.

huoliangyu · 2024-04-24T05:19:09Z

Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.

Thank you for your quick response! Could you suggest any suitable prompt templates or methods for extracting code to test the open-interpreter's performance on HumanEval? In my tests (where I've designed prompts to ensure the agent always outputs code), the performance of GPT-3.5 with open-interpreter seems somewhat inferior compared to using GPT-3.5 directly. Any good advice would be greatly appreciated!

Steve235lab · 2024-04-25T10:07:13Z

GPT-3.5 is provided as a RESTful API by OpenAI, so I don't really know what "using GPT-3.5 directly" means? Curl the API directly? If you mean the ChatGPT from OpenAI, then as long as the system prompts of ChatGPT are property of OpenAI, it's hard to compose something better than that. There are some tricks on the Internet teaching you how to get the system prompts of ChatGPT, maybe you can try that.

Steve235lab · 2024-04-25T10:16:48Z

By the way, the default embedded system prompt of OI may not be suitable for your task, somehow it focuses too much on telling the LLM how to parse special message types of OI. If custom instructions can't solve your problem, you can try to modify the embedded system prompt in OI source code.

huoliangyu · 2024-05-06T02:50:59Z

Thank you for your reply, I will try these methods and look forward to OI updating continuously.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

huoliangyu commented Apr 20, 2024

Steve235lab commented Apr 23, 2024

huoliangyu commented Apr 24, 2024

Steve235lab commented Apr 25, 2024 •

edited

Steve235lab commented Apr 25, 2024 •

edited

huoliangyu commented May 6, 2024

How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

Comments

huoliangyu commented Apr 20, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Steve235lab commented Apr 23, 2024

huoliangyu commented Apr 24, 2024

Steve235lab commented Apr 25, 2024 • edited

Steve235lab commented Apr 25, 2024 • edited

huoliangyu commented May 6, 2024

Steve235lab commented Apr 25, 2024 •

edited

Steve235lab commented Apr 25, 2024 •

edited