Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to Extract Complete, Non-redundant, and Correct Code from Messages Testing on Benchmarks like HumanEval? #1216

Open
huoliangyu opened this issue Apr 20, 2024 · 5 comments

Comments

@huoliangyu
Copy link

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like

Hello,

I am exploring the effectiveness of open-interpreter on benchmarks like HumanEval and have encountered some challenges with the code generation process. Specifically, I've noticed that sometimes the interpreter only plans but does not generate actual code, and sometimes, the generated code contains errors and requires multiple modifications.

Could you please advise on how best to extract complete, non-redundant, and correct code from messages to automatically test on HumanEval?

Thank you!

Describe alternatives you've considered

No response

Additional context

No response

@Steve235lab
Copy link
Contributor

Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.

@huoliangyu
Copy link
Author

Except the "non-redundant", your requirements can be done with some well designed custom instructions. You need to tell the LLM the expected way to response. However, the "non-redundant" is conflict with correct in most cases because of the limited ability of current LLMs, they need to debug their code for several times to give a final correct version, just like normal human programmers, which means there are always some redundant code in the conversations history. You may need to find a way by yourself to filter the history to get the final correct code.

Thank you for your quick response! Could you suggest any suitable prompt templates or methods for extracting code to test the open-interpreter's performance on HumanEval? In my tests (where I've designed prompts to ensure the agent always outputs code), the performance of GPT-3.5 with open-interpreter seems somewhat inferior compared to using GPT-3.5 directly. Any good advice would be greatly appreciated!

@Steve235lab
Copy link
Contributor

Steve235lab commented Apr 25, 2024

GPT-3.5 is provided as a RESTful API by OpenAI, so I don't really know what "using GPT-3.5 directly" means? Curl the API directly? If you mean the ChatGPT from OpenAI, then as long as the system prompts of ChatGPT are property of OpenAI, it's hard to compose something better than that. There are some tricks on the Internet teaching you how to get the system prompts of ChatGPT, maybe you can try that.

@Steve235lab
Copy link
Contributor

Steve235lab commented Apr 25, 2024

By the way, the default embedded system prompt of OI may not be suitable for your task, somehow it focuses too much on telling the LLM how to parse special message types of OI. If custom instructions can't solve your problem, you can try to modify the embedded system prompt in OI source code.

@huoliangyu
Copy link
Author

Thank you for your reply, I will try these methods and look forward to OI updating continuously.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants