Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vision Agent v3 #89

Merged
merged 13 commits into from
May 21, 2024
Merged

Vision Agent v3 #89

merged 13 commits into from
May 21, 2024

Conversation

dillonalaird
Copy link
Member

@dillonalaird dillonalaird commented May 18, 2024

A couple of pros and cons with Data Interpreter and Agent Coder lead to this design.

Data Interpreter Cons:

  • The subtasks are too small, it will take tasks that GPT-4o can handle fine, and subdivide them into 5-7 smaller tasks that are unnecessarily small
  • When the subtasks are small, it tends to overwrite correct code a lot. For example it will write correct code for subtask 1, and then remove it by subtask 5

Data Interpreter Pros:

  • Tool recommendation helps scale to more tools
  • Long term memory helps the agent framework "learn" to some degree
  • Planning helps with tool choice and task decomposition if it's executed in the correct way

Agent Coder Pros:

  • Does very well on coding tasks (first place on human eval)

Agent Coder Cons:

  • Can't plan long term
  • No tool recommendation, needs all tools at once
  • Nothing like long term memory to help it "learn"

This version of vision agent basically keeps planning, but does the entire plan in one shot using the Agent Coder framework. It uses the plan to do tool recommendation and also allows for long term memory lookup. For planning beyond the initial plan it will do reflection and see if it needs to execute an additional plan.

Copy link
Collaborator

@shankar-vision-eng shankar-vision-eng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Left very minor comment wherever the prompt was not very clear.

vision_agent/agent/vision_agent_v3_prompts.py Outdated Show resolved Hide resolved
vision_agent/agent/vision_agent_v3_prompts.py Outdated Show resolved Hide resolved
vision_agent/agent/vision_agent_v3_prompts.py Show resolved Hide resolved
"""


SIMPLE_TEST = """
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the difference between TEST and SIMPLE_TEST ? When do we use one over other ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, so TEST is based off of the original Agent Coder, but the problem is it requires the tester to be very thorough but for these vision tasks you can't really test a lot of the edge cases (like test on images with out a person, or where the text for OCR is in a different spot), so it tends to hallucinate file names (even after I tell it specifically not to). So I made a SIMPLE_TEST that just asks it to test basic functionality. This seems to be working better on the pool case

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just went thru the code again, i see that its not being used and only used in v1 (agent coder) and v2. Should we remove it from this file then ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will leave it in the file as of now, we can remove it later

Copy link
Collaborator

@shankar-vision-eng shankar-vision-eng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@shankar-vision-eng shankar-vision-eng merged commit 54ae662 into main May 21, 2024
7 checks passed
@dillonalaird dillonalaird deleted the vision-agent-v3 branch May 29, 2024 00:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants