diff --git a/vision_agent/agent/vision_agent_prompts.py b/vision_agent/agent/vision_agent_prompts.py index 2a93c60d..d3a88efc 100644 --- a/vision_agent/agent/vision_agent_prompts.py +++ b/vision_agent/agent/vision_agent_prompts.py @@ -26,8 +26,8 @@ **Instructions**: 1. **Understand and Clarify**: Make sure you understand the task, ask clarifying questions if the task is not clear. -2. **Code Generation**: Only use code provided in the Documentation in your tags. For generating vision code always start by calling `generate_vision_plan`, once you've agreed upon a plan with the user then call `generate_vision_code`. Only use `edit_vision_code` to modify code written by `generate_vision_code`. -3. **Execute**: Do only what the user asked you to do and no more. If you need to ask the user a question, set `let_user_respond` to `true`. +2. **Code Generation**: Only use code provided in the Documentation in your tags. For generating vision code always start by calling `generate_vision_plan` and asking the user which plan they want to use, after that call `generate_vision_code` with the chosen plan. Only use `edit_vision_code` to modify code written by `generate_vision_code`. +3. **Execute**: Do only what the user asked you to do and no more, do not debug vision code for them. If you need to ask the user a question, set `let_user_respond` to `true`. 4. **Output in JSON**: Respond in the following format in JSON: ```json @@ -67,7 +67,7 @@ USER: Plan 1 sounds good to me with your modifications. -AGENT: {"thoughts": "I will use the generate_vision_code to detect the dogs in the image.", "response": "generate_vision_code(artifacts, 'dog_detector.py', 'Can you write code to detect dogs in this image?', media=['/path/to/images/dog.jpg'], plan={'thoughts': 'The most straightforward and reliable approach is to use the owl_v2_image tool, which is specifically designed for object detection. It can detect multiple objects given a text prompt, which is perfect for identifying dogs in an image.', 'instructions': ['Load the image 'dogs.jpg' using the load_image tool', 'Use the owl_v2_image tool with the prompt 'dog' to detect dogs in the loaded image', 'Overlay the detected bounding boxes on the original image using the overlay_bounding_boxes tool', 'Save the resulting image with bounding boxes using the save_image tool']}, 'Add a confidence threshold of 0.5 to filter out the low-confidence detections.', 'dog_detector_plan.json')", "let_user_respond": false} +AGENT: {"thoughts": "I will use the generate_vision_code to detect the dogs in the image.", "response": "generate_vision_code(artifacts, 'dog_detector.py', 'Can you write code to detect dogs in this image?', media=['/path/to/images/dog.jpg'], plan={'thoughts': 'The most straightforward and reliable approach is to use the owl_v2_image tool, which is specifically designed for object detection. It can detect multiple objects given a text prompt, which is perfect for identifying dogs in an image.', 'instructions': ['Load the image \'dogs.jpg\' using the load_image tool', 'Use the owl_v2_image tool with the prompt \'dog\' to detect dogs in the loaded image', 'Overlay the detected bounding boxes on the original image using the overlay_bounding_boxes tool', 'Save the resulting image with bounding boxes using the save_image tool']}, 'Add a confidence threshold of 0.5 to filter out the low-confidence detections.', 'dog_detector_plan.json')", "let_user_respond": false} OBSERVATION: [Artifact dog_detector.py] @@ -176,7 +176,7 @@ USER: Could you use plan 1 with owl_v2 instead and a confidence threshold of 0.3? -AGENT: {"thoughts": "I will incorporate the users thoughts when generating the vision code.", "response": "generate_vision_code(artifacts, 'code.py', 'Can you create a function to count workers with helmets, return the count and save a visaulization of the bounding boxes?', media=['/path/to/images/workers.png'], plan={'thoughts': 'This plan uses owl_v2_image, which is specifically designed for detecting and counting multiple objects given a text prompt. It's likely the most accurate and efficient method for this task.', 'instructions': ['Load the image 'workers.png' using the load_image function', 'Use owl_v2_image with the prompt 'workers with helmets' to detect and count workers wearing helmets', 'Count the number of detections returned by owl_v2_image', 'Use overlay_bounding_boxes to visualize the detections on the original image', 'Save the visualization using save_image', 'Return the count of workers with helmets'], 'Add a confidence threshold of 0.3 to filter out low-confidence detections.', 'workers_plan_context.json'})", "let_user_respond": false} +AGENT: {"thoughts": "I will incorporate the users thoughts when generating the vision code.", "response": "generate_vision_code(artifacts, 'code.py', 'Can you create a function to count workers with helmets, return the count and save a visaulization of the bounding boxes?', media=['/path/to/images/workers.png'], plan={'thoughts': 'This plan uses owl_v2_image, which is specifically designed for detecting and counting multiple objects given a text prompt. It's likely the most accurate and efficient method for this task.', 'instructions': ['Load the image \'workers.png\' using the load_image function', 'Use owl_v2_image with the prompt \'workers with helmets\' to detect and count workers wearing helmets', 'Count the number of detections returned by owl_v2_image', 'Use overlay_bounding_boxes to visualize the detections on the original image', 'Save the visualization using save_image', 'Return the count of workers with helmets'], 'Add a confidence threshold of 0.3 to filter out low-confidence detections.', 'workers_plan_context.json'})", "let_user_respond": false} OBSERVATION: [Artifact code.py] @@ -215,7 +215,7 @@ OBSERVATION: [Fine tuning id: 23b3b022-5ebf-4798-9373-20ef36429abf] -AGENT: {"thoughts": "The model has finished fine tuning, I will now replace the original owl_v2_image call with the fine tuning id.", "response": "I will not update the code to use the fine tuned model. use_object_detection_fine_tuning(artifacts, "code.py", "23b3b022-5ebf-4798-9373-20ef36429abf")", "let_user_respond": false} +AGENT: {"thoughts": "The model has finished fine tuning, I will now replace the original owl_v2_image call with the fine tuning id.", "response": "I will not update the code to use the fine tuned model. use_object_detection_fine_tuning(artifacts, 'code.py', '23b3b022-5ebf-4798-9373-20ef36429abf')", "let_user_respond": false} OBSERVATION: [Artifact code.py edits]