diff --git a/README.md b/README.md index 87b39d22..4967461d 100644 --- a/README.md +++ b/README.md @@ -43,13 +43,32 @@ export OPENAI_API_KEY="your-api-key" ### Important Note on API Usage Please be aware that using the API in this project requires you to have API credits (minimum of five US dollars). This is different from the OpenAI subscription used in this chatbot. If you don't have credit, further information can be found [here](https://github.com/landing-ai/vision-agent?tab=readme-ov-file#how-to-get-started-with-openai-api-credits) + ### Vision Agent -#### Basic Usage -You can interact with the agent as you would with any LLM or LMM model: +There are two agents that you can use. Vision Agent is a conversational agent that has +access to tools that allow it to write an navigate python code. It can converse with +the user in natural language. VisionAgentCoder is an agent that can write code for +vision tasks, such as counting people in an image. However, it cannot converse and can +only respond with code. VisionAgent can call VisionAgentCoder to write vision code. +#### Basic Usage ```python >>> from vision_agent.agent import VisionAgent >>> agent = VisionAgent() +>>> resp = agent("Hello") +>>> print(resp) +[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "{'thoughts': 'The user has greeted me. I will respond with a greeting and ask how I can assist them.', 'response': 'Hello! How can I assist you today?', 'let_user_respond': True}"}] +>>> resp.append({"role": "user", "content": "Can you count the number of people in this image?", "media": ["people.jpg"]}) +>>> resp = agent(resp) +``` + +### Vision Agent Coder +#### Basic Usage +You can interact with the agent as you would with any LLM or LMM model: + +```python +>>> from vision_agent.agent import VisionAgentCoder +>>> agent = VisionAgentCoder() >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg") ``` @@ -90,7 +109,7 @@ To better understand how the model came up with it's answer, you can run it in d mode by passing in the verbose argument: ```python ->>> agent = VisionAgent(verbose=2) +>>> agent = VisionAgentCoder(verbose=2) ``` #### Detailed Usage @@ -180,9 +199,11 @@ def custom_tool(image_path: str) -> str: return np.zeros((10, 10)) ``` -You need to ensure you call `@va.tools.register_tool` with any imports it might use and -ensure the documentation is in the same format above with description, `Parameters:`, -`Returns:`, and `Example\n-------`. You can find an example use case [here](examples/custom_tools/). +You need to ensure you call `@va.tools.register_tool` with any imports it uses. Global +variables will not be captured by `register_tool` so you need to include them in the +function. Make sure the documentation is in the same format above with description, +`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case +[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool. ### Azure Setup If you want to use Azure OpenAI models, you need to have two OpenAI model deployments: @@ -209,7 +230,7 @@ You can then run Vision Agent using the Azure OpenAI models: ```python import vision_agent as va -agent = va.agent.AzureVisionAgent() +agent = va.agent.AzureVisionAgentCoder() ``` ****************************************************************************************************************************** diff --git a/docs/index.md b/docs/index.md index 68670be1..3d3c8ccc 100644 --- a/docs/index.md +++ b/docs/index.md @@ -35,13 +35,32 @@ export OPENAI_API_KEY="your-api-key" ### Important Note on API Usage Please be aware that using the API in this project requires you to have API credits (minimum of five US dollars). This is different from the OpenAI subscription used in this chatbot. If you don't have credit, further information can be found [here](https://github.com/landing-ai/vision-agent?tab=readme-ov-file#how-to-get-started-with-openai-api-credits) + ### Vision Agent -#### Basic Usage -You can interact with the agent as you would with any LLM or LMM model: +There are two agents that you can use. Vision Agent is a conversational agent that has +access to tools that allow it to write an navigate python code. It can converse with +the user in natural language. VisionAgentCoder is an agent that can write code for +vision tasks, such as counting people in an image. However, it cannot converse and can +only respond with code. VisionAgent can call VisionAgentCoder to write vision code. +#### Basic Usage ```python >>> from vision_agent.agent import VisionAgent >>> agent = VisionAgent() +>>> resp = agent("Hello") +>>> print(resp) +[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "{'thoughts': 'The user has greeted me. I will respond with a greeting and ask how I can assist them.', 'response': 'Hello! How can I assist you today?', 'let_user_respond': True}"}] +>>> resp.append({"role": "user", "content": "Can you count the number of people in this image?", "media": ["people.jpg"]}) +>>> resp = agent(resp) +``` + +### Vision Agent Coder +#### Basic Usage +You can interact with the agent as you would with any LLM or LMM model: + +```python +>>> from vision_agent.agent import VisionAgentCoder +>>> agent = VisionAgentCoder() >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg") ``` @@ -82,7 +101,7 @@ To better understand how the model came up with it's answer, you can run it in d mode by passing in the verbose argument: ```python ->>> agent = VisionAgent(verbose=2) +>>> agent = VisionAgentCoder(verbose=2) ``` #### Detailed Usage @@ -172,9 +191,11 @@ def custom_tool(image_path: str) -> str: return np.zeros((10, 10)) ``` -You need to ensure you call `@va.tools.register_tool` with any imports it might use and -ensure the documentation is in the same format above with description, `Parameters:`, -`Returns:`, and `Example\n-------`. You can find an example use case [here](examples/custom_tools/). +You need to ensure you call `@va.tools.register_tool` with any imports it uses. Global +variables will not be captured by `register_tool` so you need to include them in the +function. Make sure the documentation is in the same format above with description, +`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case +[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool. ### Azure Setup If you want to use Azure OpenAI models, you need to have two OpenAI model deployments: @@ -201,7 +222,7 @@ You can then run Vision Agent using the Azure OpenAI models: ```python import vision_agent as va -agent = va.agent.AzureVisionAgent() +agent = va.agent.AzureVisionAgentCoder() ``` ****************************************************************************************************************************** diff --git a/vision_agent/agent/vision_agent.py b/vision_agent/agent/vision_agent.py index cc0c04a6..5a77cc75 100644 --- a/vision_agent/agent/vision_agent.py +++ b/vision_agent/agent/vision_agent.py @@ -100,6 +100,20 @@ def parse_execution(response: str) -> Optional[str]: class VisionAgent(Agent): + """Vision Agent is an agent that can chat with the user and call tools or other + agents to generate code for it. Vision Agent uses python code to execute actions for + the user. Vision Agent is inspired by by OpenDev + https://github.com/OpenDevin/OpenDevin and CodeAct https://arxiv.org/abs/2402.01030 + + Example + ------- + >>> from vision_agent.agent import VisionAgent + >>> agent = VisionAgent() + >>> resp = agent("Hello") + >>> resp.append({"role": "user", "content": "Can you write a function that counts dogs?", "media": ["dog.jpg"]}) + >>> resp = agent(resp) + """ + def __init__( self, agent: Optional[LMM] = None, @@ -120,6 +134,17 @@ def __call__( input: Union[str, List[Message]], media: Optional[Union[str, Path]] = None, ) -> str: + """Chat with VisionAgent and get the conversation response. + + Parameters: + input (Union[str, List[Message]): A conversation in the format of + [{"role": "user", "content": "describe your task here..."}, ...] or a + string of just the contents. + media (Optional[Union[str, Path]]): The media file to be used in the task. + + Returns: + str: The conversation response. + """ if isinstance(input, str): input = [{"role": "user", "content": input}] if media is not None: @@ -131,6 +156,20 @@ def chat_with_code( self, chat: List[Message], ) -> List[Message]: + """Chat with VisionAgent, it will use code to execute actions to accomplish + its tasks. + + Parameters: + chat (List[Message]): A conversation + in the format of: + [{"role": "user", "content": "describe your task here..."}] + or if it contains media files, it should be in the format of: + [{"role": "user", "content": "describe your task here...", "media": ["image1.jpg", "image2.jpg"]}] + + Returns: + List[Message]: The conversation response. + """ + if not chat: raise ValueError("chat cannot be empty") diff --git a/vision_agent/agent/vision_agent_coder.py b/vision_agent/agent/vision_agent_coder.py index e1f8edcf..d189badf 100644 --- a/vision_agent/agent/vision_agent_coder.py +++ b/vision_agent/agent/vision_agent_coder.py @@ -492,7 +492,7 @@ class VisionAgentCoder(Agent): Example ------- - >>> from vision_agent import VisionAgentCoder + >>> from vision_agent.agent import VisionAgentCoder >>> agent = VisionAgentCoder() >>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg") """ diff --git a/vision_agent/tools/meta_tools.py b/vision_agent/tools/meta_tools.py index 2f9c2337..67aaa385 100644 --- a/vision_agent/tools/meta_tools.py +++ b/vision_agent/tools/meta_tools.py @@ -6,6 +6,8 @@ from vision_agent.lmm.types import Message from vision_agent.tools.tool_utils import get_tool_documentation +# These tools are adapted from SWE-Agent https://github.com/princeton-nlp/SWE-agent + CURRENT_FILE = None CURRENT_LINE = 0 DEFAULT_WINDOW_SIZE = 100