Skip to content

Commit 8fde0de

Browse files
committed
updated docs;
1 parent b0bafb9 commit 8fde0de

File tree

5 files changed

+98
-15
lines changed

5 files changed

+98
-15
lines changed

README.md

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -43,13 +43,32 @@ export OPENAI_API_KEY="your-api-key"
4343
### Important Note on API Usage
4444
Please be aware that using the API in this project requires you to have API credits (minimum of five US dollars). This is different from the OpenAI subscription used in this chatbot. If you don't have credit, further information can be found [here](https://github.com/landing-ai/vision-agent?tab=readme-ov-file#how-to-get-started-with-openai-api-credits)
4545

46+
4647
### Vision Agent
47-
#### Basic Usage
48-
You can interact with the agent as you would with any LLM or LMM model:
48+
There are two agents that you can use. Vision Agent is a conversational agent that has
49+
access to tools that allow it to write an navigate python code. It can converse with
50+
the user in natural language. VisionAgentCoder is an agent that can write code for
51+
vision tasks, such as counting people in an image. However, it cannot converse and can
52+
only respond with code. VisionAgent can call VisionAgentCoder to write vision code.
4953

54+
#### Basic Usage
5055
```python
5156
>>> from vision_agent.agent import VisionAgent
5257
>>> agent = VisionAgent()
58+
>>> resp = agent("Hello")
59+
>>> print(resp)
60+
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "{'thoughts': 'The user has greeted me. I will respond with a greeting and ask how I can assist them.', 'response': 'Hello! How can I assist you today?', 'let_user_respond': True}"}]
61+
>>> resp.append({"role": "user", "content": "Can you count the number of people in this image?", "media": ["people.jpg"]})
62+
>>> resp = agent(resp)
63+
```
64+
65+
### Vision Agent Coder
66+
#### Basic Usage
67+
You can interact with the agent as you would with any LLM or LMM model:
68+
69+
```python
70+
>>> from vision_agent.agent import VisionAgentCoder
71+
>>> agent = VisionAgentCoder()
5372
>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
5473
```
5574

@@ -90,7 +109,7 @@ To better understand how the model came up with it's answer, you can run it in d
90109
mode by passing in the verbose argument:
91110

92111
```python
93-
>>> agent = VisionAgent(verbose=2)
112+
>>> agent = VisionAgentCoder(verbose=2)
94113
```
95114

96115
#### Detailed Usage
@@ -180,9 +199,11 @@ def custom_tool(image_path: str) -> str:
180199
return np.zeros((10, 10))
181200
```
182201

183-
You need to ensure you call `@va.tools.register_tool` with any imports it might use and
184-
ensure the documentation is in the same format above with description, `Parameters:`,
185-
`Returns:`, and `Example\n-------`. You can find an example use case [here](examples/custom_tools/).
202+
You need to ensure you call `@va.tools.register_tool` with any imports it uses. Global
203+
variables will not be captured by `register_tool` so you need to include them in the
204+
function. Make sure the documentation is in the same format above with description,
205+
`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
206+
[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.
186207

187208
### Azure Setup
188209
If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:
@@ -209,7 +230,7 @@ You can then run Vision Agent using the Azure OpenAI models:
209230

210231
```python
211232
import vision_agent as va
212-
agent = va.agent.AzureVisionAgent()
233+
agent = va.agent.AzureVisionAgentCoder()
213234
```
214235

215236
******************************************************************************************************************************

docs/index.md

Lines changed: 28 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -35,13 +35,32 @@ export OPENAI_API_KEY="your-api-key"
3535
### Important Note on API Usage
3636
Please be aware that using the API in this project requires you to have API credits (minimum of five US dollars). This is different from the OpenAI subscription used in this chatbot. If you don't have credit, further information can be found [here](https://github.com/landing-ai/vision-agent?tab=readme-ov-file#how-to-get-started-with-openai-api-credits)
3737

38+
3839
### Vision Agent
39-
#### Basic Usage
40-
You can interact with the agent as you would with any LLM or LMM model:
40+
There are two agents that you can use. Vision Agent is a conversational agent that has
41+
access to tools that allow it to write an navigate python code. It can converse with
42+
the user in natural language. VisionAgentCoder is an agent that can write code for
43+
vision tasks, such as counting people in an image. However, it cannot converse and can
44+
only respond with code. VisionAgent can call VisionAgentCoder to write vision code.
4145

46+
#### Basic Usage
4247
```python
4348
>>> from vision_agent.agent import VisionAgent
4449
>>> agent = VisionAgent()
50+
>>> resp = agent("Hello")
51+
>>> print(resp)
52+
[{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "{'thoughts': 'The user has greeted me. I will respond with a greeting and ask how I can assist them.', 'response': 'Hello! How can I assist you today?', 'let_user_respond': True}"}]
53+
>>> resp.append({"role": "user", "content": "Can you count the number of people in this image?", "media": ["people.jpg"]})
54+
>>> resp = agent(resp)
55+
```
56+
57+
### Vision Agent Coder
58+
#### Basic Usage
59+
You can interact with the agent as you would with any LLM or LMM model:
60+
61+
```python
62+
>>> from vision_agent.agent import VisionAgentCoder
63+
>>> agent = VisionAgentCoder()
4564
>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
4665
```
4766

@@ -82,7 +101,7 @@ To better understand how the model came up with it's answer, you can run it in d
82101
mode by passing in the verbose argument:
83102

84103
```python
85-
>>> agent = VisionAgent(verbose=2)
104+
>>> agent = VisionAgentCoder(verbose=2)
86105
```
87106

88107
#### Detailed Usage
@@ -172,9 +191,11 @@ def custom_tool(image_path: str) -> str:
172191
return np.zeros((10, 10))
173192
```
174193

175-
You need to ensure you call `@va.tools.register_tool` with any imports it might use and
176-
ensure the documentation is in the same format above with description, `Parameters:`,
177-
`Returns:`, and `Example\n-------`. You can find an example use case [here](examples/custom_tools/).
194+
You need to ensure you call `@va.tools.register_tool` with any imports it uses. Global
195+
variables will not be captured by `register_tool` so you need to include them in the
196+
function. Make sure the documentation is in the same format above with description,
197+
`Parameters:`, `Returns:`, and `Example\n-------`. You can find an example use case
198+
[here](examples/custom_tools/) as this is what the agent uses to pick and use the tool.
178199

179200
### Azure Setup
180201
If you want to use Azure OpenAI models, you need to have two OpenAI model deployments:
@@ -201,7 +222,7 @@ You can then run Vision Agent using the Azure OpenAI models:
201222

202223
```python
203224
import vision_agent as va
204-
agent = va.agent.AzureVisionAgent()
225+
agent = va.agent.AzureVisionAgentCoder()
205226
```
206227

207228
******************************************************************************************************************************

vision_agent/agent/vision_agent.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -100,6 +100,20 @@ def parse_execution(response: str) -> Optional[str]:
100100

101101

102102
class VisionAgent(Agent):
103+
"""Vision Agent is an agent that can chat with the user and call tools or other
104+
agents to generate code for it. Vision Agent uses python code to execute actions for
105+
the user. Vision Agent is inspired by by OpenDev
106+
https://github.com/OpenDevin/OpenDevin and CodeAct https://arxiv.org/abs/2402.01030
107+
108+
Example
109+
-------
110+
>>> from vision_agent.agent import VisionAgent
111+
>>> agent = VisionAgent()
112+
>>> resp = agent("Hello")
113+
>>> resp.append({"role": "user", "content": "Can you write a function that counts dogs?", "media": ["dog.jpg"]})
114+
>>> resp = agent(resp)
115+
"""
116+
103117
def __init__(
104118
self,
105119
agent: Optional[LMM] = None,
@@ -120,6 +134,17 @@ def __call__(
120134
input: Union[str, List[Message]],
121135
media: Optional[Union[str, Path]] = None,
122136
) -> str:
137+
"""Chat with VisionAgent and get the conversation response.
138+
139+
Parameters:
140+
input (Union[str, List[Message]): A conversation in the format of
141+
[{"role": "user", "content": "describe your task here..."}, ...] or a
142+
string of just the contents.
143+
media (Optional[Union[str, Path]]): The media file to be used in the task.
144+
145+
Returns:
146+
str: The conversation response.
147+
"""
123148
if isinstance(input, str):
124149
input = [{"role": "user", "content": input}]
125150
if media is not None:
@@ -131,6 +156,20 @@ def chat_with_code(
131156
self,
132157
chat: List[Message],
133158
) -> List[Message]:
159+
"""Chat with VisionAgent, it will use code to execute actions to accomplish
160+
its tasks.
161+
162+
Parameters:
163+
chat (List[Message]): A conversation
164+
in the format of:
165+
[{"role": "user", "content": "describe your task here..."}]
166+
or if it contains media files, it should be in the format of:
167+
[{"role": "user", "content": "describe your task here...", "media": ["image1.jpg", "image2.jpg"]}]
168+
169+
Returns:
170+
List[Message]: The conversation response.
171+
"""
172+
134173
if not chat:
135174
raise ValueError("chat cannot be empty")
136175

vision_agent/agent/vision_agent_coder.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -492,7 +492,7 @@ class VisionAgentCoder(Agent):
492492
493493
Example
494494
-------
495-
>>> from vision_agent import VisionAgentCoder
495+
>>> from vision_agent.agent import VisionAgentCoder
496496
>>> agent = VisionAgentCoder()
497497
>>> code = agent("What percentage of the area of the jar is filled with coffee beans?", media="jar.jpg")
498498
"""

vision_agent/tools/meta_tools.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,8 @@
66
from vision_agent.lmm.types import Message
77
from vision_agent.tools.tool_utils import get_tool_documentation
88

9+
# These tools are adapted from SWE-Agent https://github.com/princeton-nlp/SWE-agent
10+
911
CURRENT_FILE = None
1012
CURRENT_LINE = 0
1113
DEFAULT_WINDOW_SIZE = 100

0 commit comments

Comments
 (0)