Skip to content

Commit

Permalink
Full Claude Sonnet 3.5 Support (#234)
Browse files Browse the repository at this point in the history
* resize image for claude

* only resize if above size

* renamed claude to anthropic for consistency

* added openai classes and made anthropic default

* add ability to view images

* add florence2 fine tune to owl_v2 args

* added fine tune id for florence2sam2

* add generic OD fine tuning

* fixed type error

* added comment

* fix prompt for florence2 sam2 video tracking

* fixed import bug

* updated fine tuning names in prompts

* improve json parsing

* update json extract, add tests

* removed old code

* minor improvements to prompt to improve benchmark

* pass plan thoughts to coder

* fixed comments

* fix type and lint errors

* update tests

* make imports easier, pass more code info

* update prompts

* standardize fps to 1

* rename functions to make them easier to understand by llm

* add openai vision agent coder

* fix complexity

* fix type issue

* fix lmm version

* updated readme
  • Loading branch information
dillonalaird authored Sep 23, 2024
1 parent fb03aad commit 696da6c
Show file tree
Hide file tree
Showing 18 changed files with 696 additions and 219 deletions.
70 changes: 59 additions & 11 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,11 @@ To get started, you can install the library using pip:
pip install vision-agent
```

Ensure you have an OpenAI API key and set it as an environment variable (if you are
using Azure OpenAI please see the Azure setup section):
Ensure you have an Anthropic key and an OpenAI API key and set in your environment
variables (if you are using Azure OpenAI please see the Azure setup section):

```bash
export ANTHROPIC_API_KEY="your-api-key"
export OPENAI_API_KEY="your-api-key"
```

Expand Down Expand Up @@ -71,6 +72,9 @@ You can find more details about the streamlit app [here](examples/chat/).
>>> resp = agent(resp)
```

`VisionAgent` currently utilizes Claude-3.5 as it's default LMM and uses OpenAI for
embeddings for tool searching.

### Vision Agent Coder
#### Basic Usage
You can interact with the agent as you would with any LLM or LMM model:
Expand Down Expand Up @@ -132,7 +136,8 @@ of the input is a list of dictionaries with the keys `role`, `content`, and `med
"code": "from vision_agent.tools import ..."
"test": "calculate_filled_percentage('jar.jpg')",
"test_result": "...",
"plan": [{"code": "...", "test": "...", "plan": "..."}, ...],
"plans": {"plan1": {"thoughts": "..."}, ...},
"plan_thoughts": "...",
"working_memory": ...,
}
```
Expand Down Expand Up @@ -169,20 +174,25 @@ result = agent.chat_with_workflow(conv)
### Tools
There are a variety of tools for the model or the user to use. Some are executed locally
while others are hosted for you. You can easily access them yourself, for example if
you want to run `owl_v2` and visualize the output you can run:
you want to run `owl_v2_image` and visualize the output you can run:

```python
import vision_agent.tools as T
import matplotlib.pyplot as plt

image = T.load_image("dogs.jpg")
dets = T.owl_v2("dogs", image)
dets = T.owl_v2_image("dogs", image)
viz = T.overlay_bounding_boxes(image, dets)
plt.imshow(viz)
plt.show()
```

You can also add custom tools to the agent:
You can find all available tools in `vision_agent/tools/tools.py`, however,
`VisionAgentCoder` only utilizes a subset of tools that have been tested and provide
the best performance. Those can be found in the same file under the `TOOLS` variable.

If you can't find the tool you are looking for you can also add custom tools to the
agent:

```python
import vision_agent as va
Expand Down Expand Up @@ -217,9 +227,48 @@ Can't find the tool you need and want add it to `VisionAgent`? Check out our
we add the source code for all the tools used in `VisionAgent`.

## Additional Backends
### Anthropic
`AnthropicVisionAgentCoder` uses Anthropic. To get started you just need to get an
Anthropic API key and set it in your environment variables:

```bash
export ANTHROPIC_API_KEY="your-api-key"
```

Because Anthropic does not support embedding models, the default embedding model used
is the OpenAI model so you will also need to set your OpenAI API key:

```bash
export OPEN_AI_API_KEY="your-api-key"
```

Usage is the same as `VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.AnthropicVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```

### OpenAI
`OpenAIVisionAgentCoder` uses OpenAI. To get started you just need to get an OpenAI API
key and set it in your environment variables:

```bash
export OPEN_AI_API_KEY="your-api-key"
```

Usage is the same as `VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.OpenAIVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```


### Ollama
We also provide a `VisionAgentCoder` that uses Ollama. To get started you must download
a few models:
`OllamaVisionAgentCoder` uses Ollama. To get started you must download a few models:

```bash
ollama pull llama3.1
Expand All @@ -240,9 +289,8 @@ tools. You can use it just like you would use `VisionAgentCoder`:
> WARNING: VisionAgent doesn't work well unless the underlying LMM is sufficiently powerful. Do not expect good results or even working code with smaller models like Llama 3.1 8B.
### Azure OpenAI
We also provide a `AzureVisionAgentCoder` that uses Azure OpenAI models. To get started
follow the Azure Setup section below. You can use it just like you would use=
`VisionAgentCoder`:
`AzureVisionAgentCoder` uses Azure OpenAI models. To get started follow the Azure Setup
section below. You can use it just like you would use `VisionAgentCoder`:

```python
>>> import vision_agent as va
Expand Down
70 changes: 59 additions & 11 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,10 +30,11 @@ To get started, you can install the library using pip:
pip install vision-agent
```

Ensure you have an OpenAI API key and set it as an environment variable (if you are
using Azure OpenAI please see the Azure setup section):
Ensure you have an Anthropic key and an OpenAI API key and set in your environment
variables (if you are using Azure OpenAI please see the Azure setup section):

```bash
export ANTHROPIC_API_KEY="your-api-key"
export OPENAI_API_KEY="your-api-key"
```

Expand Down Expand Up @@ -68,6 +69,9 @@ You can find more details about the streamlit app [here](examples/chat/).
>>> resp = agent(resp)
```

`VisionAgent` currently utilizes Claude-3.5 as it's default LMM and uses OpenAI for
embeddings for tool searching.

### Vision Agent Coder
#### Basic Usage
You can interact with the agent as you would with any LLM or LMM model:
Expand Down Expand Up @@ -129,7 +133,8 @@ of the input is a list of dictionaries with the keys `role`, `content`, and `med
"code": "from vision_agent.tools import ..."
"test": "calculate_filled_percentage('jar.jpg')",
"test_result": "...",
"plan": [{"code": "...", "test": "...", "plan": "..."}, ...],
"plans": {"plan1": {"thoughts": "..."}, ...},
"plan_thoughts": "...",
"working_memory": ...,
}
```
Expand Down Expand Up @@ -166,20 +171,25 @@ result = agent.chat_with_workflow(conv)
### Tools
There are a variety of tools for the model or the user to use. Some are executed locally
while others are hosted for you. You can easily access them yourself, for example if
you want to run `owl_v2` and visualize the output you can run:
you want to run `owl_v2_image` and visualize the output you can run:

```python
import vision_agent.tools as T
import matplotlib.pyplot as plt

image = T.load_image("dogs.jpg")
dets = T.owl_v2("dogs", image)
dets = T.owl_v2_image("dogs", image)
viz = T.overlay_bounding_boxes(image, dets)
plt.imshow(viz)
plt.show()
```

You can also add custom tools to the agent:
You can find all available tools in `vision_agent/tools/tools.py`, however,
`VisionAgentCoder` only utilizes a subset of tools that have been tested and provide
the best performance. Those can be found in the same file under the `TOOLS` variable.

If you can't find the tool you are looking for you can also add custom tools to the
agent:

```python
import vision_agent as va
Expand Down Expand Up @@ -214,9 +224,48 @@ Can't find the tool you need and want add it to `VisionAgent`? Check out our
we add the source code for all the tools used in `VisionAgent`.

## Additional Backends
### Anthropic
`AnthropicVisionAgentCoder` uses Anthropic. To get started you just need to get an
Anthropic API key and set it in your environment variables:

```bash
export ANTHROPIC_API_KEY="your-api-key"
```

Because Anthropic does not support embedding models, the default embedding model used
is the OpenAI model so you will also need to set your OpenAI API key:

```bash
export OPEN_AI_API_KEY="your-api-key"
```

Usage is the same as `VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.AnthropicVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```

### OpenAI
`OpenAIVisionAgentCoder` uses OpenAI. To get started you just need to get an OpenAI API
key and set it in your environment variables:

```bash
export OPEN_AI_API_KEY="your-api-key"
```

Usage is the same as `VisionAgentCoder`:

```python
>>> import vision_agent as va
>>> agent = va.agent.OpenAIVisionAgentCoder()
>>> agent("Count the apples in the image", media="apples.jpg")
```


### Ollama
We also provide a `VisionAgentCoder` that uses Ollama. To get started you must download
a few models:
`OllamaVisionAgentCoder` uses Ollama. To get started you must download a few models:

```bash
ollama pull llama3.1
Expand All @@ -237,9 +286,8 @@ tools. You can use it just like you would use `VisionAgentCoder`:
> WARNING: VisionAgent doesn't work well unless the underlying LMM is sufficiently powerful. Do not expect good results or even working code with smaller models like Llama 3.1 8B.
### Azure OpenAI
We also provide a `AzureVisionAgentCoder` that uses Azure OpenAI models. To get started
follow the Azure Setup section below. You can use it just like you would use=
`VisionAgentCoder`:
`AzureVisionAgentCoder` uses Azure OpenAI models. To get started follow the Azure Setup
section below. You can use it just like you would use `VisionAgentCoder`:

```python
>>> import vision_agent as va
Expand Down
43 changes: 41 additions & 2 deletions tests/integ/test_tools.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@
grounding_dino,
grounding_sam,
ixc25_image_vqa,
ixc25_video_vqa,
ixc25_temporal_localization,
ixc25_video_vqa,
loca_visual_prompt_counting,
loca_zero_shot_counting,
ocr,
Expand All @@ -33,6 +33,8 @@
vit_nsfw_classification,
)

FINE_TUNE_ID = "65ebba4a-88b7-419f-9046-0750e30250da"


def test_grounding_dino():
img = ski.data.coins()
Expand Down Expand Up @@ -65,6 +67,18 @@ def test_owl_v2_image():
assert [res["label"] for res in result] == ["coin"] * len(result)


def test_owl_v2_fine_tune_id():
img = ski.data.coins()
result = owl_v2_image(
prompt="coin",
image=img,
fine_tune_id=FINE_TUNE_ID,
)
# this calls a fine-tuned florence2 model which is going to be worse at this task
assert 14 <= len(result) <= 26
assert [res["label"] for res in result] == ["coin"] * len(result)


def test_owl_v2_video():
frames = [
np.array(Image.fromarray(ski.data.coins()).convert("RGB")) for _ in range(10)
Expand All @@ -78,7 +92,7 @@ def test_owl_v2_video():
assert 24 <= len([res["label"] for res in result[0]]) <= 26


def test_object_detection():
def test_florence2_phrase_grounding():
img = ski.data.coins()
result = florence2_phrase_grounding(
image=img,
Expand All @@ -88,6 +102,18 @@ def test_object_detection():
assert [res["label"] for res in result] == ["coin"] * 25


def test_florence2_phrase_grounding_fine_tune_id():
img = ski.data.coins()
result = florence2_phrase_grounding(
prompt="coin",
image=img,
fine_tune_id=FINE_TUNE_ID,
)
# this calls a fine-tuned florence2 model which is going to be worse at this task
assert 14 <= len(result) <= 26
assert [res["label"] for res in result] == ["coin"] * len(result)


def test_template_match():
img = ski.data.coins()
result = template_match(
Expand Down Expand Up @@ -119,6 +145,19 @@ def test_florence2_sam2_image():
assert len([res["mask"] for res in result]) == 25


def test_florence2_sam2_image_fine_tune_id():
img = ski.data.coins()
result = florence2_sam2_image(
prompt="coin",
image=img,
fine_tune_id=FINE_TUNE_ID,
)
# this calls a fine-tuned florence2 model which is going to be worse at this task
assert 14 <= len(result) <= 26
assert [res["label"] for res in result] == ["coin"] * len(result)
assert len([res["mask"] for res in result]) == len(result)


def test_florence2_sam2_video():
frames = [
np.array(Image.fromarray(ski.data.coins()).convert("RGB")) for _ in range(10)
Expand Down
45 changes: 45 additions & 0 deletions tests/unit/test_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
from vision_agent.agent.agent_utils import extract_code, extract_json


def test_basic_json_extract():
a = '{"a": 1, "b": 2}'
assert extract_json(a) == {"a": 1, "b": 2}


def test_side_case_quotes_json_extract():
a = "{'0': 'no', '3': 'no', '6': 'no', '9': 'yes', '12': 'no', '15': 'no'}"
a_json = extract_json(a)
assert len(a_json) == 6


def test_side_case_bool_json_extract():
a = "{'0': False, '3': False, '6': False, '9': True, '12': False, '15': False}"
a_json = extract_json(a)
assert len(a_json) == 6


def test_complicated_case_json_extract_1():
a = """```json { "plan1": { "thoughts": "This plan uses the owl_v2_video tool to detect the truck and then uses ocr to read the USDOT and trailer numbers. This approach is efficient as it can process the entire video at once for truck detection.", "instructions": [ "Use extract_frames to get frames from truck1.mp4", "Use owl_v2_video with prompt 'truck' to detect if a truck is present in the video", "If a truck is detected, use ocr on relevant frames to read the USDOT and trailer numbers", "Process the OCR results to extract the USDOT and trailer numbers", "Compile results into JSON format and save using save_json" ] }, "plan2": { "thoughts": "This plan uses florence2_sam2_video_tracking to segment and track the truck, then uses florence2_ocr for text detection. This approach might be more accurate for text detection as it can focus on the relevant parts of the truck.", "instructions": [ "Use extract_frames to get frames from truck1.mp4", "Use florence2_sam2_video_tracking with prompt 'truck' to segment and track the truck", "If a truck is segmented, use florence2_ocr on the segmented area to detect text", "Process the OCR results to extract the USDOT and trailer numbers", "Compile results into JSON format and save using save_json" ] }, "plan3": { "thoughts": "This plan uses ixc25_video_vqa to directly ask questions about the truck, USDOT number, and trailer number. This approach leverages the model's ability to understand and answer complex questions about video content.", "instructions": [ "Use extract_frames to get frames from truck1.mp4", "Use ixc25_video_vqa with the question 'Is there a truck in this video?' to detect the presence of a truck", "If a truck is present, use ixc25_video_vqa with the question 'What is the USDOT number on the truck?'", "Use ixc25_video_vqa with the question 'What is the trailer number on the truck?'", "Process the answers to extract the required information", "Compile results into JSON format and save using save_json" ] } } ```"""

a_json = extract_json(a)
assert len(a_json) == 3
assert "plan1" in a_json


def test_complicated_case_json_extract_2():
a = """{\n "predicted_answer": "2",\n "thoughts": "After analyzing the image and the tool outputs, I can see that there are indeed 2 dogs in the image. One is a small grey dog on the grass, and the other is a larger white dog on the patio. All three plans correctly identified 2 dogs, but I believe plan2 using the countgd_counting tool is the best choice for this task. Here\'s why:\n\n 1. Accuracy: The countgd_counting tool provided high confidence scores (0.92 and 0.9) for both dogs, which aligns with what I can see in the image.\n \n 2. Precision: The bounding boxes from the countgd_counting tool seem to be more precise and tightly fit around the dogs compared to the other tools.\n \n 3. Simplicity: While plan3 offers a more complex approach with additional verification, it\'s not necessary in this case as the dogs are clearly visible and easily identifiable. The extra steps in plan3 would add unnecessary complexity and potential for errors.\n \n 4. Efficiency: Plan2 is more straightforward and efficient than plan3, while potentially offering better accuracy than plan1 (owl_v2_image tool had lower confidence scores).",\n "best_plan": "plan2"\n}"""
a_json = extract_json(a)
assert len(a_json) == 3
assert "predicted_answer" in a_json


def test_basic_code_extract():
a = """```python
def test_basic_json_extract():
a = '{"a": 1, "b": 2}'
assert extract_json(a) == {"a": 1, "b": 2}
```
"""
a_code = extract_code(a)
assert "def test_basic_json_extract():" in a_code
assert "assert extract_json(a) == {" in a_code
3 changes: 2 additions & 1 deletion vision_agent/agent/__init__.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
from .agent import Agent
from .vision_agent import VisionAgent
from .vision_agent_coder import (
AnthropicVisionAgentCoder,
AzureVisionAgentCoder,
ClaudeVisionAgentCoder,
OllamaVisionAgentCoder,
OpenAIVisionAgentCoder,
VisionAgentCoder,
)
Loading

0 comments on commit 696da6c

Please sign in to comment.